<a href="https://colab.research.google.com/github/FraGoTe/Analisis-Estadistico-Textos/blob/master/ClasificacionTextos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PASO -1 : Agregar bibliotecas requeridas

Las siguientes bibliotecas serán usadas más adelante. Si no están disponiles, éstas pueden ser descargadas de sus respectivos sitios web. Note que algunas utilidades son obtenidas (en caso de necesitarlas) con la instrucción NLTK.download 

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# PASO -2: Establecer semilla aleatoria

Esto es utilizado para reproducir el mismo resultado cada vez (si el script se mantiene consistente), de otra forma cada ejecución producirá resultados distintos. La semilla (seed) puede ser establecida con cualquier número. Para más información, ver https://www.sharpsightlabs.com/blog/numpy-random-seed/

In [0]:
np.random.seed(500)


# PASO -3: Agregar el Corpus

El conjunto de datos (dataset, corpus) puede agregarse una estructura de datos de "pandas" (pandas Data Frame), la cual permite manipular datos en tablas. Esto con la ayuda de la función 'read_csv'. Como puede ver la codificación fue establecida para 'latin-1' ya que el texto contiene varios caracteres especiales (tildes). El corpus desde una URL, pero se puede vincular una cuenta de Drive o subir un archivo.

In [0]:
urlCorpus ='https://raw.githubusercontent.com/githila/data/master/text.csv'
Corpus = pd.read_csv(urlCorpus,encoding='latin-1')

# PASO -4: Pre-procesamiento de datos

Este es un paso importante en cualquier proceso de minería de datos. Esto básicamente supone transormar datos crudos (raw data) en un formato entendible por los modelos NLP. Los datos del mundo real (real-world data) son frecuentemente incompletos, inconsistentes, y/o faltantes en algunos comportamientos o tendencias, y ademas posiblemente contienen muchos errores. El pre-procesamiento de datos es un método comprobado para resolver tales conflictos. Esto sirve de apoyo para obtener mejores resultados a través de los algoritmos de clasificación.

A continuación se explican dos de las técnicas que son realizadas para el pre-procesamiento de datos (además de otras técnicas sencillas).

<b>Tokenization:</b> Se refiere a un proceso de partir (breaking) un flujo de texto en palabras, frases, símbolos, u otros elementos significativos llamados "tokens". La lista de tokens se convierte en entrada para un procesamiento posterior. NLTK Library tiene las funciones <i>word_tokenize</i> y <i>sent_tokenize</i> para facilmente dividir un texto (parrafo/documento) en una lista de palabras o sentencias, respectivamente.

<b>Word Stemming/Lemmatization:</b> El objetivo de estos procesos es reducir la forma de inflexión de cada palabra en una base común o raíz. <i>Lemmatization</i> es estrechamente relacionada a <i>stemming</i>. La diferencia es que un stemmer opera en una sola palabra sin tener conocimiento del contexto de ésta y por lo tanto, no puede discriminar o diferenciar entre palabras que tienen distintos significados dependiendo del análisis del discurso (part of speech). Sin embargo, los stemmers son tipicamente mas sencillos de implementar y se ejecutan más rápido, lo que significa que su reducida precisión podría no significar un problema para algunas aplicaciones.


<img src="https://raw.githubusercontent.com/pepe3059/NLP/master/figures/lema.png"
     alt="lemma"
     style="float: left; margin-right: 10px;" />

A continuación se muestra en script completo que realiza los pasos de pre-procesamiento antes mencionados. Note que es posible agregar o remover pasos de acuerdo a como se adapten con el dataset que esté manejando.

<!--Here’s the complete script which performs the aforementioned data pre-processing steps, you can always add or remove steps which best suits the data set you are dealing with:-->

<ol type="1">
    <li>Eliminar registros vacíos de los datos (si los hay)</li>
    <li>Cambiar todo el texto a minúsculas</li>
    <li>Tokenización de palabras (Word Tokenization)</li>
    <li>Eliminar palabras vacías</li>
    <li>Eliminar texto no-alfanumérico (símbolos y caracteres especiales)</li>
    <li>Lematización de palabras</li>
</ol>

<!--
1.-Remove Blank rows in Data, if any
2.-Change all the text to lower case
3.-Word Tokenization
4.-Remove Stop words
5.-Remove Non-alpha text
6.-Word Lemmatization -->

In [0]:
# Step - a (1) : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)

# Step - b (2): Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [entry.lower() for entry in Corpus['text']]

# Step - c (3): Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

# Step - d (4): Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(Corpus['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

<img src="https://raw.githubusercontent.com/pepe3059/NLP/master/figures/preprocessed.png"
     alt="Text after all the pre-processing steps are performed"
     style="float: left; margin-right: 10px;" />
     Figura. Texto después de que todos los pasos de pre-procesamiento son realizados. <!-- Text after all the pre-processing steps are performed -->

# PASO -5: Preparar Dataset de entrenamiento y prueba

El corpus será dividido en dos datasets, <b>entrenamiento</b> y <b>prueba</b>. El dataset de entrenamiento será usado para sintonizar el modelo y las predicciones serán realizadas sobre el dataset de prueba. Esto puede ser llevado a cabo a través de la funcion <b>train_test_split</b> de la biblioteca sklearn. Los datos de entrenamiento tendrán un 70% del corpus y el restante 30% será utilizado para pruebas, esto de acuerdo al parámetro <b>test_size=0.3</b>.

<!--
The Corpus will be split into two data sets, Training and Test. The training data set will be used to fit the model and the predictions will be performed on the test data set.This can be done through the train_test_split from the sklearn library. The Training Data will have 70% of the corpus and Test data will have the remaining 30% as we have set the parameter test_size=0.3 . -->

In [0]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)

<img src="https://raw.githubusercontent.com/githila/data/master/contentds.png"
     alt="Content of each dataset"
     style="float: left; margin-right: 10px;" />
     
     Figura. Contenido de cada dataset.

# PASO -6: Codificación

Esto se realiza para transofrmar los datos **categóricos** de tipo cadena en el dataset a valores numéricos que el modelo puede entender.

<!-- Label encode the target variable — This is done to transform Categorical data of string type in the data set into numerical values which the model can understand. -->

In [0]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

<img src="https://raw.githubusercontent.com/pepe3059/NLP/master/figures/encoding.png"
     alt="text encoding"
     style="float: left; margin-right: 10px;" />

# PASO -7: Vectorización de palabras (Word Vectorization)

Este es un proceso general de convertir una colección de documentos de texto a vectores numéricos de características. Existen distintos métodos para convertir datos textuales a vectores que puede enteder el modelo pero sin duda el método más popular es el llamado TF-IDF. Esto se refiere a un acrónimo que significa "Term Frequency - Inverse Document Frequency", los cuales son los componentes de las puntuaciones (scores) resultantes asignadas a cada palabra.

Frecuencia de término: Se refiere a qué tan frecuente una palabra aparece dentro de un documento.

Frecuencia de documento inversa: Esto reduce la importancia de palabras que aparecen mucho en los documentos

TF-IDF se refiere a puntuaciones de frecuencia de palabra que intentan resaltar palabras que son mas interesantes (o sobresalen), es decir, frecuentes en un documento pero no en todos los documentos.

La siguiente sintaxis puede se utilizada para calcular el modelo TF-IDF para todo el corpus. Esto ayudará a TF-IDF a construir un vocabulario de palabras que ha aprendido de los datos del corpus y asignará un número entero único a cada una de estas palabras. Su número máximo será de 5000 palabras/características únicas, ya que hemos establecido el parámetro max_features=5000.
Finalmente transformaremos Train_X y Test_X en Train_X_Tfidf y Test_X_Tfidf vectorizados. Éstos contendrán ahora para cada fila una lista de números enteros únicos y su importancia asociada calculada por TF-IDF.


<!-- It is a general process of turning a collection of text documents into numerical feature vectors.Their are many methods to convert text data to vectors which the model can understand but by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency — Inverse Document” Frequency which are the components of the resulting scores assigned to each word.
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This down scales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
The following syntax can be used to first fit the TG-IDF model on the whole corpus. This will help TF-IDF build a vocabulary of words which it has learned from the corpus data and it will assign a unique integer number to each of these words. Their will be maximum of 5000 unique words/features as we have set parameter max_features=5000.
Finally we will transform Train_X and Test_X to vectorized Train_X_Tfidf and Test_X_Tfidf. These will now contain for each row a list of unique integer number and its associated importance as calculated by TF-IDF. -->

In [0]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Para ver el vocabulario que ha aprendido del corpus

In [10]:
print(Tfidf_vect.vocabulary_)




Esto dará un resultado como

{‘even’: 1459, ‘sound’: 4067, ‘track’: 4494, ‘beautiful’: 346, ‘paint’: 3045, ‘mind’: 2740, ‘well’: 4864, ‘would’: 4952, ‘recomend’: 3493, ‘people’: 3115, ‘hate’: 1961, ‘video’: 4761 …………}


Los datos vectorizados son los siguientes.

In [11]:
print(Train_X_Tfidf)

  (0, 4502)	0.3763188267807246
  (0, 4501)	0.15031494427382475
  (0, 3974)	0.35868777245753825
  (0, 3890)	0.2515140235472667
  (0, 3858)	0.2690675584422277
  (0, 3748)	0.34695623926050195
  (0, 3658)	0.2896999547088821
  (0, 3561)	0.29449641491430995
  (0, 2922)	0.229683025366997
  (0, 1940)	0.13406125327954532
  (0, 1536)	0.17761496997588844
  (0, 517)	0.321056290554803
  (0, 488)	0.12303572865008613
  (0, 238)	0.2448559358109696
  (1, 4687)	0.21384275526442909
  (1, 4069)	0.3566872275481094
  (1, 3434)	0.21279175847748263
  (1, 3319)	0.8157357261127677
  (1, 2595)	0.2173336717856602
  (1, 1252)	0.2074693534878867
  (1, 598)	0.1614401835472762
  (2, 4734)	0.21251405574612364
  (2, 4621)	0.17383471522304228
  (2, 4464)	0.11898591577849023
  (2, 4197)	0.13515537469996092
  :	:
  (6998, 2522)	0.11512409752599596
  (6998, 2130)	0.13650214385741868
  (6998, 1976)	0.07126908030410523
  (6998, 1788)	0.22013355385880556
  (6998, 1755)	0.19935027840675415
  (6998, 1719)	0.13508979239544552
  

<img src="https://raw.githubusercontent.com/githila/data/master/vector.png"
     alt="Output: — 1: Row number of ‘Train_X_Tfidf’, 2: Unique Integer number of each word in the first row, 3: Score calculated by TF-IDF Vectorizer"
     style="float: left; margin-right: 10px;" />
     
  Salida: - 
  
  1: Número de fila de 'Train_X_Tfidf', 
  
  2: Número entero único de cada palabra de la primera fila, 
  
  3: Puntuación calculada por el Vectorizador TF-IDF

Ahora los conjuntos de datos están listos para ser introducidos en diferentes algoritmos de clasificación.

# PASO -8: Usar los algoritmos análisis (machine learning) para predecir el resultado
    

En primer lugar, probemos el algoritmo de clasificación Naive Bayes. Puedes leer más sobre éste <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">aquí</a>

In [12]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  83.1


El siguiente es el SVM - Support Vector Machine. Puedes leer más sobre éste <a href="https://en.wikipedia.org/wiki/Support_vector_machine">aquí</a>

In [13]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.7


# **Tarea** 3

## 1. Probar no hacer lematizacion ni stemming del texto en inglés


In [0]:
urlCorpus ='https://raw.githubusercontent.com/githila/data/master/text.csv'
Corpus = pd.read_csv(urlCorpus,encoding='latin-1')

In [0]:
# Step - a (1) : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)

# Step - b (2): Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [entry.lower() for entry in Corpus['text']]

# Step - c (3): Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

for index,entry in enumerate(Corpus['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            Final_words.append(word)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

In [0]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)

In [0]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [22]:
print(Tfidf_vect.vocabulary_)




In [23]:
print(Train_X_Tfidf)

  (0, 114)	0.12500150178806232
  (0, 284)	0.12174334877660022
  (0, 393)	0.11764099262930167
  (0, 448)	0.06153804934979068
  (0, 565)	0.12464939038764301
  (0, 617)	0.1810979956575249
  (0, 1137)	0.1817587441723539
  (0, 1199)	0.19120705827948992
  (0, 1518)	0.14233386957937402
  (0, 1615)	0.15854518306692794
  (0, 1664)	0.10247730840345592
  (0, 1684)	0.21938029944817708
  (0, 1686)	0.22501523617241648
  (0, 1910)	0.0996461163371844
  (0, 2456)	0.4293786690193012
  (0, 2494)	0.22205665850068532
  (0, 2883)	0.08685963184933673
  (0, 2927)	0.11485332550494431
  (0, 3174)	0.1817587441723539
  (0, 3306)	0.15602882738890556
  (0, 3477)	0.15090914395115854
  (0, 3518)	0.1521603563075731
  (0, 3525)	0.11644021754340296
  (0, 3848)	0.2169369750629383
  (0, 3892)	0.13967244300062748
  :	:
  (6998, 3918)	0.2073400467992851
  (6998, 4463)	0.11786635915885912
  (6998, 4518)	0.26627847921174685
  (6998, 4532)	0.47659326151492004
  (6999, 89)	0.2126966126897102
  (6999, 90)	0.15707549112904717
  (

In [33]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  83.76666666666667


In [21]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  85.23333333333333


## Adaptar los parámetros de NaiveBayes Multinomial y SVM para mejorar los resultados

In [43]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB(alpha=1.8, fit_prior=True)
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  83.8


In [62]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=0.5, kernel='sigmoid', gamma='scale')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  85.8


## Clasifique textos en español

In [162]:
urlCorpus ='https://raw.githubusercontent.com/FraGoTe/Analisis-Estadistico-Textos/master/elcomercio201911281.csv'
Corpus2 = Corpus = pd.read_csv(urlCorpus,encoding='latin-1')

print(Corpus.head())

                                                text     label
0  <p>Tras 17 a&ntilde;os sin comicios electorale...  Politica
1  <p>El presidente Pedro Pablo Kuczynski (<stron...  Politica
2  <p>La presidenta de la Comisi&oacute;n de Defe...  Politica
3  <p>A m&aacute;s tardar el 19 de mayo pr&oacute...  Politica
4  <p>La Procuradur&iacute;a P&uacute;blica Ad Ho...  Politica


In [0]:
!pip install w3lib

In [0]:
# Funciones Necesarias
import re
from w3lib.html import replace_entities

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', replace_entities(text))

In [0]:
# Step - a (1) : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)

# Step - b (2): Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [remove_tags(entry.lower()) for entry in Corpus['text']]

# Step - c (3): Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

# Step - d (4): Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(Corpus['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('spanish') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

In [0]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)

In [0]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [0]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [156]:
print(Tfidf_vect.vocabulary_)

{'tras': 4690, 'años': 420, 'comicios': 806, 'electorales': 1470, 'fernando': 1776, 'belaunde': 470, 'terry': 4592, 'regreso': 3915, 'régimen': 4079, 'democrático': 1203, 'convertirse': 993, 'virtual': 4875, 'presidente': 3629, 'república': 3963, 'día': 1415, 'hoy': 2049, 'victoria': 4851, 'sillón': 4238, 'presidencial': 3626, 'segunda': 4149, 'vez': 4843, 'gobierno': 1931, 'militar': 2669, 'total': 4658, 'votos': 4912, 'candidato': 605, 'acción': 24, 'popular': 3554, 'venció': 4820, 'cercanos': 686, 'contendores': 959, 'campo': 599, 'apra': 263, 'luis': 2483, 'ppc': 3584, 'necesario': 2814, 'ir': 2285, 'vuelta': 4914, 'nueva': 2882, 'carta': 633, 'posible': 3567, 'ser': 4178, 'elegido': 1476, 'mayoría': 2594, 'simple': 4245, 'declaraciones': 1165, 'comercio': 803, 'afirmó': 103, 'duda': 1401, 'superaría': 4403, 'ampliamente': 202, 'votación': 4910, 'pues': 3763, 'resultados': 4003, 'centro': 680, 'computación': 850, 'instalado': 2208, 'partido': 3223, 'mismo': 2699, 'mandatario': 2528

In [157]:
print(Train_X_Tfidf)

  (0, 4991)	0.04928420438357507
  (0, 4969)	0.07683691576012375
  (0, 4901)	0.07683691576012375
  (0, 4852)	0.04777263960929639
  (0, 4799)	0.056021313261774584
  (0, 4792)	0.08161697058027045
  (0, 4755)	0.24485091174081136
  (0, 4754)	0.307347663040495
  (0, 4702)	0.07683691576012375
  (0, 4701)	0.08161697058027045
  (0, 4678)	0.08161697058027045
  (0, 4637)	0.052795264503228724
  (0, 4601)	0.08835407945846996
  (0, 4583)	0.06753847696012079
  (0, 4445)	0.07683691576012375
  (0, 4366)	0.08835407945846996
  (0, 4310)	0.06161205349386307
  (0, 4211)	0.05184553430537852
  (0, 4210)	0.035205710763425414
  (0, 4178)	0.03269289402923193
  (0, 4146)	0.03421834577043071
  (0, 4130)	0.35341631783387983
  (0, 3972)	0.07312921719220929
  (0, 3940)	0.07312921719220929
  (0, 3793)	0.08161697058027045
  :	:
  (103, 2664)	0.11353509293555106
  (103, 2591)	0.07702032789137497
  (103, 2495)	0.1858967318009866
  (103, 2461)	0.1858967318009866
  (103, 2395)	0.08747774262221746
  (103, 2385)	0.179104375

In [158]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  86.66666666666667


In [161]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  91.11111111111111


## Reduzca el vocabulario sólo a sustantivos y verbos y ver el desempeño


In [163]:
print(Corpus2.head())

                                                text     label
0  <p>Tras 17 a&ntilde;os sin comicios electorale...  Politica
1  <p>El presidente Pedro Pablo Kuczynski (<stron...  Politica
2  <p>La presidenta de la Comisi&oacute;n de Defe...  Politica
3  <p>A m&aacute;s tardar el 19 de mayo pr&oacute...  Politica
4  <p>La Procuradur&iacute;a P&uacute;blica Ad Ho...  Politica


In [0]:
# Step - a (1) : Remove blank rows if any.
Corpus2['text'].dropna(inplace=True)

# Step - b (2): Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus2['text'] = [remove_tags(entry.lower()) for entry in Corpus2['text']]

# Step - c (3): Tokenization : In this each entry in the corpus will be broken into set of words
Corpus2['text']= [word_tokenize(entry) for entry in Corpus2['text']]

# Step - d (4): Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['V'] = wn.VERB

for index,entry in enumerate(Corpus2['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('spanish') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus2.loc[index,'text_final'] = str(Final_words)

In [0]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus2['text_final'],Corpus2['label'],test_size=0.3)

In [0]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [0]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus2['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [172]:
print(Tfidf_vect.vocabulary_)

{'tras': 4456, 'años': 417, 'comicios': 803, 'electorales': 1466, 'fernando': 1772, 'belaunde': 467, 'terry': 4314, 'regreso': 3794, 'régimen': 3957, 'democrático': 1200, 'convertirse': 990, 'virtual': 4789, 'presidente': 3508, 'república': 3842, 'día': 1411, 'hoy': 2045, 'victoria': 4741, 'sillón': 4097, 'presidencial': 3505, 'segunda': 4027, 'vez': 4722, 'gobierno': 1927, 'militar': 2665, 'total': 4387, 'votos': 4829, 'candidato': 602, 'acción': 23, 'popular': 3433, 'venció': 4668, 'cercanos': 683, 'contendores': 956, 'villanueva': 4772, 'campo': 596, 'apra': 260, 'luis': 2479, 'ppc': 3463, 'necesario': 2810, 'ir': 2281, 'vuelta': 4837, 'nueva': 2877, 'carta': 630, 'posible': 3446, 'ser': 4056, 'elegido': 1472, 'mayoría': 2590, 'simple': 4103, 'declaraciones': 1162, 'comercio': 800, 'afirmó': 101, 'duda': 1397, 'superaría': 4210, 'ampliamente': 199, 'votación': 4827, 'pues': 3642, 'resultados': 3881, 'centro': 677, 'computación': 847, 'instalado': 2204, 'partido': 3179, 'mismo': 2695

In [173]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  86.66666666666667


In [174]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.44444444444444


Como se puede observar no mejora demasiado pero el filtro de solo sustantivos y verbos no funciona muy bien

## Permita introducir nuevos documentos (cargados mediante un archivo)

In [190]:
from google.colab import files
import io
uploaded = files.upload()
fileUp = list(uploaded.keys()) 
Corpus2 = pd.read_csv(fileUp[0],encoding='latin-1')
Corpus2.head()

# Step - a (1) : Remove blank rows if any.
Corpus2['text'].dropna(inplace=True)

# Step - b (2): Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus2['text'] = [remove_tags(entry.lower()) for entry in Corpus2['text']]

# Step - c (3): Tokenization : In this each entry in the corpus will be broken into set of words
Corpus2['text']= [word_tokenize(entry) for entry in Corpus2['text']]

# Step - d (4): Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['V'] = wn.VERB

for index,entry in enumerate(Corpus2['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('spanish') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus2.loc[index,'text_final'] = str(Final_words)

Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus2['text_final'],Corpus2['label'],test_size=0.3)
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus2['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
print(Tfidf_vect.vocabulary_)
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

Saving elcomercio20191128.csv to elcomercio20191128 (1).csv
{'tras': 4456, 'años': 417, 'comicios': 803, 'electorales': 1466, 'fernando': 1772, 'belaunde': 467, 'terry': 4314, 'regreso': 3794, 'régimen': 3957, 'democrático': 1200, 'convertirse': 990, 'virtual': 4789, 'presidente': 3508, 'república': 3842, 'día': 1411, 'hoy': 2045, 'victoria': 4741, 'sillón': 4097, 'presidencial': 3505, 'segunda': 4027, 'vez': 4722, 'gobierno': 1927, 'militar': 2665, 'total': 4387, 'votos': 4829, 'candidato': 602, 'acción': 23, 'popular': 3433, 'venció': 4668, 'cercanos': 683, 'contendores': 956, 'villanueva': 4772, 'campo': 596, 'apra': 260, 'luis': 2479, 'ppc': 3463, 'necesario': 2810, 'ir': 2281, 'vuelta': 4837, 'nueva': 2877, 'carta': 630, 'posible': 3446, 'ser': 4056, 'elegido': 1472, 'mayoría': 2590, 'simple': 4103, 'declaraciones': 1162, 'comercio': 800, 'afirmó': 101, 'duda': 1397, 'superaría': 4210, 'ampliamente': 199, 'votación': 4827, 'pues': 3642, 'resultados': 3881, 'centro': 677, 'computac