## **Actividad 2: Práctica de clasificación de textos**
## **Álvaro Payo**

## **1. Clasificación sin quitar stopwords y con max features = 20**

### **1.1. Importar librerías y datos**

In [None]:
#Importación de librerías
import spacy
import nltk
import pandas as pd 
import numpy as np 

import string
punctuation = set(string.punctuation)

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

nlp_español = spacy.load('es_core_news_lg')  

In [None]:
#Lectura del fichero: tweets.txt 
tweets = pd.read_csv("tweets.txt", header = None, encoding = 'UTF-8', sep = '::::')
tweets.columns = ['Texto', 'Etiqueta']
tweets.head() 

  tweets = pd.read_csv("tweets.txt", header = None, encoding = 'UTF-8', sep = '::::')


Unnamed: 0,Texto,Etiqueta
0,"Salgo de #VeoTV , que día más largoooooo...",
1,@PauladeLasHeras No te libraras de ayudar me/n...,neutro
2,@marodriguezb Gracias MAR,positivo
3,"Off pensando en el regalito Sinde, la que se v...",negativo
4,Conozco a alguien q es adicto al drama! Ja ja ...,positivo


### **1.2. Pre-procesamiento**

In [None]:
#Hay dos etiquetas de positivo y negativo que se han quedado con : delante.
tweets['Etiqueta'].value_counts()

positivo     2883
negativo     2182
None         1482
neutro        670
:negativo       1
:positivo       1
Name: Etiqueta, dtype: int64

In [None]:
#Para que todas las etiquetas estén correctamente, reemplazo las que se han quedado con : delante por el nombre que deberían tener.
tweets.Etiqueta.replace({":positivo":"positivo", ":negativo":"negativo"}, inplace=True)

In [None]:
#Compruebo que solo existen cuatro etiquetas: None, neutro, positivo y negativo.
tweets['Etiqueta'].unique()

array(['None', 'neutro', 'positivo', 'negativo'], dtype=object)

In [None]:
tweets_count = pd.DataFrame(tweets.groupby(['Etiqueta'])['Etiqueta'].count().rename('Count'))
tweets_count['Porcentaje'] = tweets_count['Count']/tweets_count['Count'].sum()
tweets_count = tweets_count.sort_values(by = "Count", ascending=False).reset_index()
tweets_count 

Unnamed: 0,Etiqueta,Count,Porcentaje
0,positivo,2884,0.399501
1,negativo,2183,0.302396
2,,1482,0.205292
3,neutro,670,0.092811


Tras solucionar el problema en los nombres de las etiquetas mediante un replace(), creo un Data Frame con el número tweets que hay de cada etiqueta y el porcentaje que representan sobre el total de tweets.

Se puede ver como el 70% de los tweets están clasificados como positivo o negativo.

In [None]:
#Tokenizo y quito los signos de puntuación
def tokenize(sentence):
    tokens = []
    for token in sentence.split():
        new_token = []
        for character in token:
            if character not in punctuation:
                new_token.append(character.lower())
        if new_token:
            tokens.append("".join(new_token))
    return tokens 

In [None]:
print(type(tweets.head()["Texto"]))
tweets.head()["Texto"].apply(tokenize)

<class 'pandas.core.series.Series'>


0        [salgo, de, veotv, que, día, más, largoooooo]
1    [pauladelasheras, no, te, libraras, de, ayudar...
2                         [marodriguezb, gracias, mar]
3    [off, pensando, en, el, regalito, sinde, la, q...
4    [conozco, a, alguien, q, es, adicto, al, drama...
Name: Texto, dtype: object

Se puede ver como he "normalizado" el texto, es decir, ahora todo el texto está en el mismo formato y no hay signos de exclamación, hashtags, etc...

In [None]:
docs = tweets.iloc[:,0] 
categs = tweets.iloc[:,-1] 

Separo los documentos y sus categorías (docs y categs son series de Pandas). 

Hay que separar las categorías de los documentos para diferenciar entre texto y etiquetas y así poder crear la matriz Tf-idf, separar en train y test, y aplicar los algoritmos de clasificación.

In [None]:
print("Datos es tipo: ", type(tweets))
print("Docs es tipo: ", type(docs))
print("Categs es tipo: ", type(categs))

Datos es tipo:  <class 'pandas.core.frame.DataFrame'>
Docs es tipo:  <class 'pandas.core.series.Series'>
Categs es tipo:  <class 'pandas.core.series.Series'>


### **1.3. Tf-idf (Term Frecuency – Inverse Document Frecuency)**

In [None]:
tfifd_vec_mf = TfidfVectorizer(max_features = 20)
TFIDF_mf = tfifd_vec_mf.fit_transform(docs)

In [None]:
#Visualización de la matriz Tf-idf

#Obtengo el vocabulario para poner las etiquetas de las columnas.
vocab_tfidf_mf = tfifd_vec_mf.get_feature_names()

#Y construyo un dataframe para mostrar el resultado: por cada documento las ocurrencias de cada token.
pd.DataFrame(TFIDF_mf.toarray(), columns = vocab_tfidf_mf)



Unnamed: 0,co,con,de,del,el,en,es,http,la,las,lo,los,no,para,por,que,rt,se,un,una
0,0.000000,0.000000,0.576433,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.817144,0.000000,0.000000,0.000000,0.0
1,0.000000,0.000000,0.508857,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.860851,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
2,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
3,0.000000,0.000000,0.175414,0.0,0.206463,0.213866,0.0,0.000000,0.407815,0.0,0.358263,0.0,0.296754,0.000000,0.000000,0.248665,0.000000,0.660531,0.000000,0.0
4,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,1.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7214,0.410222,0.000000,0.000000,0.0,0.000000,0.413801,0.0,0.411646,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.700738,0.000000,0.000000,0.0
7215,0.432670,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.434172,0.416123,0.0,0.000000,0.0,0.000000,0.000000,0.671664,0.000000,0.000000,0.000000,0.000000,0.0
7216,0.299191,0.431245,0.495079,0.0,0.000000,0.000000,0.0,0.300230,0.000000,0.0,0.000000,0.0,0.000000,0.444568,0.000000,0.000000,0.000000,0.000000,0.437755,0.0
7217,0.423247,0.000000,0.350179,0.0,0.412162,0.426940,0.0,0.424716,0.407060,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0


Cada fila representa un documento, y cada columna un término del vocabulario. Las columnas que aparecen son los 20 términos que más se repiten en el texto.

### **1.4. División en train y test**

In [None]:
docs_train, docs_test, categs_train, categs_test = train_test_split(TFIDF_mf, categs, test_size = 0.25, 
                                                                    random_state = 50)

### **1.5. Aplicación de algoritmos de clasificación**

**1.5.1. Clasificación ingenuo bayesiano**

In [None]:
#Entrenamiento del clasificador NB
clf = MultinomialNB()
clf.fit(docs_train, categs_train) 

In [None]:
#Predicción del set de test
categs_pred_mf = clf.predict(docs_test)

In [None]:
#Confusion Matrix
cm = confusion_matrix(categs_test, categs_pred_mf)
cm 

array([[ 78,  29,   0, 242],
       [ 14, 155,   0, 375],
       [  4,  62,   0, 102],
       [ 51, 116,   0, 577]])

In [None]:
#Metrics
acc_train_mf = clf.score(docs_train, categs_train)
acc_test_mf = clf.score(docs_test, categs_test)

print("Accuracy train: ", acc_train_mf)
print("Accuracy test: ", acc_test_mf)
print("Fiabilidad: ", acc_test_mf / acc_train_mf)  

Accuracy train:  0.4619504987070558
Accuracy test:  0.4487534626038781
Fiabilidad:  0.9714319258446206


**1.5.2. SVM (Support Vector Machine)**

In [None]:
classifier = LinearSVC()
classifier.fit(docs_train, categs_train)
LinearSVC(C=1.0, class_weight = None, dual = True, fit_intercept = True,
          intercept_scaling = 1, loss = 'squared_hinge', max_iter = 1000,
          multi_class = 'ovr', penalty = 'l2', random_state = None, tol = 0.0001,
          verbose = 0)

In [None]:
#Confusion Matrix
categs_pred_mf2 = classifier.predict(docs_test)
cm_mf = confusion_matrix(categs_test, categs_pred_mf2)
cm_mf

array([[105,  54,   0, 190],
       [ 39, 242,   0, 263],
       [  9,  83,   0,  76],
       [ 87, 185,   0, 472]])

In [None]:
#Accuracy
accuracy_mf = accuracy_score(categs_test, categs_pred_mf2)
print(f"Accuracy: {accuracy_mf:.4%}")

Accuracy: 45.3740%


## **2. Clasificación quitando stopwords y con max features = 20**

### **2.1. Creación del nuevo Data Frame**

In [None]:
stopwords = nltk.corpus.stopwords.words('spanish')
tweets['Text_stopwords'] = tweets['Texto'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)])) 

In [None]:
tweets_stopwords = pd.DataFrame ()

tweets_stopwords['Texto'] = tweets['Text_stopwords']
tweets_stopwords['Etiqueta'] = tweets['Etiqueta']

tweets_stopwords.head() 

Unnamed: 0,Texto,Etiqueta
0,"Salgo #VeoTV , día largoooooo...",
1,@PauladeLasHeras No libraras ayudar me/nos. Be...,neutro
2,@marodriguezb Gracias MAR,positivo
3,"Off pensando regalito Sinde, va SGAE van corru...",negativo
4,Conozco alguien q adicto drama! Ja ja ja suena...,positivo


Tras realizar todos los pasos en el Data Frame original, procedo a quitar las stopwords y a realizar de nuevo todos los pasos para ver si hay diferencias en los modelos.

A priori, la clasificación con el Data Frame sin stopwords debe ser más precisa ya que estoy quitando las palabras vacías del texto original. 

### **2.2. Pre-procesamiento**

In [None]:
print(type(tweets_stopwords.head()["Texto"]))
tweets_stopwords.head()["Texto"].apply(tokenize)

<class 'pandas.core.series.Series'>


0                      [salgo, veotv, día, largoooooo]
1    [pauladelasheras, no, libraras, ayudar, menos,...
2                         [marodriguezb, gracias, mar]
3    [off, pensando, regalito, sinde, va, sgae, van...
4    [conozco, alguien, q, adicto, drama, ja, ja, j...
Name: Texto, dtype: object

In [None]:
docs_stopwords = tweets_stopwords.iloc[:,0] 
categs_stopwords = tweets_stopwords.iloc[:,-1] 

In [None]:
print("Datos es tipo: ", type(tweets_stopwords))
print("Docs es tipo: ", type(docs_stopwords))
print("Categs es tipo: ", type(categs_stopwords))

Datos es tipo:  <class 'pandas.core.frame.DataFrame'>
Docs es tipo:  <class 'pandas.core.series.Series'>
Categs es tipo:  <class 'pandas.core.series.Series'>


### **2.3. Tf-idf (Term Frecuency - Inverse Document Frecuency)**

In [None]:
TFIDF_stopwords = tfifd_vec_mf.fit_transform(docs_stopwords)

In [None]:
#Visualización de la matriz Tf-idf

#Obtengo el vocabulario para poner las etiquetas de las columnas.
vocab_tfidf_stopwords = tfifd_vec_mf.get_feature_names()

#Y construyo un dataframe para mostrar el resultado: por cada documento las ocurrencias de cada token.
pd.DataFrame(TFIDF_stopwords.toarray(), columns = vocab_tfidf_stopwords)



Unnamed: 0,ahora,co,día,el,en,es,gobierno,gracias,hoy,http,la,mañana,no,pp,psoe,que,rajoy,rt,si,un
0,0.0,0.000000,1.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.710651,0.000000,0.000000,0.0,0.0,0.703545,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
4,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7214,0.0,0.244933,0.0,0.0,0.0,0.0,0.0,0.499408,0.464880,0.245783,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.418393,0.4889,0.0
7215,0.0,0.705881,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.708331,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
7216,0.0,0.705881,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.708331,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0000,0.0
7217,0.0,0.323152,0.0,0.0,0.0,0.0,0.0,0.000000,0.613338,0.324273,0.0,0.0,0.000000,0.0,0.0,0.0,0.643612,0.000000,0.0000,0.0


### **2.4. División en train y test**

In [None]:
docs_train_stopwords, docs_test_stopwords, categs_train_stopwords, categs_test_stopwords = train_test_split(TFIDF_stopwords, categs_stopwords, test_size = 0.25, 
                                                                    random_state = 50)

### **2.5. Aplicación de algoritmos de clasificación**

**2.5.1. Clasificador ingenuo bayesiano**

In [None]:
#Entrenamiento del clasificador NB
clf = MultinomialNB()
clf.fit(docs_train_stopwords, categs_train_stopwords) 

In [None]:
#Predicción del set de test
categs_pred_stopwords = clf.predict(docs_test_stopwords)

In [None]:
#Confusion Matrix
cm_stopwords = confusion_matrix(categs_test_stopwords, categs_pred_stopwords)
cm_stopwords 

array([[  4,  88,   0, 257],
       [  6, 228,   0, 310],
       [  0,  71,   0,  97],
       [  3, 175,   0, 566]])

In [None]:
#Metrics
acc_train_stopwords = clf.score(docs_train_stopwords, categs_train_stopwords)
acc_test_stopwords = clf.score(docs_test_stopwords, categs_test_stopwords)

print("Accuracy train: ", acc_train_stopwords)
print("Accuracy test: ", acc_test_stopwords)
print("Fiabilidad: ", acc_test_stopwords / acc_train_stopwords) 

Accuracy train:  0.444588104913188
Accuracy test:  0.4421052631578947
Fiabilidad:  0.9944154111910437


**2.5.2. SVM (Support Vector Machine)**

In [None]:
#SVM
classifier = LinearSVC()
classifier.fit(docs_train_stopwords, categs_train_stopwords)
LinearSVC(C=1.0, class_weight = None, dual = True, fit_intercept = True,
          intercept_scaling = 1, loss = 'squared_hinge', max_iter = 1000,
          multi_class = 'ovr', penalty = 'l2', random_state = None, tol = 0.0001,
          verbose = 0)

In [None]:
#Confusion Matrix
categs_pred_stopwords2 = classifier.predict(docs_test_stopwords)
cm_stopwords2 = confusion_matrix(categs_test_stopwords, categs_pred_stopwords2)
cm_stopwords2

array([[  5,  88,   0, 256],
       [  6, 230,   0, 308],
       [  1,  70,   0,  97],
       [  4, 179,   0, 561]])

In [None]:
#Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(categs_test_stopwords, categs_pred_stopwords2)
print(f"Accuracy: {accuracy:.4%}")

Accuracy: 44.0997%


## **3. Clasificación sin quitar stopwords y sin max features**
Para tener otro modelo con el que comparar la clasificación, he decidido añadir este modelo en el que sigo los mismos pasos que en la primera clasificación, pero esta vez sin especificar un número de max_features.

Comienzo desde la matriz Tf-idf ya que hay varias cosas que tengo cargadas y es ahi donde se realiza el cambio en las max_features.

### **3.1. Tf-idf (Term Frequency - Inverse Document Frequency)**

In [None]:
#Tf-idf
tfifd_vec_nmf = TfidfVectorizer(max_features = None)
TFIDF_nmf = tfifd_vec_nmf.fit_transform(docs)

In [None]:
#Visualización de la matriz Tf-idf

#Obtengo el vocabulario para poner las etiquetas de las columnas.
vocab_tfidf_nmf = tfifd_vec_nmf.get_feature_names()

#Y construyo un dataframe para mostrar el resultado: por cada documento las ocurrencias de cada token.
pd.DataFrame(TFIDF_nmf.toarray(), columns = vocab_tfidf_nmf)



Unnamed: 0,00,000,000m,000m2,000mill,001,00h,00habemus,00w5eh53,01,...,única,únicas,único,únicos,útil,útiles,お元気ですか,心から応援しています,日本の友人たちに思いを馳せずにはいられません,日本の皆様
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **3.2. División en train y test**

In [None]:
docs_train_nmf, docs_test_nmf, categs_train_nmf, categs_test_nmf = train_test_split(TFIDF_nmf, categs, test_size = 0.25, 
                                                                    random_state = 50)

### **3.3. Aplicación de algoritmos de clasificación**

**3.3.1. Clasificador ingenuo bayesiano**

In [None]:
#Entrenamiento del clasificador NB
clf = MultinomialNB()
clf.fit(docs_train_nmf, categs_train_nmf) 

In [None]:
#Predicción del set de test
categs_pred_nmf = clf.predict(docs_test_nmf)

In [None]:
#Confusion Matrix
cm_nmf = confusion_matrix(categs_test_nmf, categs_pred_nmf)
cm_nmf

array([[ 29,  48,   0, 272],
       [  0, 349,   0, 195],
       [  2,  67,   0,  99],
       [  3,  93,   0, 648]])

In [None]:
#Metrics
acc_train_nmf = clf.score(docs_train_nmf, categs_train_nmf)
acc_test_nmf = clf.score(docs_test_nmf, categs_test_nmf)

print("Accuracy train: ", acc_train_nmf)
print("Accuracy test: ", acc_test_nmf)
print("Fiabilidad: ", acc_test_nmf / acc_train_nmf) 

Accuracy train:  0.7367934983376432
Accuracy test:  0.5684210526315789
Fiabilidad:  0.7714794632608093


**3.3.2. SVM (Support Vector Machine)**

In [None]:
classifier = LinearSVC()
classifier.fit(docs_train_nmf, categs_train_nmf)
LinearSVC(C=1.0, class_weight = None, dual = True, fit_intercept = True,
          intercept_scaling = 1, loss = 'squared_hinge', max_iter = 1000,
          multi_class = 'ovr', penalty = 'l2', random_state = None, tol = 0.0001,
          verbose = 0)

In [None]:
#Confusion Matrix
categs_pred_nmf2 = classifier.predict(docs_test_nmf)
cm_nmf2 = confusion_matrix(categs_test_nmf, categs_pred_nmf2)
cm_nmf2

array([[168,  62,   8, 111],
       [ 53, 364,  19, 108],
       [ 18,  76,   8,  66],
       [ 67, 142,   9, 526]])

In [None]:
#Accuracy
accuracy_nmf = accuracy_score(categs_test_nmf, categs_pred_nmf2)
print(f"Accuracy: {accuracy_nmf:.4%}")

Accuracy: 59.0582%


## **4. Comentario de los resultados obtenidos**
Tras realizar la clasificación con max_features = 20, quitando stopwords, y sin especificar max features, voy a comparar las diferentes métricas obtenidas al evaluar los modelos, una vez aplicados los algoritmos de clasificación.

Por un lado, se puede apreciar que el hecho de quitar las stopwords no es muy relevante, ya que las métricas mejoran muy ligeramente respecto al modelo en el que no se quitan las stopwords. Esta comparación la hago basándome en los apartados 1 y 2 de este notebook.

Por otro lado, si que es relevante el hecho de especificar o no el número de max features, ya que se puede ver que cuando no lo especificamos, el accuracy es mejor (aumenta) pero la fiabilidad es peor (disminuye). Esta comparación la hago basándome en los apartados 1 y 3 de este notebook.