# Clasificación de textos utilizando algoritmos de Regresión Logística y Random Forest (sklearn)

- El presente tutorial se inspira de: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

En este tutorial intentaremos clasificar reseñas de productos Amazon (Alexa) en dos categorías: positivos o negativos ("Análisis de sentimientos").

Utilizaremos un enfoque simple:
- representaremos los textos con representaciones vectoriales "Bag of Words"
- utilizaremos algoritmos de Machine Learning para aprender modelos a partir de las representaciones vectoriales
- evaluaremos los modelos utilizando una matriz de confusión

In [8]:
#NLP
import spacy
nlp = spacy.load("en_core_web_sm")
print(spacy.__version__)
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import string

#SKLEARN
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.linear_model import LogisticRegression # Regresion Logística

#PANDAS
import pandas as pd

3.6.1


# 2. Dataset: Alexa

Vamos a usar un conjunto de datos real: un conjunto de reseñas de productos de Amazon Alexa. Este conjunto de datos viene como un archivo separado por tabulaciones (.tsv). Tiene cinco columnas: 
- __rating__: se refiere a la calificación que cada usuario dio a Alexa (de 0 a 5). 
- __fecha__: fecha de la reseña
- __variación__: describe el modelo de producto Alexa que el usuario comentó.
- __verified_reviews__: contiene el texto del comentario.
- __feedback__: contiene un label, 0 o 1, que indica el sentimiento general negativo (0) o positivo (1).

In [9]:
# Loading TSV file
df_amazon = pd.read_csv("datasets/amazon_alexa.tsv", sep="\t")

In [10]:
# Top 5 records
df_amazon

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3136,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3137,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3138,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3139,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


In [11]:
# shape of dataframe
df_amazon.shape

(3141, 5)

In [12]:
# Feedback Value count
df_amazon.feedback.value_counts()

feedback
1    2887
0     254
Name: count, dtype: int64

In [13]:
# Eliminar filas con valores NaN o en blanco en la columna 'verified_reviews'
df_amazon.dropna(subset=['verified_reviews'], inplace=True)

# Convertir una columna a tipo cadena
df_amazon['verified_reviews'] = df_amazon['verified_reviews'].astype(str)


# 3. Preprocesamientos y representación vectorial

In [18]:
# Preprocesar y crear embeddings para cada review
def preprocess_text(text):
    doc = nlp(text)
    return doc.vector

In [19]:
preprocess_text("Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you.  I like being able to turn lights on and off while away from home.")

array([ 0.00345765, -0.17519678, -0.18819846, -0.3426981 , -0.17147252,
       -0.05003924,  0.32743692,  0.0682259 ,  0.24946326,  0.09846789,
        0.21378204, -0.0832724 , -0.34635288,  0.0586556 ,  0.04942907,
       -0.06561576, -0.12984063, -0.05251141,  0.18348958, -0.11443489,
       -0.1121325 ,  0.3676017 , -0.11177018, -0.44883752, -0.11810745,
       -0.01279029,  0.31800777,  0.08982971, -0.01777774,  0.44952226,
       -0.12545277,  0.09406837,  0.30558962, -0.02307621,  0.2056234 ,
       -0.24215038,  0.09411351, -0.09909975,  0.32156393,  0.10834395,
       -0.11529841,  0.11083379, -0.1061615 , -0.04444008,  0.16055773,
        0.0672716 ,  0.22021754,  0.01888091,  0.31585997, -0.1683473 ,
       -0.17460029, -0.25383267, -0.07693887, -0.26698208, -0.03254918,
       -0.02606801, -0.08824483,  0.08321569,  0.2032924 , -0.07650842,
        0.03502554, -0.15798518, -0.26991764,  0.01951681,  0.12650762,
        0.31005645,  0.28461054, -0.32284218, -0.00412248,  0.18

doc.vector es un atributo proporcionado por la biblioteca spaCy que devuelve un vector numérico que representa un documento de texto procesado por el modelo de lenguaje preentrenado. Este vector es una representación densa del documento en un espacio vectorial de alta dimensionalidad, donde cada dimensión captura alguna característica semántica del texto.

Cuando spaCy procesa un texto, asigna automáticamente embeddings de palabras a cada palabra en el texto utilizando su modelo de lenguaje subyacente. El atributo doc.vector combina estos embeddings de palabras para crear una representación vectorial del documento completo. 

In [20]:
df_amazon['embeddings'] = df_amazon['verified_reviews'].apply(preprocess_text)

In [21]:
df_amazon

Unnamed: 0,rating,date,variation,verified_reviews,feedback,embeddings
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1,"[-1.5827643, -0.8735091, -0.29689175, 0.089411..."
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1,"[-0.8273477, -0.0959396, 0.5433444, -0.6608756..."
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1,"[0.0034576526, -0.17519678, -0.18819846, -0.34..."
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1,"[0.062283773, 0.043004736, -0.18952252, -0.223..."
4,5,31-Jul-18,Charcoal Fabric,Music,1,"[-1.4732523, 0.9739316, -0.84583044, 0.7157047..."
...,...,...,...,...,...,...
3136,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1,"[0.24252883, 0.015927391, -0.15306765, -0.1406..."
3137,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1,"[-0.20896895, 0.058918573, -0.063608415, -0.41..."
3138,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1,"[-0.24185584, -0.07062987, -0.22071868, 0.1401..."
3139,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1,"[0.059714638, -0.26009345, -0.07460302, 0.0011..."


# 4. Entrenamiento del modelo de clasificación

In [23]:
import numpy as np

# Dividir los datos en entrenamiento y prueba
X = np.array(df_amazon['embeddings'].tolist())
y = df_amazon['feedback']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [25]:
from sklearn.ensemble import RandomForestClassifier

# Construir y entrenar el clasificador
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# 5. Evaluación del modelo de clasificación

In [26]:
# Predicting with a test dataset
predicted = clf.predict(X_test)
print(predicted)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

[1 1 1 ... 1 1 1]
Logistic Regression Accuracy: 0.9369426751592357
Logistic Regression Precision: 0.9361702127659575
Logistic Regression Recall: 1.0


In [27]:
#Evaluación del rendimiento del clasificador
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, predicted)
print(confusion_matrix)

#Print de la matriz de confusión
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))

[[  19   99]
 [   0 1452]]
              precision    recall  f1-score   support

           0       1.00      0.16      0.28       118
           1       0.94      1.00      0.97      1452

    accuracy                           0.94      1570
   macro avg       0.97      0.58      0.62      1570
weighted avg       0.94      0.94      0.92      1570

