<p align="center">
  <img src="https://i.ytimg.com/vi/Wm8ftqDZUVk/maxresdefault.jpg" alt="FIUBA" width="25%"/>
  </p>
  
# **Trabajo Práctico 2: Críticas Cinematográficas**
### **Grupo**: 11 - Los Pandas 🐼
### **Cuatrimestre**: 2ºC 2023
### **Corrector**: Mateo Suster
### **Integrantes**:
- ### 106861 - Labollita, Francisco
- ### 102312 - Mundani Vegega, Ezequiel
- ###  97263 - Otegui, Matías Iñaki

# Modelo Random Forest

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, make_scorer
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In [7]:
reviews = pd.read_csv('train.csv')

## Implementación del bag of words

In [8]:
vectorizerTotal = CountVectorizer(strip_accents='unicode', dtype='uint16')
vectorizerTotal.fit_transform(reviews['review_es'])

# Primeros 20 elementos
print(vectorizerTotal.get_feature_names_out()[:20])
# Elementos del medio
print(vectorizerTotal.get_feature_names_out()[10000:10020])
# Últimos 20 elementos
print(vectorizerTotal.get_feature_names_out()[-20:])

['00' '000' '00000' '00000000000' '0000000000001' '00000001' '00001'
 '0001' '00015' '000dm' '001' '002' '003830' '006' '0069' '007' '0079'
 '007the' '0080' '0083']
['antisocial' 'antisociales' 'antiste' 'antisunciados' 'antit'
 'antitabaco' 'antitanque' 'antiterroristas' 'antitesis' 'antitetico'
 'antithesis' 'antithetical' 'antitica' 'antitm' 'antitreideros'
 'antitrust' 'antivirus' 'antiwar' 'antm' 'antoina']
['zyuranger' 'zz' 'zzzz' 'zzzzip' 'zzzzz' 'zzzzzzzzz' 'zzzzzzzzzzzzzz'
 'zzzzzzzzzzzzzzzzz' 'zzzzzzzzzzzzzzzzzzzz' 'zzzzzzzzzzzzzzzzzzzzz'
 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz'
 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz' 'æbler' 'æon' 'æsthetic'
 'østbye' 'þo' 'þorleifsson' 'יגאל' 'כרמון']


Se ve que varias "palabras" serán números, algunas tendrán símbolos no pertenecientes al alfabeto español y también se comprueba que están palabras españolas.

## Feature engineering del bag of words

En primer lugar, siendo que todas las palabras que inician una oración empiezan en mayúscula, se hará que todas las letras de palabras con una sola mayúscula sean transformadas a minúsculas. De tal manera que en el siguiente ejemplo, las dos variaciones de hermosa sean una misma palabra: "Hermosa película" y "Esta película es hermosa".

In [9]:
matrizApariciones = vectorizerTotal.fit_transform(reviews['review_es'])

In [10]:
matrizSiAparece = matrizApariciones.toarray()
matrizApariciones = matrizApariciones.toarray()

In [11]:
matrizSiAparece[matrizSiAparece > 0] = 1

In [12]:
words_df = pd.DataFrame()
words_df['Palabra'] = vectorizerTotal.get_feature_names_out()
words_df['Apariciones Totales'] = matrizApariciones.sum(axis=0).tolist() #Cuántas veces aparece la palabra
words_df['Apariciones'] = matrizSiAparece.sum(axis=0).tolist()           #En cuántas reviews aparece la palabra

In [13]:
#Se cuentan en cuántas reviews positivas aparece cada palabra
listaAparicionesPositivas = np.zeros(shape=len(matrizSiAparece[0])).astype('int32')
for i in range(reviews.shape[0]):
    if (reviews.iloc[i]['sentimiento'] == 'positivo'):
        listaAparicionesPositivas += matrizSiAparece[i]

#Se cuentan en cuántas reviews negativas aparece cada palabra
listaAparicionesNegativas = np.zeros(shape=len(matrizSiAparece[0])).astype('int32')
for i in range(reviews.shape[0]):
    if (reviews.iloc[i]['sentimiento'] == 'negativo'):
        listaAparicionesNegativas += matrizSiAparece[i]

In [14]:
words_df['Apariciones positivas'] = listaAparicionesPositivas
words_df['Apariciones negativas'] = listaAparicionesNegativas
words_df['Fracción apariciones positivas'] = words_df['Apariciones positivas'] / words_df['Apariciones']
words_df['Fracción apariciones negativas'] = words_df['Apariciones negativas'] / words_df['Apariciones']
words_df['Tasa de positividad'] = (words_df['Apariciones positivas'] - words_df['Apariciones negativas']) / words_df['Apariciones']
words_df.sort_values(by='Apariciones', inplace=True, ascending=False)
words_df.head(10)

Unnamed: 0,Palabra,Apariciones Totales,Apariciones,Apariciones positivas,Apariciones negativas,Fracción apariciones positivas,Fracción apariciones negativas,Tasa de positividad
41364,de,661907,47992,23949,24043,0.499021,0.500979,-0.001959
128119,que,395365,47245,23501,23744,0.497428,0.502572,-0.005143
93125,la,405160,47147,23496,23651,0.498356,0.501644,-0.003288
55382,en,276429,45938,22882,23056,0.498106,0.501894,-0.003788
53511,el,253915,45037,22381,22656,0.496947,0.503053,-0.006106
162711,una,170883,43530,21815,21715,0.501149,0.498851,0.002297
58627,es,183244,43210,21708,21502,0.502384,0.497616,0.004767
162707,un,186195,43041,21390,21651,0.496968,0.503032,-0.006064
111087,no,145805,42253,19918,22335,0.471398,0.528602,-0.057203
60252,esta,119728,40952,20067,20885,0.490013,0.509987,-0.019975


# Lo q está para abajo ni idea si funciona o no, después lo corrijo -Eze

In [None]:
for i in range(0, words_df.shape[0]):
    words_df['Apariciones positivas'][i] = reviews[reviews['sentiment'] == 'positive'].sum(axis=0).tolist()[0][i]
#obtener la cantidad de filas de un dataframe

#seleccionar solo las filas en que sentimiento es positivo
reviewsPos = reviews[reviews['sentiment'] == 'positive']

words_df['Tasa de aparición'] = matrizSiAparece.sum(
    axis=0) / len(reviews['review_es'])

auxDfPos['Palabra'] = vectorizerPos.get_feature_names_out()
auxDfPos['Tasa de aparición positivas'] = matrizSiAparecePos.sum(axis=0) / len(reviewsPos)

auxDfNeg['Palabra'] = vectorizerNeg.get_feature_names_out()
auxDfNeg['Tasa de aparición negativas'] = matrizSiApareceNeg.sum(axis=0) / len(reviewsNeg)

In [12]:
# Tasa de positividad
# f(TP, TN, TA) = (TP - TN) / (2*TA)

words_df['Tasa de positividad'] = (words_df['Tasa de aparición positivas'] -
                                   words_df['Tasa de aparición negativas']) / (2 * words_df['Tasa de aparición'])

In [13]:
words_df.head(20)

Unnamed: 0,Palabra,Apariciones,Tasa de aparición,Tasa de aparición positivas,Tasa de aparición negativas,Tasa de positividad
0,00,213,0.00368,0.00316,0.0042,-0.141304
1,000,613,0.00896,0.0078,0.01012,-0.129464
2,00000,4,4e-05,8e-05,0.0,1.0
3,00000000000,2,2e-05,0.0,4e-05,-1.0
4,0000000000001,1,2e-05,0.0,4e-05,-1.0
5,00000001,1,2e-05,0.0,4e-05,-1.0
6,00001,2,4e-05,0.0,8e-05,-1.0
7,0001,1,2e-05,4e-05,0.0,1.0
8,00015,1,2e-05,0.0,4e-05,-1.0
9,000dm,1,2e-05,0.0,4e-05,-1.0


In [14]:
len(words_df)

172382

In [20]:
# Eliminar filas donde la columna "Palabra" contiene un número
words_df = words_df[~words_df['Palabra'].str.contains('\d', na=False)]

# Eliminar filas donde la columna "Apariciones" es menor a 3
words_df = words_df[words_df['Apariciones'] >= 3]

### Se entrena un módelo posible de Random Forest

In [None]:
rfc_default = RandomForestClassifier()
rfc_default.get_params()

In [None]:
hotels_df_x = pd.get_dummies(reviews, columns=["ID", "review_es", "sentimiento"], drop_first=True)

hotels_df_x = hotels_df_x.drop(['sentimiento'], axis='columns')

hotels_df_x = hotels_df_x.reindex(sorted(hotels_df_x.columns), axis=1)

hotels_df_y = hotels_df['is_canceled'].copy()

x_train, x_test, y_train, y_test = train_test_split(hotels_df_x,
                                                    hotels_df_y,
                                                    test_size=0.3,  # proporcion 70/30
                                                    random_state=2)  # semilla

In [None]:
rfc = RandomForestClassifier(max_features='sqrt',
                             oob_score=True,
                             random_state=2,
                             n_jobs=-1,
                             criterion="entropy",
                             min_samples_leaf=5,
                             min_samples_split=5,
                             n_estimators=50)

rfc_model = rfc.fit(X=x_train, y=y_train)

y_test_pred = rfc_model.predict(x_test)

In [25]:
# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(
    reviews['review_es'], reviews['sentimiento'], test_size=0.2, random_state=42)

# Crear una matriz de términos de documento utilizando CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)

# Entrenar el modelo de Random Forest
clf = RandomForestClassifier(max_features='sqrt',
                             oob_score=True,
                             random_state=2,
                             n_jobs=-1,
                             criterion="entropy",
                             min_samples_leaf=5,
                             min_samples_split=5,
                             n_estimators=50)
clf.fit(X_train_counts, y_train)

# Transformar los datos de prueba y hacer predicciones
X_test_counts = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_counts)

# Imprimir el informe de clasificación
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negativo       0.84      0.83      0.84      4961
    positivo       0.84      0.84      0.84      5039

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [28]:
# Cargar los datos de prueba
df_test = pd.read_csv('test.csv')

# Asegúrate de que tu DataFrame de prueba tiene la misma estructura que el DataFrame de entrenamiento
# En este caso, necesitamos asegurarnos de que tiene una columna 'review_es'

# Transformar los datos de prueba y hacer predicciones
X_test_counts = vectorizer.transform(df_test['review_es'])
y_pred_test = clf.predict(X_test_counts)

# Añadir las predicciones al DataFrame de prueba
df_test['sentimiento'] = y_pred_test

df_test.drop("review_es", axis=1, inplace=True)

# Guardar el DataFrame de prueba con las predicciones en un nuevo archivo csv
df_test.to_csv('sample_solution.csv', index=False)

## Entrenamiento del modelo (A PARTIR DE ACA NO ESTA HECHO :C)

Primero se ve cuál es el mejor tipo de clasificador para el modelo, se prueba con Bernoulli, Multinomial y Gaussiano. Luego se optimizan sus hiperparámetros.

In [None]:
reviews_x = reviews['review_es'].copy()
reviews_y = reviews['sentimiento'].copy()

x_train, x_test, y_train, y_test = train_test_split(reviews_x, reviews_y, test_size=0.30, random_state=0)

In [None]:
classifiers = [
    MultinomialNB(),
    ComplementNB(),
    BernoulliNB(),
]

vectorizers = [
    CountVectorizer(),
    TfidfVectorizer()
]

for v in vectorizers:
    for c in classifiers:
        model = make_pipeline(v, c)

        model.fit(x_train, y_train)

        predicted_categories = model.predict(x_test)

        print("Para", v, ",", c, "la precision es", round(accuracy_score(y_test, predicted_categories), 4))

## Análisis del mejor modelo entrenado

Se obtuvo que el mejor modelo es un CountVectorizer con clasificador multinomial.

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(x_train, y_train)

y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)


In [None]:
y_train_bool = (y_train == 'positivo').astype(int)
y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
y_test_bool = (y_test == 'positivo').astype(int)
y_test_pred_bool = (y_test_pred == 'positivo').astype(int)

train_score = f1_score(y_train_bool.values, y_train_pred_bool)
test_score = f1_score(y_test_bool.values, y_test_pred_bool)

print("Matriz de confusión de los datos de prueba:")
cm = confusion_matrix(y_test, y_test_pred)
sns.heatmap(cm, cmap='Blues',annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True');

In [None]:
accuracy=accuracy_score(y_train_bool, y_train_pred_bool)
recall=recall_score(y_train_bool, y_train_pred_bool)
f1=f1_score(y_train_bool, y_train_pred_bool)
precision=precision_score(y_train_bool, y_train_pred_bool)

print("Cálculo de las métricas en el conjunto de entrenamiento")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

accuracy=accuracy_score(y_test_bool,y_test_pred_bool)
recall=recall_score(y_test_bool,y_test_pred_bool)
f1=f1_score(y_test_bool,y_test_pred_bool)
precision=precision_score(y_test_bool,y_test_pred_bool)

print("\nCálculo de las métricas en el conjunto de pruebas")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

## Grid Search

In [None]:
model = Pipeline([("tfidf", TfidfVectorizer()), ("mnb", MultinomialNB())])

params_grid = {
        'tfidf__ngram_range': [(1,1), (1,2), (2,2)],
        'tfidf__max_features': [1000, 10000, 100000],
        'mnb__alpha': [0.001, 0.01, 0.1],
}

scorer_fn = make_scorer(f1_score, pos_label='positivo')
kfoldcv = StratifiedKFold(n_splits=5)

gridcv = GridSearchCV(estimator=model,
                      param_grid = params_grid,
                      scoring=scorer_fn,
                      cv=kfoldcv
                      )

model = gridcv.fit(x_train,y_train)

y_pred = model.predict(x_test)
score = f1_score(y_test, y_pred, pos_label='positivo')
print("Parámetros:", gridcv.best_params_, "\nF1 score: ", round(score, 3))



In [None]:
y_train_pred = model.predict(x_train)

y_train_bool = (y_train == 'positivo').astype(int)
y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
y_test_bool = (y_test == 'positivo').astype(int)
y_test_pred_bool = (y_pred == 'positivo').astype(int)

train_score = f1_score(y_train_bool.values, y_train_pred_bool)
test_score = f1_score(y_test_bool.values, y_test_pred_bool)

print("Matriz de confusión de los datos de prueba:")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, cmap='Blues',annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True');

In [None]:
accuracy=accuracy_score(y_train_bool, y_train_pred_bool)
recall=recall_score(y_train_bool, y_train_pred_bool)
f1=f1_score(y_train_bool, y_train_pred_bool)
precision=precision_score(y_train_bool, y_train_pred_bool)

print("Cálculo de las métricas en el conjunto de entrenamiento")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

accuracy=accuracy_score(y_test_bool,y_test_pred_bool)
recall=recall_score(y_test_bool,y_test_pred_bool)
f1=f1_score(y_test_bool,y_test_pred_bool)
precision=precision_score(y_test_bool,y_test_pred_bool)

print("\nCálculo de las métricas en el conjunto de pruebas")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

## Nueva Hipótesis: Filtrar los reviews

En este TP no fue necesario analizar y filtrar la base de datos antes de crear el Bayes Naive model. Pero se pensó cuantificar qué tan negativo o positivo son los reviews y luego modificar el dataset según eso. Habría dos maneras de filtrarlo: solo tomando los reviews más extremos o solamente tomar los mas neutrales

Existe una librería llamada TextBlob para intentar lograr esto. El análisis de sentimiento de TextBlob implica el uso de un modelo de aprendizaje automático previamente entrenado para asignar una puntuación de polaridad a un fragmento de texto determinado. El modelo evalúa las palabras y frases del texto y proporciona una puntuación numérica que indica la positividad o negatividad del sentimiento.

In [None]:
from textblob import TextBlob

reviews_hip = reviews.copy()

def quantify_reviews(review):
    analysis = TextBlob(review)
    return analysis.sentiment.polarity

La función quantify_reviews() agarra cada review, lo transforma en un objeto TextBlob y analiza el sentimiento del review, devolviendo un float entre -1 y 1 donde 1 es extremadamente positivo y -1 se extremadamente negativo.

Ahora se deben encontrar los thresholds optimos para probar nuestra hipótesis y saber qué reviews eliminar antes de entrenar al modelo.

In [None]:
reviews_hip['score'] = reviews_hip['review_es'].apply(quantify_reviews)

pos_threshold = 0.1
neg_threshold = -0.305
filtered_data = reviews_hip[((reviews_hip['score'] < pos_threshold) & (reviews_hip['score'] > neg_threshold))]

In [None]:
reviews_hip_x = filtered_data['review_es'].copy()
reviews_hip_y = filtered_data['sentimiento'].copy()

x_train, x_test, y_train, y_test = train_test_split(reviews_hip_x, reviews_hip_y, test_size=0.30, random_state=0)

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(x_train, y_train)

y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

In [None]:
y_train_bool = (y_train == 'positivo').astype(int)
y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
y_test_bool = (y_test == 'positivo').astype(int)
y_test_pred_bool = (y_test_pred == 'positivo').astype(int)

train_score = f1_score(y_train_bool.values, y_train_pred_bool)
test_score = f1_score(y_test_bool.values, y_test_pred_bool)

print("Matriz de confusión de los datos de prueba:")
cm = confusion_matrix(y_test, y_test_pred)
sns.heatmap(cm, cmap='Blues',annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True');

In [None]:
accuracy=accuracy_score(y_train_bool, y_train_pred_bool)
recall=recall_score(y_train_bool, y_train_pred_bool)
f1=f1_score(y_train_bool, y_train_pred_bool)
precision=precision_score(y_train_bool, y_train_pred_bool)

print("Cálculo de las métricas en el conjunto de entrenamiento")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

accuracy=accuracy_score(y_test_bool,y_test_pred_bool)
recall=recall_score(y_test_bool,y_test_pred_bool)
f1=f1_score(y_test_bool,y_test_pred_bool)
precision=precision_score(y_test_bool,y_test_pred_bool)

print("\nCálculo de las métricas en el conjunto de pruebas")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

Luego de un probar diferentes thesholds manualmente y probar ambas maneras de filtrar, nos dio mejores resultados tomar los reviews más neutrales con un threshold de -0.305 < x < 0.1. Intentaremos probar con valores más cercanos a este threshold para conseguir un mejor resultado.

In [None]:
best_theshold = (0,0)
best_score = 0


for pos_threshold in np.arange(0.095, 0.1055, 0.0025):
    for neg_threshold in np.arange(-0.315, -0.295, 0.0025):
        #print("Threshold: " + str(neg_threshold) + " < x < " + str(pos_threshold))

        filtered_data = reviews_hip[((reviews_hip['score'] < pos_threshold) & (reviews_hip['score'] > neg_threshold))]

        reviews_hip_x = filtered_data['review_es'].copy()
        reviews_hip_y = filtered_data['sentimiento'].copy()

        x_train, x_test, y_train, y_test = train_test_split(reviews_hip_x, reviews_hip_y, test_size=0.30, random_state=0)
        
        model = make_pipeline(TfidfVectorizer(), MultinomialNB())
        model.fit(x_train, y_train)

        y_train_pred = model.predict(x_train)
        y_test_pred = model.predict(x_test)

        y_train_bool = (y_train == 'positivo').astype(int)
        y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
        y_test_bool = (y_test == 'positivo').astype(int)
        y_test_pred_bool = (y_test_pred == 'positivo').astype(int)

        f1=f1_score(y_test_bool,y_test_pred_bool)

        #print("\nCálculo de las métricas en el conjunto de pruebas")
        #print("F1 score: ", round(f1, 3))

        #print("-----------------------------------------------------")

        if f1 > best_score:
            best_score = f1
            best_theshold = [neg_threshold, pos_threshold]


print("Best threshold: " + str(best_theshold[0]) + " < x < " + str(round(best_theshold[1], 3)))

In [None]:
filtered_data = reviews_hip[((reviews_hip['score'] < best_theshold[1]) & (reviews_hip['score'] > best_theshold[0]))]

reviews_hip_x = filtered_data['review_es'].copy()
reviews_hip_y = filtered_data['sentimiento'].copy()

x_train, x_test, y_train, y_test = train_test_split(reviews_hip_x, reviews_hip_y, test_size=0.30, random_state=0)

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(x_train, y_train)

y_train_pred = model.predict(x_train)
y_test_pred = model.predict(x_test)

In [None]:

y_train_bool = (y_train == 'positivo').astype(int)
y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
y_test_bool = (y_test == 'positivo').astype(int)
y_test_pred_bool = (y_test_pred == 'positivo').astype(int)

train_score = f1_score(y_train_bool.values, y_train_pred_bool)
test_score = f1_score(y_test_bool.values, y_test_pred_bool)

print("Matriz de confusión de los datos de prueba:")
cm = confusion_matrix(y_test, y_test_pred)
sns.heatmap(cm, cmap='Blues',annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True');

In [None]:
accuracy=accuracy_score(y_train_bool, y_train_pred_bool)
recall=recall_score(y_train_bool, y_train_pred_bool)
f1=f1_score(y_train_bool, y_train_pred_bool)
precision=precision_score(y_train_bool, y_train_pred_bool)

print("Cálculo de las métricas en el conjunto de entrenamiento")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

accuracy=accuracy_score(y_test_bool,y_test_pred_bool)
recall=recall_score(y_test_bool,y_test_pred_bool)
f1=f1_score(y_test_bool,y_test_pred_bool)
precision=precision_score(y_test_bool,y_test_pred_bool)

print("\nCálculo de las métricas en el conjunto de pruebas")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

## Hipótesis + Grid Search

Le aplico un Grid Search a nuestro nuevo dataset filtrado.

In [None]:
filtered_data = reviews_hip[((reviews_hip['score'] < best_theshold[1]) & (reviews_hip['score'] > best_theshold[0]))]

reviews_hip_x = filtered_data['review_es'].copy()
reviews_hip_y = filtered_data['sentimiento'].copy()

x_train, x_test, y_train, y_test = train_test_split(reviews_hip_x, reviews_hip_y, test_size=0.30, random_state=0)


model = Pipeline([("tfidf", TfidfVectorizer()), ("mnb", MultinomialNB())])

params_grid = {
        'tfidf__ngram_range': [(1,1), (1,2), (2,2)],
        'tfidf__max_features': [1000, 10000, 100000],
        'mnb__alpha': [0.001, 0.01, 0.1],
}

scorer_fn = make_scorer(f1_score, pos_label='positivo')
kfoldcv = StratifiedKFold(n_splits=5)

gridcv = GridSearchCV(estimator=model,
                      param_grid = params_grid,
                      scoring=scorer_fn,
                      cv=kfoldcv
                      )

model = gridcv.fit(x_train,y_train)

y_pred = model.predict(x_test)
score = f1_score(y_test, y_pred, pos_label='positivo')
print("Parámetros:", gridcv.best_params_, "\nF1 score: ", round(score, 3))

In [None]:
y_train_pred = model.predict(x_train)

y_train_bool = (y_train == 'positivo').astype(int)
y_train_pred_bool = (y_train_pred == 'positivo').astype(int)
y_test_bool = (y_test == 'positivo').astype(int)
y_test_pred_bool = (y_pred == 'positivo').astype(int)

train_score = f1_score(y_train_bool.values, y_train_pred_bool)
test_score = f1_score(y_test_bool.values, y_test_pred_bool)

print("Matriz de confusión de los datos de prueba:")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, cmap='Blues',annot=True,fmt='g')
plt.xlabel('Predicted')
plt.ylabel('True');

In [None]:
accuracy=accuracy_score(y_train_bool, y_train_pred_bool)
recall=recall_score(y_train_bool, y_train_pred_bool)
f1=f1_score(y_train_bool, y_train_pred_bool)
precision=precision_score(y_train_bool, y_train_pred_bool)

print("Cálculo de las métricas en el conjunto de entrenamiento")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

accuracy=accuracy_score(y_test_bool,y_test_pred_bool)
recall=recall_score(y_test_bool,y_test_pred_bool)
f1=f1_score(y_test_bool,y_test_pred_bool)
precision=precision_score(y_test_bool,y_test_pred_bool)

print("\nCálculo de las métricas en el conjunto de pruebas")
print("Accuracy: ", round(accuracy, 3))
print("Recall: ", round(recall, 3))
print("Precision: ", round(precision, 3))
print("F1 score: ", round(f1, 3))

## Predicción del conjunto test

In [None]:
test = pd.read_csv('test.csv')

predictions = pd.DataFrame()
predictions['ID'] = test['ID'].values
predictions['sentimiento'] = model.predict(test['review_es'])

predictions.to_csv('sample_submission.csv', index=False)