# TF-IDF 10 fold Cross Validation - Metacritics Sentiment Analysis.
[Panda](https://github.com/PANDA-UFSCar) - Universidade Federal de São Carlos, 2023/2
## Autores: Bárbara Dib Oliveira e [Letícia Bossato Marchezi](linkedin.com/in/letmarchezi/)

## Algoritmos
*   **SVM**: Bom para dados com alta dimensionalidade e poucas amostras
*   **Naive-bayes**: Geralmente bom para sentiment analysis, porém mais efetivo para datasets grandes
*   **Random Forest**: Bom para lidar com overfitting, efetivo para datasets grandes
*   **K-nearest neighbors**: Simples e bom para datasets que não são muito grandes, e os limiares de decisão não são regulares

---

*   **XGBoost**: Boa acurácia e precisão geralmente, versátil para vários tipos de dados.
*   **Convolutional neural network (CNN)**: Podem capturar padrões complexos nos dados
*   **BERT**: Capturam contexto locais e a longa distância com efetividade


## Lembretes
* Random state = 42
* Utilizar train test split de (70, 30)
* Realizar normalização dos dados (após a separação de dataset de treino e de teste): inst_scaler = preprocessing.StandardScaler(with_mean=False)


## Importando o arquivo pré-processado

In [None]:
import numpy as np
import pandas as pd
import ast
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
from google.colab import drive

drive.mount('/content/gdrive', force_remount=True)
%cd "gdrive/MyDrive/Grupo 1 - Processamento de Linguagem Natural/Data/"

Mounted at /content/gdrive
/content/gdrive/.shortcut-targets-by-id/1ub11KA5pjUO4RCNqv5VFfBB4Tnv21ooN/Grupo 1 - Processamento de Linguagem Natural/Data


In [None]:
%ls

[0m[01;34m'csv base'[0m/   preprocessed-metacritics-total.csv   [01;34mrascunhos_filmes[0m/


## Algoritmos

In [None]:
df_geral = pd.read_csv("preprocessed-metacritics-total.csv")
df_geral

Unnamed: 0,Movie name,Review,Created at,Score,Genre
0,Arrival,"['denis', 'villeneuve', 'shows', 'us', 'all', ...","OCT 3, 2022",1.0,Mistery
1,Arrival,"['amy', 'adams', 'gives', 'a', 'superb', 'perf...","MAR 7, 2022",1.0,Mistery
2,Arrival,"['this', 'movie', 'is', 'not', 'for', 'everyon...","DEC 6, 2019",1.0,Mistery
3,Arrival,"['arrival', 'is', 'one', 'of', 'my', 'favorite...","APR 3, 2020",1.0,Mistery
4,Arrival,"['i', 'do', 'not', 'think', 'this', 'movie', '...","MAR 2, 2020",1.0,Mistery
...,...,...,...,...,...
6475,Norm of the North,"['ugh', 'anything', 'but', 'this', 'movie', 'i...","APR 3, 2021",-1.0,Animation
6476,Norm of the North,"['this', 'is', 'a', 'pathetic', 'attempt', 'at...","JUL 9, 2016",-1.0,Animation
6477,Star Wars: The Clone Wars,"['this', 'movie', 'was', 'never', 'interesting...","AUG 2, 2011",-1.0,Animation
6478,Norm of the North,"['this', 'is', 'even', 'a', 'movie', 'i', 'tho...","JUN 25, 2016",-1.0,Animation


In [None]:
df_geral.head()

Unnamed: 0,Movie name,Review,Created at,Score,Genre
0,Arrival,"['denis', 'villeneuve', 'shows', 'us', 'all', ...","OCT 3, 2022",1.0,Mistery
1,Arrival,"['amy', 'adams', 'gives', 'a', 'superb', 'perf...","MAR 7, 2022",1.0,Mistery
2,Arrival,"['this', 'movie', 'is', 'not', 'for', 'everyon...","DEC 6, 2019",1.0,Mistery
3,Arrival,"['arrival', 'is', 'one', 'of', 'my', 'favorite...","APR 3, 2020",1.0,Mistery
4,Arrival,"['i', 'do', 'not', 'think', 'this', 'movie', '...","MAR 2, 2020",1.0,Mistery


### Treinamento


In [None]:
df_geral['Review'] = df_geral['Review'].apply(lambda x: ast.literal_eval(x))
df_geral.head()

Unnamed: 0,Movie name,Review,Created at,Score,Genre
0,Arrival,"[denis, villeneuve, shows, us, all, his, talen...","OCT 3, 2022",1.0,Mistery
1,Arrival,"[amy, adams, gives, a, superb, performance, in...","MAR 7, 2022",1.0,Mistery
2,Arrival,"[this, movie, is, not, for, everyone, if, you,...","DEC 6, 2019",1.0,Mistery
3,Arrival,"[arrival, is, one, of, my, favorite, sci, fi, ...","APR 3, 2020",1.0,Mistery
4,Arrival,"[i, do, not, think, this, movie, is, about, th...","MAR 2, 2020",1.0,Mistery


In [None]:
X = df_geral['Review'].apply(lambda x: " ".join(x))
y = df_geral['Score']

In [None]:
print(len(X))

6480


In [None]:
print(len(y))

6480


In [None]:
print(X)

0       denis villeneuve shows us all his talent in th...
1       amy adams gives a superb performance in what c...
2       this movie is not for everyone if you wanted a...
3       arrival is one of my favorite sci fi movies of...
4       i do not think this movie is about that what m...
                              ...                        
6475    ugh anything but this movie i remember watchin...
6476    this is a pathetic attempt at grabbing childre...
6477    this movie was never interesting when it was n...
6478    this is even a movie i thought i was some sort...
6479    plot was boring got bored and dissaapointed in...
Name: Review, Length: 6480, dtype: object


In [None]:
print(y)

0       1.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
6475   -1.0
6476   -1.0
6477   -1.0
6478   -1.0
6479   -1.0
Name: Score, Length: 6480, dtype: float64


In [None]:
def custom_cross_val_metrics(classifier, X, y, k=10,verbose=False):
    """
    Perform k-fold cross-validation and calculate metrics for each fold.

    Parameters:
    - classifier: The classifier to evaluate.
    - X: The feature matrix.
    - y: The target labels.
    - k: The number of folds for cross-validation.

    Returns:
    - List of dictionaries containing metrics for each fold.
    """
    metrics_list = []
    Tf_vectorizer = TfidfVectorizer()
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)

    # Convert X and y to NumPy arrays
    X = np.array(X)
    y = np.array(y)

    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Transformação TF-IDF
        X_train_tfidf = Tf_vectorizer.fit_transform(X_train).toarray()
        X_test_tfidf = Tf_vectorizer.transform(X_test).toarray()

        # Fit the classifier on the training data
        classifier.fit(X_train_tfidf, y_train)

        # Make predictions on the test data
        y_pred = classifier.predict(X_test_tfidf)

        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, average='macro')
        recall = recall_score(y_test, y_pred, average='macro')
        f1 = f1_score(y_test, y_pred, average='macro')

        metrics_dict = {
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-Score': f1
        }

        metrics_list.append(metrics_dict)
        if(verbose):
          print(f"Classifier: {classifier}")
          # Print metrics for each fold
          for i, metrics in enumerate(metrics_list):
              print(f"Fold {i+1} Metrics:")
              for metric, value in metrics.items():
                  print(f"{metric}: {value}")
              print()

    print(f"Quantidade de amostras no treinamento: {len(X_train_tfidf)}")
    print(f"Quantidade de amostras no teste: {len(X_test_tfidf)}")
    print(f"Quantidade de features no último split: {len(X_train_tfidf[0])}\n")
    return metrics_list

In [None]:
def calc_mean_metrics(metrics_list_dic):
  # Initialize dictionaries to store the sum of each metric
  mean_metrics = {'Accuracy': 0, 'Precision': 0, 'Recall': 0, 'F1-Score': 0}

  # Calculate the sum of each metric
  for metrics_dict in metrics_list_dic:
      for metric, value in metrics_dict.items():
          mean_metrics[metric] += value

  # Calculate the mean of each metric
  num_metrics = len(metrics_list_dic)
  mean_metrics = {metric: mean_metrics[metric] / num_metrics for metric in mean_metrics}

  # Print the mean of each metric
  for metric, value in mean_metrics.items():
      print(f"Mean {metric}: {value}")
  return mean_metrics

In [None]:
#-----------------------RF-------------------------------
print("Random Forest:")
metrics_list = custom_cross_val_metrics(RandomForestClassifier(), X, y, k=10,verbose=False)
mean_metric_rf = calc_mean_metrics(metrics_list)

Random Forest:
Quantidade de amostras no treinamento: 5832
Quantidade de amostras no teste: 648
Quantidade de features no último split: 22831

Mean Accuracy: 0.6231481481481482
Mean Precision: 0.623377522191409
Mean Recall: 0.6231481481481481
Mean F1-Score: 0.6228151077149594


In [None]:
#-----------------------KNN-------------------------------
print("KNN:")
metrics_list = custom_cross_val_metrics(KNeighborsClassifier(), X, y, k=10,verbose=False)
mean_metric_knn = calc_mean_metrics(metrics_list)

KNN:
Quantidade de amostras no treinamento: 5832
Quantidade de amostras no teste: 648
Quantidade de features no último split: 22831

Mean Accuracy: 0.4489197530864198
Mean Precision: 0.4832794601227143
Mean Recall: 0.4489197530864197
Mean F1-Score: 0.4185902614956108


In [None]:
#-----------------------SVM-------------------------------
print("SVM:")
metrics_list = custom_cross_val_metrics(SVC(), X, y, k=2,verbose=False)
mean_metric_svm = calc_mean_metrics(metrics_list)

SVM:
Quantidade de amostras no treinamento: 3240
Quantidade de amostras no teste: 3240
Quantidade de features no último split: 17437

Mean Accuracy: 0.6561728395061728
Mean Precision: 0.663530649988453
Mean Recall: 0.6561728395061729
Mean F1-Score: 0.6584035848169634


In [None]:
#-----------------------NB-------------------------------
print("GaussianNB:")
metrics_list = custom_cross_val_metrics(GaussianNB(), X, y, k=10,verbose=False)
mean_metric_nb = calc_mean_metrics(metrics_list)

GaussianNB:
Quantidade de amostras no treinamento: 5832
Quantidade de amostras no teste: 648
Quantidade de features no último split: 22831

Mean Accuracy: 0.45632716049382716
Mean Precision: 0.4529605160868394
Mean Recall: 0.45632716049382716
Mean F1-Score: 0.45222851855028995
