#### *ISEL - DEI - LEIM*
## Aprendizagem Automática [T52D]
### Trabalho Laboratorial 2: Classificação de Críticas de Cinema do IMDb

João Madeira ($48630$), 
Renata Góis ($51038$),
Bruno Pereira ($51811$)

**Docentes responsáveis:** 
- Prof. Gonçalo Xufre Silva

In [131]:
import numpy as np
import matplotlib.pyplot as plt
import pickle as p
import re
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.metrics import confusion_matrix
import sklearn.preprocessing as pp

In [53]:
with open("resources/imdbFull.p", "rb") as f:
    D = p.load(f)
print("Keys:", D.keys())

reviews = D['data']
sentiments = D['target']

print(len(reviews), "reviews")

Keys: dict_keys(['data', 'target', 'DESCR'])
50000 reviews


This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.
For more details see: http://ai.stanford.edu/~amaas/data/sentiment/

GPT answear when I asked what are my options for stemmers

| Method                   | Aggressiveness | Quality | Speed  | Best for                    |
| ------------------------ | -------------- | ------- | ------ | --------------------------- |
| **Porter**               | Medium         | ✔✔      | Fast   | Classic NLP                 |
| **Snowball**             | Medium         | ✔✔✔     | Fast   | Best stemmer for English    |
| **Lancaster**            | High           | ✔       | Fast   | Rare cases; very aggressive |
| **Lemmatizer (spaCy)**   | Low            | ⭐⭐⭐⭐    | Medium | Best accuracy               |
| **Lemmatizer (WordNet)** | Low            | ⭐⭐⭐     | Medium | Simpler lemmatization       |

So I opted for the "Best stemmer for english" since that what we are doing and for what I gathered Lemmatizer $ \not= $ Stemmer

In [116]:
stemmer = SnowballStemmer("english")

def clean_review(string):
    # Remove tags HTML
    string = string.replace('<br />', ' ')  
    # Remove palavras com 20 ou mais caracteres
    string = re.sub(r'\b[a-zA-Z]{20,}\b', ' ', string)
    # Remove palavras com 3 ou mais letras repetidas consecutivamente (e.g., "yaaass", "omgggg")
    string = re.sub(r'\b\w*(.)\1{2,}\w*\b', ' ', string)
    # Filtra apenas letras
    string = re.sub(r'[^a-zA-Z]', ' ', string)
    # Remove espaços consecutivos
    string = re.sub(r'\s+', ' ', string).strip()
    # Normaliza para minúsculas
    string = string.lower()
    # Aplica Stemming
    string = " ".join(stemmer.stem(w) for w in string.split())
    return string

reviews = [clean_review(rev) for rev in reviews]

output = {"data" : reviews, "target" : sentiments}
p.dump(output,open("resources/imdbPreProcessed.p",'wb'))

In [122]:
custom_stopwords = list(ENGLISH_STOP_WORDS - {'no', 'not', 'nor'})

tfidfVector = TfidfVectorizer(min_df=3,                    # Remove palavras que aparecem menos de 10 vezes no dataset
                        max_df=0.8,                        # Remove palavras que aparecem em 90% do dataset 
                        max_features=30000,                # Limita o maximo de features para 30.000
                        ngram_range=(1,2),                 # Utiliza unigramas e bigramas (good, very good, pretty bad)
                        token_pattern=r'\b[a-zA-Z]{3,}\b', # Ignora palavras com menos de 2 letras
                        sublinear_tf=True,                 # Term frequency passa a ter um comportamento logarítmico em vez de linear
                        stop_words=custom_stopwords        # Remove stopwords em inglês excepto "no", "not" e "nor"
                        )

In [123]:
pre_processed_data = p.load(open("resources/imdbPreProcessed.p","rb"))
reviews = pre_processed_data['data']
sentiments = pre_processed_data['target']

tfidfVector = tfidfVector.fit(reviews)

tokens = tfidfVector.get_feature_names_out()
X = tfidfVector.transform(reviews)
# X.astype(np.float32) # reduz para metade a utilização de RAM
len(tokens)

30000

### Divisão em conjuntos de treino, teste e validação
| Treino | Teste | Validação |
|--|--|--|
|40k (80%)|5k (10%)|5k (10%)|

In [124]:
# Split 80% treino / 20% temporário (validação + teste)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, sentiments,
    test_size=0.2,
    random_state=42,
    stratify=sentiments
)

# Split 50% validação, 50% teste (10% / 10%)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=42,
    stratify=y_temp
)

In [115]:
param_grid = {
    'penalty': ['l2'],
    'solver': ['saga'],
    'C': [1, 2, 5],
    'max_iter': [50,100]
}

grid_search = GridSearchCV(LogisticRegression(random_state=42, n_jobs=1), param_grid,cv=3)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

{'C': 1, 'max_iter': 50, 'penalty': 'l2', 'solver': 'saga'}


In [125]:
lr = LogisticRegression(penalty='l2', solver='saga', C=1, max_iter=50, random_state=42)
lr = lr.fit(X_train, y_train)

In [126]:
test_predicted = lr.predict(X_test)
cm = confusion_matrix(y_test, test_predicted)
print("Conjunto de Teste")
print(f"número de erros : {np.sum(test_predicted != y_test)}")
print(f"percentagem de acertos : {np.round((np.sum(test_predicted == y_test)/y_test.shape[0])*100,2)}%")
print(f"Matriz de confusão : \n{cm}")

Conjunto de Teste
número de erros : 2836
percentagem de acertos : 43.28%
Matriz de confusão : 
[[828  47  47  40   7  10   3  30]
 [278  28  50  63   8  11   3  18]
 [200  23  81 102  17  27   7  39]
 [140  24  65 163  47  22   7  65]
 [ 28   6  11  41 121 124  19 130]
 [ 19   4  10  31  98 159  21 244]
 [ 10   1   6   8  38  91  31 276]
 [ 46   1   8   5  36  95  29 753]]


In [128]:
valid_predicted = lr.predict(X_val)
cm = confusion_matrix(y_val, valid_predicted)
print("Conjunto de Validação")
print(f"número de erros : {np.sum(valid_predicted != y_val)}")
print(f"percentagem de acertos : {np.round((np.sum(valid_predicted == y_val)/y_val.shape[0])*100,2)}%")
print(f"Matriz de confusão : \n{cm}")

Conjunto de Validação
número de erros : 2832
percentagem de acertos : 43.36%
Matriz de confusão : 
[[817  34  50  56   6  10   5  34]
 [266  35  42  72  10   9   2  22]
 [189  39  83 114  23  16   5  27]
 [129  19  71 188  51  30   7  38]
 [ 29   8  21  41 129 111  15 127]
 [ 26   5   6  26 106 126  39 252]
 [ 13   2   6   8  33  87  35 277]
 [ 40   2   5  16  32  93  30 755]]


In [133]:
sc = pp.StandardScaler(with_mean=False).fit(X_train, y_train)

# X1t normalizado
X_train_n = sc.transform(X_train)

# X1v normalizado
X_test_n = sc.transform(X_test)

# X2 normalizado
X_val_n = sc.transform(X_val)