# Pipelines de treinamento: LogisticRegression

Esse notebook contém as pipelines de treinamento usadas na obtenção do melhor modelo classificador do problema desenvolvido no EP2.

O modelo LogisticRegression aceita diversas parametrizações diferentes - veja documentação: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression - aqui exploramos três: *C*, *penalty* e *solver*. 

As pipelines treinadas são estratificadas em **features + LogisticRegression:solver**.  
Isto quer dizer: existe uma pipeline para cada par de *feature* e *solver* aqui explorados.  

As **features** que exploramos são:  

1. Bag of Words  
2. TF/TF-IDF  
3. Word NGrams  
4. Embeddings

Os **modelos de embeddings** que exploramos aqui são:  

1. BAAI bge-3
2. Google Gemma 300m 

PS: Os embeddings Gemma 300m eu ainda NÃO CONSEGUI GERAR. Então, não precisa testar a parte que roda com eles no EP.


E os **solvers** de *LogisticRegression* que exploramos aqui são:  

1. lbfgs  
2. liblinear  

Você pode encontrar uma execução já parametrizada do melhor modelo encontrado por essas pipelines no notebook model.ipynb, **que é nossa versão de entrega do EP2**.

## Bootstrap Imports

In [1]:
import pandas as pd
import numpy as np

import os
import sys
from pathlib import Path

### Teste de importação: Lib.utils do projeto

In [2]:
filedir = Path(os.getcwd())
base_path = filedir.resolve().parents[3]
sys.path.append(str(base_path))

from Lib.utils import printhello
printhello()

HELLO!


## Configura variáveis de execução

In [7]:
sep = ";"
dec = ","
quotech = "\""
encoding = "latin-1"


EP_dir = "EP2"
CSV_input_name = "ep2-train.csv"
path_to_archive = f"../../../../Traindata/{EP_dir}/{CSV_input_name}"


do_print = True
if do_print:
    print(f"Path to csv input is:  {path_to_archive}")

Path to csv input is:  ../../../../Traindata/EP2/ep2-train.csv


### Configure variáveis de reprodutibilidade

In [4]:
random_state = 12345

### Lista de melhores modelos

In [5]:
best_models_list = []

## Pré-tratamento de dados

### Importar dados do csv

In [8]:
df = pd.read_csv(path_to_archive, na_values=['na'],
sep=sep,
decimal=dec,
quotechar=quotech,
encoding=encoding,
encoding_errors='strict')
print(df.shape)
print(df.columns)

(43678, 2)
Index(['req_text', 'profession'], dtype='object')


### Embaralhamento dos dados

O .csv de entrada tem alto ordenamento dos inputs por classe. Carregá-los dessa maneira nos modelos p/ treinamento introduz viés, então é preciso embaralhar os dados para garantir randomicidade. 
As classes em sklearn.model_selection - como a StratfiedKFold usada mais a frente - implementam parâmetro shuffle="", que pode ser passado como True para embaralhar mais os dados.

Note que é importante também garantir a reprodutibilidade do embaralhamento, especificando um valor hardcoded (Neste caso random_state=100)

In [None]:
print("Shape antes do shuffle:", df.shape)

df = df.sample(frac=1, random_state=random_state).reset_index(drop=True) #NAO MUDE random_state, essa variavel DEVE valer 12345, ou QUEBRARÁ REPRODUTIBILIDADE dos experimentos

print("Shape depois do shuffle:", df.shape)

Shape antes do shuffle: (43678, 2)
Shape depois do shuffle: (43678, 2)


### Limpeza dos dados

In [None]:
from Lib.utils import clean_text
#def clean_text(text, do_lowercase: bool, rem_emails: bool, rem_urls: bool, normalize_whitespaces: bool):

df['req_text_cleaned'] = df['req_text'].apply(lambda row_text: clean_text(
        row_text, 
        do_lowercase=True, 
        rem_emails=True, 
        rem_urls=True, 
        normalize_whitespaces=True
    ))

df['req_text'] = df['req_text_cleaned']
df = df.drop(columns=['req_text_cleaned']) # Remove a coluna temporária

# Treino dos modelos LogisticRegression

## Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2

## Training Features: Bag of Words, TF-IDF, Word NGram

### Disable Warnings

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignora warnings de ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Ignora UserWarnings específicos de l1_ratio etc
warnings.filterwarnings("ignore", category=UserWarning)

### Feature: Bag of words 

#### Definição da Pipeline

In [None]:
BoW_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
]) 

#### Treinamento c/ solver lbfgs

In [None]:
BoW_lbfgs_parameters = {
    'kbest__k': [10, 30, 50, 80, 100, 200, 500, 750, 1500, 4000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}

In [None]:
BoW_lbfgs_classifier = GridSearchCV(BoW_pipeline, BoW_lbfgs_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
BoW_lbfgs_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", BoW_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", BoW_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "BoW",
    "solver": "lbfgs",
    "accuracy": BoW_lbfgs_classifier.best_score_,
    "params": BoW_lbfgs_classifier.best_params_,
})

#### Treinamento c/ solver liblinear

In [None]:
BoW_liblinear_parameters = {
    'kbest__k': [10, 30, 50, 80, 100, 200, 500, 750, 1500, 4000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}

In [None]:
BoW_liblinear_classifier = GridSearchCV(BoW_pipeline, BoW_liblinear_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
BoW_liblinear_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", BoW_liblinear_classifier.best_score_)
print("Melhores parâmetros:", BoW_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "BoW",
    "solver": "liblinear",
    "accuracy": BoW_liblinear_classifier.best_score_,
    "params": BoW_liblinear_classifier.best_params_,
})

### Feature: TF-IDF

#### Definição da Pipeline

In [None]:
TF_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
])

#### Treinamento c/solver lbfgs

In [None]:
TF_lbfgs_parameters = {
    'tfidf__use_idf': [True, False],
    'kbest__k': [10, 30, 50, 80, 100, 200, 500, 750, 1500, 4000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}

In [None]:
TF_lbfgs_classifier = GridSearchCV(TF_pipeline, TF_lbfgs_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
TF_lbfgs_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", TF_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", TF_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "TF",
    "solver": "lbfgs",
    "accuracy": TF_lbfgs_classifier.best_score_,
    "params": TF_lbfgs_classifier.best_params_,
})

#### Treinamento c/ solver liblinear

In [None]:
TF_liblinear_parameters = {
    'tfidf__use_idf': [True, False],
    'kbest__k': [10, 30, 50, 80, 100, 200, 500, 750, 1500, 4000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}

In [None]:
TF_liblinear_classifier = GridSearchCV(TF_pipeline, TF_liblinear_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
TF_liblinear_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", TF_liblinear_classifier.best_score_)
print("Melhores parâmetros:", TF_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "TF",
    "solver": "liblinear",
    "accuracy": TF_liblinear_classifier.best_score_,
    "params": TF_liblinear_classifier.best_params_,
})

### Feature: CHAR Ngram

#### Definição da Pipeline

In [None]:
NGram_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
]) 

#### Treinamento c/ solver lbfgs

In [None]:
CHAR_NGram_lbfgs_parameters = { 
    'vect__ngram_range': [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12)],
    'vect__analyzer': ["char", "char_wb"],
    'kbest__k': [100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}

In [None]:
CHAR_NGram_lbfgs_classifier = GridSearchCV(NGram_pipeline, CHAR_NGram_lbfgs_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
CHAR_NGram_lbfgs_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", CHAR_NGram_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", CHAR_NGram_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "CHAR-NGram",
    "solver": "lbfgs",
    "accuracy": CHAR_NGram_lbfgs_classifier.best_score_,
    "params": CHAR_NGram_lbfgs_classifier.best_params_,
})

#### Treinamento c/ solver liblinear

In [None]:
CHAR_NGram_liblinear_parameters = { 
    'vect__ngram_range': [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12)],
    'vect__analyzer': ["char", "char_wb"],
    'kbest__k': [10, 30, 50, 80, 100, 200, 500, 750, 1500, 3000, 10000, 'all'],
    'classifier__max_iter': [3000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}

In [None]:
CHAR_NGram_liblinear_classifier = GridSearchCV(NGram_pipeline, CHAR_NGram_liblinear_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
CHAR_NGram_liblinear_classifier.fit(df["req_text"].fillna(""), df["profession"].values)

In [None]:
print("Melhor acurácia média:", CHAR_NGram_liblinear_classifier.best_score_)
print("Melhores parâmetros:", CHAR_NGram_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "CHAR-NGram",
    "solver": "liblinear",
    "accuracy": CHAR_NGram_liblinear_classifier.best_score_,
    "params": CHAR_NGram_liblinear_classifier.best_params_,
})

## Training Features: Embeddings

### Declaração do algoritmo & parametros

In [None]:
logress = LogisticRegression(class_weight='balanced')

logress_parameters_lbfgs = {
    'max_iter': [3000],
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs']
}

logress_parameters_liblinear = {
    'max_iter': [3000],
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

### Feature: Embeddings BAAI-bge-3

#### Importação dos embeddings

In [None]:
# 1. Gerar os dados X e Y
X_baai = np.load('../../Embeddings/npys/gen_baai_bge_3.ipynb')
y_baai = df["profession"].values

Gerando BAAI embeddings
Shape dos embeddings (X): (500, 1024)
embeddings gerados


#### Treinamento c/ solver lbfgs

In [None]:
logress_lbfgs_baai = GridSearchCV(logress, logress_parameters_lbfgs, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score=np.nan)
# 3. Fitar (Treinar) usando os embeddings pré-calculados
logress_lbfgs_baai.fit(X_baai, y_baai)

In [None]:
print("Melhor acurácia média:", logress_lbfgs_baai.best_score_)
print("Melhores parâmetros:", logress_lbfgs_baai.best_params_)

best_models_list.append({
    "features": "embeddings baai-bge-3",
    "solver": "lbfgs",
    "accuracy": logress_lbfgs_baai.best_score_,
    "params": logress_lbfgs_baai.best_params_,
})

#### Treinamento c/ solver liblinear

In [None]:
logress_liblinear_baai = GridSearchCV(logress, logress_parameters_liblinear, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score=np.nan)
# 3. Fitar (Treinar) usando os embeddings pré-calculados
logress_liblinear_baai.fit(X_baai, y_baai)

In [None]:
print("Melhor acurácia média:", logress_liblinear_baai.best_score_)
print("Melhores parâmetros:", logress_liblinear_baai.best_params_)

best_models_list.append({
    "features": "embeddings baai-bge-3",
    "solver": "liblinear",
    "accuracy": logress_liblinear_baai.best_score_,
    "params": logress_liblinear_baai.best_params_,
})

### Feature: Embeddings Google-Gemma-300m

#### Importação dos embeddings

In [None]:
# 1. Gerar os dados X e Y
X_gemma = np.load('../../Embeddings/npys/gen_google_gemma_300m.ipynb')
y_gemma = df["profession"].values

#### Treinamento c/ solver lbfgs

In [None]:
logress_lbfgs_gemma = GridSearchCV(logress, logress_parameters_lbfgs, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score=np.nan)
# 3. Fitar (Treinar) usando os embeddings pré-calculados
logress_lbfgs_gemma.fit(X_gemma, y_gemma)

In [None]:
print("Melhor acurácia média:", logress_lbfgs_gemma.best_score_)
print("Melhores parâmetros:", logress_lbfgs_gemma.best_params_)

best_models_list.append({
    "features": "embeddings google-gemma-300m",
    "solver": "lbfgs",
    "accuracy": logress_lbfgs_gemma.best_score_,
    "params": logress_lbfgs_gemma.best_params_,
})

#### Treinamento c/ solver liblinear

In [None]:
logress_liblinear_gemma = GridSearchCV(logress, logress_parameters_liblinear, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score=np.nan)
# 3. Fitar (Treinar) usando os embeddings pré-calculados
logress_liblinear_gemma.fit(X_gemma, y_gemma)

In [None]:
print("Melhor acurácia média:", logress_liblinear_gemma.best_score_)
print("Melhores parâmetros:", logress_liblinear_gemma.best_params_)

best_models_list.append({
    "features": "embeddings google-gemma-300m",
    "solver": "liblinear",
    "accuracy": logress_liblinear_gemma.best_score_,
    "params": logress_liblinear_gemma.best_params_,
})

# Seleciona melhores parâmetros

In [None]:
best_score = -1
best = 0
for idx, candidate in enumerate(best_models_list):
    if candidate["accuracy"] > best_score:
        best = idx
        best_score = candidate["accuracy"]

print(f"O melhor classificador encontrado pelas pipelines é -->    feature={best_models_list[best]["features"]} + solver={best_models_list[best]["solver"]}\n")
print(f"Melhor acucácia encontrada:  {best_models_list[best]['accuracy']}")
print(f"Melhores parametros encontrados:  {best_models_list[best]['params']}")