# Pipelines de treinamento: LogisticRegression

Esse notebook contém as pipelines de treinamento usadas na obtenção do melhor modelo classificador do problema desenvolvido no EP1.

O modelo LogisticRegression aceita diversas parametrizações diferentes - veja documentação: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression - aqui exploramos três: *C*, *penalty* e *solver*. 

As pipelines treinadas são estratificadas em **features + LogisticRegression:solver**.  
Isto quer dizer: existe uma pipeline para cada par de *feature* e *solver* aqui explorados.  

As **features** que exploramos são:  

1. Bag of Words  
2. TF/TF-IDF  
3. Word NGrams  
4. Char NGrams  

E os **solvers** de *LogisticRegression* que exploramos aqui são:  

1. lbfgs  
2. liblinear  

Você pode encontrar uma execução já parametrizada do melhor modelo encontrado por essas pipelines no notebook model.ipynb, **que é nossa versão de entrega do EP1**.

## Bootstrap Imports

In [None]:
import pandas as pd
import numpy as np

import os
import sys
from pathlib import Path

### Teste de importação: Lib.utils do projeto

In [None]:
filedir = Path(os.getcwd())
base_path = filedir.resolve().parents[3]
sys.path.append(str(base_path))

from Lib.utils import printhello
printhello()

## Configura variáveis de execução

In [None]:
sep = ";"
dec = ","
quotech = "\""
encoding = "utf8"


EP_dir = "EP1"
CSV_input_name = "train_arcaico_moderno.csv"
path_to_archive = f"../../../../Traindata/{EP_dir}/{CSV_input_name}"


do_print = True
if do_print:
    print(f"Path to csv input is:  {path_to_archive}")

### Configure variáveis de reprodutibilidade

In [None]:
random_state = 12345

### Lista de melhores modelos

In [None]:
best_models_list = []

## Pré-tratamento de dados

### Importar dados do csv

In [None]:
df = pd.read_csv(path_to_archive, na_values=['na'],
sep=sep,
decimal=dec,
quotechar=quotech,
encoding=encoding,
encoding_errors='strict')
print(df.shape)
print(df.columns)

### Embaralhamento dos dados

O .csv de entrada tem alto ordenamento dos inputs por classe. Carregá-los dessa maneira nos modelos p/ treinamento introduz viés, então é preciso embaralhar os dados para garantir randomicidade. 
As classes em sklearn.model_selection - como a StratfiedKFold usada mais a frente - implementam parâmetro shuffle="", que pode ser passado como True para embaralhar mais os dados.

Note que é importante também garantir a reprodutibilidade do embaralhamento, especificando um valor hardcoded (Neste caso random_state=100)

In [None]:
print("Shape antes do shuffle:", df.shape)

df = df.sample(frac=1, random_state=100).reset_index(drop=True)

print("Shape depois do shuffle:", df.shape)

# Pipelines

## Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import ParameterGrid

## Disable warnings

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning

# Ignora warnings de ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Ignora UserWarnings específicos de l1_ratio etc
warnings.filterwarnings("ignore", category=UserWarning)

## Feature: Bag of words 

#### Definição da Pipeline

In [None]:
BoW_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
]) 

#### Solver lbfgs: BoW

In [None]:
BoW_lbfgs_parameters = {
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}


BoW_lbfgs_classifier = GridSearchCV(BoW_pipeline, BoW_lbfgs_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
BoW_lbfgs_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", BoW_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", BoW_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "BoW",
    "solver": "lbfgs",
    "accuracy": BoW_lbfgs_classifier.best_score_,
    "params": BoW_lbfgs_classifier.best_params_,
})

#### Solver liblinear: BoW

In [None]:
BoW_liblinear_parameters = {
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}


BoW_liblinear_classifier = GridSearchCV(BoW_pipeline, BoW_liblinear_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
BoW_liblinear_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", BoW_liblinear_classifier.best_score_)
print("Melhores parâmetros:", BoW_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "BoW",
    "solver": "liblinear",
    "accuracy": BoW_liblinear_classifier.best_score_,
    "params": BoW_liblinear_classifier.best_params_,
})

## Feature: TF e TF-IDF

#### Definição da Pipeline

In [None]:
TF_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
])

#### Solver lbfgs: TF/TD-IDF

In [None]:
TF_lbfgs_parameters = {
    'tfidf__use_idf': [True, False],
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}

TF_lbfgs_classifier = GridSearchCV(TF_pipeline, TF_lbfgs_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
TF_lbfgs_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", TF_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", TF_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "TF",
    "solver": "lbfgs",
    "accuracy": TF_lbfgs_classifier.best_score_,
    "params": TF_lbfgs_classifier.best_params_,
})

#### Solver liblinear: TF/TD-IDF

In [None]:
TF_liblinear_parameters = {
    'tfidf__use_idf': [True, False],
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}

TF_liblinear_classifier = GridSearchCV(TF_pipeline, TF_liblinear_parameters, 
                                       cv=10, n_jobs=-1, scoring="accuracy", verbose=1, error_score = np.nan)
TF_liblinear_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", TF_liblinear_classifier.best_score_)
print("Melhores parâmetros:", TF_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "TF",
    "solver": "liblinear",
    "accuracy": TF_liblinear_classifier.best_score_,
    "params": TF_liblinear_classifier.best_params_,
})

## Feature: N-grams

#### Definição da Pipeline

In [None]:
NGram_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('kbest', SelectKBest(score_func=chi2)),
    ('classifier', LogisticRegression(class_weight='balanced')),
]) 

### WORD NGRAMS

#### Solver lbfgs: Word NGrams

In [None]:
WORD_NGram_lbfgs_parameters = { 
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2), (3, 3)],
    'vect__analyzer': ["word"],
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}


WORD_NGram_lbfgs_classifier = GridSearchCV(NGram_pipeline, WORD_NGram_lbfgs_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
WORD_NGram_lbfgs_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", WORD_NGram_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", WORD_NGram_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "WORD-NGram",
    "solver": "lbfgs",
    "accuracy": WORD_NGram_lbfgs_classifier.best_score_,
    "params": WORD_NGram_lbfgs_classifier.best_params_,
})

#### Solver liblinear: Word NGrams

In [None]:
WORD_NGram_liblinear_parameters = { 
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2), (3, 3)],
    'vect__analyzer': ["word"],
    'kbest__k': [100, 200, 750, 4000, 10000, 'all'], #Parametros reduzidos p/ conseguir rodar c/ a memória disponível
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}

WORD_NGram_liblinear_classifier = GridSearchCV(NGram_pipeline, WORD_NGram_liblinear_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
WORD_NGram_liblinear_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", WORD_NGram_liblinear_classifier.best_score_)
print("Melhores parâmetros:", WORD_NGram_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "WORD-NGram",
    "solver": "liblinear",
    "accuracy": WORD_NGram_liblinear_classifier.best_score_,
    "params": WORD_NGram_liblinear_classifier.best_params_,
})

### CHAR NGram

#### Solver lbfgs: Char NGrams

In [None]:
CHAR_NGram_lbfgs_parameters = { 
    'vect__ngram_range': [(1, 1), (2, 2), (3, 3), (4, 4)],
    'vect__analyzer': ["char", "char_wb"],
    'kbest__k': [100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs'],
}


CHAR_NGram_lbfgs_classifier = GridSearchCV(NGram_pipeline, CHAR_NGram_lbfgs_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
CHAR_NGram_lbfgs_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", CHAR_NGram_lbfgs_classifier.best_score_)
print("Melhores parâmetros:", CHAR_NGram_lbfgs_classifier.best_params_)

best_models_list.append({
    "features": "CHAR-NGram",
    "solver": "lbfgs",
    "accuracy": CHAR_NGram_lbfgs_classifier.best_score_,
    "params": CHAR_NGram_lbfgs_classifier.best_params_,
})

#### Solver liblinear: Char NGrams

In [None]:
CHAR_NGram_liblinear_parameters = { 
    'vect__ngram_range': [(1, 1), (2, 2), (3, 3), (4, 4)],
    'vect__analyzer': ["char", "char_wb"],
    'kbest__k': [10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 750, 4000, 10000, 'all'],
    'classifier__max_iter': [1000],
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear'],
}


CHAR_NGram_liblinear_classifier = GridSearchCV(NGram_pipeline, CHAR_NGram_liblinear_parameters, 
                                       cv=10, n_jobs=2, scoring="accuracy", verbose=1, error_score = np.nan)
CHAR_NGram_liblinear_classifier.fit(df["text"].fillna(""), df["style"].values)

In [None]:
print("Melhor acurácia média:", CHAR_NGram_liblinear_classifier.best_score_)
print("Melhores parâmetros:", CHAR_NGram_liblinear_classifier.best_params_)

best_models_list.append({
    "features": "CHAR-NGram",
    "solver": "liblinear",
    "accuracy": CHAR_NGram_liblinear_classifier.best_score_,
    "params": CHAR_NGram_liblinear_classifier.best_params_,
})

## Seleciona melhores parâmetros

In [None]:
best_score = -1
best = 0
for idx, candidate in enumerate(best_models_list):
    if candidate["accuracy"] > best_score:
        best = idx
        best_score = candidate["accuracy"]

print(f"O melhor classificador encontrado pelas pipelines é -->    feature={best_models_list[best]["features"]} + solver={best_models_list[best]["solver"]}\n")
print(f"Melhor acucácia encontrada:  {best_models_list[best]['accuracy']}")
print(f"Melhores parametros encontrados:  {best_models_list[best]['params']}")