<a href="https://colab.research.google.com/github/TNK443/RecPadroes/blob/main/07_Generalizacao_AjusteCaractTITANIC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Ajuste de Características**

## Esta tarefa tem duas partes:

### **PARTE 1.**

Alterar o código compartilhado de forma que o desempenho de classificação do RepeatedKFold tenha um score mais alto na média (pelo menos 0.85). Pode ser apenas escolhendo um conjunto de parâmetros no grid search, escolhendo melhor os atributos ou tratanto melhor as características.

In [1]:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
X = train[list(test.columns)]
y = train[train.columns[~train.columns.isin(test.columns)]]

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin

def extraiPronome(nome):
    return nome.split(',')[1].split('.')[0].strip()

class AtributosDesejados(BaseEstimator, TransformerMixin):
    def __init__(self, excluirName=True):
        self.excluirName = excluirName
    def fit(self, X, y=None):
        self.colunasIndesejadas = ['PassengerId','Ticket','Cabin']#,'Fare','Embarked']
        if self.excluirName:
            self.colunasIndesejadas.append('Name')
        return self

    def transform(self, X, y=None):
        Xdrop = X.drop(self.colunasIndesejadas,axis=1)

        if 'Name' not in self.colunasIndesejadas:
            # Xdrop['Name'] = Xdrop['Name'].apply(extraiPronome)
            Xdrop['Name'] = Xdrop['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
            title_mapping = {"Mr":0, 
                             "Miss":1, "Mlle":1, "Ms":1, "Lady":1,
                             "Mrs":2, "Mme":2, "Sir":2,
                             "Master":3, "Dr":3, "Rev":3, "Col":3, "Major":3, "Countess":3,
                             "Jonkheer":3, "Don":3, "Dona":3, "Capt":3}
            Xdrop['Name'] = Xdrop['Name'].map(title_mapping)

        if 'Age' not in self.colunasIndesejadas:
            Xdrop.loc[ Xdrop['Age'] <= 16, 'Age'] = 0
            Xdrop.loc[(Xdrop['Age'] > 16) & (Xdrop['Age'] <= 26), 'Age'] = 1
            Xdrop.loc[(Xdrop['Age'] > 26) & (Xdrop['Age'] <= 40), 'Age'] = 2
            Xdrop.loc[(Xdrop['Age'] > 40) & (Xdrop['Age'] <= 60), 'Age'] = 3
            Xdrop.loc[ Xdrop['Age'] > 60, 'Age'] = 4

        return Xdrop

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin

class AtributosNumericos(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.colunasNumericas = X.select_dtypes(include='number').columns
        return self
    def transform(self, X, y=None):
        return X[self.colunasNumericas].to_numpy()

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

class AtributosCategoricos(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.colunasCategoricas = X.select_dtypes(include='object').columns
        return self
    def transform(self, X, y=None):
        return X[self.colunasCategoricas].to_numpy()

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion

trataAtributos = Pipeline([
    ('unecaracteristicas', FeatureUnion([
        ('pipenum', Pipeline([
            ('atributos_numericos', AtributosNumericos()),
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ])),
        ('pipecat', Pipeline([
            ('atributos_categoricos', AtributosCategoricos()),
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]))
    ])),
])

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_validate, RepeatedKFold
import numpy as np

pipetotal = Pipeline([
    ('atributosDesejados', AtributosDesejados()),
    ('trataAtributos', trataAtributos),
    ('classificador', RandomForestClassifier())
])

parametros = {
    'atributosDesejados__excluirName': [False],

    # PARAMETROS DO CLASSIFICADOR = RANDOMFORESTCLASSIFIER()
    'classificador__n_estimators':[7],                  #n_estimatorsint, default=100 [7,11]
    'classificador__criterion': ['entropy',],           #{'gini', 'entropy'}, default='gini'
    'classificador__max_depth': [4]                    #int, default=None
    # 'classificador__min_samples_split':[7],             #int or float, default=2
    # 'classificador__min_samples_leaf': [1]              #int or float, default=1
    # 'classificador__min_weight_fraction_leaf': [0.0], #float, default=0.0
    # 'classificador__max_features': ['auto'],          #{'auto', 'sqrt', 'log2'}, int or float, default='auto'
    # 'classificador__max_leaf_nodes': [],              #int, default=None
    # 'classificador__min_impurity_decrease': [0.0],    #float, default=0.0
    # 'classificador__min_impurity_split': [],          #float, default=None
    # 'classificador__bootstrap': [True],               #bool, default=True
    # 'classificador__oob_score': [False],              #bool, default=False
    # 'classificador__n_jobs': [],                      #int, default=None
    # 'classificador__random_state': [],                #int, RandomState instance or None, default=None
    # 'classificador__verbose': [0],                    #int, default=0
    # 'classificador__warm_start': [False],             #bool, default=False
    # 'classificador__class_weight': [],                #{'balanced', 'balanced_subsample'}, dict or list of dicts, default=None
    # 'classificador__ccp_alpha': [0.0],                #non-negative float, default=0.0
    # 'classificador__max_samples': []                  #int or float, default=None
}
modelo = GridSearchCV(pipetotal, param_grid=parametros
                      # PARAMETROS DO CROSS_VALIDATE()
                      ,scoring='roc_auc' #'roc_auc', #str, callable, list, tuple, or dict, default=None
                              # For Classification:[    
                              #'accuracy'(83),'balanced_accuracy'(80),'top_k_accuracy'(fail),'average_precision'(86),'neg_brier_score'(-12),
                              #'f1'(75),'f1_micro'(82),'f1_macro'(81),'f1_weighted'(83),'f1_samples'(fail),
                              #'neg_log_loss'(-42),'precision'(80),'recall'(73),'jaccard'(62),
                              #'roc_auc'(87),'roc_auc_ovr'(86),'roc_auc_ovo'(86),'roc_auc_ovr_weighted'(87),'roc_auc_ovo_weighted'(86)]
                      ,n_jobs=-1    #int, default=None || -1 means using all processors. 
                      # ,pre_dispatch='2*n_jobs') #int or str, default='2*n_jobs' 
                      # ,cv=RepeatedKFold(#PARAMETROS DO REPEATEDKFOLD()
                                          # n_splits=10,     #int, default=5
                                          # n_repeats=5,     #int, default=10
                                          # random_state=None  #int, RandomState instance or None, default=None
                                        # )
                      # ,refit            #bool, str, or callable, default=True
                      # ,verbose          #int
                      # ,error_score      #'raise' or numeric, default=np.nan
                      # ,turn_train_score #bool, default=False
                      )

scores = cross_validate(modelo, X, y, cv=RepeatedKFold(),n_jobs=-1)
scores['test_score'], np.mean(scores['test_score']), np.std(scores['test_score'])

(array([0.87662253, 0.83084577, 0.84839317, 0.88526507, 0.85984427,
        0.86175918, 0.86046967, 0.81108974, 0.86804693, 0.9253337 ,
        0.82967491, 0.87108974, 0.8586624 , 0.88562834, 0.84083496,
        0.85131062, 0.88216991, 0.89540423, 0.82281746, 0.88724919,
        0.86981013, 0.8745414 , 0.84854715, 0.88636682, 0.82068452,
        0.87522523, 0.81288981, 0.88702725, 0.86187771, 0.85186688,
        0.88641026, 0.89596861, 0.80530428, 0.85964912, 0.86862361,
        0.82781672, 0.88398896, 0.87229092, 0.8863853 , 0.85401174,
        0.83819102, 0.81739075, 0.86074561, 0.88825832, 0.91087571,
        0.84525702, 0.84510281, 0.90002781, 0.8665107 , 0.86673077]),
 0.8624177752892659,
 0.02669848713671924)

**np.mean(scores['test_score']): 0.8624177752892659**

### **PARTE 2.**

Resubmeter as predições do modelo ajustado para o Kaggle. O desempenho do novo modelo foi superior, inferior ou igual (diferença menor que 0.03).

In [None]:
modelo.fit(X,y)
y_pred = modelo.predict(test)
result = test[['PassengerId']]
result['Survived'] = y_pred
result.to_csv('submission.csv',index=False)
clear_output()

**O desempenho do novo modelo foi superior, inferior ou igual (diferença menor que 0.03).**

O **desempenho(Score) no KAGGLE** do **Modelo anterior** (código inicial apresentado pelo Prof.) **foi de aproximadamente 0.77**. Já **o NOVO Modelo** apresentado acima (após diversos reajustes e testes) teve seu de desepenho(Score) de no **máximo (aproximadamente) 0.80134**, praticamente uma **diferença de 0.03**.
