# Machine Learning - na Prática - Kaggle

### Estudo de Dataset Heart Failure Clinical - Estudo de Probabilidade de Morte

In [1]:
## importando as librarys

import pandas as pd
import numpy as np

## preprocessamento

from sklearn.model_selection import train_test_split  ## divisão do dataset
from sklearn.compose import ColumnTransformer
import category_encoders as ce
from sklearn.pipeline  import Pipeline  ## criação de pipeline
from sklearn.impute import SimpleImputer

## modelo de ml - posso testar com xgboost

from sklearn.tree import DecisionTreeClassifier  ## modelo de árvore

In [2]:
## lendo o arquivo

df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [3]:
df.shape

(299, 13)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [5]:
df.nunique().sort_values()

anaemia                       2
diabetes                      2
high_blood_pressure           2
sex                           2
smoking                       2
DEATH_EVENT                   2
ejection_fraction            17
serum_sodium                 27
serum_creatinine             40
age                          47
time                        148
platelets                   176
creatinine_phosphokinase    208
dtype: int64

Caso houvessem muitas colunas e tivesse que determinar arbitrariamente quais colunas são ou não numéricas, poderia prosseguir da seguinte forma:

```python
features_categoricas = df.loc[:, df.nunique() < N].columns
```

O código acima determinar que todas as colunas com menos de N valores únicos são categóricas. De forma análoga, você poderia inverter o símbolo e filtrar as colunas numéricas.

A princípio nosso dataset só possui variáveis numéricas, mesmo as dummies (0 ou 1).

In [6]:
## alterando dados de númericos para categóricos - usando .map

df['diabetes'] = df['diabetes'].map({1: 'yes', 0: 'no'})

In [7]:
## checando mudança

df['diabetes'].value_counts()

diabetes
no     174
yes    125
Name: count, dtype: int64

In [8]:
df.columns

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

## Modelo de Machine Learning

In [33]:
features = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']

target = ['DEATH_EVENT']  ## no exmeplo a seleção do targete é feita de formar direta, sem as chaves

In [10]:
features

['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time']

In [11]:
X = df[features]


y = df[target]

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  ## no exemplo não é necessário reset_index

In [13]:
X_train.reset_index(drop=True, inplace=True)

In [14]:
X_test.reset_index(drop=True, inplace=True)

In [15]:
y_train.reset_index(drop=True, inplace=True)

In [16]:
y_test.reset_index(drop=True, inplace=True)

In [17]:
X_test.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,70.0,0,582,no,40,0,51000.0,2.7,136,1,1,250
1,50.0,1,298,no,35,0,362000.0,0.9,140,1,1,240
2,45.0,0,2442,yes,30,0,334000.0,1.1,139,1,0,129
3,80.0,1,123,no,35,1,388000.0,9.4,133,1,1,10
4,42.0,0,102,yes,40,0,237000.0,1.2,140,1,0,74


In [18]:
X_train.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,1,246,no,15,0,127000.0,1.2,137,1,0,10
1,75.0,0,99,no,38,1,224000.0,2.5,134,1,0,162
2,60.667,1,104,yes,30,0,389000.0,1.5,136,1,0,171
3,52.0,0,132,no,30,0,218000.0,0.7,136,1,1,112
4,94.0,0,582,yes,38,1,263358.03,1.83,134,1,0,27


In [19]:
y_train.head()

Unnamed: 0,DEATH_EVENT
0,1
1,1
2,1
3,0
4,1


In [20]:
y_test.head()

Unnamed: 0,DEATH_EVENT
0,0
1,0
2,1
3,1
4,0


In [31]:
categorical = ['diabetes']  ## no exemplo é selecionada a feature sem o parênteses

numerical = ['age', 'anaemia', 'creatinine_phosphokinase',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']

## Exemplo caso Houvessem Outras Variáveis:

- Variável 1: Classe Social, valores que ela recebe: 1,2,3,4,5.
- Variável 2: Estado, recebe SP, RJ, DF.
- Variável 3: Latitude, recebe valores de latitude.

Vale pontuar:

- A Variável 1 apesar de ter vindo como numérica precisa ser tratada como categórica, se for tratada como numérica estará sendo feita de forma equivocada.
- Assim como latitude e longitude não são numéricas, devem ser usadas como localidade.

Como Aplicar Tratamentos Específicos Para Cada Coluna?

```python
transformer = ColumnTransformer([
    ('nome_da_transformacao', transformacao_a_ser_feita, colunas_afetadas),
    (),
    (),
])
```
```python
transformer.fit_transform(x_train)
transformer.transform(x_test)
```

In [22]:
### fazendo a transformação de categóricas

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', ce.TargetEncoder()),
])

transformer = ColumnTransformer([
    ('categorical_transformer', categorical_pipe, categorical),
    ('numerical_transformer', SimpleImputer(strategy='median'), numerical),
])

X_train_transformed = transformer.fit_transform(X_train, y_train)

X_test_transformed = transformer.transform(X_test)

In [23]:
X_train.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,1,246,no,15,0,127000.0,1.2,137,1,0,10
1,75.0,0,99,no,38,1,224000.0,2.5,134,1,0,162
2,60.667,1,104,yes,30,0,389000.0,1.5,136,1,0,171
3,52.0,0,132,no,30,0,218000.0,0.7,136,1,1,112
4,94.0,0,582,yes,38,1,263358.03,1.83,134,1,0,27


In [24]:
print(transformer)

ColumnTransformer(transformers=[('categorical_transformer',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('encoder', TargetEncoder())]),
                                 ['diabetes']),
                                ('numerical_transformer',
                                 SimpleImputer(strategy='median'),
                                 ['age', 'anaemia', 'creatinine_phosphokinase',
                                  'ejection_fraction', 'high_blood_pressure',
                                  'platelets', 'serum_creatinine',
                                  'serum_sodium', 'sex', 'smoking', 'time'])])


In [25]:
tree = DecisionTreeClassifier()

In [26]:
tree.fit(X_train_transformed, y_train)

In [27]:
tree.predict(X_test_transformed)

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

Chega um dataset novo, com novos clientes, o que fazer?

Tratar os dados e aplicar o modelo neles:

```python

novos_clientes = pd.read_csv('.csv')

novos_clientes2 = novos_clientes[features]

transformer.transform(novos_clientes_2)

predicao_final = tree.predict(novos_clientes_2)

# A predição final contém as informações sobre esses novos pacientes terem maior ou menor propensão de morrer.
```

In [28]:
y_pred = tree.predict(X_test_transformed)

y_pred

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

In [29]:
y_test.values

array([[0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1]], dtype=int64)

In [30]:
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score

print(f'Acurácia: {accuracy_score(y_test, y_pred):.2f}')
print(f'ROC/AUC: {roc_auc_score(y_test, y_pred):.2f}')
print(f'F1-Score: {f1_score(y_test, y_pred):.2f}')
print(f'Precision: {precision_score(y_test, y_pred):.2f}')
print(f'Recall: {recall_score(y_test, y_pred):.2f}')

Acurácia: 0.68
ROC/AUC: 0.65
F1-Score: 0.56
Precision: 0.67
Recall: 0.48


### - Desafio 1: Rodar um Classificador em outro Dataset.

### - Desafio 2: Rodar um Modelo com 4 Grupos de Colunas tendo Tratamentos Diferentes.