# 09 - Transformando dados com Dummies

Devido ao resultado do `DecisionTreeClassifier` e do `RandomForestClassifier` nos datasets anteriores (melhor resultado com apenas 77.5% de acurácia na base de teste), serão geradas novas versões dos dados processados, na tentativa de melhorar o score para esses algoritmos. A hipótese é que alguma informação importante foi perdida no anterior. Para melhorar isso, utilizaremos o `get_dummies` do `pandas` e incluirei algumas informações que antes eu não tinha. Por hora, o tratamento dos dados faltantes permanecerá o mesmo.

## Preparando o ambiente

In [1]:
import pandas as pd

## Transformando os dados de treino

In [2]:
titanic = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/processed/train_complete.csv')
titanic.sample(2)

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Floor,Embarked,title,Relateds,faixa,possui_cabine
767,767,768,0,3,"Mangan, Miss. Mary",female,30,0,0,364850,7.75,SC,Q,Miss,0,mulher_solteira,False
76,76,77,0,3,"Staneff, Mr. Ivan",male,40,0,0,349208,7.8958,SC,S,Mr,0,pessoa_adulta,False


Removendo as colunas desnecessárias:

In [4]:
titanic.drop(columns=['Unnamed: 0', 'PassengerId', 'Name', 'Ticket', 'Fare', 'title', 'Relateds', 'faixa', 'possui_cabine'], inplace=True)
titanic.sample(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Floor,Embarked
664,1,3,male,20,1,0,SC,S
777,1,3,female,5,0,0,SC,S


In [6]:
dummies = pd.get_dummies(titanic)
dummies.sample(10)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Sex_female,Sex_male,Floor_A,Floor_B,Floor_C,Floor_D,Floor_E,Floor_F,Floor_G,Floor_SC,Floor_T,Embarked_C,Embarked_Q,Embarked_S
880,1,2,25,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,1
315,1,3,26,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1
360,0,3,40,1,4,0,1,0,0,0,0,0,0,0,1,0,0,0,1
512,1,1,36,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1
193,1,2,3,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1
566,0,3,19,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
791,0,2,16,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
410,0,3,41,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
162,0,3,26,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
60,0,3,22,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0


Foram removidos os campos de análise adicionados e campos que identificam o passageiro, mas mantido o campo de andar e local de embarque, assim como a criação dos dummies.

In [7]:
dummies.to_csv('../../data/processed/train_dummies.csv')

## Transformando os dados de teste

In [17]:
p_teste = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/processed/test_processed.csv')
p_teste.head(2)

Unnamed: 0.1,Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Floor,Relateds,possui_cabine,acompanhado,faixa_etaria
0,0,892,3,male,34,0,0,SC,0,False,False,jovem_adulto
1,1,893,3,female,47,1,0,SC,1,False,True,adulto_idoso


In [18]:
o_teste = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/original/test.csv')
o_teste.head(2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [19]:
p_teste['Embarked'] = o_teste['Embarked']
p_teste.drop(columns=['Unnamed: 0', 'Relateds', 'possui_cabine', 'acompanhado', 'faixa_etaria'], inplace=True)
p_teste.head(2)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Floor,Embarked
0,892,3,male,34,0,0,SC,Q
1,893,3,female,47,1,0,SC,S


In [20]:
t_dummies = pd.get_dummies(p_teste)
t_dummies.sample(10)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Sex_female,Sex_male,Floor_A,Floor_B,Floor_C,Floor_D,Floor_E,Floor_F,Floor_G,Floor_SC,Embarked_C,Embarked_Q,Embarked_S
144,1036,1,42,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
311,1203,3,22,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0
17,909,3,21,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0
326,1218,2,12,2,1,1,0,0,0,0,0,0,1,0,0,0,0,1
349,1241,2,31,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1
402,1294,1,22,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0
206,1098,3,35,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
145,1037,3,31,3,0,0,1,0,0,0,0,0,0,0,1,0,0,1
405,1297,2,20,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0
21,913,3,9,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1


In [22]:
t_dummies.to_csv('../../data/processed/test_dummies.csv')