# 09 - Transformando dados com Dummies

Devido ao resultado do `DecisionTreeClassifier` e do `RandomForestClassifier` nos datasets anteriores (melhor resultado com apenas 77.5% de acurácia na base de teste), serão geradas novas versões dos dados processados, na tentativa de melhorar o score para esses algoritmos. A hipótese é que alguma informação importante foi perdida no anterior. Para melhorar isso, utilizaremos o `get_dummies` do `pandas` e incluirei algumas informações que antes eu não tinha. Por hora, o tratamento dos dados faltantes permanecerá o mesmo.

## Preparando o ambiente

In [1]:
import pandas as pd

## Transformando os dados de treino

In [2]:
titanic = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/processed/train_complete.csv')
titanic.sample(2)

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Floor,Embarked,title,Relateds,faixa,possui_cabine
765,765,766,1,1,"Hogeboom, Mrs. John C (Anna Andrews)",female,51,1,0,13502,77.9583,D,S,Mrs,1,pessoa_adulta,True
706,706,707,1,2,"Kelly, Mrs. Florence ""Fannie""",female,45,0,0,223596,13.5,SC,S,Mrs,0,pessoa_adulta,False


Removendo as colunas desnecessárias:

In [3]:
titanic.drop(columns=['Unnamed: 0', 'PassengerId', 'Name', 'Ticket', 'Fare', 'title', 'Relateds', 'faixa', 'possui_cabine'], inplace=True)
titanic.sample(2)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Floor,Embarked
690,1,1,male,31,1,0,B,S
326,0,3,male,61,0,0,SC,S


In [4]:
dummies = pd.get_dummies(titanic)
dummies.sample(10)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Sex_female,Sex_male,Floor_A,Floor_B,Floor_C,Floor_D,Floor_E,Floor_F,Floor_G,Floor_SC,Floor_T,Embarked_C,Embarked_Q,Embarked_S
685,0,2,25,1,2,0,1,0,0,0,0,0,0,0,1,0,1,0,0
764,0,3,16,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
38,0,3,18,2,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1
734,0,2,23,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1
657,0,3,32,1,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0
797,1,3,31,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1
198,1,3,12,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
186,1,3,32,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0
853,1,1,16,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,1
561,0,3,40,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1


Foram removidos os campos de análise adicionados e campos que identificam o passageiro, mas mantido o campo de andar e local de embarque, assim como a criação dos dummies.

In [5]:
dummies.to_csv('../../data/processed/train_dummies.csv')

## Transformando os dados de teste

In [6]:
p_teste = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/processed/test_processed.csv')
p_teste.head(2)

Unnamed: 0.1,Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Floor,Relateds,possui_cabine,acompanhado,faixa_etaria
0,0,892,3,male,34,0,0,SC,0,False,False,jovem_adulto
1,1,893,3,female,47,1,0,SC,1,False,True,adulto_idoso


In [7]:
o_teste = pd.read_csv('https://raw.githubusercontent.com/SalatielBairros/kaggle-titanic/main/data/original/test.csv')
o_teste.head(2)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [8]:
p_teste['Embarked'] = o_teste['Embarked']
p_teste.drop(columns=['Unnamed: 0', 'Relateds', 'possui_cabine', 'acompanhado', 'faixa_etaria'], inplace=True)
p_teste.head(2)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Floor,Embarked
0,892,3,male,34,0,0,SC,Q
1,893,3,female,47,1,0,SC,S


In [9]:
t_dummies = pd.get_dummies(p_teste)
t_dummies.sample(10)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Sex_female,Sex_male,Floor_A,Floor_B,Floor_C,Floor_D,Floor_E,Floor_F,Floor_G,Floor_SC,Embarked_C,Embarked_Q,Embarked_S
327,1219,1,46,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0
243,1135,3,23,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
4,896,3,22,1,1,1,0,0,0,0,0,0,0,0,1,0,0,1
391,1283,1,51,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1
210,1102,3,32,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
134,1026,3,43,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
78,970,2,30,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
176,1068,2,20,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1
115,1007,3,18,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0
162,1054,2,26,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1


In [10]:
t_dummies.to_csv('../../data/processed/kaggle_test_dummies.csv')