<a href="https://colab.research.google.com/github/Eric-Mendes/treino-de-pandas/blob/main/analise_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Análise dataset Titanic

In [1]:
# Fazendo os imports necessários e inicializando o dataframe
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [2]:
print("shape df com NaN:", df.shape)
print("shape df sem NaN:", df.dropna().shape)
print("Informação perdida: %.2f%%" % (100 * (1 - df.dropna().shape[0] / df.shape[0])))

shape df com NaN: (891, 15)
shape df sem NaN: (182, 15)
Informação perdida: 79.57%


[Exploração dos dados] Ao fazer o drop de linhas com algum valor faltante, perdemos uma quantidade significativa de registros. Talvez uma melhor estratégia seja preenchê-las de alguma maneira.

In [3]:
df['alive'] = df['alive'].replace({'no': 0, 'yes': 1})
df.loc[df['alive'] != df['survived']]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone


[Feature Engineering] Acima pudemos ver que as colunas alive e survived aparentam ser intercambiáveis. Sugiro droparmos uma delas (irei tirar a coluna "alive", pois acho que o nome de atributo "survived" é mais legível).

In [4]:
df = df.drop(labels='alive', axis='columns')



---



In [5]:
df['who'].value_counts()

man      537
woman    271
child     83
Name: who, dtype: int64

In [6]:
df['adult_male'].value_counts()

True     537
False    354
Name: adult_male, dtype: int64

[Feature Engineering] Note como a informação trazida pela coluna "adult_male" aparenta poder ser "inferida" pela coluna "who". Testaremos esta hipótese, e caso verdadeira manteremos a coluna "who".

In [7]:
df.loc[(df['adult_male'] == True) & (df['who'] != 'man')]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone


In [8]:
df = df.drop(labels='adult_male', axis='columns')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,deck,embark_town,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,C,Cherbourg,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,,Southampton,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,C,Southampton,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,,Southampton,True




---



[Feature Engineering] As colunas "class" e "pclass" também trazem as mesmas informações. Deixaremos a coluna "pclass" por já ser numérica.

In [9]:
df['class'].value_counts()

Third     491
First     216
Second    184
Name: class, dtype: int64

In [10]:
df['pclass'].value_counts()

3    491
1    216
2    184
Name: pclass, dtype: int64

In [11]:
df = df.drop(labels='class', axis='columns')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,deck,embark_town,alone
0,0,3,male,22.0,1,0,7.25,S,man,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,C,woman,C,Cherbourg,False
2,1,3,female,26.0,0,0,7.925,S,woman,,Southampton,True
3,1,1,female,35.0,1,0,53.1,S,woman,C,Southampton,False
4,0,3,male,35.0,0,0,8.05,S,man,,Southampton,True




---



In [12]:
df['embarked'].value_counts()

S    644
C    168
Q     77
Name: embarked, dtype: int64

In [13]:
df['embark_town'].value_counts()

Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

[Feature Engineering] Mais colunas iguais :)

In [14]:
df = df.drop(labels='embark_town', axis='columns')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,deck,alone
0,0,3,male,22.0,1,0,7.25,S,man,,False
1,1,1,female,38.0,1,0,71.2833,C,woman,C,False
2,1,3,female,26.0,0,0,7.925,S,woman,,True
3,1,1,female,35.0,1,0,53.1,S,woman,C,False
4,0,3,male,35.0,0,0,8.05,S,man,,True


In [15]:
df = df.dropna(subset=['age'])
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,deck,alone
0,0,3,male,22.0,1,0,7.25,S,man,,False
1,1,1,female,38.0,1,0,71.2833,C,woman,C,False
2,1,3,female,26.0,0,0,7.925,S,woman,,True
3,1,1,female,35.0,1,0,53.1,S,woman,C,False
4,0,3,male,35.0,0,0,8.05,S,man,,True


In [18]:
df['age'] = df['age'].astype(int)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,deck,alone
0,0,3,male,22,1,0,7.25,S,man,,False
1,1,1,female,38,1,0,71.2833,C,woman,C,False
2,1,3,female,26,0,0,7.925,S,woman,,True
3,1,1,female,35,1,0,53.1,S,woman,C,False
4,0,3,male,35,0,0,8.05,S,man,,True
