# Análise Titanic

![Imagem](Files/img.png)

 - Replicando os mesmos tramentos da [Segunda análise.](https://github.com/DevTheo25/Titanic_ML/blob/main/Segunda_analise.ipynb) 

## Terceira Análise
 - Nessa análise vamos implementar uma nova coluna **Mr_Miss_in_Name** que vai receber o valor se o passageiro tem **Mr** ou **Miss** no nome

In [132]:
# Importando pandas
import pandas as pd

In [133]:
# Lendo e visualizando a base de dados Treino
train = pd.read_csv("Files/train.csv")

In [134]:
# Lendo e visualizando a base de dados Teste
test = pd.read_csv("Files/test.csv")
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [135]:
# Criando um função para verifica se contem Mr ou Miss no nome
def name_value(name):
     if ("Mr" in name or "Miss" in name):
        return 1
    else:
        return 0

In [136]:
# Adicionando uma nova coluna com 1 para se a pessoa possui Mr ou Miss no nome, ou 0 se não possui
train["Mr_Miss_in_Name"] = train["Name"].map(name_value)
test["Mr_Miss_in_Name"] = test["Name"].map(name_value)

In [137]:
# Vizualisando a cardinalidade dos dados
train.nunique().sort_values(ascending=False)

PassengerId        891
Name               891
Ticket             681
Fare               248
Cabin              147
Age                 88
SibSp                7
Parch                7
Pclass               3
Embarked             3
Survived             2
Sex                  2
Mr_Miss_in_Name      2
dtype: int64

In [138]:
# Eliminando colunas para os dados de treino
train = train.drop(["Name", "Ticket", "Cabin"], axis=1)

In [139]:
# Elimnando colunas para os dados de teste
test = test.drop(["Name", "Ticket", "Cabin"], axis=1)

### Tratando colunas com valores nulos

In [140]:
# Caculando a média da coluna Age
train["Age"].mean()

29.69911764705882

In [141]:
# Alterando os valores nulos na coluna Age, pela média base de treino
train.loc[train.Age.isnull(), "Age"] = train["Age"].mean()

In [142]:
# Alterando os valores nulos na coluna Age, pela média base de teste
test.loc[test.Age.isnull(), "Age"] = test["Age"].mean()

In [143]:
train.isnull().sum().sort_values(ascending=False)

Embarked           2
PassengerId        0
Survived           0
Pclass             0
Sex                0
Age                0
SibSp              0
Parch              0
Fare               0
Mr_Miss_in_Name    0
dtype: int64

In [144]:
test.isnull().sum().sort_values(ascending=False)

Fare               1
PassengerId        0
Pclass             0
Sex                0
Age                0
SibSp              0
Parch              0
Embarked           0
Mr_Miss_in_Name    0
dtype: int64

In [145]:
# Alterando valores nulos da coluna Fare
test.loc[test.Fare.isnull(), "Fare"] = test.Fare.mean()

In [146]:
test.isnull().sum().sort_values(ascending=False)

PassengerId        0
Pclass             0
Sex                0
Age                0
SibSp              0
Parch              0
Fare               0
Embarked           0
Mr_Miss_in_Name    0
dtype: int64

----

 # Entendendo colunas de texto

- Vamos agora fazer o tratamento das colunas de texto

In [147]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Mr_Miss_in_Name
0,1,0,3,male,22.0,1,0,7.25,S,1
1,2,1,1,female,38.0,1,0,71.2833,C,1
2,3,1,3,female,26.0,0,0,7.925,S,1
3,4,1,1,female,35.0,1,0,53.1,S,1
4,5,0,3,male,35.0,0,0,8.05,S,1


- Temos 2 colunas de texto, **Sex** e **Embarked**

In [148]:
train["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

- Para coluna **Sex** podemos converter para binário por existir apenas 2 valores diferentes

In [149]:
# Convertendo a coluna Sex para binário
train['Sex_binary'] = train["Sex"].apply(lambda x: 1 if  x == "female" else 0)

In [150]:
# Fazendo a mesma conversão para base de teste
test['Sex_binary'] = test["Sex"].apply(lambda x: 1 if  x == "female" else 0)

In [151]:
# Eliminando a antiga coluna Sex das bases
train = train.drop(["Sex"], axis=1)
test = test.drop("Sex", axis=1)

- Para a coluna **Embarked** precisanmos utiliza **one-hot encoding**. Isso significa que cada valor distinto na coluna "Embarked" será transformado em uma nova coluna binária

In [152]:
train["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [153]:
# Aplicar one-hot encoding
encoded = pd.get_dummies(train['Embarked'], prefix='Embarked')

# Concatenar os novos dados codificados com o dataframe original
train = pd.concat([train, encoded], axis=1)

# Elimindo coluna "Embarked"
train = train.drop("Embarked", axis=1)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Mr_Miss_in_Name,Sex_binary,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,1,0,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,1,1,0,0
2,3,1,3,26.0,0,0,7.925,1,1,0,0,1
3,4,1,1,35.0,1,0,53.1,1,1,0,0,1
4,5,0,3,35.0,0,0,8.05,1,0,0,0,1


In [154]:
# Aplicando o mesmo conceito para a base de Teste
encoded = pd.get_dummies(test['Embarked'], prefix='Embarked')

# Concatenar os novos dados codificados com o dataframe original
test = pd.concat([test, encoded], axis=1)

# Elimindo coluna "Embarked"
test = test.drop("Embarked", axis=1)
test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Mr_Miss_in_Name,Sex_binary,Embarked_C,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,1,0,0,1,0
1,893,3,47.0,1,0,7.0,1,1,0,0,1
2,894,2,62.0,0,0,9.6875,1,0,0,1,0
3,895,3,27.0,0,0,8.6625,1,0,0,0,1
4,896,3,22.0,1,1,12.2875,1,1,0,0,1


In [155]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Mr_Miss_in_Name,Sex_binary,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,1,0,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,1,1,0,0
2,3,1,3,26.0,0,0,7.925,1,1,0,0,1
3,4,1,1,35.0,1,0,53.1,1,1,0,0,1
4,5,0,3,35.0,0,0,8.05,1,0,0,0,1


# Criando Modelos

- Separando a base de treino entre Treino e Validação

In [156]:
# Serpando X e y
X = train.drop(["Survived", "PassengerId"], axis=1)
y = train["Survived"]

In [157]:
# Importando train_test_split
from sklearn.model_selection import train_test_split

In [158]:
treino_X, valid_X, treino_y, valid_y = train_test_split(X, y, random_state=1)

- Modelo Random Forent

In [159]:
# Importando modelo e acurácia
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [160]:
# Treinando Random Forest
modelo_rf = RandomForestClassifier(random_state=1)
modelo_rf.fit(treino_X, treino_y)

RandomForestClassifier(random_state=1)

In [161]:
# Predizendo os dados de validação
predicao_rf = modelo_rf.predict(valid_X)

In [162]:
# Acurácia do Modelo Random Forest
acuracia_rf = accuracy_score(predicao_rf, valid_y)

- Modelo KNeighborsClassifer

In [163]:
# Importando modelo
from sklearn.neighbors import KNeighborsClassifier

In [164]:
# Treinando o Modelo
modelo_kn = KNeighborsClassifier(n_neighbors=3)
modelo_kn.fit(treino_X, treino_y)

KNeighborsClassifier(n_neighbors=3)

In [165]:
## Predizendo os dados de validação
predicao_kn = modelo_kn.predict(valid_X)

In [166]:
# Acurácia
acuracia_kn = accuracy_score(valid_y, predicao_kn)

- Modelo Regressão Logística

In [167]:
# Importando modelo
from sklearn.linear_model import LogisticRegression

In [168]:
# Treinando o Modelo
modelo_lr = LogisticRegression(random_state=1, max_iter=1000)
modelo_lr.fit(treino_X, treino_y)

LogisticRegression(max_iter=1000, random_state=1)

In [169]:
# Predizendo os dados de Validação
predicao_lr = modelo_lr.predict(valid_X)

In [170]:
# Acurácia
acuracia_lr = accuracy_score(valid_y, predicao_lr)

## Resultados dos Modelos

In [171]:
print(f"Acurácia Modelo RandomForest: {acuracia_rf}\n")
print(f"Acurácia Modelo KNeighborsClassifer: {acuracia_kn}\n")
print(f"Acurácia Modelo LogisticRegression: {acuracia_lr}")

Acurácia Modelo RandomForest: 0.7713004484304933

Acurácia Modelo KNeighborsClassifer: 0.7309417040358744

Acurácia Modelo LogisticRegression: 0.7892376681614349


----

#  Predizendo os dados de Teste com o modelo Regressão Logística

In [174]:
X_train = train.drop(["PassengerId", "Survived"], axis=1)
y_train = train["Survived"]

In [175]:
X_test = test.drop("PassengerId", axis=1)

In [176]:
# Treinando o modelo Regressão Logística
modelo_final = LogisticRegression(random_state=1, max_iter=1000)
modelo_final.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=1)

In [177]:
# Predizendo os dados de Validação
predicao_final = modelo_final.predict(X_test)

In [179]:
sub = pd.Series(predicao_final, index=test["PassengerId"], name="Survived")
sub.to_csv("Files/Terceira_predicao.csv", header=True)