# Spaceship Titanic

- Utilizaando os [dados disponíveis no Kaggle](https://www.kaggle.com/competitions/spaceship-titanic)
    - Dataset de **competição**
    - O resultado é avaliado através da **acurácia**

### Importando novamente as bases e fazendo o tratando dos dados
- Importando arquivo original

In [1]:
# Importando o pandas
import pandas as pd

In [2]:
# Visualizando a base de treino original
treino_org = pd.read_csv('train.csv')
treino_org.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False


In [3]:
# Visualizando a base de teste original
teste_org = pd.read_csv('test.csv')
teste_org.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus


### Utilizando os novos modelos para fazer a previsão
- Vamos tentar extrair informação da coluna Cabin, que foi excluída de início
- A informação referente ao lado da cabine pode nos trazer informação útil

#### Tratando valores nulos
- O PassengerId	pode ajudar a identificar algumas das cabines
- Sendo o campo no formato "aaaa_bb" ,onde "bb" indica o número pessoas na mesma cabine

In [4]:
# Criando filtros para Cabin nulo e não nulo
idx_null = treino_org.index[treino_org.Cabin.isnull()]
idx = treino_org.index[treino_org.Cabin.notnull()]

In [5]:
# Função para preencher nulas com valores que ja existem
def compCabin(passId, df):
    passId = passId.split('_')
    for val in df.loc[idx].PassengerId.values:
        if (passId[0] in val):
            return df.loc[df.PassengerId == val, 'Cabin'].values
    return 'NONE'

In [6]:
# Utilizando a função anterior
treino_org.loc[idx_null, 'Cabin'] = treino_org.loc[idx_null].apply(lambda x: compCabin(x.PassengerId, treino_org), axis=1)
treino_org.loc[idx_null, 'Cabin']

15            NONE
93            NONE
103        [B/5/P]
222           NONE
227       [F/47/S]
           ...    
8209     [B/339/S]
8475     [B/296/P]
8485     [B/297/P]
8509    [G/1476/P]
8656          NONE
Name: Cabin, Length: 199, dtype: object

In [7]:
# Filtrando apenas a informação relacionada ao lado
treino_org.Cabin = treino_org.Cabin.apply(lambda x: x[-1:])
treino_org.Cabin

0       P
1       S
2       S
3       S
4       S
       ..
8688    P
8689    S
8690    S
8691    S
8692    S
Name: Cabin, Length: 8693, dtype: object

In [8]:
# Verificando a moda
treino_org.Cabin.mode()

0    S
Name: Cabin, dtype: object

In [9]:
# Preencheado o restante com a moda
treino_org.loc[treino_org.Cabin == 'E', 'Cabin'] = 'S'

#### Fazendo o mesmo em teste

In [10]:
# Criando filtros para Cabin nulo e não nulo
idx_null = teste_org.index[teste_org.Cabin.isnull()]
idx = teste_org.index[teste_org.Cabin.notnull()]

In [11]:
# Utilizando a função anterior
teste_org.loc[idx_null, 'Cabin'] = teste_org.loc[idx_null].apply(lambda x: compCabin(x.PassengerId, teste_org), axis=1)
teste_org.loc[idx_null, 'Cabin']

18         [B/0/S]
99            NONE
135           NONE
147           NONE
180       [F/81/P]
           ...    
4209          NONE
4248          NONE
4249          NONE
4258    [G/1501/P]
4273          NONE
Name: Cabin, Length: 100, dtype: object

In [12]:
# Filtrando apenas a informação relacionada ao lado
teste_org.Cabin = teste_org.Cabin.apply(lambda x: x[-1:])
teste_org.Cabin

0       S
1       S
2       S
3       S
4       S
       ..
4272    S
4273    E
4274    P
4275    P
4276    S
Name: Cabin, Length: 4277, dtype: object

In [13]:
# Verificando a moda
teste_org.Cabin.mode()

0    S
Name: Cabin, dtype: object

In [14]:
# Preencheado o restante com a moda
teste_org.loc[teste_org.Cabin == 'E', 'Cabin'] = 'S'

### Importando a base mais recente

- Importando oque foi feito na [Parte 3 - Engenharia de Recursos](https://github.com/PedroALage/Projetos/blob/main/Data_Science/Spaceship_Titanic/Parte3_EngRecursos.ipynb)

In [15]:
# Visualizando a base de treino
treino = pd.read_csv('treino_trat3.csv')
treino.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono
0,0001_01,39.0,0.0,0.0,0.0,0.0,0.0,False,0,0,0,1,0,0,0,1,0
1,0002_01,24.0,1.786885,0.096774,0.78125,7.418919,0.758621,True,0,0,1,0,0,0,0,1,0
2,0003_01,58.0,0.704918,38.451613,0.0,90.743243,0.844828,False,1,0,0,1,0,0,0,1,0


In [16]:
# Visualizando a base de teste
teste = pd.read_csv('teste_trat3.csv')
teste.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono
0,0013_01,27.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,0,0,1,0
1,0018_01,19.0,0.0,0.088235,0.0,44.809524,0.0,0,0,1,0,0,0,0,1,0
2,0019_01,31.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0


In [17]:
# Função para buscar somente a informação de lado
def findSide(cabin):
    if (cabin[-1:] == 'S'):
        return 1
    else:
        return 0

In [18]:
# Inserindo a informação em treino
treino['Side'] = treino_org.apply(lambda x: findSide(str(x.Cabin)), axis=1)
treino.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
0,0001_01,39.0,0.0,0.0,0.0,0.0,0.0,False,0,0,0,1,0,0,0,1,0,0
1,0002_01,24.0,1.786885,0.096774,0.78125,7.418919,0.758621,True,0,0,1,0,0,0,0,1,0,1
2,0003_01,58.0,0.704918,38.451613,0.0,90.743243,0.844828,False,1,0,0,1,0,0,0,1,0,1


In [19]:
# Inserindo a informação em teste
teste['Side'] = teste_org.apply(lambda x: findSide(str(x.Cabin)), axis=1)
teste.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
0,0013_01,27.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,0,0,1,0,1
1,0018_01,19.0,0.0,0.088235,0.0,44.809524,0.0,0,0,1,0,0,0,0,1,0,1
2,0019_01,31.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0,1


### Modelos utilizados

- Selecionando algoritmos diferentes das partes anteriores
- Considerando os [outros algoritmos disponíveis no scikit-learn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
    - **Regressão Logística**
        - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
    - **Random Forest**
        - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
    - **MLPClassifier (Redes Neurais)**
        - https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
- Antes de usar os algoritmos, é necessário separar a base de treino em **treino e validação**
    - Utilizando o **train_test_split**
        - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [20]:
# Importando o train_test_split
from sklearn.model_selection import train_test_split

In [21]:
# Separando a base de treino em X e y
X = treino.drop(['PassengerId', 'Transported'], axis=1)
y = treino.Transported

In [22]:
# Separando em treino e validação
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

- Para a **Regressão Logística**

In [23]:
# Importando
from sklearn.linear_model import LogisticRegression

In [24]:
# Criando o classificador
clf_rl = LogisticRegression(random_state=42,max_iter=1000)

In [25]:
# Fazendo o fit com os dados
clf_rl = clf_rl.fit(X_train, y_train)

In [26]:
# Fazendo a previsão
y_pred_rl = clf_rl.predict(X_val)

- Para o **Random Forest**

In [27]:
# Importando
from sklearn.ensemble import RandomForestClassifier

In [28]:
# Criando o classificador
clf_rf = RandomForestClassifier(random_state=42)

In [29]:
# Fazendo o fit com os dados
clf_rf = clf_rf.fit(X_train, y_train)

In [30]:
# Fazendo a previsão
y_pred_rf = clf_rf.predict(X_val)

- E para o **MLPClassifier (Redes Neurais)**

In [31]:
# Importando
from sklearn.neural_network import MLPClassifier

In [32]:
# Criando o classificador
clf_mp = MLPClassifier(random_state=42, max_iter=1000)

In [33]:
# Fazendo o fit com os dados
clf_mp = clf_mp.fit(X_train, y_train)

In [34]:
# Fazendo a previsão
y_pred_mp = clf_mp.predict(X_val)

- **Avaliando os modelos**
    - Acurácia (método de avaliação usado na competição):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
    - Matriz de confusão (ajuda a visualizar a distribuição dos erros):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- Avaliando a **acurácia**

In [35]:
# Importando
from sklearn.metrics import accuracy_score

In [36]:
# Para a Regressão Logística
accuracy_score(y_val, y_pred_rl)

0.7811084001394214

In [37]:
# Para o Random Forest
accuracy_score(y_val, y_pred_rf)

0.7828511676542349

In [38]:
# Para o MLPClassifier (Redes Neurais)
accuracy_score(y_val, y_pred_mp)

0.7765772046009063

- Avaliando a **matriz de confusão**

In [39]:
# Importando
from sklearn.metrics import confusion_matrix

In [40]:
# Para a Regressão Logística
confusion_matrix(y_val, y_pred_rl)

array([[1040,  384],
       [ 244, 1201]], dtype=int64)

In [41]:
# Para o Random Forest
confusion_matrix(y_val, y_pred_rf)

array([[1119,  305],
       [ 318, 1127]], dtype=int64)

In [42]:
# Para o MLPClassifier (Redes Neurais)
confusion_matrix(y_val, y_pred_mp)

array([[1022,  402],
       [ 239, 1206]], dtype=int64)

#### Armazenando as alterações

In [43]:
# Exportando as alterações na base para um csv
treino.to_csv('treino_trat5.csv', index=False)
teste.to_csv('teste_trat5.csv', index=False)

### Fazendo a previsão para os dados de teste
- Vamos usar o modelo com melhor precisão para fazer o predict na base de teste

In [101]:
# Visualizando o X_train
X_train.head(3)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
4696,35.0,21.918033,0.526882,1.78125,0.0,0.0,0,0,0,0,1,0,0,1,0,1
5946,28.0,0.0,1.634409,6.71875,0.405405,8.793103,0,0,1,0,0,0,0,1,0,0
227,43.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,0,0,1,0,0


In [102]:
# Visualizando a base de teste
teste.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
0,0013_01,27.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,0,0,1,0,1
1,0018_01,19.0,0.0,0.088235,0.0,44.809524,0.0,0,0,1,0,0,0,0,1,0,1
2,0019_01,31.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0,1


In [113]:
# Para a base de teste ser igual a base de treino, precisamos eliminar a coluna de id
X_teste = teste.drop('PassengerId',axis=1)

In [114]:
# Utilizando o melhor modelo na base de teste
y_pred = clf_mp.predict(X_teste)

In [115]:
# Criando uma nova coluna com a previsão na base de teste
teste['Transported'] = y_pred

In [116]:
# Selecionando apenas a coluna de Id e Survived para fazer o envio
base_envio = teste[['PassengerId','Transported']]

In [117]:
# Exportando para um csv
base_envio.to_csv('resultados5.csv',index=False)

## Resultado

- O modelo de Redes Neurais não teve a melhor accuracia no treino, mas apresentou melhor resultado no geral
- Modelo acertou 79,72% das avaliações
- O modelo apresentou uma pequena perda em comparação ao anterior
    
<img src="pkgImages/tentativa5.png" width=900>