# Spaceship Titanic

- Utilizaando os [dados disponíveis no Kaggle](https://www.kaggle.com/competitions/spaceship-titanic)
    - Dataset de **competição**
    - O resultado é avaliado através da **acurácia**

### Importando novamente as bases e fazendo o tratando dos dados
- Importando oque foi feito na [Parte 5 - Revisando a base](https://github.com/PedroALage/Projetos/blob/main/Data_Science/Spaceship_Titanic/Parte5_RevisandoBase.ipynb)

In [1]:
# Importando o pandas
import pandas as pd

In [2]:
# Visualizando a base de treino
treino = pd.read_csv('treino_trat5.csv')
treino.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
0,0001_01,39.0,0.0,0.0,0.0,0.0,0.0,False,0,0,0,1,0,0,0,1,0,0
1,0002_01,24.0,1.786885,0.096774,0.78125,7.418919,0.758621,True,0,0,1,0,0,0,0,1,0,1
2,0003_01,58.0,0.704918,38.451613,0.0,90.743243,0.844828,False,1,0,0,1,0,0,0,1,0,1


In [3]:
# Visualizando a base de teste
teste = pd.read_csv('teste_trat5.csv')
teste.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side,Transported
0,0013_01,27.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,0,0,1,0,1,True
1,0018_01,19.0,0.0,0.088235,0.0,44.809524,0.0,0,0,1,0,0,0,0,1,0,1,False
2,0019_01,31.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0,1,True


### Utilizando os novos modelos para fazer a previsão

- Selecionando algoritmos diferentes das partes anteriores
- Considerando os [outros algoritmos disponíveis no scikit-learn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
    - **Regressão Logística**
        - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
    - **Random Forest**
        - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
    - **MLPClassifier (Redes Neurais)**
        - https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier


- Agora, **além do train_test_split**:
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- Utilizando também o **grid_search** para estimar os melhores parâmetros
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [4]:
# Importando o train_test_split
from sklearn.model_selection import train_test_split

In [5]:
# Separando a base de treino em X e y
X = treino.drop(['PassengerId', 'Transported'], axis=1)
y = treino.Transported

In [6]:
# Separando em treino e validação
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

- Para a **Regressão Logística**

In [7]:
# Importando
from sklearn.linear_model import LogisticRegression

In [8]:
# Criando o classificador
clf_rl = LogisticRegression(random_state=42)

In [19]:
# Definindo os parâmetros
parametros_rl = {'C': [0.01], 'max_iter': [100], 'penalty': ['l2'], 'solver': ['newton-cg']}
# {
#     'penalty': ['l1', 'l2', 'elasticnet', None]
#     , 'C': [0.01, 0.1, 1, 10]
#     , 'solver': ['lbfgs', 'liblinear', 'newton-cg', 'saga']
#     , 'max_iter': [100, 500, 1000]
# }

- Para o **Random Forest**

In [10]:
# Importando
from sklearn.ensemble import RandomForestClassifier

In [11]:
# Criando o classificador
clf_rf = RandomForestClassifier(random_state=42)

In [20]:
# Definindo os parâmetros
parametros_rf = {'criterion': ['entropy'],
 'max_depth': [12],
 'max_features': ['sqrt'],
 'min_samples_leaf': [5],
 'n_estimators': [1000]}
# {
#     'n_estimators': [500, 1000, 5000]
#     , 'criterion': ['gini', 'entropy', 'log_loss']
#     , 'max_depth': [4, 8, 12, None]
#     , 'max_features': ['sqrt', 'log2', None]
#     , 'min_samples_leaf': [1, 3, 5]
# }

- E para o **MLPClassifier (Redes Neurais)**

In [13]:
# Importando
from sklearn.neural_network import MLPClassifier

In [14]:
# Criando o classificador
clf_mp = MLPClassifier(random_state=42)

In [21]:
# Definindo os parâmetros
parametros_mp = {'alpha': [0.1], 'max_iter': [200], 'solver': ['adam']}
# {
#     'solver': ['lbfgs', 'sgd', 'adam']
#     , 'alpha': [10.0**(-1), 10.0**(-4), 10.0**(-7)]  
#     , 'max_iter': [200, 600, 1000]
# }

- **Fazendo o grid_search**

In [22]:
# Ignorando os avisos
import warnings
warnings.filterwarnings('ignore')

In [23]:
# Importando o KFold e o GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

In [24]:
# Para a Regressão Logística
kfold_rl = KFold(shuffle=True, random_state=42, n_splits=8)
gsearch_rl = GridSearchCV(clf_rl, parametros_rl, scoring='accuracy', cv=kfold_rl)
gsearch_rl = gsearch_rl.fit(X_train, y_train)

In [25]:
# Para o RandomForest
kfold_rf = KFold(shuffle=True, random_state=42, n_splits=8)
gsearch_rf = GridSearchCV(clf_rf, parametros_rf, scoring='accuracy', cv=kfold_rf)
gsearch_rf = gsearch_rf.fit(X_train, y_train)

In [26]:
# Para o MLPClassifier
kfold_mp = KFold(shuffle=True, random_state=42, n_splits=8)
gsearch_mp = GridSearchCV(clf_mp, parametros_mp, scoring='accuracy', cv=kfold_mp)
gsearch_mp = gsearch_mp.fit(X_train, y_train)

- **Verificando os melhores scores**

In [27]:
# Verificando o melhor score da regressão logística
gsearch_rl.best_score_

0.7999697432641562

In [28]:
# Para o RandomForest
gsearch_rf.best_score_

0.8062993532002698

In [30]:
# e para o MLPClassifier
gsearch_mp.best_score_

0.7979560996256763

- **E os melhores parâmetros**

In [31]:
# Verificando os melhores parâmetros da regressão logística
gsearch_rl.best_params_

{'C': 0.01, 'max_iter': 100, 'penalty': 'l2', 'solver': 'newton-cg'}

In [32]:
# Para o RandomForest
gsearch_rf.best_params_

{'criterion': 'entropy',
 'max_depth': 12,
 'max_features': 'sqrt',
 'min_samples_leaf': 5,
 'n_estimators': 1000}

In [33]:
# e para o MLPClassifier
gsearch_mp.best_params_

{'alpha': 0.1, 'max_iter': 200, 'solver': 'adam'}

- **Fazendo a previsão nos dados de validação com cada um dos melhores modelos**

In [34]:
# Para a regressão logística
clf_best_rl = gsearch_rl.best_estimator_
y_pred_rl = clf_best_rl.predict(X_val)

In [35]:
# Para o RandomForest
clf_best_rf = gsearch_rf.best_estimator_
y_pred_rf = clf_best_rf.predict(X_val)

In [36]:
# e para o MLPClassifier
clf_best_mp = gsearch_mp.best_estimator_
y_pred_mp = clf_best_mp.predict(X_val)

- Vamos novamente **avaliar os modelos**
    - Acurácia (método de avaliação usado na competição):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
    - Matriz de confusão (ajuda a visualizar a distribuição dos erros):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- Avaliando a **acurácia**

In [37]:
# Importando
from sklearn.metrics import accuracy_score

In [38]:
# Para a Regressão Logística
accuracy_score(y_val, y_pred_rl)

0.7791834387579069

In [39]:
# Para o Random Forest
accuracy_score(y_val, y_pred_rf)

0.7814836112708453

In [40]:
# Para o MLPClassifier (Redes Neurais)
accuracy_score(y_val, y_pred_mp)

0.7780333525014376

- Avaliando a **matriz de confusão**

In [41]:
# Importando
from sklearn.metrics import confusion_matrix

In [42]:
# Para a Regressão Logística
confusion_matrix(y_val, y_pred_rl)

array([[614, 247],
       [137, 741]], dtype=int64)

In [43]:
# Para o Random Forest
confusion_matrix(y_val, y_pred_rf)

array([[647, 214],
       [166, 712]], dtype=int64)

In [44]:
# Para o MLPClassifier (Redes Neurais)
confusion_matrix(y_val, y_pred_mp)

array([[628, 233],
       [153, 725]], dtype=int64)

### Fazendo a previsão para os dados de teste
- Vamos usar o modelo com melhor precisão para fazer o predict na base de teste

In [45]:
# Visualizando o X_train
X_train.head(3)

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side
2333,28.0,0.0,0.591398,0.0,8.864865,0.0,0,0,1,0,0,0,0,1,0,1
2589,17.0,0.0,12.849462,0.96875,0.0,0.0,0,0,1,0,0,0,0,1,0,0
8302,28.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0,1


In [46]:
# Visualizando a base de teste
teste.head(3)

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,VIPCheck,CryoSCheck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,JovemSono,Side,Transported
0,0013_01,27.0,0.0,0.0,0.0,0.0,0.0,0,1,1,0,0,0,0,1,0,1,True
1,0018_01,19.0,0.0,0.088235,0.0,44.809524,0.0,0,0,1,0,0,0,0,1,0,1,False
2,0019_01,31.0,0.0,0.0,0.0,0.0,0.0,0,1,0,1,0,1,0,0,0,1,True


In [54]:
# Para a base de teste ser igual a base de treino, precisamos eliminar a coluna de id
X_teste = teste.drop(['PassengerId', 'Transported'],axis=1)

In [55]:
# Utilizando o melhor modelo na base de teste
y_pred = clf_best_mp.predict(X_teste)

In [56]:
# Criando uma nova coluna com a previsão na base de teste
teste['Transported'] = y_pred

In [57]:
# Selecionando apenas a coluna de Id e Survived para fazer o envio
base_envio = teste[['PassengerId','Transported']]

In [58]:
# Exportando para um csv
base_envio.to_csv('resultados6.csv',index=False)

## Resultado

- Modelo acertou 79,19% das avaliações
- O modelo apresentou uma pequena perda em comparação ao anterior
    
<img src="pkgImages/tentativa6.png" width=900>