# Spaceship Titanic

- Utilizaando os [dados disponíveis no Kaggle](https://www.kaggle.com/competitions/spaceship-titanic)
    - Dataset de **competição**
    - O resultado é avaliado através da **acurácia**

### Importando novamente as bases
- Importando oque foi feito na [Parte0](https://github.com/PedroALage/Projetos/blob/main/Data_Science/Spaceship_Titanic/Parte0_TratandoDados.ipynb)

In [1]:
# Importando o pandas
import pandas as pd

In [2]:
# Visualizando a base de treino
treino = pd.read_csv('treino_trat1.csv')
treino.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False


In [3]:
# Verificando as informações da base
treino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8693 non-null   object 
 2   CryoSleep     8693 non-null   bool   
 3   Destination   8693 non-null   object 
 4   Age           8693 non-null   float64
 5   VIP           8693 non-null   bool   
 6   RoomService   8693 non-null   float64
 7   FoodCourt     8693 non-null   float64
 8   ShoppingMall  8693 non-null   float64
 9   Spa           8693 non-null   float64
 10  VRDeck        8693 non-null   float64
 11  Transported   8693 non-null   bool   
dtypes: bool(3), float64(6), object(3)
memory usage: 636.8+ KB


In [4]:
# E os valores nulos
treino.isnull().sum().sort_values(ascending=False)

PassengerId     0
HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

- **Visualizando a base de teste**

In [5]:
# Visualizando a base de teste
teste = pd.read_csv('teste_trat1.csv')
teste.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,0013_01,Earth,True,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0
1,0018_01,Earth,False,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0
2,0019_01,Europa,True,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0


In [6]:
# Verificando as informações da base
teste.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   4277 non-null   object 
 1   HomePlanet    4277 non-null   object 
 2   CryoSleep     4277 non-null   bool   
 3   Destination   4277 non-null   object 
 4   Age           4277 non-null   float64
 5   VIP           4277 non-null   bool   
 6   RoomService   4277 non-null   float64
 7   FoodCourt     4277 non-null   float64
 8   ShoppingMall  4277 non-null   float64
 9   Spa           4277 non-null   float64
 10  VRDeck        4277 non-null   float64
dtypes: bool(2), float64(6), object(3)
memory usage: 309.2+ KB


In [7]:
# Analisando os valores nulos
teste.isnull().sum().sort_values(ascending=False)

PassengerId     0
HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

### Considerando apenas as colunas que não são de texto

In [8]:
# Verificando as colunas de texto na base de treino
col_treino_nr = treino.columns[treino.dtypes != 'object']
col_treino_nr

Index(['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck', 'Transported'],
      dtype='object')

In [9]:
# Selecionando apenas os valores numéricos da base de treino
treino_nr = treino.loc[:,col_treino_nr]

In [10]:
# E para a base de teste
col_teste_nr = teste.columns[teste.dtypes != 'object']
col_teste_nr

Index(['CryoSleep', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall',
       'Spa', 'VRDeck'],
      dtype='object')

In [11]:
# e os valores numéricos da base de teste
teste_nr = teste.loc[:,col_teste_nr]

### Modelo para classificar os dados
- Testar entre:
    - **Árvore de classificação**
        - https://scikit-learn.org/stable/modules/tree.html#classification
    - **Classificação dos vizinhos mais próximos**
        - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
    - **Regressão Logística**
        - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

- Separando a base de treino em **treino e validação**
    - Vamos fazer isso utilizando o **train_test_split**
        - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [12]:
# Importando o train_test_split
from sklearn.model_selection import train_test_split

In [13]:
# Separando a base de treino em X e y
X = treino_nr.drop(['Transported'], axis=1)
y = treino.Transported

In [14]:
# Separando em treino e validação
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

- Para a **árvore de classificação**

In [15]:
# Fazendo a importação
from sklearn import tree

In [16]:
# Criando o classificador
clf_ac = tree.DecisionTreeClassifier()

In [17]:
# Fazendo o fit com os dados
clf_ac = clf_ac.fit(X_train, y_train)

In [18]:
# Fazendo a previsão
y_pred_ac = clf_ac.predict(X_val)

- Para o **KNeighborsClassifier**

In [19]:
# Importando
from sklearn.neighbors import KNeighborsClassifier

In [20]:
# Criando o classificador
clf_knn = KNeighborsClassifier(n_neighbors=3)

In [21]:
# Fazendo o fit com os dados
clf_knn = clf_knn.fit(X_train, y_train)

In [22]:
# Fazendo a previsão
y_pred_knn = clf_knn.predict(X_val)

- Para a **Regressão Logística**

In [23]:
# Importando
from sklearn.linear_model import LogisticRegression

In [24]:
# Criando o classificador
clf_rl = LogisticRegression()

In [25]:
# Fazendo o fit com os dados
clf_rl = clf_rl.fit(X_train, y_train)

In [26]:
# Fazendo a previsão
y_pred_rl = clf_rl.predict(X_val)

### Avaliando os modelos
- Para fazer essa análise:
    - Acurácia (método de avaliação usado na competição):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
    - Matriz de confusão (ajuda a visualizar a distribuição dos erros):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- Avaliando a **acurácia**

In [27]:
# Importando
from sklearn.metrics import accuracy_score

In [28]:
# Para a árvore
accuracy_score(y_val, y_pred_ac)

0.7305681422098292

In [29]:
# Para o knn
accuracy_score(y_val, y_pred_knn)

0.7577553154409202

In [30]:
# Para a regressão logística
accuracy_score(y_val, y_pred_rl)

0.7713489020564657

- Avaliando a **matriz de confusão**

In [44]:
# Importando
from sklearn.metrics import confusion_matrix

In [45]:
# Para a árvore
confusion_matrix(y_val, y_pred_ac)

array([[ 919,  505],
       [ 268, 1177]], dtype=int64)

In [46]:
# Para o knn
confusion_matrix(y_val, y_pred_knn)

array([[1009,  415],
       [ 280, 1165]], dtype=int64)

In [47]:
# Para a regressão logística
confusion_matrix(y_val, y_pred_rl)

array([[1051,  373],
       [ 283, 1162]], dtype=int64)

### Fazendo a previsão para os dados de teste
- Vamos usar o modelo com melhor precisão para fazer o predict na base de teste

In [50]:
# Visualizando o X_train
X_train.head(3)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
4696,False,35.0,False,1337.0,49.0,57.0,0.0,0.0
5946,False,28.0,False,0.0,152.0,215.0,30.0,510.0
227,True,43.0,False,0.0,0.0,0.0,0.0,0.0


In [51]:
# Visualizando a base de teste (apenas com valores numéricos)
X_teste = teste_nr
X_teste.head(3)

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
0,True,27.0,False,0.0,0.0,0.0,0.0,0.0
1,False,19.0,False,0.0,9.0,0.0,2823.0,0.0
2,True,31.0,False,0.0,0.0,0.0,0.0,0.0


In [38]:
# Utilizando a regressão logística na base de teste
y_pred = clf_rl.predict(X_teste)

In [39]:
# Criando uma nova coluna com a previsão na base de teste
teste['Transported'] = y_pred

In [40]:
# Selecionando apenas a coluna de Id e Survived para fazer o envio
resultado1 = teste[['PassengerId', 'Transported']]

In [41]:
# Exportando para um csv
resultado1.to_csv('resultado1.csv', index=False)

- **Resultado**
    - Modelo acertou 78% das avaliações
<img src="pkgImages/tentativa1.png" width=900>