## Introducion

In this notebook, I took the easy way without doing any feature engineering.
I considered three different methods to fill in the missing data and checked the accuracy of the prediction model for each of them.
But the interesting thing I saw was that the simplest method (deleting missing data) had the highest accuracy compared to other methods.

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, train_test_split
from lightgbm import LGBMClassifier
import time
import warnings
warnings.filterwarnings('ignore')
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Preparation

In [None]:
spaceship_train = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
spaceship_test = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
submission = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

In [None]:
list(spaceship_train.columns)

PassengerId: A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet: The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep: Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin: The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination: The planet the passenger will be debarking to.

Age: The age of the passenger.

VIP: Whether the passenger has paid for special VIP service during the voyage.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name: The first and last names of the passenger.

Transported: Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

### 1-Train Data

In [None]:
print('Number of data: ',len(spaceship_train)*len(spaceship_train.columns))
print('Number of missing values: ',sum(spaceship_train.isna().sum()))
for col in spaceship_train.columns:
    print("\n"'--- %s ---'%col)
    print(spaceship_train[col].value_counts())
    print('Type is: ',spaceship_train[col].dtype)
    print('Unique values: ',spaceship_train[col].nunique())
    print('Null values: ',spaceship_train[col].isna().sum())

The results obtained from the observations:

- The total number of rows is 8693 and the total number of columns is 14.

- The total number of data is 121702, of which 2324 are missing.

- All columns have a significant number of missing values except PassengerId and Transported which is the target column.

- Of the 12 columns with missing values:
    - 4 columns (HomePlanet, CryoSleep, Destination, and VIP) are categorical.
    - 2 columns (Cabin and Name) are text.
    - The remaining 6 columns (Age, RoomService, FoodCourt, ShoppingMall, Spa, and VRDeck) are continuous.

### 2-Test Data

In [None]:
print('Number of data: ',len(spaceship_test)*len(spaceship_test.columns))
print('Number of missing values: ',sum(spaceship_test.isna().sum()))
for col in spaceship_test.columns:
    print("\n"'--- %s ---'%col)
    print(spaceship_test[col].value_counts())
    print('Type is: ',spaceship_test[col].dtype)
    print('Unique values: ',spaceship_test[col].nunique())
    print('Null values: ',spaceship_test[col].isna().sum())

The results obtained from the observations:

- The total number of rows is 4277 and the total number of columns is 13.
- The total number of data is 55601, of which 1117 are missing.
- Except for PassengerId, all columns have a significant number of missing values.
- Of the 12 columns with missing values:
    - 4 columns (HomePlanet, CryoSleep, Destination, and VIP) are categorical.
    - 2 columns (Cabin and Name)are text.
    - The remaining 6 columns (Age, RoomService, FoodCourt, ShoppingMall, Spa, and VRDeck) are continuous.

### 3-Submission

In [None]:
submission.info()

## Data Preprocessing

 In this section, I prepare our data in three ways:
 
      1- Delete the rows that have at least one missing value.
      2- Using the imputation method.
      3- Adding a column for the data filled with the imputation method.

### 1- First Method

In [None]:
spaceship_train_dropna = spaceship_train.dropna()
spaceship_test_dropna = spaceship_test.dropna()

In [None]:
categorical_col = ['HomePlanet','CryoSleep','Destination','VIP']
Text_col = ['Cabin','Name']
continous_col = ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

In [None]:
for col in categorical_col:
    spaceship_train_dropna[col] = spaceship_train_dropna[col].astype(str)
    spaceship_test_dropna[col] = spaceship_test_dropna[col].astype(str)
    spaceship_train_dropna[col] = LabelEncoder().fit_transform(spaceship_train_dropna[col])
    spaceship_test_dropna[col] =  LabelEncoder().fit_transform(spaceship_test_dropna[col])

In [None]:
spaceship_train_dropna.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)
spaceship_test_dropna.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)

### 2- Second Method

In [None]:
spaceship_train_imputation = spaceship_train.copy()
spaceship_test_imputation = spaceship_test.copy()

In [None]:
categorical_col = ['HomePlanet','CryoSleep','Destination','VIP']
Text_col = ['Cabin','Name']
continous_col = ['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit(spaceship_train_imputation[continous_col])
spaceship_train_imputation[continous_col] = imputer.transform(spaceship_train_imputation[continous_col])
spaceship_test_imputation[continous_col] = imputer.transform(spaceship_test_imputation[continous_col])

In [None]:
spaceship_train_imputation["HomePlanet"].fillna('N', inplace=True)
spaceship_test_imputation["HomePlanet"].fillna('N', inplace=True)
for col in categorical_col:
    spaceship_train_imputation[col] = spaceship_train_imputation[col].astype(str)
    spaceship_test_imputation[col] = spaceship_test_imputation[col].astype(str)
    spaceship_train_imputation[col] = LabelEncoder().fit_transform(spaceship_train_imputation[col])
    spaceship_test_imputation[col] =  LabelEncoder().fit_transform(spaceship_test_imputation[col])

In [None]:
spaceship_train_imputation.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)
spaceship_test_imputation.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)

### 3- Third Method

In [None]:
spaceship_train_EXimput = spaceship_train.copy()
spaceship_test_EXimput = spaceship_test.copy()

In [None]:
missing_cols = ['HomePlanet','CryoSleep','Destination','VIP','Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']
for col in missing_cols:
    spaceship_train_EXimput[col + '_was_missing'] = spaceship_train_EXimput[col].isnull()
    spaceship_test_EXimput[col + '_was_missing'] = spaceship_test_EXimput[col].isnull()

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit(spaceship_train_EXimput[continous_col])
spaceship_train_EXimput[continous_col] = imputer.transform(spaceship_train_EXimput[continous_col])
spaceship_test_EXimput[continous_col] = imputer.transform(spaceship_test_EXimput[continous_col])

In [None]:
spaceship_train_EXimput["HomePlanet"].fillna('N', inplace=True)
spaceship_test_EXimput["HomePlanet"].fillna('N', inplace=True)
categorical_col_eximput = ['HomePlanet','CryoSleep','Destination','VIP','HomePlanet_was_missing','CryoSleep_was_missing', 'Destination_was_missing', 'VIP_was_missing', 'Age_was_missing', 'RoomService_was_missing', 'FoodCourt_was_missing','ShoppingMall_was_missing', 'Spa_was_missing', 'VRDeck_was_missing']
for col in categorical_col_eximput:
    spaceship_train_EXimput[col] = spaceship_train_EXimput[col].astype(str)
    spaceship_test_EXimput[col] = spaceship_test_EXimput[col].astype(str)
    spaceship_train_EXimput[col] = LabelEncoder().fit_transform(spaceship_train_EXimput[col])
    spaceship_test_EXimput[col] =  LabelEncoder().fit_transform(spaceship_test_EXimput[col])

In [None]:
spaceship_train_EXimput.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)
spaceship_test_EXimput.drop(["Name" ,"Cabin","PassengerId"] , axis = 1 ,inplace = True)

In [None]:
spaceship_train_EXimput = spaceship_train_EXimput.reindex(columns = [col for col in spaceship_train_EXimput.columns if col != 'Transported'] + ['Transported'])

In [None]:
submission

## Modeling

In this section, we run the lgbm model on the three data created in the previous step and check which method has better accuracy.

In [None]:
def My_lgbm(train,test):
    lgb_params = {
        'objective' : 'binary',
        'n_estimators' :100,
        'learning_rate' : 0.1
    }
    
    lgb_predictions = 0
    lgb_scores = []
    LGBM_FEATURES = list(train.columns)[:-1]
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=12)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train[LGBM_FEATURES], train['Transported'])):
        #print(f'\033[94m')
        #print(10*"=", f"Fold={fold+1}", 10*"=")
        start_time = time.time()
    
        X_train, X_valid = train.iloc[train_idx][LGBM_FEATURES], train.iloc[valid_idx][LGBM_FEATURES]
        y_train , y_valid =  train['Transported'].iloc[train_idx] , train['Transported'].iloc[valid_idx]
    
        model = LGBMClassifier(**lgb_params)
        model.fit(X_train, y_train,verbose=0)
    
        preds_valid = model.predict(X_valid)
        acc = accuracy_score(y_valid,  preds_valid)
        lgb_scores.append(acc)
    
        print(f"Fold={fold+1}, Accuracy score: {acc:.2f}%")
        test_preds = model.predict(test[LGBM_FEATURES]) 
        lgb_predictions += test_preds/10
    

    print("")
    print("Mean Accuracy :", np.mean(lgb_scores))
    
print("\n""\n""\n""First Method : Drop null Values")
My_lgbm(spaceship_train_dropna,spaceship_test_dropna)
print("\n""\n""\n""Second Method : Use Imputation")
My_lgbm(spaceship_train_imputation,spaceship_test_imputation)
print("\n""\n""\n""Third Method : Use Extention to Imputation")
My_lgbm(spaceship_train_EXimput,spaceship_test_EXimput)

In [None]:
lgb_params = {
        'objective' : 'binary',
        'n_estimators' :100,
        'learning_rate' : 0.1
}
    
lgb_predictions = 0
lgb_scores = []
LGBM_FEATURES = list(spaceship_train_EXimput.columns)[:-1]
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=12)
for fold, (train_idx, valid_idx) in enumerate(skf.split(spaceship_train_EXimput[LGBM_FEATURES], spaceship_train_EXimput['Transported'])):
    
    X_train, X_valid = spaceship_train_EXimput.iloc[train_idx][LGBM_FEATURES], spaceship_train_EXimput.iloc[valid_idx][LGBM_FEATURES]
    y_train , y_valid =  spaceship_train_EXimput['Transported'].iloc[train_idx] , spaceship_train_EXimput['Transported'].iloc[valid_idx]
    
    model = LGBMClassifier(**lgb_params)
    model.fit(X_train, y_train,verbose=0)
    
    preds_valid = model.predict(X_valid)
    acc = accuracy_score(y_valid,  preds_valid)
    lgb_scores.append(acc)
    test_preds = model.predict(spaceship_test_EXimput[LGBM_FEATURES]) 
    lgb_predictions += test_preds/10
#print(set(list(lgb_predictions)))
lgb_predictions = np.where(lgb_predictions > 0.5, 1, 0)
#len(lgb_predictions)
lgb_predictions = np.array(lgb_predictions, dtype=bool)
submission['Transported'] = lgb_predictions
submission.to_csv("submission.csv",index=False)