In V8:

**EDA:**
More in depth EDA and visualization

**Outliers:**

1-removing rows with 5 or more outliers

**Missing values:**

1-Using IterativeImputer to impute missing values

2-Using Target Encoding for categorical features


**Scaling:**

1-PowerTransformer

**Models:**

1-A voting classifier with LGBM,XGBC,GBC

In [None]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1) Load The Data

## 1-1) Import Libraries

In [None]:
# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import seaborn as sns
import missingno as mno

## 1-2) Load Dataset

In [None]:
train=pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test=pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
sample=pd.read_csv("/kaggle/input/spaceship-titanic/sample_submission.csv")

In [None]:
train_set=train.copy()
test_set=test.copy()

# 2) Summarize the Dataset

## 2-1) Dimensions of the Dataset

In [None]:
train_set.shape

In [None]:
test_set.shape

## 2-2) Peek at the Data

In [None]:
train_set.head(10) 

In [None]:
train_set.info()

## 2-3) Statistical Summary:

In [None]:
train_set.describe()

We can see that:
* the numeric features are on different scales
* there are some missing values

In [None]:
train_set.Transported.value_counts()

we can see the distribution is almost equal

# 3) Data Visualization

We are going to look at two types of plots:
*  Univariate plots to better understand each attribute.
*  Multivariate plots to better understand the relationships between attributes.

## 3-1) Univariate Plots

### 3-1-1) Univariate Plots for Numerical Features

In [None]:
#this will create a box plot for numeric values
train_set.plot(kind='box',subplots=True,layout=(2,3),sharex=False,sharey=False,figsize=(15,15))
plt.show()

In [None]:
#this will create a historgram for numeric values
train_set.hist(figsize=(10,10))

Only 'Age' has a Gaussian-like distribution, the rest are highly skewed!

### 3-1-2) Univariate Plot for Categorical Features

In [None]:
sns.set(rc={'figure.figsize':(10,10)})
fig, axes = plt.subplots(2, 2)
names=['HomePlanet','CryoSleep','Destination','VIP']

for name, ax in zip(names, axes.flatten()):
    sns.countplot(x=name,data=train_set,ax=ax)

## 3-2) Multivariate Plots

### 3-2-1) Multivariate Plots for Numerical Data

In [None]:
#sns.heatmap(data=train_set.corr(),annot=True) 
#I won't create this heatmap because the default method for .corr() is pearson and it assums Gaussian Distribution and Outliers can heavily influence the outcomes
#but the distribution for numerical features here are highly skewed and this heatmap can not give reliable answers!

In [None]:
sns.pairplot(data=train_set.select_dtypes(['number','bool']),hue="Transported",palette='CMRmap')

Apparantly there's not linear correlation between the numerical columns.

Let's take a closer look at the diagonal plots...

In [None]:
fig, axes = plt.subplots(2, 3,figsize=(15,15))
names=['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

for name, ax in zip(names, axes.flatten()):
    sns.kdeplot(x=name,hue='Transported',data=train_set,ax=ax)

In [None]:
fig, axes = plt.subplots(2, 3,figsize=(18,15))
names=['Age','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

for name, ax in zip(names, axes.flatten()):
    sns.stripplot(y=name,x='Transported',data=train_set,ax=ax)

From the last two plots, we can see:

1-**Age:** 
* Children (up to 10-12) have a higher chance of being Transported
* Adults (20-40) have a higher chance of **not** being Transported
* Rest have almost an equal chance of being Transported

2-**RoomService:**
* People who spent no money on RoomService have a higher chance of being Transported; actually, as the amount of expenditure goes higher, fewer are among the Transported people and for the extreme expenditures, there are no Transported people.

3-**FoodCourt:**
* People who spent nothing on FoodCourt are less likely to be Transported, as the amount of expenditure goes higher, the chance of being Transported is almost equal, till we reach about 17000, from there on, we have a few people; but all of them are Transported.

4-**ShoppingMall:**
* There seems to be little to no distinction between Transported and not Transported people, based on expenditure in the ShoppingMall, this feature doesn't seem to be very informative

5-**Spa:**
* People who spent little to no money on Spa, have a higher chance of being Transported

6-**VRDeck:**
* Same as Spa

### 3-2-2) Multivariate Plots for Categorical Data

In [None]:
fig, axes = plt.subplots(2, 2)
names=['HomePlanet','CryoSleep','Destination','VIP']

for name, ax in zip(names, axes.flatten()):
    sns.barplot(x=name,y='Transported',data=train_set,ax=ax)
    ax.set( ylabel="Transportation Probability")

I'll explore the relevance of each categorical feature to the target, using chi2 test later, but what we can tell so far is:

1- People in VIP section, have less chance of being Transported

2- People in CryoSleep have much higher chance of survival

I think this is because, VIP people were awake and probably scattered in different parts of the spaceship; shopping or eating or whatever, so they were less safe in case of collision, whereas people in cryosleep, they probably were in a strong container, which would keep them safe.

3-For the destination and homeplanet, we can see how the Transportation Probability changes among them, in the plot.

# 4) Handling Missing Values and Outliers

## 4-1) Outliers

Using below function, I'll detect rows with n number of outliers, I'll drop rows with more than 5 outliers

In [None]:
from collections import Counter
def outlier_detect(df,n,cols):
    rows,to_drop=[],[]
    for col in cols:
        Q1=np.nanpercentile(df[col],25)
        Q3=np.nanpercentile(df[col],75)
        IQR=Q3-Q1
        outlier_point=1.5*IQR
        rows.extend(df[(df[col]<Q1-outlier_point)|(df[col]>Q3+outlier_point)].index)
    for r,c in Counter(rows).items():
        if c>=n: to_drop.append(r)
    return to_drop

In [None]:
to_drop=outlier_detect(train_set,5,train_set.select_dtypes('float').columns)

In [None]:
train_set.drop(to_drop,inplace=True,axis=0)

## 4-2) Missing and Duplicates

In [None]:
#there are no rows with all null values
train_set.isna().all(axis=1).unique()

In [None]:
#there are no duplicates
train_set.duplicated().any()

In [None]:
missing_prcnt=[train_set[col].isna().sum()/train_set.shape[0] *100 for col in train_set.columns]

In [None]:
miss_tbl=pd.DataFrame(missing_prcnt,columns=['%missing'],index=train_set.columns)

In [None]:
miss_tbl

Let's see if there's a pattern to missing values

In [None]:
mno.matrix(train_set, figsize = (20, 6))

They seem pretty random...

We'll get back to the missing values and impute them.

# 5) Exploring Features

## 5-1) Name

I noticed people with same last names, I'm going to extract the last names and see if I can extract a feature named: "Family" from them!

In [None]:
train_set[['Name','Last']] = train_set.Name.str.split(" ", expand=True)

In [None]:
train_set.drop('Name',inplace=True,axis=1)

In [None]:
fam_size=train_set.Last.value_counts()

In [None]:
fam_size

In [None]:
train_set['Last']=train_set['Last'].map(fam_size)
train_set['Last']=train_set['Last'].astype('object')

In [None]:
g=sns.barplot(x='Last',y='Transported',data=train_set)
g.set( ylabel="Transportation Probability")
#Generally, it looks like smaller families had a higher chance of survival.

In [None]:
# for the sake of better naming
train_set=train_set.rename(columns={'Last':'Fsize'})

## 5-2) Cabin

In [None]:
#This shows how many people were in each cabin
cabin_cap= train_set.Cabin.value_counts()

I'm going to use the number of people in each cabin and replace the cabin names with number of people in it.

In [None]:
train_set['Cabin']=train_set['Cabin'].map(cabin_cap)
train_set['Cabin']=train_set['Cabin'].astype('object')

In [None]:
train_set.Cabin.unique()

In [None]:
g=sns.barplot(x='Cabin',y='Transported',data=train_set)
g.set( ylabel="Transportation Probability")
#seems like, people in cabins with moderate capacity had more chance of survival

## 5-3) Chi2 test

**Let's do a chi2 test to see if there are any features with a pvalue higher than 0.05**

In [None]:
from scipy.stats import chi2_contingency
def chi2_calc(df,target):
    scores=[]
    for col in df.columns:
        ct=pd.crosstab(df[col],target)
        stat,p,dof,expected=chi2_contingency(ct)
        scores.append(p)
    return pd.DataFrame(scores, index=df.columns, columns=['P value']).sort_values(by='P value')

In [None]:
chi2_calc(train_set.select_dtypes(['object']),train_set.Transported)

**So all the categorical features have significant relevance to the target (pvalue <0.05), except passengerId which is ok, it's not really a feature.**

In [None]:
train_set.drop('PassengerId',inplace=True,axis=1)

In [None]:
y=train_set.pop('Transported')

In [None]:
y=y.map({False:0,True:1})

In [None]:
y

# 6) PipeLine Implementation
Here are the steps I'm going to take:

* 1- Use 2 pipelines, one for categorical data and one for numerical data. in these pipelines, 2 things are going to happen:

1-1 For numerical pipeline: imputing and scaling

1-2 For categorical pipeline:imputing and encoding

* 2-Use a ColumnTransformer to implement the functions in the pipeline on their respective data types


* 3-Try a number of models to see which ones work better

In [None]:
train_set.info()

In [None]:
num=list(train_set.select_dtypes('float').columns)

In [None]:
cat=list(train_set.select_dtypes(['object']).columns)

**Note: OrdinalEncoder (without defining an order) is like LabelEncoder, but we can apply it to multiple columns at once
whereas LabelEncoder, is for one column transformation. The order in which they encode, is ascending, meaning A will be 1 and B will be 2**

In [None]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer,KNNImputer,SimpleImputer
from category_encoders import MEstimateEncoder,PolynomialEncoder,BackwardDifferenceEncoder,LeaveOneOutEncoder,QuantileEncoder
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import PowerTransformer

  
random_state=0

#setting up the imputer for numerical features
NImputer=IterativeImputer(random_state=random_state,tol=1e-5,max_iter=20)

#setting up the imputer for categorical features (We can explore with this imputer too, I'll let it be here just in case)
CImputer=IterativeImputer(estimator=LinearDiscriminantAnalysis(),random_state=random_state,tol=1e-5,max_iter=20)

#target encoder: it's usually used for high cardinality features. 
t_encoder = MEstimateEncoder(m=10, random_state=random_state,handle_missing='return_nan')

#PolynomialEncoder
p_ecnoder=PolynomialEncoder()

#BackwardDifferenceCoding
b_encoder=BackwardDifferenceEncoder()

#QuantileEncoder
q_encoder=QuantileEncoder(m=10)

#creating log function that has fit and fit_transform methods because numerical columns are mostly skewed
def log_transform(x):
    return np.log(x + 1)
log_pip=FunctionTransformer(log_transform)

#Using power transformer class for standardizing data distirbution in numerical columns
pt=PowerTransformer()

#creating two preprocessing pipelines for categorical and numerical data types
numeric_pip=Pipeline(steps=[('PowerTransformer',pt),('NImputer',NImputer)])
category_pip=Pipeline(steps=[('MEstimateEncoder',t_encoder),('NImputer',NImputer)])


#creating a column transformer to implement the transformation
ct=ColumnTransformer(transformers=[('num',numeric_pip,num),('cat',category_pip,cat)])

In [None]:
#This is for testing to see if the columntransformer works properly
tst = Pipeline([
('coltrns',ct), #COLUMN TRANSFORMER
])
xx=tst.fit_transform(train_set,y)

In [None]:
xx=pd.DataFrame(xx)
xx.head()

## 6-3)Models and Scoring

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
classifiers = []
classifiers.append(SVC(probability=True,random_state=random_state)) 
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state)) 
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state,C=0.5,solver='liblinear')) 
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(MLPClassifier(random_state=random_state, max_iter=500,tol=0.01))
classifiers.append(GaussianNB())
classifiers.append(XGBClassifier(random_state=random_state))
classifiers.append(LGBMClassifier(random_state=random_state))

# CREATING A FOR LOOP FOR SCORING EACH MODEL
cv_results = []
cv = KFold(n_splits=2,shuffle=True,random_state=random_state)
for classifier in classifiers :
    classif = Pipeline([
('coltrns',ct), #COLUMN TRANSFORMER
('classifier', classifier)])
    cvs=cross_val_score(classif, train_set, y, scoring = "accuracy", cv = cv, n_jobs=-1)
    cv_results.append(cvs)

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

#CREATING A DATAFRAME OF MODEL SCORES
cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValSDs": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoostClassifier",
                                                                                 "RandomForestClassifier",
                                                                                 "ExtraTreesClassifier",
                                                                                  "GradientBoostingClassifier",
                                                                                  "KNeighborsClassifier",
                                                                                  "LogisticRegression",
                                                                                   "MLPClassifier",
                                                                                  "LinearDiscriminantAnalysis",
                                                                                  "GaussianNB",
                                                                                  "XGBClassifier",
                                                                                  "LGBMClassifier"]})
#PLOTTING
g = sns.barplot(x="CrossValMeans",y="Algorithm",data = cv_res.sort_values(by='CrossValMeans'), palette="twilight_shifted_r",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")
g=sns.set(rc={'figure.figsize':(5,5)})

cv_res.sort_values(by='CrossValMeans')

# 7) Learning Curve
Here we're going to take a look at the learning curve, using below function, before tuning the models.

In [None]:
from sklearn.model_selection import learning_curve
def plot_learning_curve(my_model, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=[np.linspace(.1, 1.0, 5)]):
    """Generate a simple plot of the test and training learning curve"""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        my_model, X, y, cv=cv, n_jobs=-1, train_sizes=train_sizes,scoring="accuracy")
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)


    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    plt.grid()
    return plt

In [None]:
for i in [3,5,11,12]: #Index of the last 4 highest scores
    model = Pipeline([
        ('coltrns',ct), #COLUMN TRANSFORMER
        ('classifier', classifiers[i])])
    plot_learning_curve(model,model[1],train_set,y)

The lesser the training curve changes, the more the model is overfit. because it's working well on the training data and learning it amazingly, but when it comes to the validation set, it can not generalize and performs poorly.

# 8) Model Tuning using GridSearchCV

Because of highly skewed data and the fact that if I were to delete all outliers, a big portion of the dataset would be deleted, I'm going to use tree-based models as final models for tuning. they have the highest scores in the chart and they are:

* Random Forest Classifier
* XGBClassifier
* LGBMClassifier
* Gradient Boosting Classifier


In [None]:
from sklearn.model_selection import GridSearchCV

## 7-1) Gradient Boost

[Reference](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/) for tuning gradient boost

In [None]:
#GB
gb_model = Pipeline([
('coltrns',ct), #COLUMN TRANSFORMER
('classifier', GradientBoostingClassifier(random_state=random_state))])

#there are two types of parameter to be tuned here – tree based and boosting parameters,
#in general Lower the learning rate and increase the estimators proportionally to get more robust models.
gb_grid =  {
    
    #Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
    "classifier__learning_rate":[0.1],
    
    #This should range around 40-70. Remember to choose a value on which your system can work fairly fast.
    #This is because it will be used for testing various scenarios and determining the tree parameters.
    "classifier__n_estimators":[60,70], 
    
    #This should be ~0.5-1% of total values
    "classifier__min_samples_split":[40,50],
    
    #Can be selected based on intuition. This is just used for preventing overfitting
    "classifier__min_samples_leaf" : [120,150],
    
    #Should be chosen (5-8) based on the number of observations and predictors.
    "classifier__max_depth" :[8,10],
    
    #.8 is a commonly used start value
    "classifier__subsample":[.8,.5],
   
    "classifier__max_features":['log2']

    }

gsgb = GridSearchCV(gb_model,gb_grid , cv=cv, scoring="accuracy", n_jobs= -1, verbose = 1)

gsgb.fit(train_set,y)
# gb_model.fit(train_set,y)

gb_best = gsgb.best_estimator_

# Best score
display(gsgb.best_score_)
display(gb_best)

## 7-2) LGBM

In [None]:
lgbm_model = Pipeline([
('coltrns',ct), #COLUMN TRANSFORMER
('classifier', LGBMClassifier(random_state=random_state))])

lgbm_grid =  {
    'classifier__num_leaves': [31, 127],
    'classifier__reg_alpha': [0.1, 0.5],
    'classifier__min_data_in_leaf': [30, 50, 100, 300, 400],
    'classifier__reg_lambda': [0,.5, 1]
    }

gslgbm = GridSearchCV(lgbm_model,lgbm_grid , cv=cv, scoring="accuracy", n_jobs= -1, verbose = 1)

gslgbm.fit(train_set,y)

lgbm_best = gslgbm.best_estimator_

# Best score
display(gslgbm.best_score_)

display(lgbm_best)

## 7-3) XGBClassifier

In [None]:
xgbc_model = Pipeline([
('coltrns',ct), #COLUMN TRANSFORMER
('classifier', XGBClassifier(random_state=random_state))])

xgbc_grid =  {
              
              'classifier__learning_rate': [0.03,0.01], 
              'classifier__max_depth': [8,10],
              'classifier__min_child_weight': [20,50],
              'classifier__subsample': [.5,.8],
              'classifier__colsample_bytree': [.5,.8],
              'classifier__n_estimators': [100] 
              }
gsxgbc = GridSearchCV(xgbc_model,xgbc_grid , cv=cv, scoring="accuracy", n_jobs= -1, verbose = 1)

gsxgbc.fit(train_set,y)

xgbc_best = gsxgbc.best_estimator_

# Best score
display(gsxgbc.best_score_)

display(xgbc_best)

# 8)Transforming and Predicting Test Set

In [None]:
#Creating Fsize Feature For Test Set
test_set[['Name','Last']] = test_set.Name.str.split(" ", expand=True)
test_set.drop('Name',inplace=True,axis=1)
fam_size=test_set.Last.value_counts()
test_set['Last']=test_set['Last'].map(fam_size)
test_set['Last']=test_set['Last'].astype('object')
test_set=test_set.rename(columns={'Last':'Fsize'})

#Replacing Cabin names with their capacity using the number of people in it
cabin_cap= test_set.Cabin.value_counts()
test_set['Cabin']=test_set['Cabin'].map(cabin_cap)
test_set['Cabin']=test_set['Cabin'].astype('object')

#dropping PassengerId
test_set.drop('PassengerId',inplace=True,axis=1)

In [None]:
test_xgbc = pd.Series(xgbc_best.predict(test_set), name="xgbc")
test_lgbm = pd.Series(lgbm_best.predict(test_set), name="lgbm")
test_gb = pd.Series(gb_best.predict(test_set), name="gb")


# Concatenate all classifier results
ensemble_results = pd.concat([test_xgbc,test_lgbm,test_gb],axis=1)


g= sns.heatmap(ensemble_results.corr(),annot=True)

#The results mainly agree with eachother, but in general it's better if we have strong models that are not highly correlated so that they can
#cover eachother's flaws in prediction to a degree!

In [None]:
from sklearn.ensemble import VotingClassifier
votingC = VotingClassifier(estimators=[ ('XGBC', xgbc_best),('LGBM',lgbm_best),('GB',gb_best)], voting='soft', n_jobs=-1)

votingC.fit(train_set,y)
predictions = pd.DataFrame(votingC.predict(test_set)).values

In [None]:
output = pd.DataFrame({'PassengerId': test['PassengerId'], 'Transported': predictions.flatten()})

In [None]:
output.hist()

# 9)Submission

In [None]:
# output.to_csv('submission.csv', index=False)
# print("Your submission was successfully saved!")

# Thank you for reading!

We can explore more with feature, for example, maybe the homeplanet-destination pairs have a statistically significant relation with the target or we can experiment with tuning and model selection; there are a lot of things we can do and I hope this notebook gives someone some ideas for their project!

**I'd be happy to recieve your feedbacks and suggestions on how to improve my work :)**