<center><h2 style='color:red'>Titanic: Simple Models For Beginners With EDA</h2></center>

* **1- Introduction**
* **2- Data Preparation**
* **3- Data Visualization**
* **4- Preprocessing data for machine learning**
* **5- Machine Learning**
    * 5.1 Tree Based Models
    * 5.2 Classic ML Models
* **6- Submitting**
    * 6.1 Ensemble
<hr>

**Introduction**<br>
This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.<br><br>
**Goal**
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

# Data Preparation

In [None]:
# Disabling warnings
import warnings
warnings.simplefilter("ignore")

In [None]:
# Import Main libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Import Visualization lib.
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# processing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

In [None]:
import os
print(os.listdir('../input'))

In [None]:
# set our Dataframe
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')
sub = pd.read_csv('../input/gender_submission.csv')

In [None]:
# Show first 5 rows of train data
train_df.head()

In [None]:
# data size
print("Train Data Size: ", train_df.shape)
print("Test Data Size:  ", test_df.shape)

In [None]:
# Show if any NAN data
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

1- Thier is more NAN value in "Cabin", "Age" columns.
2- we did not need "PassengerId" columns.

So we need to fix that

Fix Data using 3 various methods. 

SimpleImputer is sklearn library for Imputation of missing values
You Can find all of them here:
[impute univariate feature imputation](https://scikit-learn.org/stable/modules/impute.html#univariate-feature-imputation)

In [None]:
imputer = SimpleImputer(np.nan, "mean")

train_df['Age'] = imputer.fit_transform(np.array(train_df['Age']).reshape(891, 1)) # 1st
train_df.Embarked.fillna(method='ffill', inplace=True) # 2nd
train_df.drop(['PassengerId', 'Name', 'Cabin'], axis=1, inplace=True) # 3rd

test_df['Age'] = imputer.fit_transform(np.array(test_df['Age']).reshape(418, 1))
test_df.Embarked.fillna(method='ffill', inplace=True)
test_df.Fare.fillna(method='ffill', inplace=True)
test_df.drop(['PassengerId', 'Name', 'Cabin'], axis=1, inplace=True)

# Data Visualization

In [None]:
sns.countplot(x='Survived', hue='Sex', data=train_df)

In [None]:
sns.countplot(x='Embarked', hue='Survived', data=train_df)

In [None]:
sns.countplot(x='SibSp', hue='Survived', data=train_df)

In [None]:
sns.countplot(x='Pclass', hue='Survived', data=train_df)

In [None]:
plt.figure(figsize=(10,5))
sns.distplot(train_df['Age'], bins=24, color='b')

In [None]:
plt.figure(figsize=(12, 8))
plt.title('Titanic Correlation of Features', y=1.05, size=15)
sns.heatmap(train_df.corr(), linewidths=0.1, vmax=1.0, 
            square=True, linecolor='white', annot=True);

# Preprocessing data for machine learning

In [None]:
train_df.info()

As you see 3 columns have "object" data type. So we must convert it to numbers.

In [None]:
objects_cols = train_df.select_dtypes("object").columns
objects_cols

Encode target labels with value between 0 and (n_classes - 1).

In [None]:
le = LabelEncoder()
train_df[objects_cols] = train_df[objects_cols].apply(le.fit_transform)
test_df[objects_cols] = test_df[objects_cols].apply(le.fit_transform)
train_df[objects_cols].head()

In [None]:
train_df.head()

In [None]:
# model selection
from sklearn.model_selection import train_test_split, cross_val_score

# normaliztion
from sklearn.preprocessing import StandardScaler

# tree based models
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# classic ml models
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# metrics
from sklearn.metrics import accuracy_score

In [None]:
# Machine Learning 
X = train_df.drop(['Survived'], 1).values
y = train_df['Survived'].values

### Normalization

**StandardScaler: Standardize features by removing the mean and scaling to unit variance.**

In [None]:
scale = StandardScaler()
scale.fit(X)

# transform data
X = scale.transform(X)

# Machine Learing

In [None]:
# Split data to 80% training data and 20% of test to check the accuracy of our model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Class Model Idea from "Heart Disease - Classifications" kernel here:
[Heart Disease - Classifications](https://www.kaggle.com/elcaiseri/heart-disease-classifications)**

In [None]:
class Model:
    def __init__(self, model):
        self.model = model
        self.X, self.y = X, y
        self.X_train, self.X_test, self.y_train, self.y_test = X_train, X_test, y_train, y_test
        
        self.train()
    
    def model_name(self):
        model_name = type(self.model).__name__
        return model_name
        
    def cross_validation(self, cv=5):
        print(f"Evaluate {self.model_name()} score by cross-validation...")
        CVS = cross_val_score(self.model, self.X, self.y, scoring='accuracy', cv=cv)
        print(CVS)
        print("="*60, "\nMean accuracy of cross-validation: ", CVS.mean())
    
    def train(self):
        print(f"Training {self.model_name()} Model...")
        self.model.fit(X_train, y_train)
        print("Model Trained.")
        
    def prediction(self, test_x=None, test=False):
        if test == False:
            y_pred = self.model.predict(self.X_test)
        else:
            y_pred = self.model.predict(test_x)
            
        return y_pred
    
    def accuracy(self):
        y_pred = self.prediction()
        y_test = self.y_test
        
        acc = accuracy_score(y_pred, y_test)
        print(f"{self.model_name()} Model Accuracy: ", acc)

### Tree Based Models

In [None]:
xgb = XGBClassifier(random_state=42, n_estimators=222)
xgb = Model(xgb)

xgb.cross_validation()

In [None]:
xgb.accuracy()

In [None]:
rfc = RandomForestClassifier(random_state=42)
rfc = Model(rfc)

rfc.cross_validation()

In [None]:
rfc.accuracy()

### Classic ML Models

In [None]:
gnb = GaussianNB()
gnb = Model(gnb)

gnb.cross_validation()

In [None]:
gnb.accuracy()

In [None]:
svc = SVC(C=0.4, random_state=42, probability=True)
svc = Model(svc)

svc.cross_validation()

In [None]:
svc.accuracy()

# Submitting

In [None]:
# Predict our test file
test_df.head()

# normalize testset
test_X = scale.transform(test_df.values)

In [None]:
xgb_pred = xgb.prediction(test_x=test_X, test=True)
gnb_pred = gnb.prediction(test_x=test_X, test=True)
svc_pred = svc.prediction(test_x=test_X, test=True)
rfc_pred = rfc.prediction(test_x=test_X, test=True)

In [None]:
sub['Survived'] = xgb_pred # Best solo Submission (Top 5% LB)
sub.to_csv('xgb_submission.csv', index=False)
sub.head(10)

In [None]:
sub['Survived'] = rfc_pred
sub.to_csv('rfc_submission.csv', index=False)
sub.head(10)

In [None]:
sub['Survived'] = gnb_pred
sub.to_csv('gnb_submission.csv', index=False)
sub.head(10)

In [None]:
sub['Survived'] = svc_pred
sub.to_csv('svc_submission.csv', index=False)
sub.head(10)

### Ensemble Submitions

In [None]:
# predict probability for each model
xgb_preds = xgb.model.predict_proba(test_X)
rfc_preds = rfc.model.predict_proba(test_X)
gnb_preds = gnb.model.predict_proba(test_X)
svc_preds = svc.model.predict_proba(test_X)

# ensemble all models
preds = (xgb_preds + rfc_preds + gnb_preds + svc_preds) / 4

In [None]:
sub['Survived'] = preds.argmax(axis=1)
sub.to_csv('submission.csv', index=False)
sub.head(10)

<h3>Thanks For Being Here.  <span style='color:red'>UPVOTE</span>  If Interested.. Feel Free In Comments</h3>