#Titanic - Machine Learning from Disaster

###We will be using the [Titanic datasets](https://www.kaggle.com/c/titanic/)

### Read the data using pandas dataframe

Using the Pandas dataframe, will convert raw csv files, to structured dataframe (pretty much like excel),for type of format please refer to [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) 

In [1]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/henseljahja/learn-ml/main/titanic_train_data.csv")

Showing the top 5 (by default) of the csv file

In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33


In [None]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [None]:
df.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [None]:
df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [None]:
df.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

#1. Preprocessing

**First lets just divide the categorical value and numerical value from the datasets**

In [3]:
#Select the neccesary Column for Categorical & Numerical
#Numerical Attributes 
num_attribs = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
#Categorical Attributes
cat_attribs = ["Sex", "Embarked"]

##1.1 Numerical Value

###1.1.1 Imputer

**From `sklearn.impute` theres a library thats called Simple Imputer, by filling the NaN values in the numerical columns with the strategy of Mean, Median, Most_frequent, & Constant. Please refer to the documentation at [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)**

In [None]:
#Lets create the Numerical Pipeline for preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
  ('imputer', SimpleImputer(strategy='median')),
  ('std_scaler', StandardScaler())
])

In [None]:
df_num = num_pipeline.fit_transform(df[num_attribs])

In [None]:
pd.DataFrame(data=df_num,columns = num_attribs)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,0.83,-0.57,0.43,-0.47,-0.50
1,-1.57,0.66,0.43,-0.47,0.79
2,0.83,-0.26,-0.47,-0.47,-0.49
3,-1.57,0.43,0.43,-0.47,0.42
4,0.83,0.43,-0.47,-0.47,-0.49
...,...,...,...,...,...
886,-0.37,-0.18,-0.47,-0.47,-0.39
887,-1.57,-0.80,-0.47,-0.47,-0.04
888,0.83,-0.10,0.43,2.01,-0.18
889,-1.57,-0.26,-0.47,-0.47,-0.04


In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

In [None]:
# Inspired from stackoverflow.com/questions/25239958
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

In [None]:
#Lets create the pipeline for Categorical Attributes
from sklearn.preprocessing import OneHotEncoder
cat_pipeline = Pipeline([
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

In [None]:
df_cat = cat_pipeline.fit_transform(df[cat_attribs])

In [None]:
#Lets Mix both of those into a full pipeine
from sklearn.compose import ColumnTransformer

full_pipeline = ColumnTransformer([
  ("num", num_pipeline, num_attribs),
  ("cat", cat_pipeline, cat_attribs)
])

In [None]:
df_processed = full_pipeline.fit_transform(df)

In [None]:
df_processed

array([[ 0.82737724, -0.56573646,  0.43279337, ...,  0.        ,
         0.        ,  1.        ],
       [-1.56610693,  0.66386103,  0.43279337, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.82737724, -0.25833709, -0.4745452 , ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.82737724, -0.1046374 ,  0.43279337, ...,  0.        ,
         0.        ,  1.        ],
       [-1.56610693, -0.25833709, -0.4745452 , ...,  1.        ,
         0.        ,  0.        ],
       [ 0.82737724,  0.20276197, -0.4745452 , ...,  0.        ,
         1.        ,  0.        ]])

In [None]:
preprocessed_attribs = num_attribs + cat_attribs

In [None]:
preprocessed_attribs

['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked']

In [None]:
df_processed.shape

(891, 10)

#2. Machine Learning Modelling

In [None]:
#Selecting the X_train, y_train
X_train = df_processed
y_train = df.Survived

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
from sklearn.model_selection import GridSearchCV

###2.1 Support Vector Machine Classifier

In [None]:
#SVC Model
from sklearn.svm import SVC
svm_clf = SVC()
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)

In [None]:
display(svm_scores,svm_scores.mean())

array([0.8       , 0.85393258, 0.76404494, 0.87640449, 0.83146067,
       0.78651685, 0.82022472, 0.78651685, 0.86516854, 0.85393258])

0.8238202247191012

In [None]:
svc_param_grid = {
    'C': [0.01,0.1,1,10,100,0.03,0.3,3,30,300],
    'kernel': ["linear","poly","rbf","sigmoid"],
    "gamma" : ["scale","auto"]
    }
 
svc_grid_search = GridSearchCV(svc_clf, svc_param_grid, cv=5,
                           scoring='accuracy',verbose = 5)
svc_grid_search.fit(X_train,y_train)

Fitting 5 folds for each of 80 candidates, totalling 400 fits
[CV] C=0.01, gamma=scale, kernel=linear ..............................
[CV] .. C=0.01, gamma=scale, kernel=linear, score=0.793, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=linear ..............................
[CV] .. C=0.01, gamma=scale, kernel=linear, score=0.809, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=linear ..............................
[CV] .. C=0.01, gamma=scale, kernel=linear, score=0.781, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=linear ..............................
[CV] .. C=0.01, gamma=scale, kernel=linear, score=0.753, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=linear ..............................
[CV] .. C=0.01, gamma=scale, kernel=linear, score=0.787, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=poly ................................
[CV] .... C=0.01, gamma=scale, kernel=poly, score=0.615, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=poly ................................
[CV] .... C=0.0

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s


[CV] ..... C=0.01, gamma=scale, kernel=rbf, score=0.618, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=rbf .................................
[CV] ..... C=0.01, gamma=scale, kernel=rbf, score=0.618, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=rbf .................................
[CV] ..... C=0.01, gamma=scale, kernel=rbf, score=0.612, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=sigmoid .............................
[CV] . C=0.01, gamma=scale, kernel=sigmoid, score=0.626, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=sigmoid .............................
[CV] . C=0.01, gamma=scale, kernel=sigmoid, score=0.624, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=sigmoid .............................
[CV] . C=0.01, gamma=scale, kernel=sigmoid, score=0.618, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=sigmoid .............................
[CV] . C=0.01, gamma=scale, kernel=sigmoid, score=0.618, total=   0.0s
[CV] C=0.01, gamma=scale, kernel=sigmoid .............................
[CV] .

KeyboardInterrupt: ignored

In [None]:
display(svc_grid_search.best_score_, svc_grid_search.best_params_)

###2.2 Random Forest Classifier

In [None]:
#Random Forrest Model
from sklearn.ensemble import RandomForestClassifier
rfc_clf = RandomForestClassifier()
rfc_scores = cross_val_score(rfc_clf, X_train, y_train, cv=10)

In [None]:
display(rfc_scores,rfc_scores.mean())

In [None]:
rfc_param_grid = {
    'n_estimators' : [x for x in range(50,250)],
    'max_features': ['auto', 'sqrt', 'log2']
    }
 
rfc_grid_search = GridSearchCV(rfc_clf, rfc_param_grid, cv=5,
                           scoring='accuracy',verbose = 5)
rfc_grid_search.fit(X_train,y_train)

In [None]:
display(rfc_grid_search.best_score_, rfc_grid_search.best_params_)

###2.3 K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_scores = cross_val_score(knn_clf, X_train, y_train, cv=10)

In [None]:
display(knn_scores, knn_scores.mean())

In [None]:
knn_param_grid = {
    'weights' : ['uniform', 'distance'],
    'n_neighbors' : [x for x in range(1,100)]
} 
knn_grid_search = GridSearchCV(knn_clf, knn_param_grid, cv=5,
                           scoring='accuracy',verbose = 5)
knn_grid_search.fit(X_train,y_train)

In [None]:
display(knn_grid_search.best_score_, knn_grid_search.best_params_)

### 2.4 XGB Classifier

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb_clf = XGBClassifier()

In [None]:
xgb_clf.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
xgb_cvs = cross_val_score(xgb_clf, X_train, y_train,verbose=3, cv=10)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV]  ................................................................
[CV] .................................... , score=0.789, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.809, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.764, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.843, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.865, total=   0.1s
[CV]  ................................................................
[CV] .................................... , score=0.820, total=   0.0s
[CV]  ................................................................
[CV] .................................... , score=0.854, total=   0.0s
[CV]  

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.5s finished


In [None]:
display(xgb_cvs.mean())

0.821585518102372

###2.X Lazy Predict

**Lazy Predict is a library that will run every single algorithm, and will sorted it out from best performing, please refer to the documentation at [Lazy Predict](https://lazypredict.readthedocs.io/en/latest/readme.html)**

Instalattion: 
1. Mac, Windows & Linux
`pip install lazypredict`
2. Google Colab
`!pip install lazypredict

In [None]:
!pip install lazypredict

In [None]:
X_train_lp, X_test_lp, y_train_lp, y_test_lp = train_test_split(X_train,y_train,test_size=.5,random_state =123)

lp_clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric="accuracy")
models,predictions = clf.fit(X_train_lp, X_test_lp, y_train_lp, y_test_lp)

display(models)

100%|██████████| 30/30 [00:01<00:00, 27.42it/s]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XGBClassifier,0.83,0.82,0.82,0.83,0.06
NuSVC,0.84,0.81,0.81,0.83,0.02
LGBMClassifier,0.82,0.81,0.81,0.82,0.04
SVC,0.83,0.81,0.81,0.83,0.03
RandomForestClassifier,0.8,0.8,0.8,0.8,0.17
AdaBoostClassifier,0.8,0.79,0.79,0.8,0.09
BaggingClassifier,0.8,0.79,0.79,0.8,0.03
LogisticRegression,0.79,0.78,0.78,0.79,0.02
GaussianNB,0.79,0.78,0.78,0.79,0.01
KNeighborsClassifier,0.79,0.78,0.78,0.79,0.03


##3. FullPipeline With predictions

In [None]:
full_pipeline_with_predictor = Pipeline([
  ("preprocessing", full_pipeline),
  ("svc", SVC(C = 3, gamma =  'auto', kernel =  'rbf'))
])

In [None]:
full_pipeline_with_predictor.fit(df.drop("Survived",axis = 1), df.Survived)

In [None]:
final_model = full_pipeline_with_predictor

##4. Predict The Test Set

In [None]:
test_set = pd.read_csv("/content/{/content}/competitions/titanic/test.csv")

In [None]:
final_predictions = final_model.predict(test_set)

In [None]:
submissions_data = {"PassengerId" : test_set.PassengerId.values,
                    "Survived" : final_predictions
}

In [None]:
submissions = pd.DataFrame(data = submissions_data, columns=["PassengerId","Survived"])

In [None]:
submissions.to_csv("submission.csv",index=False)

In [None]:
!kaggle competitions submit -c titanic -f submission.csv -m "Message"