<a href="https://colab.research.google.com/github/Swatantrakumar-data/Titanic/blob/main/Copy_of_U1_MH2_Titanic_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Learning Objectives


At the end of the mini-hackathon you will be able to:
* Perform Data preprocessing
* Apply different ML algorithms on the **Titanic** dataset
* Perform VotingClassifier
* Able to participate and submit predictions in the Kaggle competition

## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

[ Data Set Link: Kaggle competition](https://www.kaggle.com/competitions/titanic)

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parch:**	No. of parents/children of the passenger aboard the Titanic

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


## Problem Statement

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/test_titanic.csv")
    print("Data downloaded successfully")
    return

setup()

Data downloaded successfully


In [2]:
!ls

sample_data  test_titanic.csv  titanic.csv


**Note:** Use **titanic.csv** for training & testing purpose and **test_titanic.csv** for submitting the prediction on Kaggle competition.

## Exercise 1 - Load and Explore the Data (2 Marks)

* Understand different features in the training dataset
* Understand the data types of each column
* Notice the columns of missing values




#### Import Required Packages

In [3]:
import pandas as pd
import numpy as np

In [4]:
# Load the dataset
dataset = pd.read_csv("titanic.csv")
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
dataset.shape

(891, 12)

In [6]:
# Getting information about the dataset
dataset.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [7]:
dataset.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Exercise 02: Split the data into train and test sets (1 Mark)
Note: Apply all your data preprocessing steps in the train set first and keep the test set aside.

In [8]:
from sklearn.model_selection import train_test_split
X = dataset.drop("Survived", axis=1)
y = dataset['Survived']
x_train,x_test,y_train,y_test = train_test_split(X,y, test_size=0.25, random_state=41)


## Exercise 03: Data Cleaning and Processing (15 Marks)
### 3.1 Working on the "Cabin" column (2 Marks)
Find unique entries in the Cabin column. We can label all passengers in two categories having a cabin or not. Check the data type(use: type) of each entry of the Cabin. Convert a string data type into '1' i.e. passengers with cabin and others into '0' i.e. passengers without cabin.  Write a function for the above operation and apply it to the cabin column and create another column with the name " Has_cabin" containing only 0 or 1 entries.





In [9]:
x_train['Cabin'].unique()

array([nan, 'C106', 'F2', 'G6', 'F G73', 'B96 B98', 'C85', 'C46', 'C93',
       'C23 C25 C27', 'A24', 'C148', 'F G63', 'C123', 'C83', 'C126',
       'B50', 'D33', 'E10', 'C87', 'E34', 'E67', 'E101', 'B20', 'B35',
       'B86', 'A19', 'C125', 'A34', 'D', 'D36', 'C32', 'D30', 'D7',
       'F E69', 'E25', 'B38', 'A31', 'A36', 'B19', 'B30', 'D20', 'C54',
       'C30', 'C118', 'C2', 'E24', 'E17', 'C22 C26', 'C49', 'C62 C64',
       'B73', 'F4', 'B58 B60', 'D9', 'A20', 'B28', 'A10', 'B4', 'C124',
       'D26', 'D46', 'C103', 'B18', 'B49', 'C90', 'B80', 'D49', 'E63',
       'E49', 'C65', 'C52', 'E33', 'C82', 'B37', 'E44', 'D47', 'E58',
       'B51 B53 B55', 'T', 'F38', 'B22', 'A16', 'A5', 'B3', 'A26', 'E40',
       'B57 B59 B63 B66', 'E12', 'C70', 'B41', 'D11', 'D45', 'F33',
       'C101', 'A6', 'E46', 'B71', 'B79', 'C78', 'B77', 'C68', 'B94',
       'B82 B84', 'D28', 'D50', 'D10 D12', 'D17', 'E38', 'E50', 'C95',
       'A7', 'B5', 'D48', 'C99', 'D56', 'E68', 'E8', 'D19', 'C92', 'E77',
      

In [10]:
type(x_train['Cabin'][1])==str

True

In [11]:
def encoded_cabin_column(x_train):
  new_cabin_val = []
  for val in x_train['Cabin']:
    if(type(val) == str):
      new_cabin_val.append(1)
    else:
      new_cabin_val.append(0)
  return new_cabin_val



In [12]:
def set_encoded_cabin_column(x_train):
  value = encoded_cabin_column(x_train)
  x_train['Has_cabin'] = np.array(value)

set_encoded_cabin_column(x_train)
x_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Has_cabin
668,669,3,"Cook, Mr. Jacob",male,43.0,0,0,A/5 3536,8.05,,S,0
143,144,3,"Burke, Mr. Jeremiah",male,19.0,0,0,365222,6.75,,Q,0
604,605,1,"Homer, Mr. Harry (""Mr E Haven"")",male,35.0,0,0,111426,26.55,,C,0
298,299,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,S,1
559,560,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4,,S,0


 ### 3.2 Working on "SibSp" & "Parch" columns (1 Mark)
Combine columns "SibSp" & "Parch" and create another column that represents the total passengers in one ticket with the name "family_size". In each ticket, there might be Siblings/Spouses (SibSp =Number of Siblings/Spouses Aboard) or Parents/Children (Parch=Number of Parents/Children Aboard ) along with the passenger who booked the ticket.

  

In [13]:
def create_family_column(x_train):
  x_train['family_size'] = x_train['SibSp'] + x_train['Parch']+1

create_family_column(x_train)
x_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Has_cabin,family_size
668,669,3,"Cook, Mr. Jacob",male,43.0,0,0,A/5 3536,8.05,,S,0,1
143,144,3,"Burke, Mr. Jeremiah",male,19.0,0,0,365222,6.75,,Q,0,1
604,605,1,"Homer, Mr. Harry (""Mr E Haven"")",male,35.0,0,0,111426,26.55,,C,0,1
298,299,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,S,1,1
559,560,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4,,S,0,2


In [14]:
x_train['family_size'].unique()

array([ 1,  2,  3,  4,  6, 11,  7,  8,  5])

### 3.3 Working on the"Embarked" column (2 Marks)
The "embarked" column represents the port of Embarkation: Cherbourg(C), Queenstown(Q), and  Southampton(S ). Thus, the entries are of three categories in this column. Fill in the missing rows in this column. We can fill it with the most frequent category. Map these categorical string entries into numerical.



In [15]:
x_train['Embarked'].isnull().sum()

1

In [16]:
def fill_and_map_embark(x_train):
  mode = x_train['Embarked'].mode()
  x_train['Embarked'] = x_train['Embarked'].fillna(mode[0])
  x_train['Embarked'] = x_train['Embarked'].replace({'C':0,'Q':1,'S':2})
  x_train['Embarked'] = x_train['Embarked'].astype('int')

In [17]:
fill_and_map_embark(x_train)
x_train.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Has_cabin,family_size
668,669,3,"Cook, Mr. Jacob",male,43.0,0,0,A/5 3536,8.05,,2,0,1
143,144,3,"Burke, Mr. Jeremiah",male,19.0,0,0,365222,6.75,,1,0,1
604,605,1,"Homer, Mr. Harry (""Mr E Haven"")",male,35.0,0,0,111426,26.55,,0,0,1
298,299,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,2,1,1
559,560,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4,,2,0,2


### 3.4 Working on the "Age" column (2 Marks)
find the number of NaN entries in the age column and their row index. Calculate the mean, Standard deviation of the Age column and check the distribution of the age column.We can fill the missing values with randomly generated integer values between (mean+Standard deviation, mean-Standard deviation). Use : np.isnan; np.random.randint; concept of slicing dataframe. Convert the age column as an integer data type.



In [18]:
np.isnan(x_train['Age']).sum()

136

In [19]:
import statistics as stat
import math

In [20]:
def fill_null_age(value):
  data =  value['Age'][~np.isnan(value['Age'])]
  mean = np.mean(data)
  std_dev = stat.stdev(data)
  n = len(data)
  mean_std_dev = std_dev / math.sqrt(n)

  sum = mean + std_dev + mean_std_dev

  random_number = np.random.randint(0,np.ceil(sum)+1, size = value['Age'].isnull().sum())
  value['Age'][value['Age'].isnull()] = random_number
  value['Age'] = value['Age'].astype('int')

In [21]:
fill_null_age(x_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  value['Age'][value['Age'].isnull()] = random_number


In [22]:
x_train['Age'].isnull().sum()

0

In [23]:
x_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Has_cabin,family_size
668,669,3,"Cook, Mr. Jacob",male,43,0,0,A/5 3536,8.05,,2,0,1
143,144,3,"Burke, Mr. Jeremiah",male,19,0,0,365222,6.75,,1,0,1
604,605,1,"Homer, Mr. Harry (""Mr E Haven"")",male,35,0,0,111426,26.55,,0,0,1
298,299,1,"Saalfeld, Mr. Adolphe",male,10,0,0,19988,30.5,C106,2,1,1
559,560,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36,1,0,345572,17.4,,2,0,2


### 3.5 Working on "sex" column (1 Mark)
Map the Sex column as 'female' : 0, 'male': 1, and convert it into an integer data type.



In [24]:
def sex_column_mapping(x_train):
  x_train['Sex'] = x_train['Sex'].replace({'female':0,'male':1})
sex_column_mapping(x_train)
x_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Has_cabin,family_size
668,669,3,"Cook, Mr. Jacob",1,43,0,0,A/5 3536,8.05,,2,0,1
143,144,3,"Burke, Mr. Jeremiah",1,19,0,0,365222,6.75,,1,0,1
604,605,1,"Homer, Mr. Harry (""Mr E Haven"")",1,35,0,0,111426,26.55,,0,0,1
298,299,1,"Saalfeld, Mr. Adolphe",1,10,0,0,19988,30.5,C106,2,1,1
559,560,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",0,36,1,0,345572,17.4,,2,0,2


### 3.8 Drop the columns (1 Mark)

Drop the columns: - "PassengerId", "Name",  "SibSp" & "Parch", "Tickets", "Cabin"

Now apply different ML algorithms and check the accuracy of your model.



In [None]:
def drop_columns(x_train):
  x_train.drop(["PassengerId", "Name", "SibSp" , "Parch", "Ticket", "Cabin"], axis=1,inplace=True)

In [None]:
drop_columns(x_train)
x_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Has_cabin,family_size
668,3,1,43,8.05,2,0,1
143,3,1,19,6.75,1,0,1
604,1,1,35,26.55,0,0,1
298,1,1,37,30.5,2,1,1
559,3,0,36,17.4,2,0,2


### 3.9 Apply Standard Scalar (1 Mark)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)

### 3.10 Create a single function for preprocessing the test set (X_test) and apply it. (4 Marks)
#### **Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
## Create a function
def preprocessing(x_test):
  encoded_cabin_column(x_test)
  set_encoded_cabin_column(x_test)
  create_family_column(x_test)
  fill_and_map_embark(x_test)
  fill_null_age(x_test)
  sex_column_mapping(x_test)
  drop_columns(x_test)


In [None]:
## Applyting above function
preprocessing(x_test)

x_test.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  value['Age'][value['Age'].isnull()] = random_number


Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Has_cabin,family_size
454,3,1,35,8.05,2,0,1
624,3,1,21,16.1,2,0,1
537,1,0,30,106.425,0,0,1
685,2,1,25,41.5792,0,0,4
396,3,0,31,7.8542,2,0,1


### 3.11 Apply standard Scalar transformation to x_test (1 Mark)

In [None]:
x_test=sc.transform(x_test)

## Exercise  4. Apply Multiple ML Algo. along with  Ensemble Technique (Voting classifier) and display the accuracy (7 Marks)
#### Expected Accuracy >= 80%  


In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

In [None]:
models = [SGDClassifier(loss='log_loss', max_iter=1000, tol=0.0001,random_state=41),
          LogisticRegression(tol=0.0001,random_state=41, solver='saga',max_iter=700),
          KNeighborsClassifier(n_neighbors=5),SVC(kernel='rbf',tol=0.0001),
          DecisionTreeClassifier(criterion='entropy',max_depth=6,random_state=41),
          RandomForestClassifier(n_estimators=80, criterion='gini', max_depth=7,random_state=41),
          BaggingClassifier(estimator=SVC(),n_estimators=10, random_state=41),
          VotingClassifier(estimators=[('svc',SVC()),('knn',KNeighborsClassifier()),('lr',LogisticRegression())])]

for model in models:
  model.fit(x_train, y_train)
  prediction = model.predict(x_test)
  accuracy = accuracy_score(y_test,prediction)

  print(model)
  print(f'train accuracy {model.score(x_train,y_train)}')
  print(f'test accuracy {accuracy}')
  print('-----------------------------------------------------------')

SGDClassifier(loss='log_loss', random_state=41, tol=0.0001)
train accuracy 0.7934131736526946
test accuracy 0.8295964125560538
-----------------------------------------------------------
LogisticRegression(max_iter=700, random_state=41, solver='saga')
train accuracy 0.8023952095808383
test accuracy 0.8251121076233184
-----------------------------------------------------------
KNeighborsClassifier()
train accuracy 0.8458083832335329
test accuracy 0.820627802690583
-----------------------------------------------------------
SVC(tol=0.0001)
train accuracy 0.8248502994011976
test accuracy 0.8565022421524664
-----------------------------------------------------------
DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=41)
train accuracy 0.8622754491017964
test accuracy 0.8116591928251121
-----------------------------------------------------------
RandomForestClassifier(max_depth=7, n_estimators=80, random_state=41)
train accuracy 0.9101796407185628
test accuracy 0.83408071

In [None]:
model = VotingClassifier(estimators=[('svc',SVC()),('knn',KNeighborsClassifier()),('lr',LogisticRegression())])
model.fit(x_train, y_train)
prediction = model.predict(x_test)
accuracy = accuracy_score(y_test,prediction)

print(model)
print(f'train accuracy {model.score(x_train,y_train)}')
print(f'test accuracy {accuracy}')

VotingClassifier(estimators=[('svc', SVC()), ('knn', KNeighborsClassifier()),
                             ('lr', LogisticRegression())])
train accuracy 0.8308383233532934
test accuracy 0.8340807174887892


## Exercise  5. Pre-process the test_set for Kaggle Submission (3 Marks)
Again we have to apply the same preprocess function and standard scaler on this test set before passing through predict function.

#### Understanding the test set:

In [None]:
dataset2 = pd.read_csv("test_titanic.csv")
dataset2.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


#### Note: In the initial train set there were no missing entries in the "Fare" column. But, now for the submission test set, there is one missing entry in this column.

#### There will be a minor change in the preprocess function to address the above issue.

In [None]:
dataset2.shape


(418, 11)

In [None]:
dataset2.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [None]:
passengerID = dataset2.loc[:,'PassengerId'].values
passengerID

array([ 892,  893,  894,  895,  896,  897,  898,  899,  900,  901,  902,
        903,  904,  905,  906,  907,  908,  909,  910,  911,  912,  913,
        914,  915,  916,  917,  918,  919,  920,  921,  922,  923,  924,
        925,  926,  927,  928,  929,  930,  931,  932,  933,  934,  935,
        936,  937,  938,  939,  940,  941,  942,  943,  944,  945,  946,
        947,  948,  949,  950,  951,  952,  953,  954,  955,  956,  957,
        958,  959,  960,  961,  962,  963,  964,  965,  966,  967,  968,
        969,  970,  971,  972,  973,  974,  975,  976,  977,  978,  979,
        980,  981,  982,  983,  984,  985,  986,  987,  988,  989,  990,
        991,  992,  993,  994,  995,  996,  997,  998,  999, 1000, 1001,
       1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010, 1011, 1012,
       1013, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023,
       1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034,
       1035, 1036, 1037, 1038, 1039, 1040, 1041, 10

In [None]:
dataset2[dataset2['Fare'].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


In [None]:
preprocessing(dataset2)
dataset2.isnull().sum()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  value['Age'][value['Age'].isnull()] = random_number


Pclass         0
Sex            0
Age            0
Fare           1
Embarked       0
Has_cabin      0
family_size    0
dtype: int64

In [None]:
mean = dataset2['Fare'].mean()
dataset2['Fare'] = dataset2['Fare'].fillna(mean)
dataset2.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Has_cabin,family_size
0,3,1,34,7.8292,1,0,1
1,3,0,47,7.0,2,0,2
2,2,1,62,9.6875,1,0,1
3,3,1,27,8.6625,2,0,1
4,3,0,22,12.2875,2,0,3


In [None]:
transformed_dataset2 = sc.fit_transform(dataset2)
transformed_dataset2

array([[ 0.87348191,  0.75592895,  0.32491724, ..., -0.47091535,
        -0.52752958, -0.5534426 ],
       [ 0.87348191, -1.32287566,  1.25614103, ...,  0.70076689,
        -0.52752958,  0.10564289],
       [-0.31581919,  0.75592895,  2.33063002, ..., -0.47091535,
        -0.52752958, -0.5534426 ],
       ...,
       [ 0.87348191,  0.75592895,  0.61144764, ...,  0.70076689,
        -0.52752958, -0.5534426 ],
       [ 0.87348191,  0.75592895, -0.96446954, ...,  0.70076689,
        -0.52752958, -0.5534426 ],
       [ 0.87348191,  0.75592895, -0.39140875, ..., -1.64259759,
        -0.52752958,  0.76472838]])

## Exercise  6. Prediction for test data for submission (1 Mark)

In [None]:
# model = VotingClassifier(estimators=[('svc',SVC()),('knn',KNeighborsClassifier()),('lr',LogisticRegression())])
pred = model.predict(transformed_dataset2)
pred

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

## Exercise  7. Saving the CSV file for submission(1 Mark)
Create a CSV file containing the first column as "PassengerID" of the test_sub file and "Survived" as the second column which stores the prediction of the test_sub file.

In [None]:
df = pd.DataFrame({'PassengerID':passengerID,'Survived':pred})
df.head(10)

Unnamed: 0,PassengerID,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [None]:
df.to_csv('Submission_file.csv', index=False)

* Create & Download the "Submission_file.csv" and submit it to Kaggle. See your leaderboard position.
* **Note:** Whatever Data preprocessing we have suggested here are just for learning purpose and not the best ones. You can try different other preprocessing approaches and check the result/accuracy.