<a href="https://colab.research.google.com/github/Aryan12Dubey/Titanic-Classification/blob/main/Titanic_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives


At the end of the mini-hackathon you will be able to:
* Perform Data preprocessing
* Apply different ML algorithms on the **Titanic** dataset
* Perform VotingClassifier
* Able to participate and submit predictions in the Kaggle competition

## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

[ Data Set Link: Kaggle competition](https://www.kaggle.com/competitions/titanic)

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parch:**	No. of parents/children of the passenger aboard the Titanic

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


## Problem Statement

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/test_titanic.csv")
    print("Data downloaded successfully")
    return

setup()

In [None]:
!ls

**Note:** Use **titanic.csv** for training & testing purpose and **test_titanic.csv** for submitting the prediction on Kaggle competition.

## Exercise 1 - Load and Explore the Data

* Understand different features in the training dataset
* Understand the data types of each column
* Notice the columns of missing values




#### Import Required Packages

In [None]:
import pandas as pd
import numpy as np


In [None]:
Data = pd.read_csv('titanic.csv')
Data.head()

In [None]:
Data.info()

## Exercise 02: Split the data into train and test sets
Note: Apply all your data preprocessing steps in the train set first and keep the test set aside.

In [None]:
label = Data['Survived']
feature = Data.drop(['Survived'],axis=1)

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(feature,label,test_size=0.2,random_state=42)

In [None]:
x_train.shape,y_train.shape

## Exercise 03: Data Cleaning and Processing
### 3.1 Working on the "Cabin" column
Find unique entries in the Cabin column. We can label all passengers in two categories having a cabin or not. Check the data type(use: type) of each entry of the Cabin. Convert a string data type into '1' i.e. passengers with cabin and others into '0' i.e. passengers without cabin.  Write a function for the above operation and apply it to the cabin column and create another column with the name " Has_cabin" containing only 0 or 1 entries.





In [None]:
x_train['Cabin'].unique()
x_train.dtypes
ss=x_train['Cabin']
ss.dtypes

In [None]:
x_test.head()

In [None]:
def has_cabin(data):
  Has_cabin = []
  for i in data:
    if str(i)=='nan':
      Has_cabin.append(0)
    else:
      Has_cabin.append(1)
  return Has_cabin


In [None]:
Has_Cabin=has_cabin(x_train['Cabin'])
x_train['Has_Cabin']=Has_Cabin
x_train.shape

 ### 3.2 Working on "SibSp" & "Parch" columns
Combine columns "SibSp" & "Parch" and create another column that represents the total passengers in one ticket with the name "family_size". In each ticket, there might be Siblings/Spouses (SibSp =Number of Siblings/Spouses Aboard) or Parents/Children (Parch=Number of Parents/Children Aboard ) along with the passenger who booked the ticket.



In [None]:
Family_size = x_train['SibSp']+x_train['Parch']+1
x_train['Family_size'] = Family_size

In [None]:
x_train.head()

### 3.3 Working on the"Embarked" column
The "embarked" column represents the port of Embarkation: Cherbourg(C), Queenstown(Q), and  Southampton(S ). Thus, the entries are of three categories in this column. Fill in the missing rows in this column. We can fill it with the most frequent category. Map these categorical string entries into numerical.



In [None]:
x_train['Embarked'] = x_train['Embarked'].map({'C':2,'Q':1,'S':0})


In [None]:
x_train=x_train.fillna(2)
x_train['Embarked'].isna().sum()
x_train.shape

In [None]:
x_train.isna().sum()

### 3.4 Working on the "Age" column
find the number of NaN entries in the age column and their row index. Calculate the mean, Standard deviation of the Age column and check the distribution of the age column.We can fill the missing values with randomly generated integer values between (mean+Standard deviation, mean-Standard deviation). Use : np.isnan; np.random.randint; concept of slicing dataframe. Convert the age column as an integer data type.



In [None]:
x_train.loc[pd.isna(x_train["Age"]), :].index

In [None]:
mean_of_age = x_train['Age'].mean()
STD_of_age = x_train['Age'].std()
size=x_train['Age'].isna().sum()
#feature['Age'].isna().sum()

In [None]:
out_arr = np.random.randint(low = mean_of_age-STD_of_age, high = mean_of_age+STD_of_age, size = 177)
out_arr

In [None]:
series = pd.Series(out_arr)
x_train['Age'] = x_train['Age'].fillna(series)

In [None]:
x_train['Age'] =x_train['Age'].fillna(mean_of_age)
x_train['Age'].isna().sum()

### 3.5 Working on "sex" column
Map the Sex column as 'female' : 0, 'male': 1, and convert it into an integer data type.



In [None]:
x_train['Sex'] = x_train['Sex'].map({'male':1,'female':0})

### 3.8 Drop the columns

Drop the columns: - "PassengerId", "Name",  "SibSp" & "Parch", "Tickets", "Cabin"

Now apply different ML algorithms and check the accuracy of your model.



In [None]:
x_train = x_train.drop(["PassengerId", "Name", "SibSp","Parch", "Ticket", "Cabin"],axis=1)

### 3.9 Apply Standard Scalar

In [None]:
from sklearn.preprocessing import StandardScaler
STD = StandardScaler()
STD.fit(x_train)
x_train = STD.fit_transform(x_train)

### 3.10 Create a single function for preprocessing the test set (X_test) and apply it.
#### **Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(feature,label,test_size=0.33,random_state=42)

In [None]:
## Create a function


def has_cabin1(data):
    Has_cabin = []

    for i in data['Cabin']:
        if str(i) == 'nan':
            Has_cabin.append(0)
        else:
            Has_cabin.append(1)

    data['Has_Cabin'] = Has_cabin
    return Family(data)

def Family(data_f):
    Family_size = data_f['SibSp']+data_f['Parch']
    data_f['Family_size'] = Family_size
    return Embark(data_f)

def Embark(data_e):
    data_e['Embarked'] = data_e['Embarked'].map({'C':2,'Q':1,'S':0})
    data_e=data_e.fillna(2)
    return Age(data_e)

def Age(data_a):
    mean_of_age = data_a['Age'].mean()
    STD_of_age = data_a['Age'].std()
    size1=data_a['Age'].isna().sum()
    out_arr = np.random.randint(low = mean_of_age-STD_of_age, high = mean_of_age+STD_of_age, size = size1)
    data_a.loc[data_a['Age'].isna(),'Age']=out_arr
    data_a['Age']=data_a['Age'].astype(int)
    return Sex(data_a)




def Sex(data_s):
    data_s['Sex'] = data_s['Sex'].map({'male':1,'female':0})
    data_s = data_s.drop(["PassengerId", "Name", "SibSp","Parch", "Ticket", "Cabin"],axis=1)
    return Scale(data_s)



def Scale(data_sc):
    from sklearn.preprocessing import StandardScaler
    STD = StandardScaler()
    STD.fit(data_sc)
    data_sc = STD.fit_transform(data_sc)
    return data_sc


In [None]:
## Applyting above function
x_test=has_cabin1(x_test)
x_train=has_cabin1(x_train)
x_train.shape,y_train.shape

## Exercise  4. Apply Multiple ML Algo. along with  Ensemble Technique (Voting classifier) and display the accuracy
#### Expected Accuracy >= 80%


In [None]:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()
LR.fit(x_train,y_train)
LR.score(x_test,y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=8, random_state=0)
clf.fit(x_train, y_train)
clf.score(x_test,y_test)

In [None]:
from sklearn.tree import DecisionTreeClassifier
DTC = DecisionTreeClassifier(random_state=0, max_depth=6)
DTC.fit(x_train,y_train)
DTC.score(x_test,y_test)

In [None]:
from sklearn.ensemble import BaggingClassifier
BG = BaggingClassifier()
BG.fit(x_train,y_train)
BG.score(x_test,y_test)

In [None]:
from sklearn import svm
from sklearn.svm import SVC
clf1 = svm.SVC()
clf1.fit(x_train, y_train)
clf1.score(x_test,y_test)

In [None]:
from sklearn.ensemble import VotingClassifier

voting = VotingClassifier(estimators=[ ('SVC', clf1),('LR', LR), ('RF', clf)],voting='hard')

voting.fit(x_train, y_train)
voting.score(x_test,y_test)

## Exercise  5. Pre-process the test_set for Kaggle Submission
Again we have to apply the same preprocess function and standard scaler on this test set before passing through predict function.

#### Understanding the test set:

In [None]:
kaggle_data = pd.read_csv('test_titanic.csv')
PassengerId=kaggle_data['PassengerId']
kaggle_data


#### Note: In the initial train set there were no missing entries in the "Fare" column. But, now for the submission test set, there is one missing entry in this column.

#### There will be a minor change in the preprocess function to address the above issue.

In [None]:

def has_cabin2(data):
    Has_cabin = []

    for i in data['Cabin']:
        if str(i) == 'nan':
            Has_cabin.append(0)
        else:
            Has_cabin.append(1)

    data['Has_Cabin'] = Has_cabin
    return Family(data)

def Family(data_f):
    Family_size = data_f['SibSp']+data_f['Parch']
    data_f['Family_size'] = Family_size
    return Embark(data_f)

def Embark(data_e):
    data_e['Embarked'] = data_e['Embarked'].map({'C':2,'Q':1,'S':0})
    data_e=data_e.fillna(2)
    return Age(data_e)

def Age(data_a):
    mean_of_age = data_a['Age'].mean()
    STD_of_age = data_a['Age'].std()
    size1=data_a['Age'].isna().sum()
    out_arr = np.random.randint(low = mean_of_age-STD_of_age, high = mean_of_age+STD_of_age, size = size1)
    data_a.loc[data_a['Age'].isna(),'Age']=out_arr
    data_a['Age']=data_a['Age'].astype(int)
    return Sex(data_a)




def Sex(data_s):
    data_s['Sex'] = data_s['Sex'].map({'male':1,'female':0})
    data_s = data_s.drop(["PassengerId", "Name", "SibSp","Parch", "Ticket", "Cabin"],axis=1)
    return data_s

In [None]:
kaggle = has_cabin2(kaggle_data)
kaggle['Fare']=kaggle['Fare'].fillna(value=kaggle['Fare'].mean())


In [None]:
kaggle.isna().sum()

In [None]:
from sklearn.preprocessing import StandardScaler
STD = StandardScaler()
STD.fit(kaggle)
kaggle = STD.fit_transform(kaggle)


## Exercise  6. Prediction for test data for submission

In [None]:
prediction = voting.predict(kaggle)
prediction

## Exercise  7. Saving the CSV file for submission
Create a CSV file containing the first column as "PassengerID" of the test_sub file and "Survived" as the second column which stores the prediction of the test_sub file.

In [None]:
Prediction = pd.DataFrame(prediction,columns=['Survived'])
Final_result =pd.concat([PassengerId,Prediction],axis=1)
Final_result