# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint


## Learning Objectives


At the end of the mini-hackathon you will be able to:
* Perform Data preprocessing
* Apply different ML algorithms on the **Titanic** dataset
* Perform VotingClassifier


## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

[ Data Set Link: Kaggle competition](https://www.kaggle.com/competitions/titanic)

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic

**Parch:**	No. of parents/children of the passenger aboard the Titanic

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


## Problem Statement

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/test_titanic.csv")
    print("Data downloaded successfully")
    return

setup()

In [None]:
!ls

## Exercise 1 - Load and Explore the Data (2 Marks)

* Understand different features in the training dataset
* Understand the data types of each column
* Notice the columns of missing values




#### Import Required Packages

In [None]:
# Import libraries
import pandas as pd
import numpy as np
# Load the given dataset
dataset = pd.read_csv('/content/titanic.csv')

In [None]:
# Getting information about the dataset
dataset.info()

#NOTE: In the output check the non-null count information. Age - 714/891; Cabin - 204/891; Embarked - 889/891

In [None]:
dataset.head()

## Exercise 02: Split the data into train and test sets (1 Mark)
Note: Apply all your data preprocessing steps in the train set first and keep the test set aside.

In [None]:
# Load Target - Survived
y = dataset.Survived

#Load Features
X = dataset.drop(columns = ['Survived'])

# check shape of X & y
X.shape, y.shape, type(X), type(y)


In [None]:
from sklearn.model_selection import train_test_split

#Perform train, test and split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Get columns list
X_train_col_list = X_train.columns.tolist()
# Convert x_train to Dataframe
X_train_df = pd.DataFrame(X_train, columns=X_train_col_list)
X_test_col_list = X_test.columns.tolist()
X_test_df = pd.DataFrame(X_test, columns=X_test_col_list)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Exercise 03: Data Cleaning and Processing (15 Marks)
### 3.1 Working on the "Cabin" column (2 Marks)
Find unique entries in the Cabin column. We can label all passengers in two categories having a cabin or not. Check the data type(use: type) of each entry of the Cabin. Convert a string data type into '1' i.e. passengers with cabin and others into '0' i.e. passengers without cabin.  Write a function for the above operation and apply it to the cabin column and create another column with the name " Has_cabin" containing only 0 or 1 entries.





In [None]:
# Method to create "Has_cabin" column and write 1 if passengers with cabin and 0 if passesngers with no cabin.
def ConvertStrToNum(ds, col_name):
  ds['Has_cabin'] = ds[col_name].notnull().astype(int)

In [None]:
print(X_train_df.shape[0])
print(X_train_df['Cabin'].info())

In [None]:
#Call the method to create Has_cabin column. #### Use apply method.
for Idx in range(X_train_df.shape[0]):
  ConvertStrToNum(X_train_df, 'Cabin')

In [None]:
#check if changes are done
print(X_train_df['Has_cabin'].info())
print(X_train_df['Has_cabin'].sum())

In [None]:
X_train_df.head()

 ### 3.2 Working on "SibSp" & "Parch" columns (1 Mark)
Combine columns "SibSp" & "Parch" and create another column that represents the total passengers in one ticket with the name "family_size". In each ticket, there might be Siblings/Spouses (SibSp =Number of Siblings/Spouses Aboard) or Parents/Children (Parch=Number of Parents/Children Aboard ) along with the passenger who booked the ticket.

  

In [None]:
#Create 'family_size' column, add Sibsp and Parch columns.
X_train_df['family_size'] = X_train_df['SibSp'] + X_train_df['Parch']

In [None]:
X_train_df.head()

### 3.3 Working on the"Embarked" column (2 Marks)
The "embarked" column represents the port of Embarkation: Cherbourg(C), Queenstown(Q), and  Southampton(S ). Thus, the entries are of three categories in this column. Fill in the missing rows in this column. We can fill it with the most frequent category. Map these categorical string entries into numerical.



In [None]:
value_counts = X_train_df['Embarked'].value_counts()
value_counts

In [None]:
max_string = value_counts.idxmax()
max_string

In [None]:
X_train_df['Embarked'].fillna(max_string, inplace=True)

In [None]:
X_train_df['Embarked'].info()
X_train_df.shape[0]

### 3.4 Working on the "Age" column (2 Marks)
find the number of NaN entries in the age column and their row index. Calculate the mean, Standard deviation of the Age column and check the distribution of the age column.We can fill the missing values with randomly generated integer values between (mean+Standard deviation, mean-Standard deviation). Use : np.isnan; np.random.randint; concept of slicing dataframe. Convert the age column as an integer data type.



In [None]:
#np.isnan; np.random.randint; concept of slicing dataframe. TBC
Age_na_mask = X_train_df['Age'].isna()

#count the number of NAN values
nan_count =Age_na_mask.sum()

#capture the indexes where the value is NAN
nan_index_list = X_train_df.index[Age_na_mask].tolist()

In [None]:
print("NAN Count=",nan_count)
print("index list", nan_index_list)

In [None]:
import random
# Calculate the mean of the column
mean_value = X_train_df['Age'].mean()

# Calculate the standard deviation of the column
std_value = X_train_df['Age'].std()

min_value = int(mean_value - std_value)
max_value = int(mean_value + std_value)

for index in nan_index_list:
  X_train_df.at[index, 'Age'] = np.random.randint(min_value, max_value)

#Convert column to int
X_train_df['Age'] = X_train_df['Age'].astype(int)

In [None]:
print(mean_value)
print(std_value)
print(min_value)
print(max_value)
#X_train_df['Age'].isnull().sum()
X_train_df['Age'].head()

### 3.5 Working on "sex" column (1 Mark)
Map the Sex column as 'female' : 0, 'male': 1, and convert it into an integer data type.



In [None]:
#create dictionary
mapping_dict = {'female':0, 'male':1}

#map string to integer
X_train_df['Sex'] = X_train_df['Sex'].map(mapping_dict)

#covert it to int type
X_train_df['Sex'] = X_train_df['Sex'].astype(int)

In [None]:
X_train_df['Sex'].head()

### 3.6  Optional- Working on the "Name" column :
Fetch titles from the name. We can map these titles with numbers and convert them into an integer. Use: concept of the regular expression.

### 3.7 Optional- Working on the "Fare" column :
We can convert face into categorical entries like Low, Medium, and High.



In [None]:
#find bins of that Fare column
bins = pd.cut(dataset['Fare'], bins=3)
# Get the boundaries of the bins
bin_boundaries = bins.unique()

print(bin_boundaries)

dataset['Fare'].max()
dataset['Fare'].min()
dataset['Fare'].average()
#required labels
#Labels = ['Low', 'Medium', 'High']

#dataset['Fare'] = pd.cut(dataset['Fare'], bins = bin_boundaries, labels = Labels, include_lowest=True)
#dataset['Fare'].head()





### 3.8 Drop the columns (1 Mark)

Drop the columns: - "PassengerId", "Name",  "SibSp" & "Parch", "Tickets", "Cabin"

Now apply different ML algorithms and check the accuracy of your model.



In [None]:
#X_train_data = pd.DataFrame(X_train)
#X_train_data

In [None]:
df_X_train = X_train_df.drop(columns = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'],inplace = False)
df_X_train.info()
#df_X_train

### 3.9 Apply Standard Scalar (1 Mark)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
#for column in df_X_train.columns:
  #df_X_train[column] = le.fit_transform(df_X_train[column])

df_X_train['Embarked'] = le.fit_transform(df_X_train['Embarked'])

df_X_train.head()
#df_X_train.info()

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_X_train)
X_training = scaler.transform(df_X_train)
X_training.shape, y_train.shape

In [None]:
X_training

### 3.10 Create a single function for preprocessing the test set (X_test) and apply it. (4 Marks)
#### **Note**: All the pre-processing steps that were applied on the train set before ML Modelling are also applied on the test set before passing through the predict function.

In [None]:
def Data_Pre_Processing(X_test):
  ############## Create Has_Cabin Column ##############
  for index in range(X_test.shape[0]):
    ConvertStrToNum(X_test, 'Cabin')

  ############## Create family_size Column ##############
  X_test['family_size'] = X_test['SibSp'] + X_test['Parch']

  ############## Process Embarked Column ##############
  value_counts = X_test['Embarked'].value_counts()
  max_string = value_counts.idxmax()
  X_test['Embarked'].fillna(max_string, inplace=True)

  ############## Work with Age Column ##############
  Age_na_mask = X_test['Age'].isna()
  #count the number of NAN values
  nan_count =Age_na_mask.sum()
  #capture the indexes where the value is NAN
  nan_index_list = X_test.index[Age_na_mask].tolist()
  # Calculate the mean of the column
  mean_value = X_test['Age'].mean()
  # Calculate the standard deviation of the column
  std_value = X_test['Age'].std()

  min_value = int(mean_value - std_value)
  max_value = int(mean_value + std_value)

  for index in nan_index_list:
    X_test.at[index, 'Age'] = np.random.randint(min_value, max_value)

  #Convert column to int
  X_test['Age'] = X_test['Age'].astype(int)

  ############## Create with Sex Column ##############
  #create dictionary
  mapping_dict = {'female':0, 'male':1}
  #map string to integer
  X_test['Sex'] = X_test['Sex'].map(mapping_dict)
  #covert it to int type
  X_test['Sex'] = X_test['Sex'].astype(int)

  return X_test

In [None]:
X_test_col_list = X_test.columns.tolist()
X_test_df = pd.DataFrame(X_test, columns=X_test_col_list)
X_test_df = Data_Pre_Processing(X_test_df)
X_test_df.info()

In [None]:
X_test_df.head()

### 3.11 Apply standard Scalar transformation to x_test (1 Mark)

In [None]:
#X_test_data = pd.DataFrame(X_test)

df_X_test = X_test_df.drop(columns = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'],inplace = False)
df_X_test.info()

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
#for column in df_X_test.columns:
#    df_X_test[column] = le.fit_transform(df_X_test[column])
df_X_test['Embarked'] = le.fit_transform(df_X_test['Embarked'])
df_X_test.head()
df_X_test.info()



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_X_test)
df_X_test = scaler.transform(df_X_test)
df_X_test.shape, y_test.shape
#df_X_test.info()

## Exercise  4. Apply Multiple ML Algo. along with  Ensemble Technique (Voting classifier) and display the accuracy (7 Marks)
#### Expected Accuracy >= 80%  


In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

In [None]:
sv = SVC(kernel='linear')
dt = DecisionTreeClassifier(max_depth=5)
lg = LogisticRegression(max_iter=300)
knn = KNeighborsClassifier(n_neighbors=3)
vc = VotingClassifier(estimators=[('sv',sv),('dt',dt),('lg',lg),('knn',knn)],voting='hard')

In [None]:
vc.fit(df_X_train, y_train)
pred_final = vc.predict(df_X_train)
print(accuracy_score(pred_final, y_train))

In [None]:
#df_X_train.head()
y_train.head()


In [None]:
pred_final

## Exercise  5. Pre-process the test_set (3 Marks)
Again we have to apply the same preprocess function and standard scaler on this test set before passing through predict function.

#### Understanding the test set:

In [None]:
dataset_test = pd.read_csv('/content/test_titanic.csv')

In [None]:
#print(dataset_test.info())
#print(dataset_test.head())
dataset_test.isna().sum()

#### Note: In the initial train set there were no missing entries in the "Fare" column. But, now for the submission test set, there is one missing entry in this column.

#### There will be a minor change in the preprocess function to address the above issue.

In [None]:
X_df = Data_Pre_Processing(dataset_test)
X_df.info()

In [None]:
X_df.isna().sum()

In [None]:
X_df['Fare'].fillna(X_df['Fare'].mean(), inplace=True)

In [None]:
X_df.isna().sum()

## Exercise  6. Prediction for test data (2 Mark)

In [None]:
dataset_test_file = X_df.drop(columns = ['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin'],inplace = False)
dataset_test_file.info()



In [None]:
dataset_test_file.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

le_test_file = LabelEncoder()
#for column in dataset_test_file.columns:
#    dataset_test_file[column] = le_test_file.fit_transform(dataset_test_file[column])
dataset_test_file['Embarked'] = le_test_file.fit_transform(dataset_test_file['Embarked'])

dataset_test_file.info()

In [None]:
scaler_test_file = StandardScaler()
#scaler_test_file.fit(dataset_test_file)
#dataset_test_file_1 = scaler_test_file.transform(dataset_test_file)
dataset_test_file_1 = scaler_test_file.fit_transform(dataset_test_file)
dataset_test_file_1.shape

col_list_final = dataset_test_file.columns.tolist()
df_final = pd.DataFrame(dataset_test_file_1, columns=col_list_final)

pred_test_file = vc.predict(df_final)
#pred_test_file.shape
#y_test.shape
pred_test_file

In [None]:
df_final

In [None]:
#df_final[:1]
pred_test_file = vc.predict(df_final[:1])
pred_test_file