# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
### Not for Grading

## Learning Objectives


At the end of the experiment you will be able to:
* apply different learning algorithms on **Titanic** dataset
* perform VotingClassifier

## Dataset Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Survived:** Survived or Not information

**Pclass:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**SibSp:**	No. of siblings/spouse of the passenger aboard the Titanic	

**Parch:**	No. of parents / children of the passenger aboard the Titanic	

**Ticket:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation
  * S = Southampton
  * C = Cherbourg
  * Q = Queenstown


In [None]:
!wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/titanic.csv

## Import Required Packages

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

### Exercise 01: Load the data

* Understand different features in the training dataset

* Drop the unwanted columns, which are unlikely to contribute for the prediction of Survived


In [None]:
# Load the dataset
data = pd.read_csv("titanic.csv")

In [None]:
# Print first ten records
data.head(10)

In [None]:
# Drop the unwanted columns, specifying axis=1
data = data.drop(["Name","Ticket","Fare","Cabin","SibSp","Parch"], axis=1)
data.head()

### Exercise 02: Data Cleaning 

* Find out the missing values for each column

    * Hint: pd.isnull( )
* Remove the missing values for the "Age" and "Embarked" column from the above data set

    * For Age, replace the missing values with mean
    * For Embarked, replace the missing values with mode

In [None]:
features = list(data.columns.values)

for feat in features:
    print (feat,": ",sum(pd.isnull(data[feat])))

In [None]:
# Calculate the mean of Age column
data["Age"].mean()

In [None]:
data["Age"] = data["Age"].fillna(data["Age"].mean())
data["Age"]

In [None]:
# Calculate the mode of Age column
data["Embarked"].mode()

In [None]:
data["Embarked"] = data["Embarked"].fillna("s")
data["Embarked"]

In [None]:
sum(pd.isnull(data["Embarked"]))

### Exercise 03: Convert categorical values to numerical 

*  Sex feature contains categorical values such as male and female then replace them with 1, 2 

*  Embarked feature contains categorical values such as `S` and` s`. Replace 's' with 'S'

*  Embarked feature contains categorical values such as S, C and Q  then  replace with 1, 2, 3

In [None]:
# Convert categorical values to numerical using replace
data = data.replace('male', 1)
data = data.replace('female', 2)

In [None]:
data["Embarked"].unique()

In [None]:
data = data.replace('s','S')

In [None]:
data = data.replace(['S','C','Q'],[1,2,3])
data

### Exercise 04:  Consider the labels as survived  and the remaining as the features 

* Find the shape of the features and labels


In [None]:
X1 = data.drop('Survived',axis=1) # Features
y1 = data['Survived']  # Labels

In [None]:
print(X1.head())

In [None]:
print(X1.shape, y1.shape)

### Exercise 05:  Split the data into train and test sets 




In [None]:
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.1)

In [None]:
print(X_train.shape,X_test.shape, y_train.shape, y_test.shape)

### Exercise 06: Perform Ensemble Technique 


* Create LogisticRegression, Decision Tree and SVC object

* Apply Voting Classifier for LogisticRegression, Decision Tree and SVC

* Fit the model with X_train and y_train

* Predict the model for X_test 

* Find the accuracy using sklearn metrics for actual response values (y_test) and predicted response values (y_pred) 

Hint: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [None]:
# Create an object for all the algorithms
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()

# Apply VotingClassifier
model = VotingClassifier(estimators = [('LR', model1), ('DT', model2), ('SVC',model3)])

# Fit the model
model.fit(X_train,y_train)

# Predict the model
y_pred = model.predict(X_test)

# Calculate the accuracy
print("Accuracy(in %):", accuracy_score(y_test, y_pred)*100)