# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
### Not for Grading

## Learning Objective

At the end of the experiment, you will be able to :

* perform Data Pre-processing
* perform Bagging classifier

## Dataset

### Description

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of many passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Build a predictive model that answers the question: “what sort of people were more likely to survive?” using titanic's passenger data (ie name, age, gender, socio-economic class, etc).

<br/>

### Data Set Characteristics:

**PassengerId:** Id of the Passenger

**Ticket_Class:** Socio-economic status (SES)
  * 1st = Upper
  * 2nd = Middle
  * 3rd = Lower

**Name:** Surname, First Names of the Passenger

**Sex:** Gender of the Passenger

**Age:** Age of the Passenger

**Siblings_Spouse:**	No. of siblings/spouse of the passenger aboard the Titanic	

**Parents_Children:**	No. of parents / children of the passenger aboard the Titanic	

**Ticket_Number:**	Ticket number

**Fare:** Passenger fare

**Cabin:**	Cabin number

**Embarked:** Port of Embarkation

**Survived:** Survived or Not information

In [None]:
!  wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Titanic.csv
!  wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Titanic_Test.csv

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

## Data Pre-Processing

#### Exercise 01: Load the data and print the first five records

**Hint:** https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [None]:
df = pd.read_csv('Titanic.csv')
df.head()

#### Exercise 02: Data Cleaning and Analysis

* Generate [Descriptive Statistics](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) of the dataframe

* Count [NaN values in each column](https://stackoverflow.com/a/26266451) of the dataframe

* Get the average age of all the Survived people and fill the NaN records for the respective survived age records with the mean
  * **Hint:** [DataFrame Where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html)
  * **Note:** Where method replaces values where the condition is **`False`**

* Get the average age of all the people who did not survive and fill the *remaining NaN records* in the not survived age column with the resultant mean

* Analyze how each column is affecting in predicting the survival of a person,
accordingly deciding whether to drop the column or fill the NaN values of that column.

*  **Example:**
  * PassengerId column can never decide survival of a person, hence it can be dropped






In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
####### If they dont want to use df.where()   #########
# df.Age[(df.Age.isna()) & (df['Survived']=='Yes')]['Age']  = df[df.Survived=='Yes'].Age.mean()
# df.Age

# df.Age[(df.Age.isna()) & (df['Survived']=='No')]['Age']  = df[df.Survived=='No'].Age.mean()
# df.Age

In [None]:
# Replacing Age NaNs with mean values based on Survived value

# other way of doing it without using df.where()

# means = df.groupby(['Survived'])['Age'].mean()
# df = df.set_index('Survived')
# df['Age'] = df['Age'].fillna(means)
# df = df.reset_index()
# df.isna().sum()

In [None]:
#df['Age'] = np.where((df["Survived"]=="Yes") & (df["Age"] == np.nan) , round(df[df.Survived=='Yes'].Age.mean(),1), df['Age']) 

In [None]:
# Finding the mean age of "Survived" people

meanS= df[df.Survived=='Yes'].Age.mean()
df.Age = df.Age.where(~((df.Age.isna()) & (df['Survived']=='Yes')), meanS)
df.isna().sum()

In [None]:
# Finding the mean age of "Not Survived" people

meanNS = df[df.Survived == 'No'].Age.mean()
df.Age.fillna(meanNS,inplace=True)
df.isna().sum()

In [None]:
# Dropping useless columns

df.drop(columns=['PassengerId', 'Name','Ticket_Number', 'Fare', 'Cabin'], inplace=True)
df.head()

In [None]:
#### IF PARTICIPANTS ASK WHETHER "EMBARKED" COLUMN IS NECESSARY OR NOT GIVE BELOW EXPLAINATION
# Checking if "Embarked" column is is important for analysis or not, that is whether survival of the person depends on the Embarked column value or not

survivedQ = df[df.Embarked == 'Queenstown'][df.Survived == 'Yes'].shape[0]
not_survivedQ = df[df.Embarked == 'Queenstown'][df.Survived == 'No'].shape[0]


survivedC = df[df.Embarked == 'Cherbourg'][df.Survived == 'Yes'].shape[0]
not_survivedC = df[df.Embarked == 'Cherbourg'][df.Survived == 'No'].shape[0]

survivedS = df[df.Embarked == 'Southampton'][df.Survived == 'Yes'].shape[0]
not_survivedS = df[df.Embarked == 'Southampton'][df.Survived == 'No'].shape[0]

print(survivedQ, not_survivedQ)
print(survivedC, not_survivedC)
print(survivedS, not_survivedS)

# As there are significant changes in the survival rate based on which port the passengers aboard the ship. We cannot delete the whole embarked column(It is useful)

#### Exercise 03: Convert categorical values to numerical
**Hint:** Use [Sklearn LabelEncoder's](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) fit_transform method

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
# YOUR CODE HERE

le_t = preprocessing.LabelEncoder()
df['Ticket_Class'] = le_t.fit_transform(df['Ticket_Class'])

le_s = preprocessing.LabelEncoder()
df['Sex'] = le_s.fit_transform(df['Sex'])

le_e = preprocessing.LabelEncoder()
df['Embarked'] = le_e.fit_transform(df['Embarked'])

le_sur = preprocessing.LabelEncoder()
df['Survived'] = le_sur.fit_transform(df['Survived'])
df.head()

In [None]:
##### IF PARTICIPANTS ASK IS THERE ANY SHORT METHOD THEN WE CAN SUGGEST THIS BUT IT HAS IT'S OWN LIMITATIONS

# labelledData = df.select_dtypes(include=object).apply(lambda x: pd.factorize(x)[0])
# IntData = df.select_dtypes(exclude=object)
# new_df = pd.concat([IntData,labelledData], axis=1)
# new_df = new_df[df.columns]
# new_df.head()

#### Exercise 05:  Consider the target labels as **Survived Column** and the remaining as the features 

* Print the shape of the features and labels


In [None]:
# YOUR CODE HERE
features = df.iloc[:, :-1]
labels = df.iloc[:, -1]
print(features.shape)
print(labels.shape)

#### Exercise 06:  Split the data into train and test sets




In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### Exercise 07: Build the classification model using bagging technique

In [None]:
from sklearn.ensemble import BaggingClassifier

Bag = BaggingClassifier()
Bag.fit(X_train, y_train) 
bag_y_pred = Bag.predict(X_test)

# Accuracy Score of the  Bagging Classifier Model
accuracy_score(y_test, bag_y_pred)