## A Hands-on Workshop series in Machine Learning
#### Instructor: Dr. Aashita Kesarwani

First we import the relevant python modules:

In [1]:
import numpy as np
import pandas as pd

# The module re is for regular expressions
import re

import warnings
warnings.filterwarnings('ignore')



Loading the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) stored in the `csv` file as a dataframe using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/AashitaK/datasets/main/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


[Description for the columns](https://www.kaggle.com/c/titanic/data) is as follows.  

|Variable|	Definition|	Key|   
|:---  |:--- |:---|
|PassengerId| Passenger ID |
|Survived| 	Survival|	0 = No, 1 = Yes |
|Pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex	|Sex|	
|Age	|Age in years	|
|SibSp	|# of siblings / spouses aboard the Titanic	|
|Parch	|# of parents / children aboard the Titanic	|
|Ticket	|Ticket number	|
|Fare	|Passenger fare	|
|Cabin	|Cabin number	|
|Embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

We fill the missing values in some of the columns as explained in the previous session.

In [3]:
df['Title'] = df['Name'].apply(lambda name: re.findall("\w+[.]", name)[0])

df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mrs.', 
                  'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 
                  'Capt.': 'Rare', 'Countess.': 'Rare', 'Jonkheer.': 'Rare', 
                  'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)

df['MedianAge'] = df.groupby('Title')['Age'].transform("median")
df['Age'] = df['Age'].fillna(df['MedianAge'])
df = df.drop(['Title', 'MedianAge', 'Cabin'], axis=1)
df['Embarked'] = df['Embarked'].fillna('S')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Let's now check the missing values.

In [4]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

We also create a new column *Groupsize* as seen in the previous session.

In [5]:
df['Family'] = df['SibSp'] + df['Parch'] + 1
df['TicketCount'] = df.groupby('Ticket')['Name'].transform("count")
df['GroupSize'] = df[['Family', 'TicketCount']].max(axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Family,TicketCount,GroupSize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,2,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,2,1,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,1,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,2,2,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,1,1,1


### Encoding categorical variables

Let us check the datatype of each column. Hint: Use [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
Family           int64
TicketCount      int64
GroupSize        int64
dtype: object

Let us encode some of the categorial columns with numerical values as seen in the previous session. 

In [7]:
df = df.replace({'male': 0, 'female': 1})
port_df = pd.get_dummies(df['Embarked'], prefix='Port')
df = pd.concat([df, port_df], axis=1).drop(['Embarked', 'Name', 'Ticket'], axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Family,TicketCount,GroupSize,Port_C,Port_Q,Port_S
0,1,0,3,0,22.0,1,0,7.25,2,1,2,0,0,1
1,2,1,1,1,38.0,1,0,71.2833,2,1,2,1,0,0
2,3,1,3,1,26.0,0,0,7.925,1,1,1,0,0,1
3,4,1,1,1,35.0,1,0,53.1,2,2,2,0,0,1
4,5,0,3,0,35.0,0,0,8.05,1,1,1,0,0,1


In [8]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Sex              int64
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Family           int64
TicketCount      int64
GroupSize        int64
Port_C           uint8
Port_Q           uint8
Port_S           uint8
dtype: object

Next, do the same for other columns if required. 

Finally, we take input `X` and label `y` for our model.

In [None]:
X = df[['Age', 'GroupSize']]  # Pick columns that you thing are useful
y = df['Survived'].astype('category')

### Build a Logistic classifier using scikit-learn 
Steps:
* Split the train and validation set
* Define logistic classifer
* Fit logistic classifier
* Get accuracy scores on train and validation sets

In [None]:
from sklearn.model_selection import train_test_split
# default is 75% / 25% train-test split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [None]:
from sklearn.linear_model import LogisticRegression
LR_clf = LogisticRegression()
LR_clf.fit(X_train, y_train)

print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(LR_clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on validation set: {:.2f}'
     .format(LR_clf.score(X_valid, y_valid)))

### Build a neural network using keras

Refer to the other notebook `Primer on Keras` for code.

#### Acknowledgment:
* [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) dataset openly available in Kaggle is used in the exercises.
