## A Hands-on Workshop series in Machine Learning
#### Instructor: Dr. Aashita Kesarwani

First we import the relevant python modules:

In [None]:
import numpy as np
import pandas as pd

# The module re is for regular expressions
import re

import warnings
warnings.filterwarnings('ignore')

Loading the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) stored in the `csv` file as a dataframe using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/AashitaK/datasets/main/titanic.csv')
df.head()

[Description for the columns](https://www.kaggle.com/c/titanic/data) is as follows.  

|Variable|	Definition|	Key|   
|:---  |:--- |:---|
|PassengerId| Passenger ID |
|Survived| 	Survival|	0 = No, 1 = Yes |
|Pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex	|Sex|	
|Age	|Age in years	|
|SibSp	|# of siblings / spouses aboard the Titanic	|
|Parch	|# of parents / children aboard the Titanic	|
|Ticket	|Ticket number	|
|Fare	|Passenger fare	|
|Cabin	|Cabin number	|
|Embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

We fill the missing values in some of the columns as explained in the previous session.

In [None]:
df['Title'] = df['Name'].apply(lambda name: re.findall("\w+[.]", name)[0])

df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mrs.', 
                  'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 
                  'Capt.': 'Rare', 'Countess.': 'Rare', 'Jonkheer.': 'Rare', 
                  'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)

df['MedianAge'] = df.groupby('Title')['Age'].transform("median")
df['Age'] = df['Age'].fillna(df['MedianAge'])
df = df.drop(['Title', 'MedianAge', 'Cabin'], axis=1)
df['Embarked'] = df['Embarked'].fillna('S')
df.head()

Let's now check the missing values.

In [None]:
df.isnull().sum()

We also create a new column *Groupsize* as seen in the previous session.

In [None]:
df['Family'] = df['SibSp'] + df['Parch'] + 1
df['TicketCount'] = df.groupby('Ticket')['Name'].transform("count")
df['GroupSize'] = df[['Family', 'TicketCount']].max(axis=1)
df.head()

### Encoding categorical variables

Let us check the datatype of each column. Hint: Use [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
df.dtypes

Let us encode some of the categorial columns with numerical values as seen in the previous session. 

In [None]:
df = df.replace({'male': 0, 'female': 1})
port_df = pd.get_dummies(df['Embarked'], prefix='Port')
df = pd.concat([df, port_df], axis=1).drop(['Embarked', 'Name', 'Ticket'], axis=1)
df.head()

In [None]:
df.dtypes

Next, do the same for other columns if required. 

Finally, we take input `X` and label `y` for our model.

In [None]:
X = df[['Age', 'GroupSize']]  # Pick columns that you thing are useful
y = df['Survived'].astype('category')

### Build a Logistic classifier using scikit-learn 
Steps:
* Split the train and validation set
* Define logistic classifer
* Fit logistic classifier
* Get accuracy scores on train and validation sets

### Build a neural network using keras

Refer to the other notebook `Primer on Keras` for code.

#### Acknowledgment:
* [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) dataset openly available in Kaggle is used in the exercises.
