## A Hands-on Workshop series in Machine Learning
#### Instructor: Aashita Kesarwani

First we import the relevant python modules:

In [None]:
import numpy as np
import pandas as pd

# The module re is for regular expressions
import re

Loading the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) stored in the `csv` file as a dataframe using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [None]:
df = pd.read_csv('titanic.csv')
df['Embarked'].value_counts()

[Description for the columns](https://www.kaggle.com/c/titanic/data) is as follows.  

|Variable|	Definition|	Key|   
|:---  |:--- |:---|
|PassengerId| Passenger ID |
|Survived| 	Survival|	0 = No, 1 = Yes |
|Pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex	|Sex|	
|Age	|Age in years	|
|SibSp	|# of siblings / spouses aboard the Titanic	|
|Parch	|# of parents / children aboard the Titanic	|
|Ticket	|Ticket number	|
|Fare	|Passenger fare	|
|Cabin	|Cabin number	|
|Embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

We fill the missing values in the age column as explained in the previous session.

In [None]:
df['Title'] = df['Name'].apply(lambda name: re.findall("\w+[.]", name)[0])

df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mrs.', 
                  'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 
                  'Capt.': 'Rare', 'Countess.': 'Rare', 'Jonkheer.': 'Rare', 
                  'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)

df['MedianAge'] = df.groupby('Title')['Age'].transform("median")
df['Age'] = df['Age'].fillna(df['MedianAge'])
df['Embarked'] = df['Embarked'].fillna('S')
df.head()

Let's now check the missing values.

In [None]:
df.isnull().sum()

We also create a new column *Groupsize* as seen in the previous session.

In [None]:
df['Family'] = df['SibSp'] + df['Parch'] + 1
df['TicketCount'] = df.groupby('Ticket')['Name'].transform("count")
df['GroupSize'] = df[['Family', 'TicketCount']].max(axis=1)
df.head()

### Encoding categorical variables

Let us check the datatype of each column. Hint: Use [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
df.dtypes

In machine learning, our models usually take numbers as inputs rather than strings. We have to convert categorical data into a form the model can recognize.

We convert the gender values to numerical values 0 and 1 using [`replace`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) with a suitable dictionary. 

In [None]:
df = df.replace({'male': 0, 'female': 1})
df.head()

What can go wrong with randomly assigning numbers to categories?

There are two kinds of categorical variables based on whether the categories possess an inherent order or not:
* Ordinal categorical variables
* Inordinal categorical variables

For example, passengers' ticket class `Pclass` takes the values 1, 2, and 3. These three categories have an inherent order and hence it is an ordinal categorical variable. On the other hand, gender takes two values - male and female, which have no intrinsic ordering and hence it is an inordinal categorical variable.

Does it mean that we can simply treat the ordinal variables such as `Pclass` as another numerical variable? Can you think of any problem this may cause in our model?

Other than a natural order, number also possess certain other properties. For example, the difference between the numbers 1 and 2 is the same as the difference between the numbers 2 and 3. 
$$ 2-1 == 3-2$$

Can we make the same claim for the categories labeled $1, 2,$ and $3$ in our ordinal variables `Pclass`?

So, converting categories to numbers means adding untrue assumptions that may or may not adversely affect our model. 

To address this, the commonly used method is one-hot encoding. In this method, we build a one-hot encoded vector with dimension equal to the number of classes in the categories. This vector consists of all 0's except for a 1 corresponding to the class of the instance. For example, the *Embarked* column will have one-hot encoded vectors of [1,0,0], [0,1,0] and [0,0,1] representing each of the three possible ports. 

How will this look in our dataset?  
Instead of a single column for the port of embarkment, we will have three columns corresponding to each port. The values in these columns will be $0$ or $1$. For each row, there will be only one $1$ among these three columns.

One-hot encoding is accomplished in pandas using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) as given below. It simply creates a column for each class of a categorical variable.

In [None]:
pd.get_dummies(df['Embarked']).head()

We want the column names to be `'Port_C', 'Port_Q', 'Port_S'`. Make use of the [`prefix` ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) keyword in `get_dummies` to alter the column names and save the one-hot encoded vectors to a new dataframe named `port_df`.

In [None]:
port_df = pd.get_dummies(df['Embarked'], prefix='Port')

Let's add `port_df` to the original dataframe `df`.

In [None]:
pd.concat([df, port_df], axis=1).head()

The above looks good, so let us update the original dataframe.

In [None]:
df = pd.concat([df, port_df], axis=1)
df.head()

In [None]:
df.dtypes

Next, do the same for other columns if required. 

Finally, we take input `X` and label `y` for our model.

In [None]:
X = df[['Age', 'GroupSize']]  # Pick columns that you thing are useful
y = df['Survived'].astype('category')

### Build a Logistic classifier using scikit-learn 
Steps:
* Split the train and validation set
* Define logistic classifer
* Fit logistic classifier
* Get accuracy scores on train and validation sets

### Build a neural network using keras

Refer to the other notebook for code.

#### Acknowledgment:
* [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) dataset openly available in Kaggle is used in the exercises.
