## A Hands-on Workshop series in Machine Learning
#### Instructor: Aashita Kesarwani

First we import the relevant python modules:

In [1]:
import numpy as np
import pandas as pd

# The module re is for regular expressions
import re

Loading the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) stored in the `csv` file as a dataframe using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [2]:
df = pd.read_csv('titanic.csv')
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

[Description for the columns](https://www.kaggle.com/c/titanic/data) is as follows.  

|Variable|	Definition|	Key|   
|:---  |:--- |:---|
|PassengerId| Passenger ID |
|Survived| 	Survival|	0 = No, 1 = Yes |
|Pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex	|Sex|	
|Age	|Age in years	|
|SibSp	|# of siblings / spouses aboard the Titanic	|
|Parch	|# of parents / children aboard the Titanic	|
|Ticket	|Ticket number	|
|Fare	|Passenger fare	|
|Cabin	|Cabin number	|
|Embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

We fill the missing values in the age column as explained in the previous session.

In [3]:
df['Title'] = df['Name'].apply(lambda name: re.findall("\w+[.]", name)[0])

df.Title.replace({'Ms.': 'Miss.', 'Mlle.': 'Miss.', 'Dr.': 'Rare', 'Mme.': 'Mrs.', 
                  'Major.': 'Rare', 'Lady.': 'Rare', 'Sir.': 'Rare', 'Col.': 'Rare', 
                  'Capt.': 'Rare', 'Countess.': 'Rare', 'Jonkheer.': 'Rare', 
                  'Dona.': 'Rare', 'Don.': 'Rare', 'Rev.': 'Rare'}, inplace=True)

df['MedianAge'] = df.groupby('Title')['Age'].transform("median")
df['Age'] = df['Age'].fillna(df['MedianAge'])
df['Embarked'] = df['Embarked'].fillna('S')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,MedianAge
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr.,30.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.,35.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.,21.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs.,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr.,30.0


Let's now check the missing values.

In [4]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Title            0
MedianAge        0
dtype: int64

We also create a new column *Groupsize* as seen in the previous session.

In [5]:
df['Family'] = df['SibSp'] + df['Parch'] + 1
df['TicketCount'] = df.groupby('Ticket')['Name'].transform("count")
df['GroupSize'] = df[['Family', 'TicketCount']].max(axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,MedianAge,Family,TicketCount,GroupSize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr.,30.0,2,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.,35.0,2,1,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.,21.0,1,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs.,35.0,2,2,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr.,30.0,1,1,1


### Encoding categorical variables

Let us check the datatype of each column. Hint: Use [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [6]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Title           object
MedianAge      float64
Family           int64
TicketCount      int64
GroupSize        int64
dtype: object

In machine learning, our models usually take numbers as inputs rather than strings. We have to convert categorical data into a form the model can recognize.

We convert the gender values to numerical values 0 and 1 using [`replace`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) with a suitable dictionary. 

In [7]:
df = df.replace({'male': 0, 'female': 1})
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,MedianAge,Family,TicketCount,GroupSize
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Mr.,30.0,2,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.,35.0,2,1,2
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.,21.0,1,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Mrs.,35.0,2,2,2
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Mr.,30.0,1,1,1


What can go wrong with randomly assigning numbers to categories?

There are two kinds of categorical variables based on whether the categories possess an inherent order or not:
* Ordinal categorical variables
* Inordinal categorical variables

For example, passengers' ticket class `Pclass` takes the values 1, 2, and 3. These three categories have an inherent order and hence it is an ordinal categorical variable. On the other hand, gender takes two values - male and female, which have no intrinsic ordering and hence it is an inordinal categorical variable.

Does it mean that we can simply treat the ordinal variables such as `Pclass` as another numerical variable? Can you think of any problem this may cause in our model?

Other than a natural order, number also possess certain other properties. For example, the difference between the numbers 1 and 2 is the same as the difference between the numbers 2 and 3. 
$$ 2-1 == 3-2$$

Can we make the same claim for the categories labeled $1, 2,$ and $3$ in our ordinal variables `Pclass`?

So, converting categories to numbers means adding untrue assumptions that may or may not adversely affect our model. 

To address this, the commonly used method is one-hot encoding. In this method, we build a one-hot encoded vector with dimension equal to the number of classes in the categories. This vector consists of all 0's except for a 1 corresponding to the class of the instance. For example, the *Embarked* column will have one-hot encoded vectors of [1,0,0], [0,1,0] and [0,0,1] representing each of the three possible ports. 

How will this look in our dataset?  
Instead of a single column for the port of embarkment, we will have three columns corresponding to each port. The values in these columns will be $0$ or $1$. For each row, there will be only one $1$ among these three columns.

One-hot encoding is accomplished in pandas using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) as given below. It simply creates a column for each class of a categorical variable.

In [8]:
pd.get_dummies(df['Embarked']).head()`

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


We want the column names to be `'Port_C', 'Port_Q', 'Port_S'`. Make use of the [`prefix` ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) keyword in `get_dummies` to alter the column names and save the one-hot encoded vectors to a new dataframe named `port_df`.

In [9]:
port_df = pd.get_dummies(df['Embarked'], prefix='Port')

Let's add `port_df` to the original dataframe `df`.

In [10]:
pd.concat([df, port_df], axis=1).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,MedianAge,Family,TicketCount,GroupSize,Port_C,Port_Q,Port_S
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Mr.,30.0,2,1,2,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.,35.0,2,1,2,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.,21.0,1,1,1,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Mrs.,35.0,2,2,2,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Mr.,30.0,1,1,1,0,0,1


The above looks good, so let us update the original dataframe.

In [11]:
df = pd.concat([df, port_df], axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,MedianAge,Family,TicketCount,GroupSize,Port_C,Port_Q,Port_S
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,Mr.,30.0,2,1,2,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,Mrs.,35.0,2,1,2,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss.,21.0,1,1,1,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,Mrs.,35.0,2,2,2,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,Mr.,30.0,1,1,1,0,0,1


In [12]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex              int64
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Title           object
MedianAge      float64
Family           int64
TicketCount      int64
GroupSize        int64
Port_C           uint8
Port_Q           uint8
Port_S           uint8
dtype: object

Next, do the same for other columns if required. 

Finally, we take input `X` and label `y` for our model.

In [13]:
X = df[['Age', 'GroupSize']]  # Pick columns that you thing are useful
y = df['Survived'].astype('category')

### Build a Logistic classifier using scikit-learn 
Steps:
* Split the train and validation set
* Define logistic classifer
* Fit logistic classifier
* Get accuracy scores on train and validation sets

In [16]:
from sklearn.model_selection import train_test_split
# default is 75% / 25% train-test split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [None]:
# from sklearn.linear_model import LogisticRegression
# LR_clf = LogisticRegression()
# LR_clf.fit(X_train, y_train)

# print('Accuracy of Logistic regression classifier on training set: {:.2f}'
#      .format(LR_clf.score(X_train, y_train)))
# print('Accuracy of Logistic regression classifier on validation set: {:.2f}'
#      .format(LR_clf.score(X_valid, y_valid)))

### Build a neural network using keras

Refer to the other notebook for code.

In [17]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras import losses

input_dim = X_train.shape[1] # size of input variables
print("Input size:", input_dim)

model = Sequential()
model.add(Dense(units=5, input_dim=input_dim, activation="sigmoid")) # Hidden layer
model.add(Dense(units=1, activation="sigmoid")) # Output layer
model.summary()

Input size: 2
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 5)                 15        
                                                                 
 dense_1 (Dense)             (None, 1)                 6         
                                                                 
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________


2023-04-14 13:55:56.819966: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [18]:
model.compile(optimizer=SGD(lr=0.001), loss="binary_crossentropy", metrics=["accuracy"])

  super().__init__(name, **kwargs)


In [19]:
model.fit(X_train, y_train, epochs=30, verbose=1); 

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [20]:
# show the accuracy on the testing set
print("Evaluating on testing set...")
(loss, accuracy) = model.evaluate(X_valid, y_valid, batch_size=5, verbose=1)
print("[INFO] loss={:.4f}, accuracy: {:.4f}%".format(loss, accuracy * 100))


Evaluating on testing set...
[INFO] loss=0.6756, accuracy: 62.3318%


#### Acknowledgment:
* [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) dataset openly available in Kaggle is used in the exercises.
