## Guided session 2

This is the first notebook for the second session of the [Machine Learning workshop series at Harvey Mudd College](http://www.aashitak.com/ML-Workshops/).

Main topics of today's session:
* Split-apply-combine operations by grouping rows of a dataframe
* Encoding categorical variables
* Concatentating and merging dataframes

In [1]:
import pandas as pd
import re # For regular expressions

In today's guided session, we will continue exploring the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic). Let us set *Passengerid* as the index.

In [6]:
path = 'titanic/'
df = pd.read_csv(path + 'train.csv') 
df = df.set_index('PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 1. [GroupBy object](https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html)
In the last exercise session, we noticed the *Age* column has a lot of missing values. To fill these values, we can group the passengers based on the titles derived from their name and then take the median value from each group to fill the missing values of the group.

The below code is a repetition from the exercises in the previous session to create a new column named *Title* from the *Name* column using regular expressions. 

In [None]:
df['Title'] = df['Name'].map(lambda name: re.findall("\w+[.]", name)[0])

title_dictionary = {'Ms.': 'Miss.', 'Mlle.': 'Miss.', 
              'Dr.': 'Rare', 'Mme.': 'Mr.', 
              'Major.': 'Rare', 'Lady.': 'Rare', 
              'Sir.': 'Rare', 'Col.': 'Rare', 
              'Capt.': 'Rare', 'Countess.': 'Rare', 
              'Jonkheer.': 'Rare', 'Dona.': 'Rare', 
              'Don.': 'Rare', 'Rev.': 'Rare'}

df['Title'] = df['Title'].replace(title_dictionary)

df.head()

We can use [`groupby()`](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.groupby.html) to group the rows of the dataframe based on column(s), say *Title*, but we need to apply some operation on the grouped object to derive a dataframe.

In [None]:
df.groupby('Title')

One of the ways to derive a dataframe from a groupby object is by aggregation, that is computing a summary statistic (or statistics) about each group. For example, we can get the median values for the columns in each group of titles.

In [None]:
df.groupby('Title').median()

The median age vary greatly for each group ranging from 3.5 to 48 years.

The most common way to derive a dataframe from a groupby object is by transformation. We create a new column *MedianAge* which consists of the groupwise median age depending on the passengers' title using [`transform()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).

In [None]:
df['MedianAge'] = df.groupby('Title')['Age'].transform("median")
df.head(3)

Note that `transform()` has to applied to a column of the grouped object. The following will give an error.

In [None]:
df.groupby('Title').transform("median")

Now we fill in the missing values in the *Age* column using the values in the *MedianAge* column.

In [None]:
df['Age'] = df['Age'].fillna(df['MedianAge'])
df.head()

We drop off the *MedianAge* column since we no longer need it.

In [None]:
df = df.drop('MedianAge', axis=1)
df.head()

Let us check for the missing values. There are none in the *Age* column!

In [None]:
df.isnull().sum()

### 2. Encoding categorical variables
Let us check the datatype of each column. Hint: Use `dtypes`.

In [None]:
# df.dtypes

There are two columns with `object` datatype - *Sex* and *Embarked*. These two along with *Pclass* are categorical variables. The feature *Pclass* has an innate order in its categories and hence, is ordinal, whereas *Sex* and *Embarked* are inordinal categorical variables. Most machine learning models require the features or input variables to be numerical. One way to accomplish that is to encode the categories with numbers.

Convert the gender values to numerical values 0 and 1. Hint: Use `replace` with the suitable dictionary. 

In [None]:
# df['Sex'] = df['Sex'].replace({'male': 0, 'female': 1})

Check the datatypes again and make note of datatype for the column *Sex*. Discuss what can possibly go wrong with randomly assigning numbers to categories.

In [None]:
df.dtypes

Numbers have a natural order and so do ordered categories such as passengers' ticket class in our case. Number also have  an inherent quantitive value attached to them that categories do not. For example, the difference between the numbers 1 and 2 is the same as the difference between the numbers 2 and 3 but the same cannot be said for ordinal categories. So, converting categories to numbers means adding untrue assumptions that may or may not adversely affect our model. 

For this reason, the prefered method is one-hot encoding. In this method, we build a one-hot encoded vector with dimension equal to the number of classes in the categories. This vector consists of all 0's except for a 1 corresponding to the class of the instance. For example, the *Embarked* column will have one-hot encoded vectors of [1,0,0], [0,1,0] and [0,0,1] for the three ports.

One-hot encoding is accomplished in pandas using `get_dummies` as given below. It simply creates a column for each class of a categorical variable.

In [None]:
pd.get_dummies(df['Embarked']).head()

We want the column names to be `'Port_C', 'Port_Q', 'Port_S'`. Copy the above code with `get_dummies` and modify it to [make use of the `prefix` keyword](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to alter the column names. Next, save this to a new dataframe named `port_df`.

In [None]:
# port_df = pd.get_dummies(df['Embarked'], prefix="Port").head()

To add this dataframe of two new columns to the original dataframe, we can use `concat` with `axis=1`.

In [None]:
# df = pd.concat([df, port_df], axis=1)

Now check that the new columns are added. 

In [None]:
df.head()

Next, drop the column for *Embarked*. 

In [None]:
# df = df.drop('Embarked', axis=1)

Note: if you run the above cell more than once, it will give an error, since the column *Embarked* is no more present in the dataframe for the code to work. 

Next, we check the columns in our dataframe.

In [None]:
df.columns

The expected output is  
```Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Port_C', 'Port_Q', 'Port_S'], dtype='object')```

Notes:
- One of the columns in the one-hot encoding obtained in the above manner is always redundant. In case of features with just two classes such as gender in our dataset, one-hot encoding is not truly useful. One of its column is same as what we obtained by simply replacing classes with 0 and 1 and the other is redundant.  
- The main disadvantage of using one-hot encoding is the increase in the number of features that can negatively affect our model which we will discuss in the later sessions.


### Next steps:

Please proceed to the hands-on exercises.