## A Hands-on Workshop series in Machine Learning
### Session 1: Data Manipulation using `pandas`

### Meet the Instructor:
Aashita Kesarwani  
Current job: Scientific Computing Specialist at Harvey Mudd College  
Background:  
- PhD in Mathematics from Tulane University (Number Theory)
- Undergraduate from IIT (Indian Institute of Technology) in Applied Mathematics   

Other roles: 
- Visiting AI Researcher at [deepkapha.ai](https://www.linkedin.com/company/digitalis-kapha-b-v-/)
- [Open source contributor](https://pypi.org/user/Aashita/)

### Meet the Teaching Assistants: 
Daniel Ashcroft
- Senior Computer Science Major
- Interested in artificial intelligence, machine learning, and game design

Victoria Lloyd
- Sophomore Physics Major
- Interested in theoretical and computational physics and the ethics/applications of emerging technologies

##### What is machine learning?
- Learning from data without explicit programming.

##### How to approach a problem? 
- Data exploration and feature engineering 
- Model building, tuning and testing 
    
#### Topics to be covered today (Data manipulation using `pandas`)
- A brief overview of Jupyter Notebook
- Pandas dataframes as the data structure for datasets
- Converting csv files to dataframes 
- Slicing dataframes using conditionals as well as iloc and loc methods.
- Statistical summary and exploration of dataframes
- Detecting and filling missing values in the dataframes 
- Regular expressions for data extraction
- Feature engineering such as creating new features 
- Basic plots
- Basic operations such as dropping rows/columns, replacing values of a column using a dictionary, etc.
- Encoding categorical variables
- Correlation between features 

##### Structure of the session:

The notebook is divided into the following sections:
0. Preliminaries
1. Slicing rows and columns from the dataframe
2. Exploring the dataset (45 min exercise session)
3. Feature Engineering: Creating a new column for the titles of the passengers (30 min exercise session)
4. Encoding categorical variables
5. Correlation between variables 

You will follow along with me in some of the sessions and the rest are the exercises that you will work on in groups.

##### Note:
* Please raise your hand if you have any questions, or need any help anytime during the workshop.
* Solutions for all the exercises (including the ones in the follow-along section) will be provided at the end. There will also be a link for the lecture capture.
* Please leave us your feedback (form will soon be emailed to you if have signed-up) and it will be gladly taken into account for the next sessions.

### 0. Preliminaries
#### An overview of ***Jupyter Notebook***.
Jupyter Notebook is the de facto standard in Data Science inspired by the concept of [literate programming](https://www-cs-faculty.stanford.edu/~knuth/lp.html) introduced by [Donald Knuth](https://amturing.acm.org/award_winners/knuth_1013846.cfm). 

It is composed of blocks that are called cells.

There are two types of cells:
* Code cells
* Markdown cells (for text such as this cell itself)

You can run the code within the cells (No need to use command line) by first selecting the cell and then using `Shift` + `Enter`. Another option is to use the `Run` button at the header at the top.

In [None]:
a = 2
b = 3
a + b

#### Key shortcuts using command mode
Press `Esc` to activate the command mode:
Shortcuts:
* A: Insert cell above
* B: Insert cell below
* C: Copy 
* V: Paste 
* X: Cut 
* DD: Delete 
* M: Convert a cell to Markdown cell
* Shift: Let's you select multiple cells at once that you can copy/cut/delete.

To exit the command mode, simply press `Enter`. You need to first exit the command mode to run/edit the cell. 

#### Python refresher:

Q: What are the main data structures in Python?

- Strings   
Syntax: `A = "Aashita Kesarwani"`


- Lists   
Syntax: `B = [2, 6, 5, 3]`


- Dictionaries (key-value pairs)   
Syntax: `C = {"First name": "Aashita", "Last name": "Kesarwani"}`


In [None]:
A = "Aashita Kesarwani"
B = [2, 6, 5, 3]
C = {"First name": "Aashita", "Last name": "Kesarwani"}

Q: What is slicing in Python?  
A: To extract a single(or multiple) character(s) from a string or list.


What would `B[1]` in the list `B = [2, 6, 5, 3]` give?

In [None]:
B[1]

How do I get the letter `h` from the string `A = "Aashita Kesarwani"`?

How do I get my entire first name *Aashita* from the string A = "Aashita Kesarwani"? 

Q: Can we use Python lists for matrices (or 2-d arrays)?  
A: `Mat = [[1, 3], [2, 4]]`

In [None]:
Mat = [[1, 3], [2, 4]]
Mat[0]

In [None]:
Mat[0][1]

***Q: What are the `numpy` arrays? Why do we need them?***

`numpy` is one of the commonly used python modules/packages, which stands for numerical python. Numpy arrays are multidimensional arrays that are optimized for computing, especially for operations such as matrix multiplication.

To be able to use a python module, we first need to import it. Let us import all the relevant python modules that we are going to use today. Each of them will be introduced later on.

In [None]:
import numpy as np
import pandas as pd

# The following two modules matplotlib and seaborn are for plots
import matplotlib.pyplot as plt
import seaborn as sns # Comment this if seaborn is not installed
%matplotlib inline

# The module re is for regular expressions
import re

Notice that we are giving an alias to the modules we imported: `import numpy as np`. When we use the built-in functions in a module, we need to use the module name, so it is good to have shorter names for them. For example, we will create a `numpy` array using `np.array()` instead of `numpy.array()`.

In [None]:
M = np.array([1, 2, 3, 4])

Q: What is vectorization?

Applying an operation to the entire array at once instead of individual elements, thus eliminating `for` loops.

In [None]:
np.sqrt(M)

***Q: What are the `pandas` dataframes? Why do we need them? What is the crucial difference between numpy matrices and pandas dataframes?***

Pandas: an excellent tool to work with datasets

Dataframes: the central data structure of pandas library
- Evolved out of tables
- Most suitable for data manipulation tasks  

Pandas is built on top of numpy. The crucial difference between numpy matrices and pandas dataframes is that the columns in a Dataframe can be of different datatypes such as numerical, categorical, textual, etc.

First we load the [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) stored in the `csv` file as a dataframe using [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function.

In [None]:
path = 'titanic/'
df = pd.read_csv(path + 'train.csv')

If using Google Colab, the above will give you an error.

You will have to first download the `train.csv` file from the [Github repository](https://github.com/AashitaK/A-Hands-on-Workshop-series-in-Machine-Learning/tree/master/Session%201/titanic) then manually upload it to Colab using: 
```python
from google.colab import files
uploaded = files.upload()
```

Then, load the file into pandas dataframe using `read_csv`:
```python
df = pd.read_csv('train.csv')
```

In [None]:
df

As it turns out to be rather big dataset to display, we can comment the above cell by adding # in front of df and run it again to get rid of the output.

Next, let's check the numbers of rows and columns in the dataset.

In [None]:
df.shape

So, the dataset consists of 891 rows and 12 columns.

We use `head()` function to peek into the first 5 rows (or any number of rows by using `head(n)`).

In [None]:
df.head()

The [Titanic dataset](https://www.kaggle.com/c/titanic) that we will explore today is sourced from a [Kaggle](https://www.kaggle.com/) beginner-level competition.  

Goal of the competition: To apply the tools of machine learning to predict which passengers survived the Titanic tragedy.

[Description for the columns](https://www.kaggle.com/c/titanic/data) is as follows.  

|Variable|	Definition|	Key|   
|:---  |:--- |:---|
|PassengerId| Passenger ID |
|Survived| 	Survival|	0 = No, 1 = Yes |
|Pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd|
|Sex	|Sex|	
|Age	|Age in years	|
|SibSp	|# of siblings / spouses aboard the Titanic	|
|Parch	|# of parents / children aboard the Titanic	|
|Ticket	|Ticket number	|
|Fare	|Passenger fare	|
|Cabin	|Cabin number	|
|Embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton|

Q: What are the features? 

Features are nothing but the variables in our model or the columns in our dataset. For example, `PClass`, `Age`, `Sex`, `Fare`, etc. are features for this particular dataset.

The final goal is to design a model to predict whether a passenger survives or not. 
* Which of the above features seem like important predictors? 
* How can you analyse the data in view of this objective?
    
Q: What is feature engineering?
* Detecting and handling missing values
* Encoding categorial features into numerical values
* Creating new features from the existing ones

### 1. Selecting rows and columns from the dataframe

How do we select a column from the dataframe? Say, we want to select the *Name* column from the dataframe. 

Remember, we used square brackets for indexing lists, strings and numpy arrays in Python, for example `A[0]`.

In [None]:
df['Name'].head()

Since we do not want all the rows in the output, we have used `head()` function at the end.

How do we select multiple columns? Suppose we want to select the columns *Name, Sex* and *Age* from the dataframe. Hint: Use a list of columns inside the square brackets.

We can also select rows by putting a certain condition on a column. Say, we want only those rows for which the gender is *'female'*. 

In [None]:
df[df["Sex"] == "female"].head(3)

Now, we want to retrieve only the female passengers traveling in the first class. 
Hint: Add another conditional `df['Pclass']==1` to the above code using & and make sure to wrap the two conditionals with parenthesis.

We can also get the number of passengers using the shape method which gives us both the number of columns and the number of rows. Write the code to count the number of female passengers traveling in the first class. 

#### The `loc` and `iloc` methods
So far, we have seen how to retrieve either some select columns or certain rows based on conditionals. What if we want to slice off a portion of the dataframe with some specific rows and columns? We use `.loc[]` or `.iloc[]` methods for this purpose. 
* `.iloc[]` method is primarily integer position based and gets rows/columns at particular positions in the index (so it only takes integers). 
* `loc[]` method is label based and gets rows/columns with particular labels from the index.

The `loc[]` method allows us to put conditions on rows and retrieve select columns simultaneously.

For example, we want to get the name and the survival information for all the adults above 70 years.

In [None]:
df.loc[df['Age']>70, ['Name', 'Survived']]

Write the code to retrieve the **name, age and survival** information for all the **female passengers traveling in the first class**. 

The `iloc[]` method let us retrieve rows by passing sequence of indexes. For example, we can select the rows numbered 100th to 105th. The indexing works exactly like python lists and numpy arrays.

In [None]:
df.iloc[100:106]

Write the code to retrieve every 100th row from the dataframe.

Write the code to retrieve the last 10 rows from the dataframe using `iloc[]` method.

## Instructions for the exercise session:
- There are two exercise sections (section 2 and 3) below and they are alloted 45 min and 30 min respectively.
- The exercise involves new concepts not covered in the guided session above. Please feel free to ask questions and take help from the instructor and/or TAs.
- The hints are provided for the each of the exercises. The built-in functions to be used for them are provided with a clickable link to the user manual. 
- The exercise sessions are time-bound and you are encouraged to work in groups to speed things up! 

### 2. Exploring the dataset (45 min)

Use [`describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function for numerical features (columns) to get a brief overview of the statistics of the data.

Do the same as above for qualitative (non-numerical) features. Hint: Use `include=['O']` parameter in the [`describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) function.

 Use the built-in pandas function to count the number of surviving and non-surviving passengers. Hint: Use [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) on the column `df['Survived']`.

Below is a pie chart of the same using `matplotlib`:

In [None]:
plt.axis('equal') 
plt.pie(df['Survived'].value_counts(), labels=('Died', "Survived"));

Below is a bar chart for the survival rate among male and female passengers using `seaborn`. Here is [Seaborn cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf).

In [None]:
sns.barplot(x = 'Sex', y = 'Survived', data = df);

Plot the survival rate among passengers in each ticket class.

We can also check the survival rate among both genders within the three ticket classes as follows.

In [None]:
sns.barplot(x='Pclass', y='Survived', hue='Sex', data=df);

From the above chart, do you think that the gender affect the chance of survival for all the three ticket classes equally? Or does it seem like gender's effect is more pronounced for a certain ticket class passengers than others? We plot the  point estimates and confidence intervals for each sub-category to see it more clearly.

In [None]:
sns.pointplot(x='Sex', y='Survived', hue='Pclass', data=df);

Notice the steeper slope for the second class.

It seems that gender and ticket class put together give more information about the survival chance than both of them separately. Please feel free to later explore other variables and combination of variables in depth in your own time.

How many children were on board? Hint: Use indexing on rows using conditional on the *Age* column and then the [`shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) method to count the rows as seen above.

How many of the children on board survived? Hint: Add another conditional for the *Survived* column to the above code.

Use the functions [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) and [`sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html) on the dataframe to find out the number of missing values in each column.

Detecting missing values is an important first step in Feature Engineering, that is preparing the features (independent variables) to use for building the machine learning models. The next step is to handle those missing values. Depending on the data, sometimes it is a good idea to drop the rows or columns that have some or a lot of missing values, but that also means discarding relevant information. Another way to handle missing values is to fill them with something appropriate. 

1. Discuss the pros and cons of dropping the rows and/or columns with missing values in general. Should you drop none, all or some of the columns for this particular dataset in view of building the predictive model? Same question for dropping the rows with missing values.
3. If you consider filling the missing values, what are the possible options? Can you make use of other values in that column to fill the missing values? Can you make use of other values in that row as well as values in that column to fill the missing values 
4. Can the title in the name column be used for guessing a passengers' age based on the age values of other passengers with the same title?

What is the most common port of embarkment? Hint: Check the frequency (counts) of each value in the Embarked column using the built-in function [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) as seen above. 

As we saw above, there are missing values in the column for *Embarked*. Fill them with the most commonly occuring value. Hint: Use [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

Let us check whether the missing values for the *Embarked* column is indeed filled.

In [None]:
df.isnull().sum()

If not, there are two options to fix this. One is to set `inplace` parameter in the `fillna()` function as `True` and another is to use assignment operator `=` as in `df = df.function()`. 

***Question***: Why is the `inplace` keyword False by default? This is true not just for `fillna()` but for most built-in functions in pandas. 

Answer: To facilitate method chaining or piping i.e. invoking multiple operations one after the other. For example, `df.isnull().sum()` used above. Chaining is more commonly used in pandas as compared to another programming style i.e. using nested function calls. Please read more [here](https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69), if interested.

We should remove the *Cabin* column from the DataFrame -- too many values are missing. Hint: Use [`drop()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) with appropriate value for the `axis` keyword. 

Let us check whether the column is indeed dropped. If not, modify the code above accordingly.

In [None]:
df.head()

What is the age of the oldest person on board? 

Find all the passenger information for the oldest person on board. Hint: Use [`loc[]`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) method with [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) for the Age column.

### 3. Feature Engineering: Creating a new column for the titles of the passengers (30 min)

The real-world datasets many-a-times contain useful information in the textual format. Text mining is an important area of data science and one of the most powerful tool is [regular expressions](https://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html) that are not specific to python, but have much wider applications.

In this section, you are going to create a new feature for the titles of the passengers derived from their names using regular expressions. For that, let us first take a look at the passengers' names. 

In [None]:
df.loc[:20, 'Name'].values

We notice one of the identifying characteristics of the titles above are that they end with a period. Regular expressions are very useful in the process of such data extraction and we will use them using the python module `re` to extract the titles from the *Name* column. We will first use regular expressions characters to construct a pattern and then use built-in function `findall` for pattern matching.

Some useful regular expression characters:
- `\w`: pattern must contain a word character, such as letters.
- `[ ]`: pattern must contain one of the characters inside the square brackets. If there is only one character inside the square brackets, for example `[.]`, then the pattern must contain it.

Let's try this.

In [None]:
re.findall("\w\w[.]", 'Braund, Mr. Owen Harris')

It worked! It returned a list instead of the string, so we use indexing to get the first element of the list.

In [None]:
re.findall("\w\w[.]", 'Braund, Mr. Owen Harris')[0]

Let us try it on another name:

In [None]:
re.findall("\w\w[.]", 'Heikkinen, Miss. Laina')[0]

So, we want a pattern that automatically detects the length of the title and returns the entire title.

For regular expressions, \+ is added to a character/pattern to denote it is present one or more times. For example, `\w+` is used to denote one or more word characters. Fill in the regular expression in the below cell that will detect a period preceeded by one or more word characters.

The output should be `'Miss.'`

Summary: For pattern matching the titles using regular expressions:
- First we make sure it contains a period by using `[.]`. 
- Secondly, the period must be preceeded by word characters (one or more), so we use `\w+[.]`.

Write a function `get_title` that takes a name, extracts the title from it and returns the title.

Check that the function is working properly by running the following two cells.

In [None]:
get_title('Futrelle, Mrs. Jacques Heath (Lily May Peel)')

The output should be `'Mrs.'`. Note: Make sure that the funtion returns a string and not a list. Please modify the above function accordingly.

In [None]:
get_title('Simonius-Blumer, Col. Oberst Alfons')

The output should be `'Col.'`.

Create a new column named Title and extract titles from the Name column using the above function `get_title`. Hint: Use built-in [`map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) function. The syntax is `df['New_column'] = df['Relevant_column'].map(function_name)`.

Let us peek into the dataframe.

In [None]:
df.head()

List all the unique values for the titles along with their frequency. Hint: Use an inbuilt pandas function

Now, we want to replace the various spellings of the same title to a single one. Hint: Use the below dictionary with the [`replace`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) function

`title_dictionary = {'Ms.': 'Miss.', 'Mlle.': 'Miss.', 
              'Dr.': 'Rare', 'Mme.': 'Mrs.', 
              'Major.': 'Rare', 'Lady.': 'Rare', 
              'Sir.': 'Rare', 'Col.': 'Rare', 
              'Capt.': 'Rare', 'Countess.': 'Rare', 
              'Jonkheer.': 'Rare', 'Dona.': 'Rare', 
              'Don.': 'Rare', 'Rev.': 'Rare'}`

List all the unique values for the titles along with their frequency to check that the titles are replaced properly.

What is the median age of passengers? Hint: Use the inbuilt function [`median`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html).

What is the median age of passengers with the title 'Miss.'? Hint: Use [`loc[]`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html) method for slicing off the select rows and the *Age* column.

What is the median age of passengers with the title 'Mrs.'?

Is there a noticeble difference in the median ages for the passengers with the above two titles? Should we take titles into account while filling the missing values for the *Age* column? If yes, how?

This is the end of the exercise session and the following code is part of the guided session. If you finished this and the above section earlier than the alloted time, then it is time to explore the dataset more on your own by first framing some questions and then using google to find the useful pandas built-in functions. Please feel free to ask for help, if needed.

### 4. Encoding categorical variables

Let us check the datatype of each column. Hint: Use [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html).

In [None]:
df.dtypes

In machine learning, our models usually take numbers as inputs rather than strings. We have to convert categorical data into a form the model can recognize.

We convert the gender values to numerical values 0 and 1 using [`replace`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) with a suitable dictionary. 

In [None]:
df = df.replace({'male': 0, 'female': 1})
df.head()

What can go wrong with randomly assigning numbers to categories?

There are two kinds of categorical variables based on whether the categories possess an inherent order or not:
* Ordinal categorical variables
* Inordinal categorical variables

For example, passengers' ticket class `Pclass` takes the values 1, 2, and 3. These three categories have an inherent order and hence it is an ordinal categorical variable. On the other hand, gender takes two values - male and female, which have no intrinsic ordering and hence it is an inordinal categorical variable.

Doe it mean that we can simply treat the ordinal variables such as `Pclass` as another numerical variable? Can you think of any problem this may cause in our model?

Other than a natural order, number also possess certain other properties. For example, the difference between the numbers 1 and 2 is the same as the difference between the numbers 2 and 3. 
$$ 2-1 == 3-2$$

Can we make the same claim for the categories labeled $1, 2,$ and $3$ in our ordinal variables `Pclass`?

So, converting categories to numbers means adding untrue assumptions that may or may not adversely affect our model. 

To address this, the commonly used method is one-hot encoding. In this method, we build a one-hot encoded vector with dimension equal to the number of classes in the categories. This vector consists of all 0's except for a 1 corresponding to the class of the instance. For example, the *Embarked* column will have one-hot encoded vectors of [1,0,0], [0,1,0] and [0,0,1] representing each of the three possible ports. This means that we will have three columns for the *Embarked* columns - one for each port and the values for these columns would simply be 1 or 0.

One-hot encoding is accomplished in pandas using [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) as given below. It simply creates a column for each class of a categorical variable.

In [None]:
pd.get_dummies(df['Embarked']).head()

We want the column names to be `'Port_C', 'Port_Q', 'Port_S'`. Make use of the [`prefix` ](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) keyword in `get_dummies` to alter the column names and save the one-hot encoded vectors to a new dataframe named `port_df`.

In [None]:
port_df.head()

Concat the two dataframes `df` and `port_df` using [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and save the resulting dataframe in `df`.

In [None]:
df.head()

Notes:
- One of the columns in the one-hot encoding obtained in the above manner is always redundant. In case of features with just two classes such as gender in our dataset, one-hot encoding is not truly useful. One of its column is same as what we obtained by simply replacing classes with 0 and 1 and the other is redundant.  
- The main disadvantage of using one-hot encoding is the increase in the number of features that can negatively affect our model which we will discuss in the later sessions.

### 5. Correlation between variables 

What are the possible ways to understand the correlation of features with survival? 

Pearson correlation coefficients measures the linear correlation between the variables.

$$\rho_{X,Y} = \frac{cov(X, Y)}{\sigma_X, \sigma_Y}$$
where 
- $cov(X, Y)$ is the covariance.    
- $\sigma_X, \sigma_Y$ are standard deviations of $X$ and $Y$ respectively.

The correlation between two variables ranges from -1 to 1. The closer in absolute value a correlation is to 1, the more dependent two features are each other.

We can get the correlation matrix for the variables (columns) in the dataset using the built-in function `corr()`.

In [None]:
df.corr()

* From the above matrix, note which feature has the highest correlation with the survival. 
* Do features have high correlation among themselves? 
* Note that this matrix has excluded some categorical variables like gender, port of embarkment, etc. 

The correlation matrix can also be visualized using heatmaps as shown below.

In [None]:
correlation_matrix = df.corr();
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(correlation_matrix);

#### Acknowledgment:
* [Titanic dataset from Kaggle](https://www.kaggle.com/c/titanic) dataset openly available in Kaggle is used in the exercises.


In the next session, you will learn two machine learning algorithms and work to build a prediction model on an Election Dataset by [American National Election Studies](https://electionstudies.org/). The feature engineering you learnt today will be integral to prepare the data before applying the machine learning algorithms on it. Please make sure to finish the exercise involving regular expressions. 

Please make sure to fill the feedback form, it is highly appreciated.