# Introduction
In Part III, we performed exploratory data analysis extensively to develop an intuition behind our data. 

In this section, we dummify our categorical variables and prepare our data before model training in the last section.

Recommended readings:
1. https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
2. https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/

In this notebook, we will do the following:
1. one-hot encode our categorical data
2. extract specific columns containing numerical data
3. merge the numerical data with the dummified categorical data
4. export the final DataFrame as CSV

For one-hot encoding/dummification of the categorical values, there are various ways to do it. 

However, no matter which method you do it, make sure you drop one column to avoid the <strong>dummy variable trap</strong>.

For example, if your column contains four categorical values, you'd have to drop one of the four columns after one-hot encoding.

![OnehotEncodingExample.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/OnehotEncodingExample.png)

In this example, we drop is_Purple because is_Purple is redundant, i.e. you can infer that row's values based on is_Black, is_Blue, and is_Red. If all of those values are 0, it means the row contains purple instead.

Suggested reading: https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

### Step 1: Import pandas
You just need pandas in this library - export it as pd.

In [None]:
# Step 1: Import pandas as pd

### Step 2: Read the CSV from Part II
We will work with the CSV from the merged DataFrame from Part II

In [None]:
# Step 2: Read your CSV from Part II

### Step 3: Replace 55<= in age_band to 55+
'<' and '=' do not play well with certain models so we might as well replace all of the values in the age_band column first.

![Replace55.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/Replace55.png)

<strong>Hint: Google "replace values in column pandas"</strong>

In [None]:
# Step 3: Replace '55<='' in age_band to '55+'

### Step 4: Dummify core_module

In [None]:
# Step 4: Declare a variable and store the dummified/one-hot encoded values from 'core_module'

### Step 5: Dummify code_presentation

In [None]:
# Step 5: Declare a variable and store the dummified/one-hot encoded values from 'core_presentation'

### Step 6: Dummy gender

In [None]:
# Step 6: Declare a variable and store the dummified/one-hot encoded values from 'gender'

### Step 7: Dummify region

In [None]:
# Step 7: Declare a variable and store the dummified/one-hot encoded values from 'region'

### Step 8: Dummify imd_band

In [None]:
# Step 8: Declare a variable and store the dummified/one-hot encoded values from 'imd_bank'

### Step 9: Dummify age_band

In [None]:
# Step 9: Declare a variable and store the dummified/one-hot encoded values from 'age_band'

### Step 10: Dummify disability

In [None]:
# Step 10: Declare a variable and store the dummified/one-hot encoded values from 'disability'

### Step 11: Concatenate all of the dummies into a single DataFrame
Now that we're done dummifying the categorical columns, let's concatenate them into a single horizontally long DataFrame.

![ConcatenatedDummies.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/ConcatenatedDummies.png)

Sanity check - if you dummified properly, i.e. remove first column from the dummies, you should have:
1. 28,785 rows
2. 35 columns

In [None]:
# Step 11: Concatenate all of your dummy DataFrame into a single DataFrame

### Step 12: Replace values in final_result with binaries
Let's replace the 'final_result' values with 1 and 0 because we just have two outcomes in a programme - fail or pass. 

As such, we shall bin our categorical results like this:
1. Pass/Distinction (1)
2. Withdrawn/Fail (0)

![TransformDependentVariable.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/TransformDependentVariable.png)

There are a few ways to do it, e.g., mapping, replacing the column with a new list, etc. 

In [None]:
# Step 12: Replace the categorical values in final_result

### Step 13: Extract numerical columns
Before we do anything else, let's take only the necessary data from the original DataFrame.

We only need these columns for now:
1. num_of_prev_attempts
2. studied_credits
3. sum_click
4. mean
5. max
6. min
7. final_result

In [None]:
# Step 13: Subset the original DataFrame

### Step 14: Concatenate the DataFrame from Step 13 to the DataFrame from Step 11
Now that we have all we need, let's combine our DataFrame with the rests of the dummies.

![FinalDataFrame.png](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectLearningAnalytics/FinalDataFrame.png)

Sanity check:
1. 28,785 rows 
2. 42 columns

In [None]:
# Step 14: Concatenate everything from Steps 13 and Steps 11

### Step 15: Export the final DataFrame as a CSV
Now you're finally done and can proceed to modelling after getting the final DataFrame form. 

In [None]:
# Step 15: Export the final DataFrame as a CSV