# Data Preparation

## Table of Contents

1. [Data Cleaning for Baseline Model](#dc-base)
    - [1.1 Marital Status One Hot Encoding](#step1)
    - [1.2 Application Mode One Hot Encoding](#step2)
    - [1.3 Course One Hot Encoding](#step3)
    - [1.4 Rename Daytime/Nighttime Attendance](#step4)
    - [1.5 Modify Previous Qualifications](#step5)
    - [1.6 Nationality Mapping](#step6)
    - [1.7 Modify Mother's Qualifications](#step7)
    - [1.8 Modify Father's Qualifications](#step8)
    - [1.9 Mother's Occupation Cleaning](#step9)

In [413]:
import pandas as pd

data = pd.read_csv("data/student-dropout-academic-success.csv", sep=";")
df = data.copy()
data.head(3)

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout


In [414]:
df.columns = df.columns.str.lower().str.replace(" ", "_").str.strip()
# df.columns

## Data Cleaning for Baseline Model <a class="anchor" id="dc-base"></a>

We will start by cleaning the data for the baseline model. This will be for a target variable as is with Dropout, Enrolled, and Graduated.

### 1.1 Marital Status <a class="anchor" id="step1"></a>

For now, we will convert this to a One Hot Encoded variable.

In [415]:
df.marital_status.unique()

array([1, 2, 4, 3, 5, 6], dtype=int64)

In [416]:
# mapping to string values so one hot encoding columns make sense
df.marital_status = df.marital_status.map(
    {1: "single", 2: "married", 3: "widower", 4: "divorced", 5: "facto_union", 6: "legally_separated"}
)

In [417]:
marital_status_df = pd.get_dummies(df.marital_status, "marital_status", dtype="int")\
    .drop("marital_status_single", axis=1)
df = df.join(marital_status_df)
df.shape

(4424, 42)

### 1.2 Application Mode <a class="anchor" id="step2"></a>

Need to convert back to labels then One Hot Encode them. For now I will just drop first because it is hard to get a column to drop through intuition.

In [418]:
df.application_mode = df.application_mode.map({
    1: "1st_phase_general_contingent", 2: "Ordinance_No_612/93", 
    5: "1st_phase_special_contingent_Azores_Island", 
    7: "Holders_of_other_higher_courses", 10: "Ordinance_No_854-B/99",
    15: "International_student_bachelor", 
    16: "1st_phase_special_contingent_Madeira_Island",
    17: "2nd_phase_general_contingent", 18: "3rd_phase_general_contingent",
    26: "Ordinance_No_533-A/99_item_b2_Different_Plan",
    27: "Ordinance_No_533-A/99_item_b3_Other_Institution",
    39: "Over_23_years_old", 42: "Transfer", 43: "Change_of_course",
    44: "Technological_specialization_diploma_holders",
    51: "Change_of_institution_course", 53: "Short_cycle_diploma_holders",
    57: "Change_of_institution_course_International"
})

In [419]:
application_mode_dummies = pd.get_dummies(df.application_mode, "application_mode", drop_first=True, dtype="int")
# application_mode_dummies

In [420]:
df = df.join(application_mode_dummies)
df.shape

(4424, 59)

### 1.3 Course <a class="anchor" id="step3"></a>

For this feature I need to map integers to real values then One Hot Encode them into categories.

I will use Nursing as the reference column due to the fact it has the most rows.

In [421]:
# mapping course numbers to their respective name
df.course = df.course.map({
    33: "Biofuel_Production_Technologies", 171: "Animation_and_Multimedia_Design",
    8014: "Social_Service_evening_attendance", 9003: "Agronomy",
    9070: "Communication_Design", 9085: "Veterinary_Nursing",
    9119: "Informatics_Engineering", 9130: "Equinculture",
    9147: "Management", 9238: "Social_Service",
    9254: "Tourism", 9500: "Nursing",
    9556: "Oral_Hygiene", 9670: "Advertising_and_Marketing_Management",
    9773: "Journalism_and_Communication", 9853: "Basic_Education",
    9991: "Management_evening_attendance"
})

In [422]:
courses_dummies = pd.get_dummies(df.course, "course", dtype="int")\
    .drop(columns=["course_Nursing"])
# courses_dummies.columns

In [423]:
df = df.join(courses_dummies)
df.shape

(4424, 75)

### 1.4 Daytime Nightime Attendance <a class="anchor" id="step4"></a>

In [424]:
# rename column for clarity
df = df.rename(columns={"daytime/evening_attendance": "daytime_attendance"})

### 1.5 Previous Qualifications <a class="anchor" id="step5"></a>

For now, I will encode this feature as an Ordinal feature

In [425]:
mapper = {
    1: "high_school", 2: "higher_education", 3: "higher_education", 4: "graduate_school",
    5: "graduate_school", 6: "higher_education",  9: "other", 10: "other", 11: "other", 12: "other", 
    14: "other", 15: "other", 18: "technical_training", 19: "other", 22: "technical_training", 26: "other",
    27: "other", 29: "other", 30: "other", 34: "other", 35: "other", 36: "other", 37: "other",
    38: "other", 39: "technical_training",
    40: "higher_education", 41: "technical_training", 42: "technical_training", 43: "graduate_school", 44: "graduate_school"
}

In [426]:
df.previous_qualification = df.previous_qualification.map(mapper)
df.previous_qualification.value_counts()

previous_qualification
high_school           3717
technical_training     255
other                  232
higher_education       205
graduate_school         15
Name: count, dtype: int64

In [427]:
from sklearn.preprocessing import OrdinalEncoder

ord_encoder = OrdinalEncoder(
    categories=[["other", "high_school", "technical_training", "higher_education", "graduate_school"]]
)
ord_encoder.set_output(transform="pandas")
df.previous_qualification = ord_encoder.fit_transform(df[["previous_qualification"]])

In [428]:
df.previous_qualification.value_counts()

previous_qualification
1.0    3717
2.0     255
0.0     232
3.0     205
4.0      15
Name: count, dtype: int64

### 1.6 Nationality <a class="anchor" id="step6">

Just mapping these to One Hot encoded columns

In [429]:
df = df.rename(columns={"nacionality": "nationality"})  # renaming column

# mapping column to categories
df.nationality = df.nationality.map({
    1: "Portuguese", 2: "German", 6: "Spanish", 11: "Italian",
    13: "Dutch", 14: "English", 17: "Lithuanian", 21: "Angolan",
    22: "Cape_Verdean", 24: "Guinean", 25: "Mozambican",
    26: "Santomean", 32: "Turkish", 41: "Brazilian",
    62: "Romanian", 100: "Moldova_Republic_of",
    101: "Mexican", 103: "Ukrainian", 105: "Russian",
    108: "Cuban", 109: "Colombian"
})

In [430]:
nationality_to_dummies = pd.get_dummies(df.nationality, "nationality", dtype="int")\
    .drop(columns=["nationality_Portuguese"])
df = df.join(nationality_to_dummies)
df.shape

(4424, 95)

### 1.7 Mother's Qualification <a class="anchor" id="step7"></a>

This feature was similar enough to Previous Qualifications to be able to use prior mapper.

In [431]:
df["mother's_qualification"] = df["mother's_qualification"].map(mapper)
ord_encoder2 = OrdinalEncoder(
    categories=[["other", "high_school", "technical_training", "higher_education", "graduate_school"]]
)
ord_encoder2.set_output(transform="pandas")
df["mother's_qualification"] = ord_encoder2.fit_transform(df[["mother's_qualification"]])

### 1.8 Father's Qualification <a class="anchor" id="step8"></a>

Will have to construct a new mapper, then encode feature as a Ordinal Feature

In [432]:
mapper_father = {}
other_keys = [7, 8, 9, 10, 11, 12, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 35, 36, 37, 38]
high_school_keys = [1, 14]
technical_training_keys = [13, 15, 16, 22, 23, 29, 32, 39, 41, 42]
higher_ed_keys = [2, 3, 6, 30, 31, 40]
graduate_school_keys = [4, 5, 33, 34, 43, 44]

In [433]:
def add_to_dict(mapper_dict, key_list_, value):
    for key in key_list_:
        mapper_dict[key] = value
    return mapper_dict

In [434]:
in_data = set(df["father's_qualification"].unique())

In [435]:
categories = ["other", "high_school", "technical_training", "higher_education", "graduate_school"]
key_lists = [other_keys, high_school_keys, technical_training_keys, higher_ed_keys, graduate_school_keys]
for category, key_list in zip(categories, key_lists):
    mapper_father = add_to_dict(mapper_father, key_list, category)

In [436]:
in_mapper = set(mapper_father.keys())
in_data.difference(in_mapper)

set()

In [437]:
df["father's_qualification"] = df["father's_qualification"].map(mapper_father)

In [438]:
ord_encoder3 = OrdinalEncoder(
    categories=[["other", "high_school", "technical_training", "higher_education", "graduate_school"]]
)
ord_encoder3.set_output(transform="pandas")
df["father's_qualification"] = ord_encoder3.fit_transform(df[["father's_qualification"]])

### 1.9 Mother's Occupation <a class="anchor" id="step9"></a>