# Data Preparation

## Table of Contents

1. [Data Cleaning for Baseline Model](#dc-base)
    - [Marital Status One Hot Encoding](#step1)
    - [Application Mode One Hot Encoding](#step2)
    - [Course One Hot Encoding](#step3)
    - [Rename Daytime/Nighttime Attendance](#step4)
    - [Modify Previous Qualifications](#step5)

In [1]:
import pandas as pd

data = pd.read_csv("data/student-dropout-academic-success.csv", sep=";")
df = data.copy()
data.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [14]:
df.columns = df.columns.str.lower().str.replace(" ", "_").str.strip()
df.columns

Index(['marital_status', 'application_mode', 'application_order', 'course',
       'daytime/evening_attendance', 'previous_qualification',
       'previous_qualification_(grade)', 'nacionality',
       'mother's_qualification', 'father's_qualification',
       'mother's_occupation', 'father's_occupation', 'admission_grade',
       'displaced', 'educational_special_needs', 'debtor',
       'tuition_fees_up_to_date', 'gender', 'scholarship_holder',
       'age_at_enrollment', 'international',
       'curricular_units_1st_sem_(credited)',
       'curricular_units_1st_sem_(enrolled)',
       'curricular_units_1st_sem_(evaluations)',
       'curricular_units_1st_sem_(approved)',
       'curricular_units_1st_sem_(grade)',
       'curricular_units_1st_sem_(without_evaluations)',
       'curricular_units_2nd_sem_(credited)',
       'curricular_units_2nd_sem_(enrolled)',
       'curricular_units_2nd_sem_(evaluations)',
       'curricular_units_2nd_sem_(approved)',
       'curricular_units_2nd_s

## Data Cleaning for Baseline Model <a class="anchor" id="dc-base"></a>

We will start by cleaning the data for the baseline model. This will be for a target variable as is with Dropout, Enrolled, and Graduated.

### 1.1 Marital Status <a class="anchor" id="step1"></a>

For now, we will convert this to a One Hot Encoded variable.

In [3]:
df.marital_status.unique()

array([1, 2, 4, 3, 5, 6], dtype=int64)

In [4]:
# mapping to string values so one hot encoding columns make sense
df.marital_status = df.marital_status.map(
    {1: "single", 2: "married", 3: "widower", 4: "divorced", 5: "facto_union", 6: "legally_separated"}
)

In [5]:
marital_status_df = pd.get_dummies(df.marital_status, "marital_status", dtype="int")\
    .drop("marital_status_single", axis=1)
df = df.join(marital_status_df)
df.shape

(4424, 42)

### 1.2 Application Mode <a class="anchor" id="step2"></a>

Need to convert back to labels then One Hot Encode them. For now I will just drop first because it is hard to get a column to drop through intuition.

In [6]:
df.application_mode = df.application_mode.map({
    1: "1st_phase_general_contingent", 2: "Ordinance_No_612/93", 
    5: "1st_phase_special_contingent_Azores_Island", 
    7: "Holders_of_other_higher_courses", 10: "Ordinance_No_854-B/99",
    15: "International_student_bachelor", 
    16: "1st_phase_special_contingent_Madeira_Island",
    17: "2nd_phase_general_contingent", 18: "3rd_phase_general_contingent",
    26: "Ordinance_No_533-A/99_item_b2_Different_Plan",
    27: "Ordinance_No_533-A/99_item_b3_Other_Institution",
    39: "Over_23_years_old", 42: "Transfer", 43: "Change_of_course",
    44: "Technological_specialization_diploma_holders",
    51: "Change_of_institution_course", 53: "Short_cycle_diploma_holders",
    57: "Change_of_institution_course_International"
})

In [7]:
applictaion_mode_dummies = pd.get_dummies(df.application_mode, "application_mode", drop_first=True, dtype="int")
# applictaion_mode_dummies

In [8]:
df = df.join(applictaion_mode_dummies)
df.shape

(4424, 59)

## Course <a class="anchor" id="step3"></a>

For this feature I need to map integers to real values then One Hot Encode them into categories.

I will use Nursing as the reference column due to the fact it has the most rows.

In [9]:
# mapping course numbers to their respective name
df.course = df.course.map({
    33: "Biofuel_Production_Technologies", 171: "Animation_and_Multimedia_Design",
    8014: "Social_Service_evening_attendance", 9003: "Agronomy",
    9070: "Communication_Design", 9085: "Veterinary_Nursing",
    9119: "Informatics_Engineering", 9130: "Equinculture",
    9147: "Management", 9238: "Social_Service",
    9254: "Tourism", 9500: "Nursing",
    9556: "Oral_Hygiene", 9670: "Advertising_and_Marketing_Management",
    9773: "Journalism_and_Communication", 9853: "Basic_Education",
    9991: "Management_evening_attendance"
})

In [11]:
courses_dummies = pd.get_dummies(df.course, "course", dtype="int")\
    .drop(columns=["course_Nursing"])
courses_dummies.columns

Index(['course_Advertising_and_Marketing_Management', 'course_Agronomy',
       'course_Animation_and_Multimedia_Design', 'course_Basic_Education',
       'course_Biofuel_Production_Technologies', 'course_Communication_Design',
       'course_Equinculture', 'course_Informatics_Engineering',
       'course_Journalism_and_Communication', 'course_Management',
       'course_Management_evening_attendance', 'course_Oral_Hygiene',
       'course_Social_Service', 'course_Social_Service_evening_attendance',
       'course_Tourism', 'course_Veterinary_Nursing'],
      dtype='object')

In [12]:
df = df.join(courses_dummies)
df.shape

(4424, 75)

### Daytime Nightime Attendance <a class="anchor" id="step4"></a>

In [17]:
# rename column for clarity
df = df.rename(columns={"daytime/evening_attendance": "daytime_attendance"})

### Previous Qualifications <a class="anchor" id="step5"></a>