### Importing the required libraries

In [1]:
import pandas as pd
import numpy as np

### Loading the Data

In [2]:
df = pd.read_csv("student_data.csv")

### Inspecting the Data Set

In [3]:
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171.0,1,1,122.0,1,19,12.0,...,0,0.0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254.0,1,1,160.0,1,1,3.0,...,0,6.0,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070.0,1,1,122.0,1,37,37.0,...,0,6.0,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773.0,1,1,122.0,1,38,37.0,...,0,6.0,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014.0,0,1,100.0,1,37,38.0,...,0,6.0,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [4]:
df.shape

(4424, 37)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   object 
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4423 non-null   float64
 4   Daytime/evening attendance                      4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Previous qualification (grade)                  4424 non-null   float64
 7   Nacionality                                     4424 non-null   int64  
 8   Mother's qualification                          4424 non-null   int64  
 9   Father's qualification                   

In [6]:
# df.describe()

In [7]:
df.isnull().sum()

Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            1
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            1
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   1
Displaced                                         0
Educational special needs                         0
Debtor                                            1
Tuition fees up to date                           1
Gender                                            0
Scholarship holder                                0
Age at enrol

In [8]:
df.isnull().sum()[df.isnull().sum() > 0]

Course                                 1
Father's qualification                 1
Admission grade                        1
Debtor                                 1
Tuition fees up to date                1
International                          1
Curricular units 1st sem (grade)       1
Curricular units 2nd sem (enrolled)    1
Curricular units 2nd sem (grade)       1
dtype: int64

#### Result: Features missing values

- Course 1
- Father's qualification 1
- Admission grade 1
- Debtor 1
- Tuition fees up to date 1
- International 1
- Curricular units 1st sem (grade) 1
- Curricular units 2nd sem (enrolled) 1
- Curricular units 2nd sem (grade) 1

---

### Data Cleaning

- Checking the types of data
- Checking the missing or null values
    - Irrelevant data
    - Missing values
    - Duplicates
    - Type conversion
    - Syntax errors
    - Standardize
    - Scaling / Transformation
    - Normalisation
- Dropping irrelevant columns
- Renaming the columns
- Checking the duplicate rows

Using df.head() wasn't possible to get the visual information about the 37 columns/features and check their respective Type, so I used .iloc method to get the first 13 features, after that the middle 13 features and the last 13 features

```python
df.iloc[:, :13].head(5)
df.iloc[:, 13:26].head(5)
df.iloc[:, 26:37].head(5)
```

In [9]:
df.iloc[:, :13].head(5)

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Admission grade
0,1,17,5,171.0,1,1,122.0,1,19,12.0,5,9,127.3
1,1,15,1,9254.0,1,1,160.0,1,1,3.0,3,3,142.5
2,1,1,5,9070.0,1,1,122.0,1,37,37.0,9,9,124.8
3,1,17,2,9773.0,1,1,122.0,1,38,37.0,5,3,119.6
4,2,39,1,8014.0,0,1,100.0,1,37,38.0,9,9,141.5


In [10]:
df.dtypes

Marital status                                     object
Application mode                                    int64
Application order                                   int64
Course                                            float64
Daytime/evening attendance                          int64
Previous qualification                              int64
Previous qualification (grade)                    float64
Nacionality                                         int64
Mother's qualification                              int64
Father's qualification                            float64
Mother's occupation                                 int64
Father's occupation                                 int64
Admission grade                                   float64
Displaced                                          object
Educational special needs                           int64
Debtor                                            float64
Tuition fees up to date                            object
Gender        

#### Result: Different types

Found 12 features with different types, based on those described in the dictionary document.

- **Marital status** is Categorical and should be Integer (int64)
- **Course** is Continuous and should be Integer (int64)  
- **Father's qualification** is Continuous and should be Integer (int64)
- **Displaced** is Categorical and should be Integer (int64)
- **Debtor** is Continuous and should be Integer (int64)
- **Tuition fees up to date** is Categorical and should be Integer (int64)
- **Age at enrollment** is Categorical and should be Integer (int64)
- **International** is Continuous and should be Integer (int64)  
- **Curricular units 1st sem (credited)** is Categorical and should be Integer (int64)
- **Curricular units 1st sem (grade)** is Continuous and should be Integer (int64)  
- **Curricular units 2nd sem (enrolled)** is Continuous and should be Integer (int64)  
- **Curricular units 2nd sem (grade)** is Continuous and should be Integer (int64)  

25 features with their respective types ok
- Application mode Integer
- Application order Integer
- Daytime/evening attendance Integer
- Previous qualification Integer
- Previous qualification (grade) Continuous
- Nacionality Integer
- Mother's qualification Integer
- Mother's occupation Integer
- Father's occupation Integer
- Admission grade Continuous
- Educational special needs Integer
- Gender Integer
- Scholarship holder Integer
- Curricular units 1st sem (enrolled) Integer
- Curricular units 1st sem (evaluations) Integer
- Curricular units 1st sem (approved) Integer
- Curricular units 1st sem (without evaluations) Integer
- Curricular units 2nd sem (credited) Integer
- Curricular units 2nd sem (evaluations) Integer
- Curricular units 2nd sem (approved) Integer
- Curricular units 2nd sem (without evaluations) Integer
- Unemployment rate Continuous
- Inflation rate Continuous
- GDP Continuous
- Target Categorical

Features:

Marital status, Application mode, Application order, Course, Daytime/evening attendance, Previous qualification, Previous qualification (grade), Nacionality, Mother's qualification, Father's qualification, Mother's occupation, Father's occupation, Admission grade, Displaced, Educational special needs, Debtor, Tuition fees up to date, Gender, Scholarship holder, Age at enrollment, International, Curricular units 1st sem (credited), Curricular units 1st sem (enrolled), Curricular units 1st sem (evaluations), Curricular units 1st sem (approved), Curricular units 1st sem (grade), Curricular units 1st sem (without evaluations), Curricular units 2nd sem (credited), Curricular units 2nd sem (enrolled), Curricular units 2nd sem (evaluations), Curricular units 2nd sem (approved), Curricular units 2nd sem (grade), Curricular units 2nd sem (without evaluations), Unemployment rate, Inflation rate, GDP, Target.        

In [11]:
df["Marital status"].unique()

array(['1', '2', '4', '?', '3', '5', '6'], dtype=object)

Different value formats found

	•	Marital status** = '?'
	•	Course*** = nan
	•	Father's qualification*** = nan
	•	Displaced** = '?'
	•	Debtor*** = nan
	•	Tuition fees up to date*** = nan and '?'
	•	Age at enrollment** = 'UnKnown'
	•	International*** = nan
	•	Curricular units 1st sem (credited)** = 'Na'
	•	Curricular units 1st sem (grade)*** = nan
	•	Curricular units 2nd sem (enrolled)*** = nan
	•	Curricular units 2nd sem (grade)*** = nan
	•	Admission grade* = nan

- (*) Missing Values
- (**) Different Data Types
- (***) Missing Values and Different Data Types


### Mark invalid/corrupt values as missing (NaN)

In [12]:
missing_value_formats = ["?", "UnKnown", "Na"]

In [13]:
df = pd.read_csv("student_data.csv", na_values = missing_value_formats)

In [14]:
df["Marital status"].unique()

array([ 1.,  2.,  4., nan,  3.,  5.,  6.])

```python
def make_int(i):
    try:
        return int(i)
    except:
        return np.nan

# apply make_int function to the entire series using map
df['Marital status'] = df['Marital status'].map(make_int)
```

In [15]:
# null_filter = df['Marital status'].isnull()
# df[null_filter].head()

In [16]:
# df["Marital status"].unique()