## Template IPYNB to learn and experiment with Data Cleaning and Preprocessing

Types of Cleaning:
- Removing missing values
- Imputing missing values
- Removing outliers

Types of Preprocessing:
- Encoding categorical data
- Feature scaling

Types of Visualization:
- Scatter
- Heatmap


### 1. Data Cleaning

In [13]:
# Basic imports for Data Cleaning

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [14]:
dataset = pd.read_csv('student-mat.csv', delimiter=';')
dataset.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [15]:
# Checking for Null Values

dataset.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [16]:
# Checking for Outliers

dataset.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


### **1. General Information**
- **Dataset Size**: 395 rows (observations) and 33 columns (features).
- **Feature Types**: Includes numerical, categorical, and binary data (e.g., `sex`, `address`, `schoolsup`).
- **No Missing Values**: All features report zero missing values, indicating that we do not need to handle null entries initially.

---

### **2. Numerical Data Statistics**
For the numerical features (`age`, `Medu`, `Fedu`, `traveltime`, `studytime`, etc.), the summary statistics reveal the following:

#### Central Tendency
- **Mean Values**:
  - The average age is ~16.7 years.
  - The average education level of the mother (`Medu`) and father (`Fedu`) is approximately 2.75 and 2.52, respectively, on a scale of 0-4.
  - Average grades (`G1`, `G2`, `G3`) are around 10.9, 10.7, and 10.4, respectively, suggesting a consistent scoring pattern.

#### Variability
- **Standard Deviations**:
  - High standard deviation in `absences` (8.00) suggests significant variation in student attendance.
  - Grades (`G1`, `G2`, `G3`) also show moderate variability (~3.3-4.6), indicating differences in academic performance.

#### Extremes
- **Minimum and Maximum Values**:
  - Features like `Medu`, `Fedu`, `traveltime`, and `studytime` have defined upper bounds (e.g., max value 4).
  - `absences` has a maximum of 75, far from the 75th percentile (8), suggesting potential **outliers**.
  - Grades (`G1`, `G2`, `G3`) have a minimum of 0, indicating possible failing students.

#### Quartiles
- **Interquartile Range (IQR)**:
  - Most features have a compact IQR, e.g., `freetime` ranges between 3 (25th percentile) and 4 (75th percentile), indicating most students have similar free time.
  - The wide IQR in `absences` shows variation in student attendance patterns.

---

### **3. Implications for Data Cleaning**
1. **Outliers**:
   - `absences` likely contains outliers (e.g., maximum of 75). This will require closer inspection, potentially using boxplots or Z-score thresholds.
2. **Encoding**:
   - Categorical features like `sex`, `address`, `Mjob`, and `Fjob` need encoding for machine learning models.
3. **Scaling**:
   - Numerical features like `absences`, `G1`, `G2`, and `G3` vary widely in range and may need feature scaling (e.g., normalization or standardization).

---

### **4. Implications for Visualization**
- **Scatter Plots**:
  - Explore relationships between grades (`G1`, `G2`, `G3`) and other numerical features like `studytime` or `absences`.
- **Heatmap**:
  - Analyze correlations among numerical features to identify highly correlated variables (e.g., between grades or parental education levels).


In [17]:
# Check for Duplicates

dataset.duplicated().sum()

np.int64(0)

Here we can see, the data set does not have any missing values, duplicates or any large variations in the data. So, we can directly move to the preprocessing part.

Visualizing the Data to learn more about it

In [18]:
# One Hot Encoding to ints

dataset = pd.get_dummies(dataset, drop_first=True).astype(float)
dataset.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18.0,4.0,4.0,2.0,2.0,0.0,4.0,3.0,4.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1,17.0,1.0,1.0,1.0,2.0,0.0,5.0,3.0,3.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
2,15.0,1.0,1.0,1.0,2.0,3.0,4.0,3.0,2.0,2.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0
3,15.0,4.0,2.0,1.0,3.0,0.0,3.0,2.0,2.0,1.0,...,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,16.0,3.0,3.0,1.0,2.0,0.0,4.0,3.0,2.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0


We have now encoded the categorical data and scaled the numerical data. We can now move to the next step of splitting the data into training and testing sets.

In [19]:
xtrain, xtest, ytrain, ytest = train_test_split(dataset.drop('G3', axis=1), dataset['G3'], test_size=0.2, random_state=42)