In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

In [4]:
df_mat = pd.read_csv("data/student-mat.csv", sep=';')
df_mat.head()


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [7]:
df_por = pd.read_csv("data/student-por.csv" , sep=";")
df_por.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## 📄 UCI Student Performance Dataset

### 🔢 Attributes for both `student-mat.csv` (Math course) and `student-por.csv` (Portuguese course):

1. **school** - student's school (binary): `"GP"` (Gabriel Pereira) or `"MS"` (Mousinho da Silveira)  
2. **sex** - student's sex (binary): `"F"` (female) or `"M"` (male)  
3. **age** - student's age (numeric): from 15 to 22  
4. **address** - home address type (binary): `"U"` (urban) or `"R"` (rural)  
5. **famsize** - family size (binary): `"LE3"` (≤3) or `"GT3"` (>3)  
6. **Pstatus** - parent's cohabitation status (binary): `"T"` (together) or `"A"` (apart)  
7. **Medu** - mother's education (numeric):  
   - 0: none  
   - 1: primary education (4th grade)  
   - 2: 5th to 9th grade  
   - 3: secondary education  
   - 4: higher education  
8. **Fedu** - father's education (same scale as Medu)  
9. **Mjob** - mother's job (nominal): `"teacher"`, `"health"`, `"services"`, `"at_home"`, `"other"`  
10. **Fjob** - father's job (same as Mjob)  
11. **reason** - reason for choosing this school (nominal): `"home"`, `"reputation"`, `"course"`, `"other"`  
12. **guardian** - student's guardian (nominal): `"mother"`, `"father"`, `"other"`  
13. **traveltime** - home to school travel time (numeric):  
    - 1: <15 min  
    - 2: 15–30 min  
    - 3: 30 min–1 hour  
    - 4: >1 hour  
14. **studytime** - weekly study time (numeric):  
    - 1: <2 hours  
    - 2: 2–5 hours  
    - 3: 5–10 hours  
    - 4: >10 hours  
15. **failures** - number of past class failures (numeric): 0–3, 4 = 4+ failures  
16. **schoolsup** - extra educational support (binary): `"yes"` or `"no"`  
17. **famsup** - family educational support (binary): `"yes"` or `"no"`  
18. **paid** - extra paid classes (binary): `"yes"` or `"no"`  
19. **activities** - extra-curricular activities (binary): `"yes"` or `"no"`  
20. **nursery** - attended nursery school (binary): `"yes"` or `"no"`  
21. **higher** - wants to pursue higher education (binary): `"yes"` or `"no"`  
22. **internet** - internet access at home (binary): `"yes"` or `"no"`  
23. **romantic** - in a romantic relationship (binary): `"yes"` or `"no"`  
24. **famrel** - family relationship quality (numeric): 1 (very bad) to 5 (excellent)  
25. **freetime** - free time after school (numeric): 1 (very low) to 5 (very high)  
26. **goout** - going out with friends (numeric): 1 (very low) to 5 (very high)  
27. **Dalc** - workday alcohol consumption (numeric): 1 (very low) to 5 (very high)  
28. **Walc** - weekend alcohol consumption (numeric): 1 (very low) to 5 (very high)  
29. **health** - current health status (numeric): 1 (very bad) to 5 (very good)  
30. **absences** - number of school absences (numeric): from 0 to 93  

### 🎯 Grade Attributes:
31. **G1** - First period grade (0–20)  
32. **G2** - Second period grade (0–20)  
33. **G3** - Final grade (0–20) → **Target variable**

---

⚠️ Note:  
There are several (382) students common to both datasets.  
These can be identified via identical attributes, as outlined in the dataset's accompanying R script.



In [13]:
df_mat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    