# I. Business Understanding

## The objective of this project is:

To learn how to apply machine learning/data mining methods to real-world data sets.

The project covers all stages of data mining, from setting objectives to drawing conclusions.

What factors influence student performance the most?

(Which variables have the highest impact on final grades?)

How does access to educational resources (extra classes, internet) affect student grade outcomes?



# II. Data Understanding
The dataset is available at the UCI Machine Learning Repository, with the following link: https://archive.ics.uci.edu/dataset/320/student+performance

In [282]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

student_performance = fetch_ucirepo(id=320)

X = student_performance.data.features
y = student_performance.data.targets

print(X.head())
print(y.head())


  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  higher internet  romantic  famrel  freetime goout Dalc Walc health absences  
0    yes       no        no       4         3     4    1    1      3        4  
1    yes      yes        no       5         3     3    1    1      3        2  
2    yes      yes        no       4         3     2    2    3      3        6  
3    yes      yes       yes       3         2     2    1    1      5        0  
4    yes       no        no       4         3     2    1    2      5        0  

[5 rows x 30 columns]
   G1  G2 

The dataset contains 649 records and 30 features.

In [283]:
X.shape, y.shape

((649, 30), (649, 3))

In [284]:
print(student_performance.variables)

          name     role         type      demographic  \
0       school  Feature  Categorical             None   
1          sex  Feature       Binary              Sex   
2          age  Feature      Integer              Age   
3      address  Feature  Categorical             None   
4      famsize  Feature  Categorical            Other   
5      Pstatus  Feature  Categorical            Other   
6         Medu  Feature      Integer  Education Level   
7         Fedu  Feature      Integer  Education Level   
8         Mjob  Feature  Categorical       Occupation   
9         Fjob  Feature  Categorical       Occupation   
10      reason  Feature  Categorical             None   
11    guardian  Feature  Categorical             None   
12  traveltime  Feature      Integer             None   
13   studytime  Feature      Integer             None   
14    failures  Feature      Integer             None   
15   schoolsup  Feature       Binary             None   
16      famsup  Feature       B

The dataset contains student-related attributes, including demographic, academic, and behavioral characteristics. It includes both numerical and categorical values describing students' background, study habits, and extracurricular activities. The dataset also contains students' grades (G1, G2, and G3), which represent their performance in different periods, with G3 being the final grade. The dataset does not contain missing values.

# III. Data Preparation

Handle possible missing values

In [285]:
from sklearn.preprocessing import StandardScaler

X = pd.get_dummies(X, drop_first=True)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [286]:
X.head(20)

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,True,False,True,False,False,False,True,True,False,False
1,17,1,1,1,2,0,5,3,3,1,...,False,False,False,True,False,False,False,True,True,False
2,15,1,1,1,2,0,4,3,2,2,...,True,False,True,False,False,False,True,True,True,False
3,15,4,2,1,3,0,3,2,2,1,...,True,False,False,True,False,True,True,True,True,True
4,16,3,3,1,2,0,4,3,2,1,...,False,False,False,True,False,False,True,True,False,False
5,16,4,3,1,2,0,5,4,2,1,...,True,False,False,True,False,True,True,True,True,False
6,16,2,2,1,2,0,4,4,4,1,...,True,False,False,False,False,False,True,True,True,False
7,17,4,4,2,2,0,4,1,4,1,...,True,False,True,True,False,False,True,True,False,False
8,15,3,2,1,2,0,4,2,2,1,...,True,False,False,True,False,False,True,True,True,False
9,15,3,4,1,2,0,5,5,1,1,...,True,False,False,True,False,True,True,True,True,False


# IV. Modeling

Understand the relationship between alcohol usage, internet access and the grade

In [287]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier



y_reg = y['G3']
y_class = y['G3'].apply(lambda x: 'Pass' if x >= 10 else 'Fail')

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_reg, test_size=0.2, random_state=42)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train, y_train)


# V. Evaluation

In [288]:
from sklearn.metrics import mean_absolute_error, r2_score

y_pred_reg = regressor.predict(X_test)
print("MAE:", mean_absolute_error(y_test, y_pred_reg))
print("R² Score:", r2_score(y_test, y_pred_reg))

MAE: 2.156382187028527
R² Score: 0.16016991964318295


In [289]:
from sklearn.metrics import accuracy_score, classification_report

y_pred_class = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_class))
print(classification_report(y_test, y_pred_class))

Accuracy: 0.14615384615384616
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         1
           8       0.33      0.14      0.20         7
           9       0.00      0.00      0.00         5
          10       0.21      0.41      0.28        17
          11       0.17      0.24      0.20        25
          12       0.08      0.06      0.07        16
          13       0.05      0.08      0.06        13
          14       0.14      0.17      0.15        12
          15       0.00      0.00      0.00        10
          16       0.00      0.00      0.00         9
          17       0.33      0.20      0.25         5
          18       0.00      0.00      0.00         7
          19       0.00      0.00      0.00         1

    accuracy                           0.15       130
   macro avg       0.09      0.09      0.09       130
weighted avg       0.12      0.15      0.12       

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
