# Student Performance Prediction using KNN and Pipeline

## Objective
The goal of this project is to predict whether a student will pass or fail
based on academic and behavioral features using a K-Nearest Neighbors (KNN)
classifier. A machine learning pipeline is used to combine preprocessing
(feature scaling) and model training.


In [11]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

import joblib



**Load dataset**

In [12]:
df = pd.read_csv("student-mat.csv", sep=";")
df.head()


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


**Check for missing values**

In [13]:
df.isnull().sum()


Unnamed: 0,0
school,0
sex,0
age,0
address,0
famsize,0
Pstatus,0
Medu,0
Fedu,0
Mjob,0
Fjob,0


**Select features & target**

In [4]:
X = df[['studytime', 'absences', 'G1', 'G2']]
y = (df['G3'] >= 10).astype(int)

X.head(), y.head()


(   studytime  absences  G1  G2
 0          2         6   5   6
 1          2         4   5   5
 2          2        10   7   8
 3          3         2  15  14
 4          2         4   6  10,
 0    0
 1    0
 2    1
 3    1
 4    1
 Name: G3, dtype: int64)

**Train-test split**

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## Preprocessing

KNN is a distance-based algorithm, meaning feature scale directly affects
distance calculations. Therefore, feature scaling is applied using
StandardScaler to ensure all features contribute equally.


**Build pipeline**

In [14]:
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])


**Train the model**

In [15]:
pipeline.fit(X_train, y_train)



**Model evaluation**

In [9]:
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy


0.9240506329113924

**Save trained pipeline**

In [10]:
joblib.dump(pipeline, "student_knn_pipeline.pkl")


['student_knn_pipeline.pkl']