<a href="https://colab.research.google.com/github/AyeshaIjazTabassum/PythonAIBootcamp/blob/main/Day9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ðŸ“Œ Import Required Libraries

This cell imports all libraries required for KNN, preprocessing, and evaluation metrics.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

## ðŸ“Œ Load the Dataset

Load your CSV dataset and display the first 5 rows to confirm it's loaded correctly.

In [5]:
df = pd.read_csv("students_dataset.csv")
df.head()

Unnamed: 0,student_id,age,gender,study_hours,extracurricular,sleep_hours,previous_score,attendance_rate,favorite_subject,parental_support,internet_access,target_grade
0,1,17,Male,18.324,Yes,7.842,82.456,91.234,Math,High,Yes,A
1,2,16,Female,15.678,No,6.123,71.89,88.567,Science,Medium,Yes,B
2,3,18,Male,21.456,Yes,8.901,89.123,93.456,Math,High,Yes,A
3,4,19,Female,12.345,Yes,5.678,64.789,76.89,English,Low,No,C
4,5,17,Male,17.89,No,7.234,77.012,85.678,History,Medium,Yes,B


## ðŸ“Œ Basic Cleaning

- Fill missing numeric values with mean.
- Fill missing categorical values with mode.
- Remove unrealistic outliers (age > 100, negative study hours, previous_score > 100).

In [6]:
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

categorical_cols = df.select_dtypes(include=['object']).columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

df = df[df['age'] < 100]
df = df[df['study_hours'] >= 0]
df = df[df['previous_score'] <= 100]

df.shape

(196, 12)

## ðŸ“Œ Encode Categorical Columns

Convert categorical features into numbers using LabelEncoder.

In [7]:
le = LabelEncoder()
for col in ['gender', 'extracurricular', 'favorite_subject', 'target_grade']:
    df[col] = le.fit_transform(df[col])

## ðŸ“Œ Select Features (X) and Target (y)

`X` contains all input features, `y` is the target grade.

In [8]:
X = df.drop('target_grade', axis=1)
y = df['target_grade']

## ðŸ“Œ Train/Test Split

Split dataset into training and testing sets (80% train, 20% test).

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("Training Data:", X_train.shape)
print("Testing Data:", X_test.shape)

Training Data: (156, 11)
Testing Data: (40, 11)


## ðŸ“Œ Feature Scaling

KNN uses distances, so features must be scaled to avoid bias.

In [13]:
X_train.dtypes

Unnamed: 0,0
student_id,int64
age,int64
gender,int64
study_hours,float64
extracurricular,int64
sleep_hours,float64
previous_score,float64
attendance_rate,float64
favorite_subject,int64
parental_support,object


In [14]:
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

In [15]:
X_train_encoded, X_test_encoded = X_train_encoded.align(
    X_test_encoded, join="left", axis=1, fill_value=0
)

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

## ðŸ“Œ Train KNN Model

Train KNN classifier with K=5.

In [18]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

## ðŸ“Œ Make Predictions

Predict target grades for test data.

In [19]:
y_pred = knn.predict(X_test_scaled)
y_pred[:10]

array([2, 1, 0, 0, 2, 0, 1, 0, 1, 2])

## ðŸ“Œ Accuracy

Percentage of correctly predicted values.

In [20]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.775


## ðŸ“Œ Precision, Recall, F1 Score

- Precision: Correct positive predictions
- Recall: How many real positives captured
- F1 Score: Balance between precision and recall

In [21]:
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Precision: 0.8895833333333332
Recall: 0.775
F1 Score: 0.7739901960784313


## ðŸ“Œ Confusion Matrix

Shows correct vs misclassified predictions.

In [22]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[14,  0,  0,  0,  0],
       [ 0, 11,  2,  0,  0],
       [ 0,  1,  4,  0,  0],
       [ 0,  0,  2,  1,  0],
       [ 0,  0,  4,  0,  1]])

## ðŸ“Œ Classification Report

Detailed metrics per class (Precision, Recall, F1, Support).

In [23]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       0.92      0.85      0.88        13
           2       0.33      0.80      0.47         5
           3       1.00      0.33      0.50         3
           4       1.00      0.20      0.33         5

    accuracy                           0.78        40
   macro avg       0.85      0.64      0.64        40
weighted avg       0.89      0.78      0.77        40

