# k-Nearest Neighbors (kNN) Analysis

This notebook uses the k-Nearest Neighbors (kNN) algorithm to predict student performance.  
We use a dataset called **student_performance.csv** with features like self-study hours, attendance, and class participation.  

We build two models:
- A **classification model** to predict student grades.
- A **regression model** to predict total score.

We also test the model with a new student example.


In [4]:
import pandas as pd

# Change the path to where you saved the file
url ="https://raw.githubusercontent.com/Patrick0481/Data-mining-2025-course/refs/heads/main/playerstats%20(1).csv"
df = pd.read_csv(url, encoding='latin-1')

kolommen = df.columns.tolist()
print(kolommen)

['Rk;Player;Nation;Pos;Squad;Comp;Age;Born;MP;Starts;Min;90s;Goals;Shots;SoT;SoT%;G/Sh;G/SoT;ShoDist;ShoFK;ShoPK;PKatt;PasTotCmp;PasTotAtt;PasTotCmp%;PasTotDist;PasTotPrgDist;PasShoCmp;PasShoAtt;PasShoCmp%;PasMedCmp;PasMedAtt;PasMedCmp%;PasLonCmp;PasLonAtt;PasLonCmp%;Assists;PasAss;Pas3rd;PPA;CrsPA;PasProg;PasAtt;PasLive;PasDead;PasFK;TB;Sw;PasCrs;TI;CK;CkIn;CkOut;CkStr;PasCmp;PasOff;PasBlocks;SCA;ScaPassLive;ScaPassDead;ScaDrib;ScaSh;ScaFld;ScaDef;GCA;GcaPassLive;GcaPassDead;GcaDrib;GcaSh;GcaFld;GcaDef;Tkl;TklWon;TklDef3rd;TklMid3rd;TklAtt3rd;TklDri;TklDriAtt;TklDri%;TklDriPast;Blocks;BlkSh;BlkPass;Int;Tkl+Int;Clr;Err;Touches;TouDefPen;TouDef3rd;TouMid3rd;TouAtt3rd;TouAttPen;TouLive;ToAtt;ToSuc;ToSuc%;ToTkl;ToTkl%;Carries;CarTotDist;CarPrgDist;CarProg;Car3rd;CPA;CarMis;CarDis;Rec;RecProg;CrdY;CrdR;2CrdY;Fls;Fld;Off;Crs;TklW;PKwon;PKcon;OG;Recov;AerWon;AerLost;AerWon%']


In [None]:
# Import needed libraries for kNN classification and evaluation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Select the features (independent variables)
X = df[['weekly_self_study_hours', 'attendance_percentage', 'class_participation']]

# Select the target (dependent variable)
y = df['grade']

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize (scale) the data so all features have similar range
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create the kNN classifier with 5 neighbors
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn_clf.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = knn_clf.predict(X_test_scaled)

# Show model evaluation results
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

KeyError: "None of [Index(['weekly_self_study_hours', 'attendance_percentage',\n       'class_participation'],\n      dtype='object')] are in the [columns]"

# ðŸ§¾ Conclusion

The kNN model reached **66% accuracy**.  
It predicts **grade A** well but struggles with lower grades.  
This is likely due to **imbalanced data** (many A students).  
Overall, results are fair but can improve with tuning or balancing.

In [3]:

# Select the features and target variable
X = df[['weekly_self_study_hours', 'attendance_percentage', 'class_participation']]

# Make sure total_score is numeric (convert if needed)
df['total_score'] = pd.to_numeric(df['total_score'], errors='coerce')
y = df['total_score']

# Split the data again for this target (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize (scale) the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create the kNN regressor with 5 neighbors
knn = KNeighborsRegressor(n_neighbors=5)

# Train the regression model
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)

# Show evaluation metrics for regression
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

Mean Squared Error: 80.74160792400001
R^2 Score: 0.6608558641758525


In [4]:
# Enter a new student as a list (order must match X.columns)
new_student = [[10, 80, 5]]  # [self-study hours, attendance %, participation]

# Convert to a DataFrame with the correct column names
new_student_df = pd.DataFrame(new_student, columns=X.columns)

# Scale using the same scaler fitted on the training data
new_student_scaled = scaler.transform(new_student_df)

# Predict the total score for the new student
predicted_score = knn.predict(new_student_scaled)
print("Predicted score:", predicted_score[0])

Predicted score: 81.28
