In [2]:
import pandas as pd

# Load the dataset
file_path = 'Student_performance_10k.csv'
data = pd.read_csv(file_path)

# Display the first few rows and basic info
data.head(), data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   roll_no                      9999 non-null   object 
 1   gender                       9982 non-null   object 
 2   race_ethnicity               9977 non-null   object 
 3   parental_level_of_education  9978 non-null   object 
 4   lunch                        9976 non-null   float64
 5   test_preparation_course      9977 non-null   float64
 6   math_score                   9976 non-null   object 
 7   reading_score                9975 non-null   float64
 8   writing_score                9976 non-null   float64
 9   science_score                9977 non-null   float64
 10  total_score                  9981 non-null   float64
 11  grade                        9997 non-null   object 
dtypes: float64(6), object(6)
memory usage: 937.6+ KB


(  roll_no gender race_ethnicity parental_level_of_education  lunch  \
 0  std-01   male        group D                some college    1.0   
 1  std-02   male        group B                 high school    1.0   
 2  std-03   male        group C             master's degree    1.0   
 3  std-04   male        group D                some college    1.0   
 4  std-05   male        group C                some college    0.0   
 
    test_preparation_course math_score  reading_score  writing_score  \
 0                      1.0         89           38.0           85.0   
 1                      0.0         65          100.0           67.0   
 2                      0.0         10           99.0           97.0   
 3                      1.0         22           51.0           41.0   
 4                      1.0         26           58.0           64.0   
 
    science_score  total_score grade  
 0           26.0        238.0     C  
 1           96.0        328.0     A  
 2           58.0    

In [7]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Clean the data
data.dropna(subset=['roll_no'], inplace=True)
categorical_columns = ['gender', 'race_ethnicity', 'parental_level_of_education', 'grade']
numerical_columns = ['lunch', 'test_preparation_course', 'math_score', 'reading_score', 'writing_score', 'science_score', 'total_score']
for column in categorical_columns:
    data[column].fillna(data[column].mode()[0], inplace=True)
for column in numerical_columns:
    if data[column].dtype == object:
        data[column] = pd.to_numeric(data[column], errors='coerce')
    data[column].fillna(data[column].median(), inplace=True)

# Selecting features and scaling
features = data[numerical_columns]
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Applying KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(features_scaled)
data['cluster'] = clusters

# Calculate the cluster centroids
centroids = kmeans.cluster_centers_

# Evaluate the clustering
silhouette_avg = silhouette_score(features_scaled, clusters)

# Output results
print("Cluster centers:")
print(centroids)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data[column].fillna(data[column].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which 

Cluster centers:
[[-1.34811474 -0.03718046  0.21582209  0.16640338  0.25732339  0.18588746
   0.38146207]
 [ 0.74177662  0.06080073  0.36433924  0.23163791  0.31872186  0.37787076
   0.60223629]
 [ 0.1189389  -0.05269261 -0.68929744 -0.46309675 -0.66135047 -0.68257809
  -1.16042818]]


In [None]:
Cluster 0:
Low math score (-1.34)
Below average reading score (-0.037)
Slightly above average writing score (0.215)
Slightly above average science score (0.166)
Below average total score (0.185)
This cluster might represent students who generally perform below average in math and slightly below average overall but have closer to average performance in reading, writing, and science.

Cluster 1:
High math score (0.741)
Above average reading score (0.688)
Above average writing score (0.364)
Above average science score (0.231)
Above average total score (0.378)
This cluster likely represents high-performing students who score well across all subjects, particularly excelling in math and reading.

Cluster 2:
Average math score (0.118)
Below average reading score (-0.526)
Well below average writing score (-0.689)
Well below average science score (-0.463)
Well below average total score (-0.682)
This cluster represents students who are struggling across the board, particularly in writing, science, and overall.

Usage:
Cluster 0 might need more help in math but are doing reasonably well in other subjects.
Cluster 1 students are high achievers who might benefit from advanced coursework and enrichment programs.
Cluster 2 students need broad support, particularly in writing and science.