## Problem Statement
 Analyze how attendance rates influence student outcomes by building a classification model. Group students into categories like "high achievers" and "struggling students" and identify key patterns in attendance that correlate with academic success or challenges. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('student_performance.csv')

In [3]:
df.head()

Unnamed: 0,StudentID,Name,Gender,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade
0,1,John,Male,85,15,78,1,High,80
1,2,Sarah,Female,90,20,85,2,Medium,87
2,3,Alex,Male,78,10,65,0,Low,68
3,4,Michael,Male,92,25,90,3,High,92
4,5,Emma,Female,88,18,82,2,Medium,85


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   StudentID                  13 non-null     int64 
 1   Name                       13 non-null     object
 2   Gender                     13 non-null     object
 3   AttendanceRate             13 non-null     int64 
 4   StudyHoursPerWeek          13 non-null     int64 
 5   PreviousGrade              13 non-null     int64 
 6   ExtracurricularActivities  13 non-null     int64 
 7   ParentalSupport            13 non-null     object
 8   FinalGrade                 13 non-null     int64 
dtypes: int64(6), object(3)
memory usage: 1.0+ KB


There is no missing values, but if there were missing values then we use imputer to fill the missing data by using different strategy like mean, mode, or some constant.
Now we have to specify our target column which basically categories the student into high achievers or struggling student. So we need to speficy the threshold of Final grade inorder to classify the categories. Lets assume that the student whose FinalGrade is less than 75 is Struggling student and above it is High achievers. There is no in between.

In [5]:
df['FinalGrade']

0     80
1     87
2     68
3     92
4     85
5     90
6     62
7     78
8     72
9     88
10    77
11    75
12    58
Name: FinalGrade, dtype: int64

In [6]:
newdf = df.drop(['StudentID', 'Name', 'Gender'], axis=1)
ordinal_mapping = {'Low': 1, 'Medium': 2, 'High': 3}
newdf['ParentalSupport'] = newdf['ParentalSupport'].map(ordinal_mapping)
corr_matrix = newdf.corr()
print(corr_matrix['FinalGrade'].sort_values(ascending=False))

FinalGrade                   1.000000
PreviousGrade                0.987637
AttendanceRate               0.951182
StudyHoursPerWeek            0.796374
ParentalSupport              0.654298
ExtracurricularActivities    0.288949
Name: FinalGrade, dtype: float64


In [7]:
newdf

Unnamed: 0,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade
0,85,15,78,1,3,80
1,90,20,85,2,2,87
2,78,10,65,0,1,68
3,92,25,90,3,3,92
4,88,18,82,2,2,85
5,95,30,88,1,3,90
6,70,8,60,0,1,62
7,85,17,77,1,2,78
8,82,12,70,2,1,72
9,91,22,86,3,3,88


In [8]:
threshold = 75
newdf['Student_Category'] = df['FinalGrade'].apply(lambda x: 'High Achiever' if x > threshold else 'Struggling Student') #This will create a column called Student_Category which basically classify student as High Achiever or Struggling student based on the FinalGrade

In [9]:
newdf.head()

Unnamed: 0,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Student_Category
0,85,15,78,1,3,80,High Achiever
1,90,20,85,2,2,87,High Achiever
2,78,10,65,0,1,68,Struggling Student
3,92,25,90,3,3,92,High Achiever
4,88,18,82,2,2,85,High Achiever


As we have to do binary classification, we use Label encoder to convert the non-numerical column which is Student_Category to numerical column (containing binary digits)

In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
le = LabelEncoder()

In [12]:
newdf['Student_Category'] = le.fit_transform(newdf['Student_Category'])  #0 for high acheivers, 1 for struggling students

In [13]:
newdf.head()

Unnamed: 0,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Student_Category
0,85,15,78,1,3,80,0
1,90,20,85,2,2,87,0
2,78,10,65,0,1,68,1
3,92,25,90,3,3,92,0
4,88,18,82,2,2,85,0


In [14]:
from sklearn.model_selection import train_test_split

In [15]:
# Now we are going to split the data into training set and testing set
features = ['AttendanceRate', 'StudyHoursPerWeek', 'PreviousGrade', 'ParentalSupport']
target = 'Student_Category'
X = newdf[features]
y = newdf[target]

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [17]:
len(X_train)

9

In [18]:
from sklearn.ensemble import RandomForestClassifier

In [19]:
model = RandomForestClassifier(n_estimators=100, random_state = 42)

In [20]:
model.fit(X_train, y_train)

In [21]:
y_pred = model.predict(X_test)

In [22]:
y_pred

array([0, 0, 0, 1])

In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [24]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")

[[2 0]
 [1 1]]
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

Accuracy: 0.75


In [25]:
#creating a new csv file which contains actual category and the predicted category to see the result of the model.
original_categories = le.inverse_transform(y_pred)

results_df = X_test.copy()
results_df['Actual_Category'] = le.inverse_transform(y_test)
results_df['Predicted_Category'] = original_categories

results_df.to_csv('student_performance_predictions.csv', index=False)

In [26]:
# Compute correlation between AttendanceRate and predicted categories to see how AttendanceRate have impact on the predicted categories
newdf['Predicted_Category'] = model.predict(X[features])
correlation = newdf[['AttendanceRate', 'Predicted_Category']].corr()

print("Correlation between AttendanceRate and Predicted_Category:")
print(correlation)


Correlation between AttendanceRate and Predicted_Category:
                    AttendanceRate  Predicted_Category
AttendanceRate            1.000000           -0.846137
Predicted_Category       -0.846137            1.000000


# The Final Conclusion: 
There is a strong negative relationship between AttendanceRate and the Predicted_Category. As AttendanceRate increases, the predicted category is more likely to be 'High Achiever' (because 'High Achiever' is coded as 0 using encoding). Conversely, lower attendance rates are associated with 'Struggling Student' (because 'Struggling Student' is coded as 1 using encoding). ALso there are very less number of data, so the model might not perform well. To increase the accuracy, we can collect more data, also we can use other type of algorithms, we can also use Hyperparameter Tuning, etc. 