<a href="https://colab.research.google.com/github/Basmala-hub/Student-Attendance-Dataset-College-Level-/blob/main/Student_Attendance_Dataset_College_Level.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**About Dataset**

This dataset provides a detailed representation of factors that influence student attendance in a college environment. It includes a wide variety of variables covering student demographics, lifestyle habits, academic background, environmental conditions, and attendance outcomes.

Each record represents a unique student-day profile with realistic behavior patterns. Variables such as study hours, sleep duration, travel time, internet access, weather conditions, hostel residency, and type of class (online/offline) offer an in-depth view of how different factors collectively impact attendance consistency.

The dataset reflects common real-world trends observed in educational institutions. For instance:

Students with higher study hours tend to show greater consistency in attendance.
Rainy weather or long travel times negatively affect attendance.
Online classes increase attendance likelihood due to convenience.
Hostel students generally have lower travel time and slightly better attendance rates.


**How the machine solved it**


1.   Additionally, absence reasons are included to help understand common causes of absenteeism, such as illness, personal work, festivals, travel, or academic project commitments.
2.   The dataset is suitable for educational analysis, institutional planning, behavioral insights, research, dashboards, attendance trend visualization, and general statistical exploration. Since the data is fully synthetic, it avoids privacy issues while maintaining realistic patterns.







     




**import libraries**

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler,OrdinalEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score,recall_score,f1_score,precision_score,confusion_matrix

Get path of data

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("kundanbedmutha/student-attendance-dataset-college-level")

print("Path to dataset files:", path)

show files in this path

In [None]:
import os

print(os.listdir(path))


show 5 samples in dataset

In [None]:
dataSet=pd.read_csv(f"{path}/Attendance_Prediction.csv")
dataSet.head()

show number of features and number of samples

In [None]:
dataSet.shape

In [None]:
dataSet.info()

Provide summary statistics

In [None]:
dataSet.describe()

split featueres and target

In [None]:
x=dataSet.drop(columns=["attendance","absence_reason"],axis=1)
y=dataSet["attendance"]

List features

In [None]:
x.columns

target

In [None]:
y.head()

Feature Type Identification


*   Numerical:because the type of data is int64 or float64
          

      *  student_id
      *   age


     *   study_hours
     *   sleep_hours













*   Binary categorical:as it is presented in zero or one
     

      *   attendance
       *   internet_access
     *   hostel_resident


*   Ordinal categorical:as it's category can be ordered


    *   year
     *   parent_education

     *   weather





*   Nominal categorical:not realation between this values and can't be order



     *   gender
     *   course

     *  class_type










In [None]:
import matplotlib.pyplot as plt
import math

numeric_columns = dataSet.select_dtypes(include=['int64', 'float64']).columns

n_cols = 2
n_rows = math.ceil(len(numeric_columns) / n_cols)

fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 4 * n_rows))
axes = axes.flatten()

for idx, col in enumerate(numeric_columns):
    axes[idx].hist(dataSet[col], bins=20, edgecolor='black')
    axes[idx].set_title(f'Distribution of {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
plt.tight_layout()
plt.show()


In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:

class_counts = y_train.value_counts()
print(class_counts)
class_percentage = y_train.value_counts(normalize=True) * 100
print(class_percentage)


The classes are almost balanced, but SMOTE can still be applied to achieve an even better balance

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
y_train.value_counts().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Class Distribution of Target Variable')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.xticks(rotation=0)
plt.show()


show null features

In [None]:
print(x_train.isnull().sum(),"\n")
print(x_test.isnull().sum())


Handling Missing Values for categorical features by Mode

In [None]:
cat_cols = [
    'course',
    'class_type',
]

for col in cat_cols:
    mode_val = x_train[col].mode()[0]
    x_train[col].fillna(mode_val)
    x_test[col].fillna(mode_val)
# x_train=x_train.drop(columns=['course',"class_type"])
print(x_train['course'],'\n\n\n\n\n',x_train['class_type'])

In [None]:
x_train.columns

OneHotEncoder for Nominal features:

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder( handle_unknown='ignore',sparse_output=False)
x_train["gender"]=encoder.fit_transform(x_train[["gender"]])
x_test["gender"]=encoder.transform(x_test[["gender"]])
x_train['course'] = encoder.fit_transform(x_train[['course']])
x_test['course']=encoder.transform(x_test[['course']])
x_train['class_type']=encoder.fit_transform(x_train[['class_type']])
x_test['class_type']=encoder.transform(x_test[['class_type']])


OrdinalEncoder for Ordinal features:

In [None]:
ordencoder=OrdinalEncoder()
x_train["year"]=ordencoder.fit_transform(x_train[["year"]])
x_train["parent_education"]=ordencoder.fit_transform(x_train[["parent_education"]])
x_train["weather"]=ordencoder.fit_transform(x_train[["weather"]])
x_test["year"]=ordencoder.fit_transform(x_test[["year"]])
x_test["parent_education"]=ordencoder.fit_transform(x_test[["parent_education"]])
x_test["weather"]=ordencoder.fit_transform(x_test[["weather"]])

  Label Encoding   Binary features:

In [None]:
labelencoder=LabelEncoder()
x_train["internet_access"]=labelencoder.fit_transform(x_train["internet_access"])
x_train["hostel_resident"]=labelencoder.fit_transform(x_train["hostel_resident"])

Handling Missing Values for numerical features by Mean

In [None]:
# numeric_cols = ['student_id', 'age', 'study_hours', 'sleep_hours']

# for col in numeric_cols:
#     mean_value = x_train[col].mean()
#     x_train[col].fillna(mean_value, inplace=True)
#     x_test[col].fillna(mean_value, inplace=True)

In [None]:
x_train.columns

In [None]:
x_train

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(x_train, y_train)
print(y_train_res.value_counts())


In [None]:
plt.figure(figsize=(6,4))
y_train.value_counts().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Class Distribution of Target Variable')
plt.xlabel('Class')
plt.ylabel('Number of Samples')
plt.xticks(rotation=0)
plt.show()


**Remove outliers**

In [None]:
num_cols = x_train.select_dtypes(include=["int64", "float64"]).columns
Q1 = x_train[num_cols].quantile(0.25)
Q3 = x_train[num_cols].quantile(0.75)
IQR = Q3 - Q1

mask = ~((x_train[num_cols] < (Q1 - 1.5 * IQR)) |
         (x_train[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)

X_train_clean = x_train[mask]
y_train_clean = y_train[mask]


**Feature Scaling**

In [None]:
scale=StandardScaler()
X_train_scaled = scale.fit_transform(X_train_clean[['student_id','age',"study_hours","sleep_hours"]])
x_test_scaled=scale.transform(x_test[['student_id','age',"study_hours","sleep_hours"]])


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)  # احتفاظ بـ 95% من المعلومات
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(x_test_scaled)


**K-Nearest Neighbors (KNN)**

In [None]:
knndefaultParameters=KNeighborsClassifier(n_neighbors=5)
knndefaultParameters.fit(X_train_pca,y_train_clean)
y_pred=knndefaultParameters.predict(X_test_pca)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred ))

**Performance using GridSearchCV best hyperparameters**

In [None]:

knn = KNeighborsClassifier()
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11, 15,20],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
grid_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='accuracy')
grid_knn.fit(X_train_pca,y_train_clean)
print("Best Parameters:", grid_knn.best_params_)
print("Best Cross-Validation Accuracy:", grid_knn.best_score_)
best_knn = grid_knn.best_estimator_
y_pred = best_knn.predict(X_test_pca)

print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


**Naive Bayes**

In [None]:
naive_bayes = GaussianNB()

param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4]
}
gridGaussianNB = GridSearchCV(naive_bayes, param_grid, cv=5, scoring='accuracy')
gridGaussianNB.fit(X_train_pca,y_train_clean)
y_pred=gridGaussianNB.predict(X_test_pca)
print(classification_report(y_test,y_pred))



**train randomforest with deafult paramater**

In [None]:
randomforest=RandomForestClassifier()
ranforestModel=randomforest.fit(X_train_pca,y_train_clean)
y_pred=ranforestModel.predict(X_test_pca)
print(classification_report(y_test,y_pred))

**Performance using GridSearchCV best hyperparameters for randomforest**





In [68]:

randomForestClassifier=RandomForestClassifier()
randomgrid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "max_features": ["sqrt", "log2"]
}
gridrandomForestClassifier=GridSearchCV(randomForestClassifier,randomgrid,cv=5,scoring="accuracy")
gridrandomForestClassifier.fit(X_train_pca,y_train_clean)
y_pred=gridrandomForestClassifier.predict(X_test_pca)
print(classification_report(y_test,y_pred))
print("Best Parameters:", gridrandomForestClassifier.best_params_)
print("Best Cross-Validation Accuracy:", gridrandomForestClassifier.best_score_)

              precision    recall  f1-score   support

           0       0.59      0.52      0.55      1946
           1       0.59      0.66      0.63      2054

    accuracy                           0.59      4000
   macro avg       0.59      0.59      0.59      4000
weighted avg       0.59      0.59      0.59      4000

Best Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Best Cross-Validation Accuracy: 0.5906935908691835


**train logistic regression with default parameter**

In [69]:
logisticRegression=LogisticRegression()
logisticModel=logisticRegression.fit(X_train_pca,y_train_clean)
y_pred=logisticModel.predict(X_test_pca)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.59      0.54      0.56      1946
           1       0.60      0.65      0.62      2054

    accuracy                           0.60      4000
   macro avg       0.60      0.60      0.59      4000
weighted avg       0.60      0.60      0.60      4000



Performance using GridSearchCV best hyperparameters for **logistic Regression**

In [70]:
logisticRegression=LogisticRegression()
param_grid = [
    {"penalty": ["l1"], "C": [0.01,0.1,1], "solver": ["saga"], "class_weight":[None,"balanced"], "max_iter":[1000]},
    {"penalty": ["l2"], "C": [0.01,0.1,1], "solver": ["saga"], "class_weight":[None,"balanced"], "max_iter":[1000]},
    {"penalty": ["elasticnet"], "C": [0.01,0.1,1], "solver": ["saga"], "l1_ratio":[0,0.5,1], "class_weight":[None,"balanced"], "max_iter":[1000]}
]

gridLogisticRegression=GridSearchCV(logisticRegression,param_grid,cv=5,scoring="accuracy",n_jobs=-1)
gridLogistic=gridLogisticRegression.fit(X_train_pca,y_train_clean)
y_pred=gridLogistic.predict(X_test_pca)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.59      0.58      0.59      1946
           1       0.61      0.61      0.61      2054

    accuracy                           0.60      4000
   macro avg       0.60      0.60      0.60      4000
weighted avg       0.60      0.60      0.60      4000



**Decision tree with default prameter**

In [71]:
decisionTree=DecisionTreeClassifier()
decisionTreeModel=decisionTree.fit(X_train_pca,y_train_clean)
y_pred=decisionTreeModel.predict(X_test_pca)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.53      0.52      0.52      1946
           1       0.55      0.56      0.55      2054

    accuracy                           0.54      4000
   macro avg       0.54      0.54      0.54      4000
weighted avg       0.54      0.54      0.54      4000



*Performance using GridSearchCV best hyperparameters for* **Decision tree**

In [72]:
decisionTree=DecisionTreeClassifier()
param_grid = {
    "criterion": ["gini", "entropy", "log_loss"],
    "max_depth": [None, 5, 10, 15],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": [None, "sqrt", "log2"],
    "class_weight": [None, "balanced"],
    "splitter": ["best", "random"]
}
gridDecisionTree=GridSearchCV(decisionTree,param_grid,cv=5,scoring="accuracy", n_jobs=-1)
gridDecisionTree.fit(X_train_pca,y_train_clean)
y_pred=gridDecisionTree.predict(X_test_pca)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.56      0.55      0.56      1946
           1       0.58      0.59      0.59      2054

    accuracy                           0.57      4000
   macro avg       0.57      0.57      0.57      4000
weighted avg       0.57      0.57      0.57      4000



**SVM with default paramater**

In [73]:
svm=SVC()
modelsvm=svm.fit(X_train_pca,y_train_clean)
y_pred=modelsvm.predict(X_test_pca)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.59      0.51      0.55      1946
           1       0.59      0.66      0.62      2054

    accuracy                           0.59      4000
   macro avg       0.59      0.59      0.59      4000
weighted avg       0.59      0.59      0.59      4000



In [74]:
svm=SVC()
param_grid_svm = {
    "C": [0.1, 1, 10, 100],              # Regularization parameter
    "kernel": ["linear", "rbf", "poly"], # نوع الكيرنل
    "gamma": ["scale", "auto"],          # للتحكم في مدى تأثير النقاط
    "degree": [2, 3, 4],                 # بس مهم للـ 'poly' kernel
    "class_weight": [None, "balanced"]   # لو الداتا Imbalanced
}
gridSvc=GridSearchCV(svm,param_grid_svm,cv=5,n_jobs=-1,scoring="accuracy")
gridSvc.fit(X_train_pca,y_train_clean)
y_pred=gridSvc.predict(X_test_pca)
print(classification_report(y_test,y_pred))


KeyboardInterrupt: 

In [16]:
!git init

Reinitialized existing Git repository in /content/Student-Attendance-Dataset-College-Level-/.git/


In [39]:
!git add .


In [34]:
!git commit -m "first commit"

On branch my_new_branch

Initial commit

nothing to commit (create/copy files and use "git add" to track)


In [36]:
!git branch -M main


In [37]:
!git remote add origin https://github.com/Basmala-hub/Student-Attendance-Dataset-College-Level-.git

error: remote origin already exists.


In [38]:
!git push -u origin main

error: src refspec main does not match any
[31merror: failed to push some refs to 'https://github.com/Basmala-hub/Student-Attendance-Dataset-College-Level-'
[m

In [40]:
!git status

On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)


In [41]:
!git add .

In [42]:
!git status

On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)
