# 📘 Student Performance Tracker - EdTech Analytics Project

## 📘 Introduction

In modern education systems, the analysis and prediction of student academic performance have become crucial tools for improving teaching effectiveness and learning outcomes. With the growing availability of student test data, educational institutions can now leverage data analytics to gain actionable insights into learner behavior, subject strengths, participation trends, and exam readiness. These insights not only help teachers personalize their strategies but also enable administrators to allocate academic resources more effectively.

This project leverages real-time student data from a JEE coaching institute to assess test-wise academic performance, behavioral engagement, and predictive score modeling. It helps identify at-risk students, high performers, and students needing targeted support, thereby making performance monitoring a more scientific and data-backed process.

## 🔍 Problem Statement

The objective of this project is to analyze how student performance varies based on attributes such as gender, subject-wise accuracy, exam type, and batch. The ultimate institutional goal is to help more students gain admission to Tier-1 colleges like IIT, NIT, and IIIT. This requires tracking student improvement over time, identifying weak and strong subject areas, recognizing exam non-participation patterns, and predicting future exam scores.

By using regression modeling (Linear, Lasso, Ridge, and XGBoost), this project aims to forecast students' upcoming JEE scores and segment them based on risk levels. These predictive insights allow educators to intervene early, personalize learning, and drive better student outcomes through data-driven decisions.

## 💡 Why This Project?

Data-driven performance analytics is revolutionizing how JEE aspirants prepare for competitive exams. By helping students understand their academic strengths and weaknesses, such systems empower smarter study patterns and strategic preparation. Rather than relying solely on broad assessments, this project provides a focused and personalized view into each student’s journey — helping academic mentors optimize their efforts and ultimately enhance success rates across the institute.

## 📊 Dataset Overview

- Source: 50+ Excel files (2023–2024) containing student test performance across various exam types.
- Additional dataset: student details (roll no, name, current batch).
- Final consolidated dataset: 13,894 rows × 19 columns after cleaning.

### 📦 Data Processing Steps:
- Created `clean_excel_file()` to dynamically parse sheets, headers, and metadata like exam type, date.
- Removed unmatched roll numbers, fixed nulls in subject scores.
- Final dataset uploaded to MSSQL in two tables:
  - `student`: roll no, name, batch
  - `student_performance`: all exam-related metrics

In [None]:
import pandas as pd
import re
from datetime import datetime
import os
import shutil

# --- Helper Functions ---


def infer_exam_date(filename: str):
    match = re.search(r"(\d{2}-\d{2}-\d{4})", filename)
    if match:
        return datetime.strptime(match.group(1), "%d-%m-%Y").date()
    return None

def infer_exam_type(filename: str):
    exam_types = ['WTM', 'WTA', 'EAMCET', 'ADV', 'BITSAT', 'PTM']
    for etype in exam_types:
        if etype in filename.upper():
            return etype
    return "UNKNOWN"


# --- Main Cleaning Function ---
def clean_excel_file(filepath: str) -> pd.DataFrame:
    try:
        
        filename = filepath.split('/')[-1]
        print("Processing file:", filename)
        print("File path:", filepath)
        #If more than two sheets are present, then select the second sheet
        xls = pd.ExcelFile(filepath)

        # List of sheet names
        sheet_names = xls.sheet_names
        # print("Sheets:", sheet_names)

        # Total number of sheets
        sheet_count = len(sheet_names)
        print("Total sheets:", sheet_count)
        if(sheet_count==2) :
            sheet=sheet_names[1]
        else:
            sheet=sheet_names[0]
        if sheet_count==0:
            raise ValueError("No sheets found in the Excel file")

        df_raw = pd.read_excel(filepath, header=None,sheet_name=sheet)
        # Define list of required headers
        expected_keywords = ['ADM', 'SEC', 'TOT', 'M_M', 'P_R','S NO','MAT_M', 'PHYS_M', 'CHEM_M','roll_no','Rank']

        # Find row that contains most of the keywords
        header_row_index = None
        for i, row in df_raw.iterrows():
            matches = sum(any(kw in str(cell) for kw in expected_keywords) for cell in row)
            if matches >= 2:  # Threshold: 2 or more keyword matches
                header_row_index = i
                break
        if header_row_index is not None:
            df = pd.read_excel(filepath, sheet_name=sheet, header=header_row_index)
        else:
            raise ValueError("No valid header row found")
        # print(f"Header row index: {header_row_index}")

        #trim extra spaces from col names
        df.columns = df.columns.str.strip()
        
        df = df.rename(columns={'ADM NO': 'rollno',
                 'STUDENT NAME': 'student_name',
                 'M_C': 'math_correct', 'M_W': 'math_wrong', 'M_M\n100': 'math_tot','MAT_M\n80' : 'math_tot',
                 'M_M\n80':'math_tot','P_M\n40' : 'phy_tot','C_M\n40' : 'chem_tot',
                 'PHY_M\n40': 'phy_tot','CHE_M\n40' : 'chem_tot','Tot_M\n160': 'total_marks','TOT\n160': 'total_marks',
                 'P_C': 'phys_correct', 'P_W': 'phys_wrong', 'P_M\n100': 'phy_tot',
                 'C_C': 'chem_correct', 'C_W': 'chem_wrong', 'C_M\n100': 'chem_tot',
                 'Maths_CorrectMarks': 'math_correct',
                 'Physics_CorrectMarks': 'phys_correct',
                 'Chemistry_CorrectMarks': 'chem_correct',
                 'Maths_WrongMarks': 'math_wrong',
                 'Physics_WrongMarks': 'phys_wrong',
                 'Chemistry_WrongMarks': 'chem_wrong',
                 'Maths_Marks': 'math_tot',
                 'Physics_Marks': 'phy_tot',
                 'Chemistry_Marks': 'chem_tot',
                 'Total_Marks': 'total_marks',
                 'TOT_M\n300': 'total_marks',
                 'Rank': 'rank','TOT_R':'rank',
                 'SEC': 'batch_at_exam'})

        
        standard_columns = [
            'rollno', 'student_name', 'math_correct', 'phys_correct', 'chem_correct',
            'math_wrong', 'phys_wrong', 'chem_wrong',
            'math_tot', 'phy_tot', 'chem_tot',
            'total_marks', 'rank', 'batch_at_exam'
        ]
        for col in standard_columns:
            if col not in df.columns:
                df[col] = None
        df = df[standard_columns]
        


        # Metadata
        df['exam_type'] = infer_exam_type(filename)
        df['exam_date'] = infer_exam_date(filename)
        df['filename'] = filename
        return df

    except Exception as e:
        print(f"[SKIPPED] {filepath}: {str(e)}")
        skipped_dir = os.path.join("data", "raw", "skipped")
        os.makedirs(skipped_dir, exist_ok=True)
        shutil.move(filepath, os.path.join(skipped_dir, os.path.basename(filepath)))
        return pd.DataFrame()


Now exporting all files I noticed somes files was skipped again de-bugged code and loaded file and finally saved into student_data dataframe

In [None]:
print(student_data.shape)

(11567, 19)