# Exam Score Predictor Using Decision Tree Regression

This notebook uses a Decision Tree Regressor to predict student exam scores based on study hours, sleep patterns, attendance, and previous performance.

**Decision Tree Regression** learns decision rules from training data to predict continuous values. It creates a tree structure where each node tests a feature, branches represent outcomes, and leaf nodes contain predictions.

## Step 1: Import Required Libraries

- **pandas**: Data manipulation library for handling CSV files and DataFrames
- **DecisionTreeRegressor**: ML algorithm from scikit-learn that predicts continuous values
- **matplotlib.pyplot**: Visualization library for creating plots and charts

In [None]:
# Import Modules
import pandas as pd
from sklearn.tree import DecisionTreeRegressor as DTS
import matplotlib.pyplot as plt

## Step 2: Load the Dataset

Load student exam data from CSV using pandas' `read_csv()` function.

**Dataset columns:**
- `student_id`: Unique identifier
- `hours_studied`: Study time
- `sleep_hours`: Average sleep per night
- `attendance_percent`: Class attendance
- `previous_scores`: Previous exam scores
- `exam_score`: Target variable (what we predict)

In [None]:
# Gather Data from file
# Load CSV File
scoresFilepath = "/home/jha/MLLearning/exam-score-predictor/student_exam_scores.csv"

# Read the file
score_data = pd.read_csv(scoresFilepath)
print("Completed loading data")

## Step 3: Prepare Features and Target Variable

Separate data into features (inputs) and target (output):

- **Target (y)**: `exam_score` - what we want to predict
- **Features (Xa)**: Four input variables that influence exam performance:
  - `hours_studied`
  - `sleep_hours`
  - `attendance_percent`
  - `previous_scores`

In [None]:
# Initialize Variables
y = score_data.exam_score
dataPoints = ["hours_studied", "sleep_hours", "attendance_percent", "previous_scores"]
Xa = score_data[dataPoints]

## Step 4: Initialize the Decision Tree Model

Create a Decision Tree Regressor with `random_state=1` for reproducibility. This ensures the model produces the same results every time we run the code.

The model is currently an empty template that hasn't learned anything yet.

In [None]:
# Setup Modeling
score_model = DTS(random_state=1)

## Step 5: Train the Model

Train the model using `.fit(Xa, y)` to learn patterns from the data.

The algorithm:
1. Builds a tree structure by finding the most informative features
2. Recursively splits data based on feature values
3. Creates leaf nodes containing predicted values
4. Learns rules like "If hours_studied > 7.5 AND previous_scores > 80, predict high score"

In [None]:
# Fit Model
score_model.fit(Xa, y)
print("\nModel trained successfully")

## Step 6: Make Predictions

Use the trained model to predict exam scores with `.predict(Xa)`.

For each student, the model:
1. Takes their feature values
2. Traverses the decision tree
3. Reaches a leaf node with the predicted score

**Note**: We're predicting on the same data we trained on (in-sample prediction). In real applications, you should split data into training and testing sets to evaluate on unseen data.

In [None]:
# Predictions!!!!
predictions = score_model.predict(Xa)

## Step 7: Display Results and Visualize

Compare predicted vs actual scores in a formatted table and plot the results.

- Loop through each student
- Extract student ID, predicted score, and actual score
- Format output with 2 decimal places (`.2f`)
- Plot predicted scores (circles) and actual scores (x markers) for visualization
- Look for close matches (good performance) and large differences (outliers)

In [None]:
# Print each student's ID, predicted score, and actual score
print("\n" + "="*60)
print(f"{'Student ID':<15} {'Predicted Score':<20} {'Actual Score':<15}")
print("="*60)

for i in range(len(predictions)):
    student_id = score_data.iloc[i]['student_id']
    predicted = predictions[i]
    actual = y.iloc[i]
    print(f"{student_id:<15} {predicted:<20.2f} {actual:<15.2f}")
    plt.plot(student_id, predicted, marker='o', label="Predicted")
    plt.plot(student_id, actual, marker='x', label="Actual")
plt.show()

## Summary

### Completed Steps:
1. ✅ Loaded student data from CSV
2. ✅ Prepared features and target variables
3. ✅ Built and trained Decision Tree model
4. ✅ Generated predictions
5. ✅ Compared predictions vs actual scores

### Limitations:
- In-sample prediction (tested on training data)
- No validation metrics (RMSE, MAE, R²)
- No train-test split
- No hyperparameter tuning

### Potential Improvements:
- Split data into training/testing sets
- Use cross-validation
- Calculate evaluation metrics
- Try other algorithms (Random Forest, XGBoost)
- Analyze feature importance
- Add visualizations