# Exam Score Predictor Using Decision Tree Regression

## Overview
This notebook demonstrates a machine learning approach to predict student exam scores based on various factors including study hours, sleep patterns, attendance, and previous academic performance. We'll use a Decision Tree Regressor from scikit-learn to build our predictive model.

## Project Goals
- Load and explore student performance data
- Build a Decision Tree Regression model
- Train the model on historical student data
- Make predictions and compare them with actual scores
- Evaluate model performance

## What is Decision Tree Regression?
Decision Tree Regression is a supervised machine learning algorithm that predicts continuous values (like exam scores) by learning decision rules from the training data. The model creates a tree-like structure where each internal node represents a "test" on a feature (e.g., hours studied > 5), each branch represents the outcome of the test, and each leaf node represents a prediction value.

## Step 1: Import Required Libraries

We begin by importing the necessary Python libraries for our machine learning project:

- **pandas**: A powerful data manipulation library that provides data structures like DataFrames for handling structured data (like CSV files)
- **DecisionTreeRegressor**: A machine learning algorithm from scikit-learn's tree module that predicts continuous values using a decision tree structure

These libraries form the foundation of our predictive modeling workflow.

In [None]:
# Import Modules
import pandas as pd
from sklearn.tree import DecisionTreeRegressor as DTS

## Step 2: Load the Dataset

In this step, we load our student exam score data from a CSV (Comma-Separated Values) file:

### What's happening here:
1. **File Path Definition**: We specify the absolute path to our CSV file containing student data
2. **Data Loading**: Using pandas' `read_csv()` function, we read the CSV file into a DataFrame object
3. **Confirmation Message**: We print a success message to confirm the data has been loaded

### About the Dataset:
The CSV file contains student information including:
- `student_id`: Unique identifier for each student
- `hours_studied`: Number of hours spent studying
- `sleep_hours`: Average hours of sleep per night
- `attendance_percent`: Class attendance percentage
- `previous_scores`: Scores from previous exams
- `exam_score`: The actual exam score (our target variable)

**Note**: If you're running this notebook, you may need to update the file path to match your local directory structure.

In [None]:
# Gather Data from file
# Load CSV File
scoresFilepath = "/home/jha/MLLearning/exam-score-predictor/student_exam_scores.csv"

# Read the file
score_data = pd.read_csv(scoresFilepath)
print("Completed loading data")

## Step 3: Prepare Features and Target Variable

This is a crucial step in machine learning where we separate our data into features (input variables) and target (what we want to predict):

### Target Variable (y):
- **`y = score_data.exam_score`**: We extract the 'exam_score' column as our target variable
- This is what we want our model to learn to predict
- In machine learning terminology, this is often called the "dependent variable" or "label"

### Feature Variables (Xa):
- **`dataPoints`**: A list defining which columns to use as input features
- We've selected four features that logically influence exam performance:
  - `hours_studied`: More study time typically leads to better scores
  - `sleep_hours`: Adequate sleep affects cognitive function and test performance
  - `attendance_percent`: Regular attendance correlates with better understanding
  - `previous_scores`: Past performance is often a good predictor of future performance
- **`Xa = score_data[dataPoints]`**: We create a new DataFrame containing only these feature columns

### Why this matters:
The quality and relevance of features you choose significantly impact model performance. These features were selected because they have a logical relationship with exam scores and provide diverse information about student behavior and preparation.

In [None]:
# Initialize Variables
y = score_data.exam_score
dataPoints = ["hours_studied", "sleep_hours", "attendance_percent", "previous_scores"]
Xa = score_data[dataPoints]

## Step 4: Initialize the Decision Tree Model

Here we create our machine learning model by instantiating a Decision Tree Regressor:

### Model Configuration:
- **`DTS(random_state=1)`**: Creates a new Decision Tree Regressor object
- **`random_state=1`**: This is a critical parameter that ensures reproducibility

### What is random_state?
Decision trees involve some randomness in how they're built. By setting `random_state=1`, we:
- Ensure the model produces the same results every time we run the code
- Make our work reproducible for ourselves and others
- Enable fair comparisons when testing different approaches

Think of `random_state` as a seed value - like a starting point for random number generation. Using the same seed always produces the same sequence of "random" decisions.

### At this stage:
The model is just an empty template - it hasn't learned anything yet. It's like having a blank notebook ready to be filled with knowledge. The actual learning happens in the next step during training.

In [None]:
# Setup Modeling
score_model = DTS(random_state=1)

## Step 5: Train the Model

This is where the actual machine learning happens! We train our model using the `.fit()` method:

### What happens during training:
- **`score_model.fit(Xa, y)`**: This method trains the model by showing it examples
- The model analyzes the relationship between features (Xa) and outcomes (y)
- It learns patterns like: "Students who study 8+ hours and attend 90%+ of classes tend to score higher"

### The Learning Process:
1. **Decision Tree Building**: The algorithm creates a tree structure by:
   - Finding the most informative feature to split on first
   - Recursively splitting the data based on feature values
   - Creating leaf nodes that contain predicted values

2. **Pattern Recognition**: The tree learns rules like:
   - "If hours_studied > 7.5 AND previous_scores > 80, predict high score"
   - "If attendance_percent < 60, predict lower score"

3. **Internal Optimization**: The algorithm determines:
   - Which features are most important
   - What threshold values to use for splitting
   - How deep the tree should grow

### After Training:
Once training is complete, the model has "learned" from the data and is ready to make predictions on new examples. We confirm successful training with a message.

In [None]:
# Fit Model
score_model.fit(Xa, y)
print("\nModel trained successfully")

## Step 6: Make Predictions

Now that our model is trained, we can use it to make predictions:

### What's happening:
- **`score_model.predict(Xa)`**: We pass our feature data through the trained model
- The model uses the decision tree it built during training to predict exam scores
- **`predictions`**: This variable stores all the predicted scores as a NumPy array

### How prediction works:
For each student, the model:
1. Takes their feature values (hours studied, sleep hours, etc.)
2. Traverses the decision tree based on these values
3. Follows the branches based on conditions learned during training
4. Reaches a leaf node that contains the predicted exam score

### Important Note:
In this example, we're predicting on the same data we trained on (Xa). This is useful for:
- Demonstrating how the model works
- Checking if the model learned the training data patterns
- Initial validation of model functionality

**However**, in real-world applications, you should:
- Split data into training and testing sets
- Evaluate on unseen test data to assess true model performance
- Avoid overfitting by not testing on training data

This approach here is called "in-sample prediction" and gives us an optimistic view of model performance.

In [None]:
# Predictions!!!!
predictions = score_model.predict(Xa)

## Step 7: Display Results - Compare Predictions vs Actual Scores

In this final step, we create a formatted table to compare our model's predictions against actual exam scores:

### Output Formatting:
- **Header Creation**: We use string formatting to create a professional-looking table
  - `"="*60`: Creates a line of 60 equal signs for visual separation
  - `f"{'Student ID':<15}"`: Left-aligns text in a 15-character wide column
  - The `<` symbol means left-align, and the number specifies column width

### The Loop Iteration:
```python
for i in range(len(predictions)):
```
- Loops through each student (from index 0 to the last prediction)
- `len(predictions)` gives us the total number of students

### Data Extraction:
For each iteration, we extract:
1. **`student_id`**: Retrieved from the DataFrame using `.iloc[i]` (gets row by index)
2. **`predicted`**: The model's prediction from our predictions array
3. **`actual`**: The true exam score from our target variable (y)

### Display Format:
- **`.2f`**: Formats numbers to 2 decimal places for readability
- Each row shows: Student ID | Predicted Score | Actual Score
- This allows easy visual comparison of model accuracy

### What to look for in the results:
- **Close matches**: Predicted ≈ Actual indicates good model performance
- **Large differences**: May reveal students with unusual patterns
- **Consistent patterns**: If predictions are always too high/low, the model may have bias
- **Overall accuracy**: Average difference between predicted and actual scores

This comparison helps us understand:
- How well our model performs
- Whether it's making reasonable predictions
- Which students' scores are harder to predict (potential outliers)

In [None]:
# Print each student's ID, predicted score, and actual score
print("\n" + "="*60)
print(f"{'Student ID':<15} {'Predicted Score':<20} {'Actual Score':<15}")
print("="*60)

for i in range(len(predictions)):
    student_id = score_data.iloc[i]['student_id']
    predicted = predictions[i]
    actual = y.iloc[i]
    print(f"{student_id:<15} {predicted:<20.2f} {actual:<15.2f}")

## Summary and Next Steps

### What we accomplished:
1. ✅ Loaded student exam data from a CSV file
2. ✅ Prepared features and target variables for machine learning
3. ✅ Built and trained a Decision Tree Regression model
4. ✅ Generated predictions for student exam scores
5. ✅ Compared predictions against actual scores

### Model Limitations:
- **In-sample prediction**: We tested on the same data we trained on, which doesn't reflect real-world performance
- **No validation metrics**: We didn't calculate formal metrics like RMSE, MAE, or R²
- **No train-test split**: A proper evaluation would use separate data for testing
- **No hyperparameter tuning**: Default Decision Tree settings may not be optimal

### Potential Improvements:
1. **Train-Test Split**: Divide data into 80% training, 20% testing for unbiased evaluation
2. **Cross-Validation**: Use k-fold cross-validation for robust performance estimates
3. **Feature Engineering**: Create new features (e.g., study_to_sleep_ratio)
4. **Model Evaluation Metrics**:
   - Mean Absolute Error (MAE): Average prediction error
   - Root Mean Squared Error (RMSE): Penalizes larger errors more
   - R² Score: Proportion of variance explained
5. **Hyperparameter Tuning**: Optimize tree depth, min_samples_split, etc.
6. **Model Comparison**: Try other algorithms (Random Forest, Linear Regression, XGBoost)
7. **Feature Importance**: Analyze which features matter most
8. **Visualization**: Plot actual vs predicted scores, feature distributions, tree structure

### Key Takeaways:
- Decision Trees are interpretable models that learn non-linear relationships
- Feature selection significantly impacts prediction quality
- Model evaluation should always be done on unseen data
- Real-world applications require more rigorous validation and testing

This notebook provides a foundation for understanding machine learning workflows. The methodology demonstrated here can be extended and refined for more sophisticated predictive modeling projects.