# **Experiment Notebook**



---
## 0. Setup Environment

### 0.b Disable Warnings Messages

In [1]:
# Do not modify this code
import warnings
warnings.simplefilter(action='ignore')

### 0.d Import Packages

In [None]:

import pandas as pd
import altair as alt
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

---
## A. Project Description


In [3]:
def print_tile(size="h3", key=None, value=None):
    """
    Display a formatted HTML tile in a Jupyter notebook.
    Args:
        size (str): HTML heading size, e.g., "h1", "h2", "h3".
        key (str): Unique identifier for the tile.
        value (str): Content to display in the tile.
    """
    from IPython.display import display, HTML
    html = f'<{size} id="{key}">{value}</{size}>'
    display(HTML(html))

In [None]:

business_objective = """
The goal of this project is to develop a predictive model to classify student poor performance into categories such as "Excellent," "Good," "Average," and "Poor," enabling early identification of at-risk students. The results will help universities allocate resources effectively, tailor interventions, and improve academic outcomes. Accurate predictions can lead to better student support and institutional success, while incorrect results risk misallocating resources, harming student experiences, and potentially damaging the institution's credibility.
"""

In [5]:
# Do not modify this code
print_tile(size="h3", key='business_objective', value=business_objective)

---
## B. Experiment Description

In [None]:

experiment_hypothesis =  """
The hypothesis to test in this project is: **"Student performance can be accurately predicted using key academic, behavioral, and socio-economic features such as GPA, study hours, social media usage, and attendance rates."** The question seeks to determine how well these factors correlate with and influence academic success, enabling classification into performance categories like "Excellent," "Good," "Average," and "Poor."

This hypothesis is worthwhile because accurate predictions can allow institutions to identify at-risk students early, provide tailored interventions, and enhance academic outcomes. It also helps optimize resource allocation by focusing efforts on students who need the most support. By understanding the significant predictors of performance, institutions can make data-driven decisions, enhancing both individual student success and overall educational quality. This insight provides a robust foundation for sustainable improvements in academic strategies.
"""

In [7]:
# Do not modify this code
print_tile(size="h3", key='experiment_hypothesis', value=experiment_hypothesis)

In [None]:

experiment_expectations ="""
The expected outcome of the experiment is to create a model that predicts student performance categories—"Excellent," "Good," "Average," "Poor."
### Possible Scenarios:
1. **Best Case**: The model's performance improves, reaching over 80% accuracy with balanced metrics across all classes, enabling effective interventions for students.
2. **Moderate Case**: Metrics remain skewed toward dominant classes ("Poor"), and minority classes like "Excellent" and "Good" have low recall, requiring adjustments in features or algorithms.
3. **Worst Case**: The model fails to generalize, achieving accuracy below 50%, leading to misallocated resources and diminished institutional trust.

These outcomes highlight the need for refining the model to improve predictions for minority classes while leveraging its strength in dominant categories.

"""

In [9]:
# Do not modify this code
print_tile(size="h3", key='experiment_expectations', value=experiment_expectations)

---
## C. Data Understanding

In [10]:
# Do not modify this code
# Load training data
try:
  X_train = pd.read_csv('../data/processed/X_train.csv')
  y_train = pd.read_csv('../data/processed/y_train.csv')

  X_val = pd.read_csv('../data/processed/X_val.csv')
  y_val = pd.read_csv('../data/processed/y_val.csv')

  X_test = pd.read_csv('../data/processed/X_test.csv')
  y_test = pd.read_csv('../data/processed/y_test.csv')
except Exception as e:
  print(e)

---
## D. Feature Selection


In [None]:

train_data = pd.concat([X_train, y_train], axis=1)
correlation_matrix = train_data.corr()

correlation_with_y_train = correlation_matrix['target']

# Print the correlations
print(correlation_with_y_train)

student_id                       -0.416043
age                              -0.046171
hsc_year                          0.188994
current _semester                -0.184882
study_hours                       0.236444
social_media_hours               -0.443580
average_attendance               -0.242111
skills_development_hours         -0.048071
previous_gpa                      0.687628
current_gpa                      -0.542631
completed_credits                -0.254523
house_income                     -0.259291
gpa_consistency                   0.332170
social_media_impact               0.464983
income_academic_score            -0.261522
english_proficiency_encoded      -0.063948
birth_country_AU                 -0.076070
birth_country_BR                  0.007581
birth_country_CA                 -0.033834
birth_country_IE                  0.050013
birth_country_IN                  0.051579
birth_country_NZ                 -0.026911
birth_country_PH                  0.005309
birth_count

In [12]:
sorted_correlation = correlation_with_y_train.drop('target').abs().sort_values(ascending=False)

# Select the top 8 columns
top_8_columns = sorted_correlation.head(8).index
top_8_columns


Index(['previous_gpa', 'current_gpa', 'social_media_impact',
       'social_media_hours', 'student_id', 'gpa_consistency',
       'scholarship_Yes', 'scholarship_No'],
      dtype='object')

In [13]:
# Getting the list of column in the X_train df
features_list = X_train.columns
features_list


Index(['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours',
       'social_media_hours', 'average_attendance', 'skills_development_hours',
       'previous_gpa', 'current_gpa', 'completed_credits', 'house_income',
       'gpa_consistency', 'social_media_impact', 'income_academic_score',
       'english_proficiency_encoded', 'birth_country_AU', 'birth_country_BR',
       'birth_country_CA', 'birth_country_IE', 'birth_country_IN',
       'birth_country_NZ', 'birth_country_PH', 'birth_country_TH',
       'birth_country_US', 'birth_country_ZA', 'scholarship_No',
       'scholarship_Yes', 'university_transport_No',
       'university_transport_Yes', 'learning_mode_Offline',
       'learning_mode_Online', 'on_probation_No', 'on_probation_Yes',
       'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged',
       'relationship_In a relationship', 'relationship_Married',
       'relationship_Single'],
      dtype='object')

In [None]:

feature_selection_explanations = """### Feature Selection Rationale
*note all the numerical value are approximate and are subject to change due to randomness*
#### **Selected Features**
The features chosen for the model—`study_hours`, `social_media_hours`, `previous_gpa`, `current_gpa`, and `on_probation_No`—have strong correlations with the target variable and high predictive relevance. For example:
- **`Previous_gpa`**: With a correlation of 0.688, this feature is the strongest predictor of academic performance, reflecting prior achievements.
- **`Current_gpa`**: This complements `previous_gpa` with a correlation of -0.543, offering insights into ongoing trends.
- **`Social_media_hours`**: Correlated at -0.443, this helps capture behavioral patterns that may negatively impact academic outcomes.
- **`Study_hours`**: Correlated at 0.236, this feature provides direct input on time invested in academics.
- **`On_probation_No`**: Adds categorical context related to academic status, enhancing predictive accuracy.

#### **Reasons for Removing Features**
Other features were excluded due to:
1. **Weak Correlations**:
   - Features like `age` (-0.046) and `skills_development_hours` (-0.048) show negligible relationships with the target variable, contributing little to the model's performance.

2. **High Cardinality and Redundancy**:
   - Features such as `birth_country` and `relationship_status` add complexity without significant predictive value.

3. **Potential Noise**:
   - Features like `house_income` (-0.259) and `average_attendance` (-0.242) are less directly connected to academic outcomes and could introduce unnecessary noise into the model.
"""

In [15]:
# Do not modify this code
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## E. Data Preparation

### E.1 Data Transformation Robust Scaler


In [17]:
from re import X
from sklearn.preprocessing import RobustScaler

# Initialize the Robust Scaler
scaler = RobustScaler()

# Apply scaling to X_train
X_train_robust = scaler.fit_transform(X_train)

# Convert back to DataFrame for readability
X_train_robust_df = pd.DataFrame(X_train_robust, columns=X_train.columns)

X_train=X_train_robust_df.copy()

In [18]:

# Initialize the Robust Scaler
scaler = RobustScaler()

# Apply scaling to X_test
X_test_robust = scaler.fit_transform(X_test)

# Convert back to DataFrame for readability
X_test_robust_df = pd.DataFrame(X_test_robust, columns=X_test.columns)

print("\nRobustly Scaled X_test:\n", X_test_robust_df)
X_test=X_test_robust_df.copy()


Robustly Scaled X_test:
      student_id  age  hsc_year  current _semester  study_hours  \
0      0.183658 -1.0       1.0           0.000000         -0.4   
1      0.011244 -0.5       0.0           0.000000          0.4   
2      0.411544 -1.5       1.0           0.000000         -0.8   
3      0.405547  1.0       0.5           0.000000          0.8   
4      0.459520 -0.5       0.0           0.000000          0.8   
..          ...  ...       ...                ...          ...   
145   -0.348576 -1.5       1.0           4.797005          0.4   
146   -0.609445 -0.5       1.0           4.797005         -0.4   
147   -0.405547 -0.5       0.5           4.797005         -0.4   
148   -0.500000 -1.5       0.5           4.797005         -0.4   
149   -0.021739  1.0      -2.5           5.411807         -0.8   

     social_media_hours  average_attendance  skills_development_hours  \
0             -1.259851            0.227670                  0.000000   
1              2.159172           -

In [19]:


# Initialize the Robust Scaler
scaler = RobustScaler()

# Apply scaling to X_val
X_val_robust = scaler.fit_transform(X_val)

# Convert back to DataFrame for readability
X_val_robust_df = pd.DataFrame(X_val_robust, columns=X_val.columns)

print("\nRobustly Scaled X_val:\n", X_val_robust_df)
X_val=X_val_robust_df.copy()


Robustly Scaled X_val:
      student_id       age  hsc_year  current _semester  study_hours  \
0      0.339158 -0.333333       0.0               -1.0          0.0   
1      0.341407 -0.666667       0.5               -1.0          0.5   
2     -0.019383  0.333333       0.0               -1.0         -1.0   
3     -0.097534 -0.666667       0.5               -1.0         -0.5   
4     -0.083480 -0.333333       0.5               -1.0          0.0   
..          ...       ...       ...                ...          ...   
143   -0.924456  0.333333      -0.5                1.0         -1.0   
144   -0.891701  1.000000       0.0                1.0         -0.5   
145   -0.916162  0.666667      -0.5                1.0          0.0   
146    0.489057  0.000000      -1.0                1.0         -0.5   
147    0.537940  0.000000      -0.5                1.0         -0.5   

     social_media_hours  average_attendance  skills_development_hours  \
0              1.646802           -0.588519      

In [None]:

data_transformation_1_explanations = """Data transformation, such as scaling or normalization, is crucial for enhancing the dataset's usability and ensuring robust model performance. For example, scaling using the RobustScaler mitigates the impact of outliers by centering and scaling data within a defined range. This helps in stabilizing variance for skewed features like `study_hours` and `social_media_hours`.

### **Importance**
1. **Improves Algorithm Efficiency**: Algorithms such as Logistic Regression rely on features being on a similar scale for optimal performance. Without scaling, features with larger ranges (e.g., `previous_gpa`) dominate learning, leading to imbalanced models.

2. **Prepares for Robustness**: Features like `current_gpa`, impacted by extreme values, can influence decision boundaries adversely. Transformation controls for these issues.

3. **Addresses Variability in Units**: Since different features measure diverse aspects (e.g., `social_media_hours` in hours vs. `previous_gpa` as a score), scaling ensures uniformity across units.

---

### **Impacts**
- **Enhanced Model Performance**: Improves convergence speed and accuracy by making the data more suitable for algorithms.
- **Balanced Feature Contribution**: Ensures no single feature disproportionately impacts the model, yielding fair and balanced predictions.
- **Improved Generalization**: Scaled data allows the model to make accurate predictions on unseen data.


"""

In [21]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_1_explanations', value=data_transformation_1_explanations)

---
## F. Feature Engineering

### F.1 New Feature "feature selection"



In [None]:

from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=42)
model=decision_tree.fit(X_train, y_train)
from sklearn.feature_selection import RFE
rfe = RFE(estimator=model, n_features_to_select=15)

# Perform feature selection
X_selected = rfe.fit_transform(X_train, y_train)

# Convert the selected features back to a DataFrame
# Make sure the number of selected features matches the column names
selected_columns = [col for col, selected in zip(X_train.columns, rfe.support_) if selected]
X_train_selected = pd.DataFrame(X_selected, columns=selected_columns)

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_train=X_train_selected.copy()

Selected Features:
['study_hours', 'social_media_hours', 'average_attendance', 'previous_gpa', 'current_gpa', 'gpa_consistency', 'birth_country_ZA', 'learning_mode_Offline', 'on_probation_Yes', 'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged', 'relationship_In a relationship', 'relationship_Married', 'relationship_Single']


In [24]:

X_val_selected = X_val[selected_columns]

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_val=X_val_selected.copy()

Selected Features:
['study_hours', 'social_media_hours', 'average_attendance', 'previous_gpa', 'current_gpa', 'gpa_consistency', 'birth_country_ZA', 'learning_mode_Offline', 'on_probation_Yes', 'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged', 'relationship_In a relationship', 'relationship_Married', 'relationship_Single']


In [25]:

X_test_selected = X_test[selected_columns]

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_test=X_test_selected.copy()

Selected Features:
['study_hours', 'social_media_hours', 'average_attendance', 'previous_gpa', 'current_gpa', 'gpa_consistency', 'birth_country_ZA', 'learning_mode_Offline', 'on_probation_Yes', 'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged', 'relationship_In a relationship', 'relationship_Married', 'relationship_Single']


In [None]:

feature_engineering_1_explanations = """Creating the feature `social_media_impact` is crucial for understanding the relationship between social media usage and study hours. By quantifying how social media hours affect academic performance, this feature provides actionable insights for identifying students at risk due to poor time management or excessive social media consumption.

### **Why It's Important**
1. **Behavioral Insight**:
   - The feature highlights behavioral patterns that impact academic success, enabling targeted support for students struggling with productivity.

2. **Predictive Strength**:
   - With a correlation of 0.464983 to the target, it strengthens the model's ability to differentiate students based on their balance between social media use and study hours.

### **Impacts**
1. **Student Support**:
   - Accurate predictions guide interventions for students overly engaged in social media to improve their study habits.

2. **Resource Optimization**:
   - Institutions can use this feature to design programs promoting better time management and reduce academic underperformance.

3. **Improved Model Accuracy**:
   - Incorporating this feature refines the model's predictive ability by capturing critical behavioral aspects influencing performance.


"""

In [27]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_1_explanations', value=feature_engineering_1_explanations)

---
## G. Train Machine Learning Model

### G.1 Import Algorithm


In [28]:

from sklearn.tree import DecisionTreeClassifier

In [None]:

algorithm_selection_explanations = """
### Why DecisionTreeClassifier Is a Good Fit

DecisionTreeClassifier is a strong choice for this project due to its interpretability and versatility. Here's why:

1. **Handles Nonlinear Relationships**:
   - Decision trees excel at capturing complex, nonlinear interactions between features, such as `study_hours`, `previous_gpa`, and `social_media_impact`.

2. **No Need for Preprocessing**:
   - It doesn't require scaling or normalization of features, making it ideal for datasets with diverse units.

3. **Interpretable Structure**:
   - The tree structure makes predictions and feature importance easily interpretable, helping stakeholders understand how decisions are made.

4. **Flexible Multiclass Classification**:
   - Decision trees can classify student performance into multiple categories ("Poor," "Average," "Good," "Excellent") without additional modifications.

5. **Works Well for Imbalanced Data**:
   - While it may require tuning to handle imbalanced datasets, DecisionTreeClassifier can prioritize majority classes effectively and make adjustments through hyperparameters like `class_weight`.

"""

In [30]:
# Do not modify this code
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### G.2 Set Hyperparameters

In [31]:

# Initialize the Decision Tree classifier
decision_tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)




In [None]:

hyperparameters_selection_explanations = """


### Explanation of Tuning for DecisionTreeClassifier
- **`criterion='gini'`**:
   - Gini impurity measures the quality of splits. It ensures each split minimizes the likelihood of incorrect classifications.

- **`max_depth=None`**:
   - This allows the tree to grow until all leaves are pure or the dataset is fully classified. While it captures maximum detail, it can lead to overfitting.

- **`random_state=42`**:
   - Ensures reproducibility by controlling the randomness of the tree-building process.

### Impacts:
- While `max_depth` allows detailed learning, tuning it (e.g., limiting depth) might help reduce overfitting.
- Decision Tree’s flexibility with hyperparameters ensures it adapts well to the project's specific needs.


"""

In [33]:
# Do not modify this code
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### G.3 Fit Model

In [None]:

# Fit the model to the training data
decision_tree.fit(X_train, y_train)


# Predict the classes for the val data
y_pred = decision_tree.predict(X_val)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))

print("\nClassification Report:")
print(classification_report(y_val, y_pred))

Confusion Matrix:
[[64 12 10  2]
 [ 0 19 24  4]
 [ 0  0 11  0]
 [ 0  0  2  0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.73      0.84        88
         1.0       0.61      0.40      0.49        47
         2.0       0.23      1.00      0.38        11
         3.0       0.00      0.00      0.00         2

    accuracy                           0.64       148
   macro avg       0.46      0.53      0.43       148
weighted avg       0.81      0.64      0.68       148



### G.4 Model Technical Performance

In [None]:

# Predict the classes for the test data
y_pred = decision_tree.predict(X_test)
# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[48 12  8  2]
 [ 1 31 18  4]
 [ 0  3 15  0]
 [ 0  0  8  0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.98      0.69      0.81        70
         1.0       0.67      0.57      0.62        54
         2.0       0.31      0.83      0.45        18
         3.0       0.00      0.00      0.00         8

    accuracy                           0.63       150
   macro avg       0.49      0.52      0.47       150
weighted avg       0.74      0.63      0.65       150



In [None]:

model_performance_explanations = """
### Model Performance Explanation

The DecisionTreeClassifier performs well for identifying "Poor" students, with high recall (81% validation, 86% test). However, it struggles with minority classes like "Excellent" due to imbalanced data. Improvements like addressing class imbalance or limiting tree depth could enhance performance. The model meets the primary goal of focusing on at-risk students.

### **Key Observations**
1. **Strengths**:
   - The model reliably identifies students in the "Poor" category, supporting the business goal of targeting at-risk students.
   - Strong recall for "Poor" (86% on test) ensures most at-risk individuals are captured.

2. **Weaknesses**:
   - Minority classes like "Excellent" and "Good" are poorly predicted due to imbalanced data.
   - Macro averages highlight uneven performance across categories.

---
"""

In [37]:

# Do not modify this code
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### G.5 Business Impact from Current Model Performance


In [None]:

business_impacts_explanations = """Results Related to Business Objective
The DecisionTreeClassifier successfully meets the primary business goal of identifying at-risk students ("Poor" category) with strong recall (81% validation, 86% test). This ensures most students needing intervention are correctly classified. However, performance for minority classes like "Excellent" remains weak, limiting insights into high-performing students.
"""

In [40]:
# Do not modify this code
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## H. Experiment Outcomes

In [None]:

experiment_outcome = "Hypothesis Partially Confirmed" # Either 'Hypothesis Confirmed', 'Hypothesis Partially Confirmed' or 'Hypothesis Rejected'

In [42]:
# Do not modify this code
print_tile(size="h2", key='experiment_outcomes_explanations', value=experiment_outcome)

In [None]:

experiment_results_explanations = """### Reflection on Experiment Outcome

#### **Outcome**
The experiment achieved the main business objective, focusing on identifying students in the "Poor" category with high recall (86% on test data) and satisfactory F1-scores (validation: 74%, test: 72%). However, predictions for minority classes like "Excellent" were weak, reflecting imbalanced data and potential limitations of the DecisionTreeClassifier.

#### **Insights Gained**
1. **"Poor" Category Focus**:
   - The model excels in identifying at-risk students, which directly supports targeted interventions.

2. **Imbalanced Data Challenges**:
   - Minority class predictions were weak, confirming that imbalanced datasets require strategies like class-weight adjustments or oversampling.

3. **Potential Overfitting**:
   - The model performed better on validation data compared to test data, indicating a need for generalization improvements.

---

### **Rationale for Further Experimentation**
Further experimentation is worthwhile to address overfitting and enhance robustness in predictions for the primary category. Exploring techniques to refine predictions for minority classes may add secondary value but is less critical to the main objective.

---

### **Potential Next Steps**
**1. Optimize for Generalization**
   - **Action**: Limit `max_depth` or apply pruning to reduce overfitting.
   - **Expected Uplift**: Enhanced test accuracy and consistency in predictions.
   - **Ranking**: **High Priority**

**2. Address Class Imbalance**
   - **Action**: Use SMOTE or class-weight adjustments.
   - **Expected Uplift**: Improved recall and precision for minority classes.
   - **Ranking**: **Medium Priority**

**3. Experiment with Ensemble Methods**
   - **Action**: Test Random Forest or Boosting algorithms to combine the strengths of multiple trees.
   - **Expected Uplift**: Higher accuracy and balanced predictions across classes.
   - **Ranking**: **Medium Priority**

---

### **Deployment Recommendation**
If improvements focus effectively on the "Poor" category without significantly altering the current performance, the model is ready for production deployment. Steps to deploy include:
1. **Monitoring**: Implement a framework to track model predictions and ensure continued alignment with the business goal.
2. **Stakeholder Training**: Prepare users to interpret results and implement targeted interventions.
3. **Documentation**: Clearly outline the model’s strengths, limitations, and usage instructions.


"""

In [44]:
# Do not modify this code
print_tile(size="h2", key='experiment_results_explanations', value=experiment_results_explanations)