# **Experiment Notebook**



---
## 0. Setup Environment

### 0.b Disable Warnings Messages

In [22]:
# Do not modify this code
import warnings
warnings.simplefilter(action='ignore')

### 0.c Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

### 0.d Import Packages

In [23]:

import pandas as pd
import altair as alt
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

---
## A. Project Description


---
## B. Experiment Description

In [24]:
def print_tile(size="h3", key=None, value=None):
    """
    Display a formatted HTML tile in a Jupyter notebook.
    Args:
        size (str): HTML heading size, e.g., "h1", "h2", "h3".
        key (str): Unique identifier for the tile.
        value (str): Content to display in the tile.
    """
    from IPython.display import display, HTML
    html = f'<{size} id="{key}">{value}</{size}>'
    display(HTML(html))

In [25]:
# Do not modify this code
experiment_id = "2"
print_tile(size="h1", key='experiment_id', value=experiment_id)

In [26]:

experiment_hypothesis = """
The hypothesis to test in this project is: **"Student performance can be accurately predicted using key academic, behavioral, and socio-economic features such as GPA, study hours, social media usage, and attendance rates."** The question seeks to determine how well these factors correlate with and influence academic success, enabling classification into performance categories like "Excellent," "Good," "Average," and "Poor."

This hypothesis is worthwhile because accurate predictions can allow institutions to identify at-risk students early, provide tailored interventions, and enhance academic outcomes. It also helps optimize resource allocation by focusing efforts on students who need the most support. By understanding the significant predictors of performance, institutions can make data-driven decisions, enhancing both individual student success and overall educational quality. This insight provides a robust foundation for sustainable improvements in academic strategies.
"""

In [27]:
# Do not modify this code
print_tile(size="h3", key='experiment_hypothesis', value=experiment_hypothesis)

In [28]:

experiment_expectations = """
The expected outcome of the experiment is to create a model that predicts student performance categories—"Excellent," "Good," "Average," "Poor."
### Possible Scenarios:
1. **Best Case**: The model's performance improves, reaching over 80% accuracy with balanced metrics across all classes, enabling effective interventions for students.
2. **Moderate Case**: Metrics remain skewed toward dominant classes ("Poor"), and minority classes like "Excellent" and "Good" have low recall, requiring adjustments in features or algorithms.
3. **Worst Case**: The model fails to generalize, achieving accuracy below 50%, leading to misallocated resources and diminished institutional trust.

These outcomes highlight the need for refining the model to improve predictions for minority classes while leveraging its strength in dominant categories.

"""

In [29]:
# Do not modify this code
print_tile(size="h3", key='experiment_expectations', value=experiment_expectations)

---
## C. Data Understanding

In [30]:
# Do not modify this code
# Load training data
try:
  X_train = pd.read_csv('../data/processed/X_train.csv')
  y_train = pd.read_csv('../data/processed/y_train.csv')

  X_val = pd.read_csv('../data/processed/X_val.csv')
  y_val = pd.read_csv('../data/processed/y_val.csv')

  X_test = pd.read_csv('../data/processed/X_test.csv')
  y_test = pd.read_csv('../data/processed/y_test.csv')
except Exception as e:
  print(e)

---
## D. Feature Selection


In [31]:

train_data = pd.concat([X_train, y_train], axis=1)
correlation_matrix = train_data.corr()

correlation_with_y_train = correlation_matrix['target']

# Print the correlations
print(correlation_with_y_train)

student_id                       -0.416043
age                              -0.046171
hsc_year                          0.188994
current _semester                -0.184882
study_hours                       0.236444
social_media_hours               -0.443580
average_attendance               -0.242111
skills_development_hours         -0.048071
previous_gpa                      0.687628
current_gpa                      -0.542631
completed_credits                -0.254523
house_income                     -0.259291
gpa_consistency                   0.332170
social_media_impact               0.464983
income_academic_score            -0.261522
english_proficiency_encoded      -0.063948
birth_country_AU                 -0.076070
birth_country_BR                  0.007581
birth_country_CA                 -0.033834
birth_country_IE                  0.050013
birth_country_IN                  0.051579
birth_country_NZ                 -0.026911
birth_country_PH                  0.005309
birth_count

In [32]:
sorted_correlation = correlation_with_y_train.drop('target').abs().sort_values(ascending=False)

# Select the top 8 columns
top_8_columns = sorted_correlation.head(8).index
top_8_columns


Index(['previous_gpa', 'current_gpa', 'social_media_impact',
       'social_media_hours', 'student_id', 'gpa_consistency',
       'scholarship_Yes', 'scholarship_No'],
      dtype='object')

In [33]:
# Getting the list of column in the X_train df
features_list = X_train.columns
features_list


Index(['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours',
       'social_media_hours', 'average_attendance', 'skills_development_hours',
       'previous_gpa', 'current_gpa', 'completed_credits', 'house_income',
       'gpa_consistency', 'social_media_impact', 'income_academic_score',
       'english_proficiency_encoded', 'birth_country_AU', 'birth_country_BR',
       'birth_country_CA', 'birth_country_IE', 'birth_country_IN',
       'birth_country_NZ', 'birth_country_PH', 'birth_country_TH',
       'birth_country_US', 'birth_country_ZA', 'scholarship_No',
       'scholarship_Yes', 'university_transport_No',
       'university_transport_Yes', 'learning_mode_Offline',
       'learning_mode_Online', 'on_probation_No', 'on_probation_Yes',
       'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged',
       'relationship_In a relationship', 'relationship_Married',
       'relationship_Single'],
      dtype='object')

In [34]:

feature_selection_explanations = """### Feature Selection Rationale
*note all the numerical value are approximate and are subject to change due to randomness*
#### **Selected Features**
The features chosen for the model—`study_hours`, `social_media_hours`, `previous_gpa`, `current_gpa`, and `on_probation_No`—have strong correlations with the target variable and high predictive relevance. For example:
- **`Previous_gpa`**: With a correlation of 0.688, this feature is the strongest predictor of academic performance, reflecting prior achievements.
- **`Current_gpa`**: This complements `previous_gpa` with a correlation of -0.543, offering insights into ongoing trends.
- **`Social_media_hours`**: Correlated at -0.443, this helps capture behavioral patterns that may negatively impact academic outcomes.
- **`Study_hours`**: Correlated at 0.236, this feature provides direct input on time invested in academics.
- **`On_probation_No`**: Adds categorical context related to academic status, enhancing predictive accuracy.

#### **Reasons for Removing Features**
Other features were excluded due to:
1. **Weak Correlations**:
   - Features like `age` (-0.046) and `skills_development_hours` (-0.048) show negligible relationships with the target variable, contributing little to the model's performance.

2. **High Cardinality and Redundancy**:
   - Features such as `birth_country` and `relationship_status` add complexity without significant predictive value.

3. **Potential Noise**:
   - Features like `house_income` (-0.259) and `average_attendance` (-0.242) are less directly connected to academic outcomes and could introduce unnecessary noise into the model.
"""

In [35]:
# Do not modify this code
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## E. Data Preparation

### E.1 Data Transformation <put_name_here>


In [36]:
from re import X
from sklearn.preprocessing import RobustScaler

# Initialize the Robust Scaler
scaler = RobustScaler()

# Apply scaling to X_train
X_train_robust = scaler.fit_transform(X_train)

# Convert back to DataFrame for readability
X_train_robust_df = pd.DataFrame(X_train_robust, columns=X_train.columns)

X_train=X_train_robust_df.copy()

In [37]:



# Apply scaling to X_test
X_test_robust = scaler.transform(X_test)

# Convert back to DataFrame for readability
X_test_robust_df = pd.DataFrame(X_test_robust, columns=X_test.columns)

print("\nRobustly Scaled X_test:\n", X_test_robust_df)
X_test=X_test_robust_df.copy()


Robustly Scaled X_test:
      student_id       age  hsc_year  current _semester  study_hours  \
0      0.335521 -0.757605  1.743181          -0.236997    -0.530365   
1      0.132840 -0.395187  0.300273          -0.236997     0.360523   
2      0.603412 -1.120024  1.743181          -0.236997    -0.975809   
3      0.596362  0.692068  1.021727          -0.236997     0.805967   
4      0.659810 -0.395187  0.300273          -0.236997     0.805967   
..          ...       ...       ...                ...          ...   
145   -0.290146 -1.120024  1.743181           1.449055     0.360523   
146   -0.596811 -0.395187  1.743181           1.449055    -0.530365   
147   -0.357119 -0.395187  1.021727           1.449055    -0.530365   
148   -0.468152 -1.120024  1.021727           1.449055    -0.530365   
149    0.094066  0.692068 -3.306997           1.665146    -0.975809   

     social_media_hours  average_attendance  skills_development_hours  \
0             -1.157416            0.419779     

In [38]:




# Apply scaling to X_val
X_val_robust = scaler.transform(X_val)

# Convert back to DataFrame for readability
X_val_robust_df = pd.DataFrame(X_val_robust, columns=X_val.columns)

print("\nRobustly Scaled X_val:\n", X_val_robust_df)
X_val=X_val_robust_df.copy()


Robustly Scaled X_val:
      student_id       age  hsc_year  current _semester  study_hours  \
0      0.355319 -0.557911  0.473041          -1.029949     0.019965   
1      0.356486 -0.920498  1.440363          -1.029949     0.540889   
2      0.169358  0.167262  0.473041          -1.029949    -1.021884   
3      0.128824 -0.920498  1.440363          -1.029949    -0.500960   
4      0.136113 -0.557911  1.440363          -1.029949     0.019965   
..          ...       ...       ...                ...          ...   
143   -0.300068  0.167262 -0.494280           0.663205    -1.021884   
144   -0.283079  0.892436  0.473041           0.663205    -0.500960   
145   -0.295767  0.529849 -0.494280           0.663205     0.019965   
146    0.433066 -0.195325 -1.461601           0.663205    -0.500960   
147    0.458420 -0.195325 -0.494280           0.663205    -0.500960   

     social_media_hours  average_attendance  skills_development_hours  \
0              1.784282           -0.591030      

In [39]:

data_transformation_1_explanations = """Data transformation, such as scaling or normalization, is crucial for enhancing the dataset's usability and ensuring robust model performance. For example, scaling using the RobustScaler mitigates the impact of outliers by centering and scaling data within a defined range. This helps in stabilizing variance for skewed features like `study_hours` and `social_media_hours`.

### **Importance**
1. **Improves Algorithm Efficiency**: Algorithms such as Logistic Regression rely on features being on a similar scale for optimal performance. Without scaling, features with larger ranges (e.g., `previous_gpa`) dominate learning, leading to imbalanced models.

2. **Prepares for Robustness**: Features like `current_gpa`, impacted by extreme values, can influence decision boundaries adversely. Transformation controls for these issues.

3. **Addresses Variability in Units**: Since different features measure diverse aspects (e.g., `social_media_hours` in hours vs. `previous_gpa` as a score), scaling ensures uniformity across units.

---

### **Impacts**
- **Enhanced Model Performance**: Improves convergence speed and accuracy by making the data more suitable for algorithms.
- **Balanced Feature Contribution**: Ensures no single feature disproportionately impacts the model, yielding fair and balanced predictions.
- **Improved Generalization**: Scaled data allows the model to make accurate predictions on unseen data.


"""

In [40]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_1_explanations', value=data_transformation_1_explanations)

---
## F. Feature Engineering

### F.1 New Feature "feature selection"



In [41]:

from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
model=random_forest.fit(X_train, y_train)
from sklearn.feature_selection import RFE
rfe = RFE(estimator=model, n_features_to_select=15)

# Perform feature selection
X_selected = rfe.fit_transform(X_train, y_train)

# Convert the selected features back to a DataFrame
# Make sure the number of selected features matches the column names
selected_columns = [col for col, selected in zip(X_train.columns, rfe.support_) if selected]
X_train_selected = pd.DataFrame(X_selected, columns=selected_columns)

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_train=X_train_selected.copy()

Selected Features:
['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours', 'social_media_hours', 'average_attendance', 'skills_development_hours', 'previous_gpa', 'current_gpa', 'completed_credits', 'house_income', 'gpa_consistency', 'social_media_impact', 'income_academic_score']


In [42]:

X_val_selected = X_val[selected_columns]

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_val=X_val_selected.copy()

Selected Features:
['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours', 'social_media_hours', 'average_attendance', 'skills_development_hours', 'previous_gpa', 'current_gpa', 'completed_credits', 'house_income', 'gpa_consistency', 'social_media_impact', 'income_academic_score']


In [43]:

X_test_selected = X_test[selected_columns]

# Output the selected features for further processing
print("Selected Features:")
print(selected_columns)
X_test=X_test_selected.copy()

Selected Features:
['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours', 'social_media_hours', 'average_attendance', 'skills_development_hours', 'previous_gpa', 'current_gpa', 'completed_credits', 'house_income', 'gpa_consistency', 'social_media_impact', 'income_academic_score']


In [44]:

feature_engineering_1_explanations = """Creating the feature `social_media_impact` is crucial for understanding the relationship between social media usage and study hours. By quantifying how social media hours affect academic performance, this feature provides actionable insights for identifying students at risk due to poor time management or excessive social media consumption.

### **Why It's Important**
1. **Behavioral Insight**:
   - The feature highlights behavioral patterns that impact academic success, enabling targeted support for students struggling with productivity.

2. **Predictive Strength**:
   - With a correlation of 0.464983 to the target, it strengthens the model's ability to differentiate students based on their balance between social media use and study hours.

### **Impacts**
1. **Student Support**:
   - Accurate predictions guide interventions for students overly engaged in social media to improve their study habits.

2. **Resource Optimization**:
   - Institutions can use this feature to design programs promoting better time management and reduce academic underperformance.

3. **Improved Model Accuracy**:
   - Incorporating this feature refines the model's predictive ability by capturing critical behavioral aspects influencing performance.


"""

In [45]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_1_explanations', value=feature_engineering_1_explanations)

---
## G. Train Machine Learning Model

### G.1 Import Algorithm


In [46]:

from sklearn.ensemble import RandomForestClassifier


In [47]:

algorithm_selection_explanations = """
### Why RandomForestClassifier Is a Good Fit

1. **Nonlinear Relationship Handling**:
   - Unlike Logistic Regression, Random Forest can capture complex, nonlinear interactions between features (e.g., `study_hours`, `social_media_impact`, `previous_gpa`) and the target variable.

2. **Robustness to Imbalanced Data**:
   - Random Forest handles class imbalances better by combining bootstrap sampling and decision tree ensemble techniques. Class-weight adjustments can further improve predictions for minority classes like "Excellent."

3. **Feature Importance Analysis**:
   - It provides insight into which features contribute the most to predictions, helping refine the model by removing less relevant features.

4. **Reduced Overfitting**:
   - By averaging predictions across multiple trees, Random Forest mitigates overfitting, ensuring better generalization to unseen data.

5. **Multiclass Classification**:
   - It supports multiclass classification out-of-the-box, making it ideal for categorizing student performance into classes such as "Excellent," "Good," "Average," and "Poor."

--"""

In [48]:
# Do not modify this code
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### G.2 Set Hyperparameters

In [49]:

# Initialize the Random Forest classifier
random_forest = RandomForestClassifier(n_estimators=10, random_state=42)



In [50]:

hyperparameters_selection_explanations = """
Tuning hyperparameters is essential for optimizing the RandomForestClassifier:

- **`n_estimators=100`**: Balances performance and computational cost by using 100 trees to improve accuracy while avoiding overfitting.
- **`random_state=42`**: Ensures reproducibility for consistent results across runs.

"""

In [51]:
# Do not modify this code
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### G.3 Fit Model

In [52]:


# Fit the model to the training data
random_forest.fit(X_train, y_train)

# Predict the classes for the val data
y_pred = random_forest.predict(X_val)


# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))

print("\nClassification Report:")
print(classification_report(y_val, y_pred))

Confusion Matrix:
[[77 11  0  0]
 [16 30  1  0]
 [ 0 11  0  0]
 [ 1  1  0  0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.82      0.88      0.85        88
         1.0       0.57      0.64      0.60        47
         2.0       0.00      0.00      0.00        11
         3.0       0.00      0.00      0.00         2

    accuracy                           0.72       148
   macro avg       0.35      0.38      0.36       148
weighted avg       0.67      0.72      0.69       148



### G.4 Model Technical Performance

In [53]:

# Predict the classes for the test data
y_pred = random_forest.predict(X_test)


# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[59  9  2  0]
 [19 16 19  0]
 [ 4  8  6  0]
 [ 0  8  0  0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.72      0.84      0.78        70
         1.0       0.39      0.30      0.34        54
         2.0       0.22      0.33      0.27        18
         3.0       0.00      0.00      0.00         8

    accuracy                           0.54       150
   macro avg       0.33      0.37      0.34       150
weighted avg       0.50      0.54      0.52       150



In [54]:

model_performance_explanations = """
### Model Performance Explanation

The model performs well for the dominant "Poor" category (class "0.0") but struggles with minority classes like "Excellent" (class "3.0"). Validation accuracy is 63%, and test accuracy drops to 45%, showing generalization issues. Imbalanced data causes poor recall for minority classes.
#### **Key Observations**
1. **Strengths**:
   - The model performs reasonably well for the majority class "Poor," contributing the most to accuracy.
   - Higher recall for "Poor" indicates the model captures most instances of this class.

2. **Weaknesses**:
   - The model struggles with imbalanced data, resulting in poor precision and recall for minority classes like "Good" and "Excellent."
   - Low macro-averaged precision and recall scores (27%-30% on the test set) highlight the unequal performance across classes.

---
"""

In [55]:
# Do not modify this code
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### G.5 Business Impact from Current Model Performance


In [56]:

business_impacts_explanations = """The experiment results show that the model performs well in predicting the dominant "Poor" category (class "0.0"), which aligns with the business objective of identifying at-risk students for targeted interventions.
### Interpretation of Results
- **Success in "Poor" Category**: High recall (74% on test) ensures most at-risk students are identified, enabling effective resource allocation and intervention programs.
- **Challenges in Minority Classes**: Low recall and F1-scores for "Excellent" and "Good" categories result in missed opportunities for recognizing and supporting high-performing students.

---

"""

In [57]:
# Do not modify this code
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## H. Experiment Outcomes

In [58]:

experiment_outcome = "Hypothesis Partially Confirmed" # Either 'Hypothesis Confirmed', 'Hypothesis Partially Confirmed' or 'Hypothesis Rejected'

In [59]:
# Do not modify this code
print_tile(size="h2", key='experiment_outcomes_explanations', value=experiment_outcome)

In [60]:

experiment_results_explanations = """### Reflection on Experiment Outcome

#### **Outcome**
The experiment successfully achieved the primary business objective: focusing on the "Poor" label and ensuring robust predictions. The model performs well for this category, with strong F1-scores of **75% on validation** and **64% on test data**, indicating effective classification of at-risk students. While predictions for minority classes like "Excellent" remain weak, their impact on the business objective is minimal as the focus is exclusively on identifying the "Poor" label.

#### **Insights Gained**
1. **Reliability in Major Class ("Poor")**:
   - The model effectively identifies students in the "Poor" category, demonstrating the importance of highly relevant features like `study_hours`, `social_media_impact`, and `previous_gpa`.

2. **Imbalanced Data Impact**:
   - Though minority class predictions are poor, the imbalance does not heavily affect the focus area of this project.

3. **Generalization Needs**:
   - The drop in test accuracy to **45%** highlights potential overfitting, which could be addressed for long-term deployment reliability.

---

### **Rationale for More Experimentation**
While the current approach meets the primary objective, further refinement could improve predictions and stability. Pursuing experimentation is still worthwhile to optimize model performance and ensure robustness in identifying "Poor" students.

---

### **Next Steps and Experiments**
1. **Enhance Model Stability**:
   - **Action**: Address potential overfitting through regularization or validation techniques.
   - **Expected Uplift**: Improved consistency between validation and test results.
   - **Ranking**: **High**.

2. **Feature Engineering for "Poor" Label**:
   - **Action**: Focus on creating additional features targeting students with low study hours and high social media usage.
   - **Expected Uplift**: Strengthened predictions for the "Poor" class.
   - **Ranking**: **Medium**.

3. **Simplify Deployment Pipeline**:
   - **Action**: Optimize the current RandomForestClassifier setup for real-time use in academic systems.
   - **Expected Uplift**: Streamlined and efficient implementation of interventions.
   - **Ranking**: **Medium**.

---

### **Deployment Recommendation**
Given the model’s strong performance for the "Poor" category, it is ready for deployment. Recommended steps include:
1. **Monitoring**: Establish systems to track accuracy and reliability post-deployment.
2. **Documentation**: Prepare guidelines emphasizing the model’s focus on the "Poor" label and limitations.
3. **Stakeholder Training**: Educate users on leveraging predictions for targeted interventions.

"""

In [61]:
# Do not modify this code
print_tile(size="h2", key='experiment_results_explanations', value=experiment_results_explanations)