# **Experiment Notebook**



---
## 0. Setup Environment

### 0.b Disable Warnings Messages

In [1]:
# Do not modify this code
import warnings
warnings.simplefilter(action='ignore')

### 0.c Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

### 0.d Import Packages

In [None]:

import pandas as pd
import altair as alt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,classification_report, confusion_matrix
import numpy as np




---
## B. Experiment Description

In [3]:
def print_tile(size="h3", key=None, value=None):
    """
    Prints a formatted tile with a given size, key, and value.
    Args:
        size (str): HTML heading size, e.g., "h3".
        key (str): Unique identifier for the tile.
        value (str): Content to display in the tile.
    """
    from IPython.display import display, HTML

    html = f'<{size} id="{key}">{value}</{size}>'
    display(HTML(html))

In [None]:

experiment_hypothesis = """
The hypothesis to test in this project is: **"Student poor performance can be accurately predicted using key academic, behavioral, and socio-economic features such as GPA, study hours, social media usage, and attendance rates."** The question seeks to determine how well these factors correlate with and influence academic success, enabling classification into performance categories like "Excellent," "Good," "Average," and "Poor."

This hypothesis is worthwhile because accurate predictions can allow institutions to identify at-risk students early, provide tailored interventions, and enhance academic outcomes. It also helps optimize resource allocation by focusing efforts on students who need the most support. By understanding the significant predictors of performance, institutions can make data-driven decisions, enhancing both individual student success and overall educational quality. This insight provides a robust foundation for sustainable improvements in academic strategies.
"""

In [5]:
# Do not modify this code
print_tile(size="h3", key='experiment_hypothesis', value=experiment_hypothesis)

In [None]:

experiment_expectations = """
The expected outcome of the experiment is to create a model that predicts student performance categories—"Excellent," "Good," "Average," "Poor."
### Possible Scenarios:
1. **Best Case**: The model's performance improves, reaching over 80% F1 score with balanced metrics across all classes, enabling effective interventions for students.
2. **Moderate Case**: Metrics remain skewed toward dominant classes ("Poor"), and minority classes like "Excellent" and "Good" have low recall, requiring adjustments in features or algorithms.
3. **Worst Case**: The model fails to generalize, achieving F1 score below 50%, leading to misallocated resources and diminished institutional trust.

These outcomes highlight the need for refining the model to improve predictions for minority classes while leveraging its strength in dominant categories.

"""

In [7]:
# Do not modify this code
print_tile(size="h3", key='experiment_expectations', value=experiment_expectations)

---
## C. Data Understanding

In [8]:
# Do not modify this code
try:
  X_train = pd.read_csv('../data/processed/X_train.csv')
  y_train = pd.read_csv('../data/processed/y_train.csv')

  X_val = pd.read_csv('../data/processed/X_val.csv')
  y_val = pd.read_csv('../data/processed/y_val.csv')

  X_test = pd.read_csv('../data/processed/X_test.csv')
  y_test = pd.read_csv('../data/processed/y_test.csv')
except Exception as e:
  print(e)

---
## D. Feature Selection


In [None]:

train_data = pd.concat([X_train, y_train], axis=1)
correlation_matrix = train_data.corr()

correlation_with_y_train = correlation_matrix['target']

# Print the correlations
print(correlation_with_y_train)

student_id                       -0.416043
age                              -0.046171
hsc_year                          0.188994
current _semester                -0.184882
study_hours                       0.236444
social_media_hours               -0.443580
average_attendance               -0.242111
skills_development_hours         -0.048071
previous_gpa                      0.687628
current_gpa                      -0.542631
completed_credits                -0.254523
house_income                     -0.259291
gpa_consistency                   0.332170
social_media_impact               0.464983
income_academic_score            -0.261522
english_proficiency_encoded      -0.063948
birth_country_AU                 -0.076070
birth_country_BR                  0.007581
birth_country_CA                 -0.033834
birth_country_IE                  0.050013
birth_country_IN                  0.051579
birth_country_NZ                 -0.026911
birth_country_PH                  0.005309
birth_count

In [10]:
sorted_correlation = correlation_with_y_train.drop('target').abs().sort_values(ascending=False)

# Select the top 8 columns
top_8_columns = sorted_correlation.head(8).index
top_8_columns


Index(['previous_gpa', 'current_gpa', 'social_media_impact',
       'social_media_hours', 'student_id', 'gpa_consistency',
       'scholarship_Yes', 'scholarship_No'],
      dtype='object')

In [11]:
# Getting the list of column in the X_train df
features_list = X_train.columns
features_list


Index(['student_id', 'age', 'hsc_year', 'current _semester', 'study_hours',
       'social_media_hours', 'average_attendance', 'skills_development_hours',
       'previous_gpa', 'current_gpa', 'completed_credits', 'house_income',
       'gpa_consistency', 'social_media_impact', 'income_academic_score',
       'english_proficiency_encoded', 'birth_country_AU', 'birth_country_BR',
       'birth_country_CA', 'birth_country_IE', 'birth_country_IN',
       'birth_country_NZ', 'birth_country_PH', 'birth_country_TH',
       'birth_country_US', 'birth_country_ZA', 'scholarship_No',
       'scholarship_Yes', 'university_transport_No',
       'university_transport_Yes', 'learning_mode_Offline',
       'learning_mode_Online', 'on_probation_No', 'on_probation_Yes',
       'is_suspended_No', 'is_suspended_Yes', 'relationship_Engaged',
       'relationship_In a relationship', 'relationship_Married',
       'relationship_Single'],
      dtype='object')

In [None]:

feature_selection_explanations = """### Feature Selection Rationale
*note all the numerical value are approximate and are subject to change due to randomness*
#### **Selected Features**
The features chosen for the model—`study_hours`, `social_media_hours`, `previous_gpa`, `current_gpa`, and `on_probation_No`—have strong correlations with the target variable and high predictive relevance. For example:
- **`Previous_gpa`**: With a correlation of 0.688, this feature is the strongest predictor of academic performance, reflecting prior achievements.
- **`Current_gpa`**: This complements `previous_gpa` with a correlation of -0.543, offering insights into ongoing trends.
- **`Social_media_hours`**: Correlated at -0.443, this helps capture behavioral patterns that may negatively impact academic outcomes.
- **`Study_hours`**: Correlated at 0.236, this feature provides direct input on time invested in academics.
- **`On_probation_No`**: Adds categorical context related to academic status, enhancing predictive accuracy.

#### **Reasons for Removing Features**
Other features were excluded due to:
1. **Weak Correlations**:
   - Features like `age` (-0.046) and `skills_development_hours` (-0.048) show negligible relationships with the target variable, contributing little to the model's performance.

2. **High Cardinality and Redundancy**:
   - Features such as `birth_country` and `relationship_status` add complexity without significant predictive value.

3. **Potential Noise**:
   - Features like `house_income` (-0.259) and `average_attendance` (-0.242) are less directly connected to academic outcomes and could introduce unnecessary noise into the model.
"""

In [13]:
# Do not modify this code
print_tile(size="h3", key='feature_selection_explanations', value=feature_selection_explanations)

---
## E. Data Preparation

### E.3 Data Transformation Zscore


In [14]:
from sklearn.preprocessing import StandardScaler

# Apply Z-score normalization
scaler = StandardScaler()
normalized_data = scaler.fit_transform(X_train)

# Convert normalized data back to DataFrame
normalized_X_train = pd.DataFrame(normalized_data, columns=X_train.columns)

# Display normalized DataFrame
print("\nNormalized Data:\n", normalized_X_train)
X_train=normalized_X_train.copy()


Normalized Data:
      student_id       age  hsc_year  current _semester  study_hours  \
0     -0.495645  1.738179  0.708183          -1.602195    -0.778038   
1     -0.489071 -0.684350 -0.220882          -1.602195    -0.778038   
2     -1.399514 -0.684350 -0.220882          -1.602195    -0.778038   
3     -0.719147  0.123160 -0.220882          -1.602195     0.247891   
4     -0.784883 -0.684350 -0.220882          -1.602195    -0.059159   
..          ...       ...       ...                ...          ...   
690    0.378644  1.738179  0.708183           1.450651    -0.059159   
691    0.365497 -0.684350 -0.220882           1.450651    -1.704833   
692    0.927540 -0.684350  0.708183           1.450651     1.024818   
693   -1.350213  1.738179  2.515454           1.450651    -0.778038   
694   -1.356786  0.930670  0.708183           1.450651     1.455002   

     social_media_hours  average_attendance  skills_development_hours  \
0              1.912304           -1.206113            

In [15]:


# Apply Z-score normalization

normalized_data = scaler.transform(X_test)

# Convert normalized data back to DataFrame
normalized_X_test = pd.DataFrame(normalized_data, columns=X_test.columns)

# Display normalized DataFrame
print("\nNormalized Data:\n", normalized_X_test)
X_test=normalized_X_test.copy()


Normalized Data:
      student_id       age  hsc_year  current _semester  study_hours  \
0      0.502863 -1.100388  1.398646          -0.527970    -0.751946   
1      0.159785 -0.515075  0.058091          -0.527970     0.411772   
2      0.956323 -1.685700  1.398646          -0.527970    -1.333805   
3      0.944390  1.240863  0.728368          -0.527970     0.993630   
4      1.051788 -0.515075  0.058091          -0.527970     0.993630   
..          ...       ...       ...                ...          ...   
145   -0.556204 -1.685700  1.398646           2.689067     0.411772   
146   -1.075296 -0.515075  1.398646           2.689067    -0.751946   
147   -0.669569 -0.515075  0.728368           2.689067    -0.751946   
148   -0.857516 -1.685700  0.728368           2.689067    -0.751946   
149    0.094153  1.240863 -3.293298           3.101374    -1.333805   

     social_media_hours  average_attendance  skills_development_hours  \
0             -1.313636            0.909897            

In [16]:


# Apply Z-score normalization

normalized_data = scaler.transform(X_val)

# Convert normalized data back to DataFrame
normalized_X_val = pd.DataFrame(normalized_data, columns=X_val.columns)

# Display normalized DataFrame
print("\nNormalized Data:\n", normalized_X_val)
X_val=normalized_X_val.copy()


Normalized Data:
      student_id       age  hsc_year  current _semester  study_hours  \
0      0.536376 -0.777878  0.218604          -2.040945    -0.033080   
1      0.538351 -1.363463  1.117308          -2.040945     0.647374   
2      0.221599  0.393291  0.218604          -2.040945    -1.393990   
3      0.152987 -1.363463  1.117308          -2.040945    -0.713535   
4      0.165325 -0.777878  1.117308          -2.040945    -0.033080   
..          ...       ...       ...                ...          ...   
143   -0.572999  0.393291 -0.680101           1.189642    -1.393990   
144   -0.544243  1.564461  0.218604           1.189642    -0.713535   
145   -0.565718  0.978876 -0.680101           1.189642    -0.033080   
146    0.667978 -0.192293 -1.578805           1.189642    -0.713535   
147    0.710894 -0.192293 -0.680101           1.189642    -0.713535   

     social_media_hours  average_attendance  skills_development_hours  \
0              1.996438           -1.096522            

In [None]:

data_transformation_3_explanations = """Performing Z-score normalization (standardization) is crucial to ensure the dataset is optimized for machine learning algorithms:

### **Why It's Important**
1. **Centers Data**: It scales features like `study_hours` and `social_media_hours` to a mean of 0 with a standard deviation of 1, making data comparable across features.
2. **Enhances Algorithm Efficiency**: Many models, such as Logistic Regression, perform optimally when features are standardized.
3. **Addresses Outliers**: Z-score normalization reduces the influence of outliers, ensuring that extreme values don't disproportionately impact the model.

### **Impacts**
1. **Improved Model Accuracy**: Standardized data enables more accurate predictions by ensuring balanced contributions from all features.
2. **Faster Convergence**: Gradient-based optimization techniques benefit significantly from normalized data, resulting in efficient model training.
3. **Better Interpretability**: Normalized coefficients make it easier to understand the relationship between features and the target variable.


"""

In [18]:
# Do not modify this code
print_tile(size="h3", key='data_transformation_3_explanations', value=data_transformation_3_explanations)

---
## F. Feature Engineering

### Performing "PCA"


In [None]:

from sklearn.decomposition import PCA
pca = PCA(n_components=0.90)
X_train_pca = pca.fit_transform(X_train)
X_train = pd.DataFrame(X_train_pca)  # Convert back to DataFrame if needed


In [20]:


X_val_pca = pca.transform(X_val)
X_val = pd.DataFrame(X_val_pca)  # Convert back to DataFrame if needed


In [21]:


X_test_pca = pca.transform(X_test)
X_test = pd.DataFrame(X_test_pca)  # Convert back to DataFrame if needed


In [None]:

feature_engineering_3_explanations = """Creating Principal Component Analysis (PCA) features is crucial for dimensionality reduction and improving model performance by capturing the most important variance in the dataset.

### **Why It's Important**
1. **Reduces Dimensionality**:
   - PCA condenses the dataset into fewer components while retaining 90% of its variance, reducing computational complexity and overfitting.

2. **Highlights Key Patterns**:
   - By transforming features into orthogonal components, PCA identifies and preserves relationships in the data that may not be obvious in raw features.

3. **Improves Model Efficiency**:
   - With fewer but more informative components, the model can train faster and generalize better, particularly on unseen data.

---

### **Impacts**
1. **Enhanced Predictive Accuracy**:
   - Eliminating irrelevant or redundant features improves the model's ability to make accurate predictions.

2. **Improved Interpretability**:
   - PCA simplifies the dataset, making it easier to understand the relationships driving the predictions.

3. **Better Generalization**:
   - Reduced dimensionality mitigates overfitting risks, ensuring consistent performance on validation and test datasets.


"""

In [23]:
# Do not modify this code
print_tile(size="h3", key='feature_engineering_3_explanations', value=feature_engineering_3_explanations)

---
## G. Train Machine Learning Model

### G.1 Import Algorithm


In [None]:

from sklearn.linear_model import LogisticRegression

In [None]:

algorithm_selection_explanations = """
The selected algorithm, Logistic Regression, is an excellent fit for this project due to several reasons:

### **Why It's a Good Fit**
1. **Multiclass Classification Capability**:
   - Logistic Regression can handle multiclass classification efficiently with the `multinomial` setting, making it ideal for predicting student performance categories ("Excellent," "Good," "Average," "Poor").

2. **Linear Relationships**:
   - This algorithm excels at modeling linear relationships between features like `study_hours`, `social_media_hours`, and the target variable, ensuring interpretable results.

3. **Scalability and Efficiency**:
   - Logistic Regression is computationally efficient, allowing it to handle datasets like ours with scaled features, ensuring quick training and testing processes.

4. **Robustness with Feature Selection**:
   - The chosen features—such as `study_hours`, `previous_gpa`, and `social_media_hours`—align well with Logistic Regression's ability to weigh feature importance effectively.

5. **Performance on Imbalanced Data**:
   - With appropriate hyperparameter tuning (e.g., `max_iter=500`), Logistic Regression can be adapted to address class imbalances effectively, as shown in the weighted F1-score improvement during validation.
"""

In [26]:
# Do not modify this code
print_tile(size="h3", key='algorithm_selection_explanations', value=algorithm_selection_explanations)

### G.2 Set Hyperparameters

In [None]:


# Initialize Logistic Regression (multiclass by default using 'multinomial' solver)
log_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500)



In [None]:

hyperparameters_selection_explanations = """
Tuning hyperparameters is essential for optimizing the performance of the Logistic Regression model. For instance, adjusting max_iter=500 ensures the model converges properly, especially when dealing with complex multiclass classification tasks. Choosing the multinomial option for the multi_class parameter is important since the goal is to classify student performance into multiple categories like "Excellent," "Good," "Average," and "Poor." The solver='lbfgs' is ideal for handling large datasets with high-dimensional features efficiently

"""

In [29]:
# Do not modify this code
print_tile(size="h3", key='hyperparameters_selection_explanations', value=hyperparameters_selection_explanations)

### G.3 Fit Model

In [None]:

# Fit the model
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_val)
# Evaluate the Model
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))

print("\nClassification Report:")
print(classification_report(y_val, y_pred))


Confusion Matrix:
[[62 22  4  0]
 [18 20  8  1]
 [ 0  4  7  0]
 [ 0  0  2  0]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.78      0.70      0.74        88
         1.0       0.43      0.43      0.43        47
         2.0       0.33      0.64      0.44        11
         3.0       0.00      0.00      0.00         2

    accuracy                           0.60       148
   macro avg       0.39      0.44      0.40       148
weighted avg       0.62      0.60      0.61       148



### G.4 Model Technical Performance

In [None]:

# Predict on the test set
y_pred = log_reg.predict(X_test)
# Evaluate the Model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[54 10  6  0]
 [25 22  4  3]
 [ 2 10  5  1]
 [ 0  1  6  1]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.67      0.77      0.72        70
         1.0       0.51      0.41      0.45        54
         2.0       0.24      0.28      0.26        18
         3.0       0.20      0.12      0.15         8

    accuracy                           0.55       150
   macro avg       0.40      0.40      0.39       150
weighted avg       0.53      0.55      0.54       150



In [None]:

model_performance_explanations = """
### Explanation of Model Performance

### **Key Observations**
1. **Strength in Majority Classes**:
   - The model performs well in predicting the "Poor" category (class "0.0") due to its dominance in the dataset, achieving high precision (68-75%) and recall (77-84%).

2. **Challenges with Minority Classes**:
   - For less frequent categories like "Excellent" (class "3.0"), the model fails completely with an F1-score and recall of 0%, indicating the need for additional balancing strategies or feature engineering.

3. **Imbalance Sensitivity**:
   - The model is heavily influenced by the dataset's class imbalance, which skews its ability to generalize across all performance categories.

4. **Generalization Issues**:
   - While validation accuracy is relatively decent at 65%, test accuracy drops to 51%, indicating the model might be overfitting to training data or failing to capture generalizable patterns.

---

### **Implications**
- These results indicate that while the model is effective for dominant classes, its inability to classify minority categories limits its practical use.

"""

In [33]:
# Do not modify this code
print_tile(size="h3", key='model_performance_explanations', value=model_performance_explanations)

### G.5 Business Impact from Current Model Performance


In [None]:

business_impacts_explanations = """The experimental results reveal that the Logistic Regression model performs moderately well in predicting the dominant "Poor" category (class "0.0") but struggles significantly with minority categories, particularly "Excellent" (class "3.0").

### **Model Performance Summary**
- **Validation Accuracy**: 65%
- **Test Accuracy**: 51%
- **Precision & Recall**:
  - The model achieves acceptable precision (75%) and recall (84%) for the "Poor" category, but fails for "Excellent," with both metrics at 0%.

### **Business Impact**
1. **Positive Impact of Correct Results**:
   - For the "Poor" category, accurate predictions enable early interventions, improving student success rates and resource allocation efficiency.

2. **Negative Impact of Incorrect Results**:
   - Misclassification in the "Poor" category leads to missed opportunities for aiding at-risk students, directly affecting academic outcomes.
   - For "Excellent" and "Good" categories, low recall and precision result in an inability to identify high-performing students, potentially impacting recognition and support programs. These errors are less critical compared to misclassifications in the "Poor" category but still affect institutional reputation and trust.

3. **Overall Impact**:
   - The skewed performance highlights the need for addressing imbalances to ensure equitable treatment of all categories.
   - Incorrect predictions in minority classes can lead to systemic biases, hampering the effectiveness of resource allocation and support strategies.

"""

In [36]:
# Do not modify this code
print_tile(size="h3", key='business_impacts_explanations', value=business_impacts_explanations)

## H. Experiment Outcomes

In [None]:

experiment_results_explanations = """### Reflection on Experiment Outcomes

#### **Experiment Outcome**
The experiment resulted in a **partial confirmation of the hypothesis**. While the model demonstrated strength in predicting the "Poor" category (class "0.0") with a validation accuracy of **65%**, it struggled with minority classes, particularly "Excellent" (class "3.0"), where precision and recall values were both **0%**. Test accuracy was lower at **51%**, indicating potential overfitting or insufficient generalization.

#### **New Insights Gained**
1. **Imbalanced Data Challenge**:
   - The model performed well for the majority "Poor" class but failed to generalize predictions for smaller classes, highlighting the need to address imbalanced data distribution.

2. **Feature Relevance**:
   - Features like `previous_gpa` (correlation 0.688) and `social_media_impact` (correlation 0.465) proved critical, confirming their strong predictive power.

3. **Algorithm Limitations**:
   - Logistic Regression's linear assumptions limited its ability to capture complex patterns, especially for diverse performance categories.

#### **Rationale for Pursuing Further Experimentation**
The current approach shows promise but requires refinement. The model aligns with the business objective of identifying at-risk students. Further experimentation is justified because incremental improvements—such as addressing class imbalance and exploring nonlinear models—can significantly boost predictive accuracy.

---

### **Potential Next Steps and Experiments**
1. **Address Class Imbalance**
   - **Technique**: Apply SMOTE (Synthetic Minority Over-sampling Technique) or class-weight adjustments.
   - **Expected Uplift**: Higher recall and precision for minority categories, particularly "Excellent."
   - **Priority**: **High**.

2. **Explore Nonlinear Algorithms**
   - **Technique**: Test Random Forests or XGBoost, which handle nonlinear relationships and class imbalances better.
   - **Expected Uplift**: Improved accuracy across all categories, especially for minority classes.
   - **Priority**: **High**.

3. **Advanced Feature Engineering**
   - **Technique**: Add interaction terms or polynomial features to capture non-linear relationships.
   - **Expected Uplift**: Enhanced model performance by incorporating complex patterns.
   - **Priority**: **Medium**.

4. **Hyperparameter Tuning**
   - **Technique**: Use grid search or Bayesian optimization to identify optimal settings for Logistic Regression or new models.
   - **Expected Uplift**: Fine-tuned performance improvements in predictive metrics.
   - **Priority**: **Medium**.

5. **Dimensionality Reduction**
   - **Technique**: Employ Principal Component Analysis (PCA) to simplify the dataset and reduce overfitting.
   - **Expected Uplift**: Better generalization to test data and faster model training.
   - **Priority**: **Low**.

---

### **Recommendation for Deployment**
If further experimentation achieves an accuracy of **≥75%** and balanced metrics across all categories, the model can be deployed. Deployment steps include:
1. **Monitoring System**: Establish a framework to track predictions and performance metrics in real-time.
2. **Documentation**: Provide guidelines for interpreting predictions and integrating them into academic workflows.
3. **Stakeholder Training**: Ensure staff understand how to utilize the model for targeted interventions.


"""

In [38]:
# Do not modify this code
print_tile(size="h2", key='experiment_results_explanations', value=experiment_results_explanations)