# Part 1: Student Dropout Prediction

## 1. Problem Definition

**Objectives:**
1. **Early Identification:** To identify students at high risk of dropping out early in their academic journey.
2. **Intervention Targeting:** To enable targeted interventions and support programs for at-risk students.
3. **Resource Optimization:** To optimize educational resources by focusing on students who need them most, improving overall student retention rates.

**Stakeholders:**
1. **Educational Institutions:** Universities, colleges, and schools that benefit from improved retention rates, reputation, and efficient resource allocation.
2. **Students:** Individuals who receive timely support, potentially preventing academic failure and improving their educational outcomes.

**KPI (Key Performance Indicator):**
1. **Student Retention Rate:** The percentage of students who continue their studies from one academic period to the next. An increase in this rate would indicate the success of the prediction model and subsequent interventions.

## 2. Data Collection & Preprocessing

**Data Sources:**
1.  **Student Information System (SIS):** Contains demographic data (age, gender, enrollment status), academic records (grades, attendance, course load), and financial aid information.
2.  **Learning Management System (LMS) Logs:** Provides behavioral data such as login frequency, assignment submission patterns, forum participation, and resource access times.

**Potential Bias:**
1.  **Historical Bias:** If past dropout data primarily reflects students from certain socioeconomic backgrounds or specific academic programs, the model might learn to disproportionately predict dropout for similar future students, even if their individual risk factors are low. This could lead to unfair targeting of interventions.

**Preprocessing Steps:**
1.  **Handling Missing Values:** Impute missing grades or attendance records using mean, median, or mode imputation, or more advanced techniques like K-nearest neighbors (KNN) imputation.
2.  **Feature Engineering:** Create new features such as 'grade point average (GPA)', 'attendance rate', 'change in course load', or 'engagement score' from LMS data to capture more predictive patterns.
3.  **Categorical Encoding:** Convert categorical features (e.g., 'major', 'enrollment status', 'financial aid type') into numerical representations using one-hot encoding or label encoding, as demonstrated in the code below for the target variable.

In [1]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('../data/students dropout and academic success/dataset.csv')
df.head()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,8,5,2,1,1,1,13,10,6,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,6,1,11,1,1,1,1,3,4,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,5,1,1,1,22,27,10,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,8,2,15,1,1,1,23,27,6,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,12,1,3,0,1,1,22,28,10,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 35 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance                      4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Nacionality                                     4424 non-null   int64  
 7   Mother's qualification                          4424 non-null   int64  
 8   Father's qualification                          4424 non-null   int64  
 9   Mother's occupation                      

In [3]:
df.describe()

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
count,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,...,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0,4424.0
mean,1.178571,6.88698,1.727848,9.899186,0.890823,2.53142,1.254521,12.322107,16.455244,7.317812,...,0.137658,0.541817,6.232143,8.063291,4.435805,10.230206,0.150316,11.566139,1.228029,0.001969
std,0.605747,5.298964,1.313793,4.331792,0.311897,3.963707,1.748447,9.026251,11.0448,3.997828,...,0.69088,1.918546,2.195951,3.947951,3.014764,5.210808,0.753774,2.66385,1.382711,2.269935
min,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.6,-0.8,-4.06
25%,1.0,1.0,1.0,6.0,1.0,1.0,1.0,2.0,3.0,5.0,...,0.0,0.0,5.0,6.0,2.0,10.75,0.0,9.4,0.3,-1.7
50%,1.0,8.0,1.0,10.0,1.0,1.0,1.0,13.0,14.0,6.0,...,0.0,0.0,6.0,8.0,5.0,12.2,0.0,11.1,1.4,0.32
75%,1.0,12.0,2.0,13.0,1.0,1.0,1.0,22.0,27.0,10.0,...,0.0,0.0,7.0,10.0,6.0,13.333333,0.0,13.9,2.6,1.79
max,6.0,18.0,9.0,17.0,1.0,17.0,21.0,29.0,34.0,32.0,...,12.0,19.0,23.0,33.0,20.0,18.571429,12.0,16.2,3.7,3.51


## 3. Model Development

**Model Choice & Justification:**
For predicting student dropout, a **Random Forest Classifier** is chosen. This model is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is robust to overfitting, can handle a large number of features, and implicitly performs feature selection, making it suitable for complex educational datasets with mixed data types.

**Train/Validation/Test Split:**
The dataset would ideally be split into three sets:
1.  **Training Set (70%):** Used to train the model. The model learns patterns and relationships from this data.
2.  **Validation Set (15%):** Used for hyperparameter tuning and model selection. It helps in evaluating different model configurations and preventing overfitting to the training data.
3.  **Test Set (15%):** Held out completely until the final model evaluation. It provides an unbiased assessment of the model's performance on unseen data.

**Hyperparameters to Tune:**
1.  **`n_estimators` (Number of Trees):** Controls the number of decision trees in the forest. A higher number generally improves performance but increases computational cost. Tuning this helps find the optimal balance.
2.  **`max_depth` (Maximum Depth of Trees):** Limits the maximum depth of each decision tree. This helps control overfitting; deeper trees can capture more complex patterns but are more prone to overfitting. Finding the right depth prevents the model from becoming too specific to the training data.

In [4]:
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
le = LabelEncoder()
df['Target'] = le.fit_transform(df['Target'])

In [5]:
from sklearn.model_selection import train_test_split

# Split the data
X = df.drop('Target', axis=1)
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## 4. Evaluation & Deployment

**Evaluation Metrics:**
1.  **F1-Score:** This is the harmonic mean of precision and recall, providing a balance between the two. It's particularly useful when dealing with imbalanced datasets (which dropout prediction often is), as it penalizes models that perform well on the majority class but poorly on the minority class.
2.  **Recall (Sensitivity):** Measures the proportion of actual positive cases (students who *did* drop out) that were correctly identified by the model. High recall is crucial in dropout prediction to ensure that as many at-risk students as possible are identified for intervention.

**Concept Drift:**
Concept drift refers to the phenomenon where the statistical properties of the target variable (what we are trying to predict) change over time in unforeseen ways. In student dropout prediction, this could happen if:
*   **Changes in Educational Policies:** New admission criteria, curriculum changes, or support programs might alter student behavior and dropout patterns.
*   **Socioeconomic Shifts:** Economic downturns or changes in job markets could influence students' decisions to continue or discontinue their studies.
*   **Demographic Changes:** A shift in the student population's demographics (e.g., more international students, older students) could introduce new patterns not seen in the training data.
If concept drift occurs, the model's performance will degrade over time, requiring retraining or adaptation.

**Deployment Challenge:**
1.  **Ethical Considerations and Bias Mitigation:** A significant challenge is ensuring the model is fair and does not perpetuate or amplify existing biases. For example, if the training data disproportionately represents certain demographic groups as "at-risk," the model might unfairly target students from those groups. Addressing this requires continuous monitoring for disparate impact, implementing fairness metrics, and potentially using bias mitigation techniques during preprocessing or post-processing.

**Model Performance (Factual Results):**
```
Accuracy: 0.7728813559322034
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.76      0.80       316
           1       0.53      0.32      0.40       151
           2       0.78      0.94      0.85       418

    accuracy                           0.77       885
   macro avg       0.72      0.68      0.68       885
weighted avg       0.76      0.77      0.76       885
```

In [7]:
from sklearn.metrics import accuracy_score, classification_report

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)

Accuracy: 0.7728813559322034
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.76      0.80       316
           1       0.53      0.32      0.40       151
           2       0.78      0.94      0.85       418

    accuracy                           0.77       885
   macro avg       0.72      0.68      0.68       885
weighted avg       0.76      0.77      0.76       885

