# Introduction

This notebook aims to evaluate the performance of various classification algorithms, fine-tune their hyperparameters, and select the best-performing model. Given the imbalanced nature of the dataset, class weighting (`class_weight='balanced'`) will be applied to enhance model performance.

The algorithms to be evaluated include:

- Logistic Regression
- Decision Tree

Each model will undergo:
1. Baseline training with default hyperparameters.
2. Hyperparameter tuning to optimize performance.
3. Validation using metrics such as AUC to identify overfitting and select the final model.

## Important Note:
**(Strongly correlated features)**

During initial tests, we observed the model achieved high performance due to features strongly correlated with the target variable, such as economic participation and hours worked, reflecting real-world factors influencing school dropout rates. To better evaluate model performance and for illustrative purposes in this exercise, we will exclude the most predictive features, allowing us to practice model selection, training, and tuning without their dominant influence. The following cells will demonstrate this condition and make the aforementioned adjustments.

Libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.exceptions import ConvergenceWarning

# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

### Base Model

#### Logistic Regression with default parameters

Load data

In [2]:
# Load data
df = pd.read_csv('G:\Mi unidad\###_ ML Zoomcamp 2024\enape_post_eda.csv')

Split dataset (60/20/20)

In [3]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.dropout.values
y_val = df_val.dropout.values
y_test = df_test.dropout.values

del df_train['dropout']
del df_val['dropout']
del df_test['dropout']

In [4]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(11983, 41)
(3995, 41)
(3995, 41)


Train model

In [5]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

Check accuracy

In [6]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
dropout_prediction = (y_pred >= 0.5)
(y_val == dropout_prediction).mean()

np.float64(0.9987484355444305)

In [7]:
report = classification_report(y_val, dropout_prediction)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3907
           1       1.00      0.94      0.97        88

    accuracy                           1.00      3995
   macro avg       1.00      0.97      0.99      3995
weighted avg       1.00      1.00      1.00      3995



In [8]:
roc_auc_score(y_val,dropout_prediction).round(3)

np.float64(0.972)

It seems that our model is overfitting, we'll check with the test split and see how it handles new data.


In [9]:
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

y_pred = model.predict_proba(X_test)[:, 1]
dropout_prediction = (y_pred >= 0.5)
(y_test == dropout_prediction).mean()

np.float64(0.9987484355444305)

In [10]:
report = classification_report(y_test, dropout_prediction)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3931
           1       0.98      0.94      0.96        64

    accuracy                           1.00      3995
   macro avg       0.99      0.97      0.98      3995
weighted avg       1.00      1.00      1.00      3995



In [11]:
roc_auc_score(y_test,dropout_prediction).round(3)

np.float64(0.969)

The model performs well on new data, indicating that it is generalizing effectively. Therefore, we can proceed with a cross-validation exercise.

Additionally, we know (as stated in the [Corelation Analysis with the Target Variable](https://github.com/Maxkaizo/---_-ML-Zoomcamp-2024/blob/main/2_eda.ipynb)) that certain features, such as economic participation, economic consequences, and work hours, have a strong influence on dropout rates. To further illustrate the impact of feature selection, we will remove some of these features in the next steps.

#### Cross Validation

In [12]:
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')

mean_accuracy = scores.mean()
std_accuracy = np.std(scores)

print(scores)
print("Cross-Validation Accuracy (mean):", mean_accuracy)
print("Cross-Validation Accuracy (std):", std_accuracy)


[0.99833125 0.99666249 0.99791406 0.99791319 0.99958264]
Cross-Validation Accuracy (mean): 0.9980807255591471
Cross-Validation Accuracy (std): 0.0009365603198218517


#### Base Model Conclusions

The model performs exceptionally well, mostly due to the presence of highly correlated features. After testing it with cross-validation and new data, we can conclude that it generalizes effectively. This aligns with the real-world scenario, where it is intuitive to expect that students with excessive work hours, high economic participation, and significant economic consequences are more likely to drop out.

#### Feature exclusion

As stated before, we'll exclude some features in order to run the model's selection and fine tunning process

In [13]:
# Load data
df = pd.read_csv('G:\Mi unidad\###_ ML Zoomcamp 2024\enape_post_eda.csv')

In [14]:
columns_to_drop = [
    'em_hw_projects',
    'em_tests',
    #'economic_participation',
    #'economic_consequences'
    ] 
df_full_train = df_full_train.drop(columns=columns_to_drop, axis=1)
df_train = df_train.drop(columns=columns_to_drop, axis=1)
df_val = df_val.drop(columns=columns_to_drop, axis=1)
df_test = df_test.drop(columns=columns_to_drop, axis=1)

In [15]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(11983, 39)
(3995, 39)
(3995, 39)


# Model Selection

## Logistic Regression

### Define Functions (Train, Predict, Evaluation Metrics)

Train function

In [16]:
def train(df_train, y_train, C=1.0,cw=None):
    dict = df_train.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dict)

    model = LogisticRegression(C=C, max_iter=1000,class_weight=cw)
    model.fit(X_train, y_train)
    
    return dv, model

Predict Function

In [17]:
def predict(df, dv, model,t):
    dict = df.to_dict(orient='records')

    X = dv.transform(dict)
    y_pred = model.predict_proba(X)[:, 1]
    dropout_prediction = (y_pred >= t)

    return y_pred, dropout_prediction

In [18]:
def eval_metrics(y, dropout_prediction):

    gral_accuracy = (y == dropout_prediction).mean()
    report = classification_report(y, dropout_prediction, output_dict=True)
    auc = roc_auc_score(y, dropout_prediction).round(3)
    
    # Filtrar solo la clase 1 del reporte
    class_1_report = report["1"]
    
    metrics_dict = {
        "gral_accuracy": gral_accuracy,
        "class_1_report": class_1_report,
        "auc": auc
    }
    
    return metrics_dict

### Base (Poor) Performance

In [19]:
dv, model = train(df_train, y_train, C=1.0,cw=None)
y_pred, dropout_prediction = predict(df_val, dv, model,t=0.5)
eval_metrics(y_val, dropout_prediction)

{'gral_accuracy': np.float64(0.9799749687108886),
 'class_1_report': {'precision': 0.6666666666666666,
  'recall': 0.18181818181818182,
  'f1-score': 0.2857142857142857,
  'support': 88.0},
 'auc': np.float64(0.59)}

As stated before, this model shows a poor performance. It has an apparently good accuracy score, but our main metric is recall and its to low, also our auc shows barely a better performance than a random model.

### Class Weight Balancing

Our first improvement is to apply balanced weights, this technique deals with imbalance by applying a higher ponderation to the minority class, in this case the dropout class

In [20]:
dv, model = train(df_train, y_train, C=1.0,cw="balanced")
y_pred, dropout_prediction = predict(df_val, dv, model,t=0.5)
eval_metrics(y_val, dropout_prediction)

{'gral_accuracy': np.float64(0.8578222778473091),
 'class_1_report': {'precision': 0.1261682242990654,
  'recall': 0.9204545454545454,
  'f1-score': 0.2219178082191781,
  'support': 88.0},
 'auc': np.float64(0.888)}

With this change, we see a huge improvement on recall and auc, but also a huge decrease on precision, we'll keep tunning

### Hyperparameter Selection

Now the model shows way better recall and auc, but now it shows poor precision, so now we'll try to find a better balance by:
- Trying some regularization values
- Adjusting the decision threshold


In [21]:
c_values = [1.0, 0.01, 0.001, 0.0001, 0.00001]
thresholds = [0.3, 0.5, 0.7, 0.8, 0.9]

eval_metrics_dict = []

for c in c_values:
    for t in thresholds:
        dv, model = train(df_train, y_train, C=c, cw="balanced")
        y_pred, dropout_prediction = predict(df_val, dv, model,t=t)
        
        # Evalúa las métricas
        eval_metrics_result = eval_metrics(y_val, dropout_prediction)
        
        # Almacena los resultados en el diccionario
        eval_metrics_dict.append({
            "C": c,
            "Threshold": t,
            "result": eval_metrics_result
        })

In [22]:
rows = []
for item in eval_metrics_dict:
    c_value = item['C']
    threshold = item['Threshold']
    gral_accuracy = item['result']['gral_accuracy']
    auc = item['result']['auc']
    class_1_report = item['result']['class_1_report']
    rows.append({
        "C": c_value,
        "Threshold": threshold,
        "gral_accuracy": gral_accuracy,
        "auc": auc,
        "precision": class_1_report['precision'],
        "recall": class_1_report['recall'],
        "f1_score": class_1_report['f1-score'],
        "support": class_1_report['support']
    })

# Crear el DataFrame
df_results = pd.DataFrame(rows)

In [23]:
df_results[df_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(5)

Unnamed: 0,C,Threshold,gral_accuracy,auc,precision,recall,f1_score,support
2,1.0,0.7,0.915394,0.857,0.179487,0.795455,0.292887,88.0
1,1.0,0.5,0.857822,0.888,0.126168,0.920455,0.221918,88.0
6,0.01,0.5,0.848561,0.839,0.110106,0.829545,0.194407,88.0
11,0.001,0.5,0.831039,0.819,0.097394,0.806818,0.173807,88.0
0,1.0,0.3,0.796496,0.868,0.093154,0.943182,0.169561,88.0


We'll select the top 2 combinations as candidates

### SMOTE Testing

Now we'll try another technique for dealing with our imbalanced dataset, this one is called SMOTE (Synthetic Minority Oversampling Technique), which is a method that generates synthetic samples for the minority class. It works by interpolating new samples between existing minority class instances, rather than simply duplicating them, which helps create a more balanced dataset and improves model performance without overfitting.

In [24]:
from imblearn.over_sampling import SMOTE

In [25]:
def train_wsmote(df_train, y_train, C=1.0,cw=None):
    dict = df_train.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dict)
    
    smote = SMOTE(random_state=42)
    X_train, y_train = smote.fit_resample(X_train, y_train)

    model = LogisticRegression(C=C, max_iter=1000,class_weight=cw)
    model.fit(X_train, y_train)
    
    return dv, model

In [26]:
c_values = [1.0, 0.01, 0.001, 0.0001, 0.00001]
thresholds = [0.3, 0.5, 0.7, 0.8, 0.9]

eval_metrics_dict = []

for c in c_values:
    for t in thresholds:
        dv, model = train_wsmote(df_train, y_train, C=c, cw=None) # As we're using SMOTE, in this case we won't use class weights
        y_pred, dropout_prediction = predict(df_val, dv, model,t=t)
        
        eval_metrics_result = eval_metrics(y_val, dropout_prediction)
        
        eval_metrics_dict.append({
            "C": c,
            "Threshold": t,
            "result": eval_metrics_result
        })

In [27]:
rows = []
for item in eval_metrics_dict:
    c_value = item['C']
    threshold = item['Threshold']
    gral_accuracy = item['result']['gral_accuracy']
    auc = item['result']['auc']
    class_1_report = item['result']['class_1_report']
    rows.append({
        "C": c_value,
        "Threshold": threshold,
        "gral_accuracy": gral_accuracy,
        "auc": auc,
        "precision": class_1_report['precision'],
        "recall": class_1_report['recall'],
        "f1_score": class_1_report['f1-score'],
        "support": class_1_report['support']
    })

df_results_wsmote = pd.DataFrame(rows)

In [28]:
df_results[df_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(5)

Unnamed: 0,C,Threshold,gral_accuracy,auc,precision,recall,f1_score,support
2,1.0,0.7,0.915394,0.857,0.179487,0.795455,0.292887,88.0
1,1.0,0.5,0.857822,0.888,0.126168,0.920455,0.221918,88.0
6,0.01,0.5,0.848561,0.839,0.110106,0.829545,0.194407,88.0
11,0.001,0.5,0.831039,0.819,0.097394,0.806818,0.173807,88.0
0,1.0,0.3,0.796496,0.868,0.093154,0.943182,0.169561,88.0


In [29]:
df_results_wsmote[df_results_wsmote['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(5)

Unnamed: 0,C,Threshold,gral_accuracy,auc,precision,recall,f1_score,support
2,1.0,0.7,0.922904,0.861,0.194444,0.795455,0.3125,88.0
1,1.0,0.5,0.872591,0.852,0.128748,0.829545,0.222901,88.0
6,0.01,0.5,0.866583,0.854,0.124789,0.840909,0.217327,88.0
11,0.001,0.5,0.850563,0.829,0.109063,0.806818,0.192152,88.0
0,1.0,0.3,0.821777,0.876,0.104061,0.931818,0.187215,88.0


This indicates that while SMOTE helps the model to be more precise in its predictions for the minority class (fewer false positives).

## Decision Tree Classifier

With the Decision Tree We'll try selecting this hyperparameters:

- Depth
- Minimium samples per leaf

Import additional Library

In [31]:
from sklearn.tree import DecisionTreeClassifier

In [32]:
train_dicts = df_train.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val.to_dict(orient='records')
X_val = dv.transform(val_dicts)

Now we'll iterate over several max depths and min samples

In [39]:
depths = [1, 2, 3, 4, 5, 6, 10, 15, 20, None]
samples = [1, 5, 10, 15, 20, 500, 100, 200]
dt_metrics_dict = []

for depth in depths:
    for s in samples:
        dt = DecisionTreeClassifier(max_depth=depth,class_weight="balanced",min_samples_leaf=s)
        dt.fit(X_train, y_train)
        
        y_pred = dt.predict_proba(X_val)[:, 1]
        dropout_prediction = (y_pred >= 0.5)
        
        gral_accuracy = (y_val == dropout_prediction).mean()
        report = classification_report(y_val, dropout_prediction, output_dict=True)
        auc = roc_auc_score(y_val, dropout_prediction)
        
        class_1_report = report["1"]
        
        dt_metrics_dict.append({
            "depth": depth,
            "min_samples": s,
            "gral_accuracy": gral_accuracy,
            "precision": class_1_report['precision'],
            "recall": class_1_report['recall'],
            "f1_score": class_1_report['f1-score'],
            "support": class_1_report['support'],
            "auc": auc
        })

dt_results = pd.DataFrame(dt_metrics_dict)

In [59]:
dt_results[dt_results['recall'] > 0.75].sort_values(by='precision', ascending=False).head(5)

Unnamed: 0,depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
16,3.0,1,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
18,3.0,10,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
17,3.0,5,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
19,3.0,15,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
20,3.0,20,0.875594,0.130199,0.818182,0.224649,88.0,0.847535


In [60]:
dt_results[dt_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(5)

Unnamed: 0,depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
16,3.0,1,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
18,3.0,10,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
17,3.0,5,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
19,3.0,15,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
20,3.0,20,0.875594,0.130199,0.818182,0.224649,88.0,0.847535


## Random Forest

With the Decision Tree We'll try selecting this hyperparameters:

- Number of estimators
- Trees Depth
- Minimium samples per leaf- 

In [49]:
from sklearn.ensemble import RandomForestClassifier

In [57]:
estimators = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200]
depths = [None, 5, 10, 15, 20, 25, 30]
samples = [1, 5, 10, 15, 20, 500, 100, 200]
rf_metrics_dict = []

for n in estimators:
    for d in depths:
        for s in samples:
            rf = RandomForestClassifier(n_estimators=n, max_depth=d, min_samples_leaf=s, class_weight='balanced', random_state=1)
            
            rf.fit(X_train, y_train)
            
            y_pred = rf.predict_proba(X_val)[:, 1]
            dropout_prediction = (y_pred >= 0.5)
            
            gral_accuracy = (y_val == dropout_prediction).mean()
            report = classification_report(y_val, dropout_prediction, output_dict=True)
            auc = roc_auc_score(y_val, dropout_prediction)
            
            class_1_report = report["1"]
            
            rf_metrics_dict.append({
                "n_estimators": n,
                "max_depth": d,
                "min_samples": s,
                "gral_accuracy": gral_accuracy,
                "precision": class_1_report['precision'],
                "recall": class_1_report['recall'],
                "f1_score": class_1_report['f1-score'],
                "support": class_1_report['support'],
                "auc": auc
            })

rf_results = pd.DataFrame(rf_metrics_dict)

In [61]:
rf_results[rf_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(5)

Unnamed: 0,n_estimators,max_depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
470,90,10.0,100,0.902879,0.159091,0.795455,0.265152,88.0,0.850376
550,100,25.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
558,100,30.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
542,100,20.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
510,100,,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012


In [62]:
rf_results[rf_results['recall'] > 0.75].sort_values(by='precision', ascending=False).head(5)

Unnamed: 0,n_estimators,max_depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
470,90,10.0,100,0.902879,0.159091,0.795455,0.265152,88.0,0.850376
558,100,30.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
542,100,20.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
510,100,,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
534,100,15.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012


# Model Comparison

Now we'll compare our best performing models and select one based on recall, but looking for a balance with f1-score and balance

## Logistic Regression with class weighting

In [64]:
df_results[df_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(3)

Unnamed: 0,C,Threshold,gral_accuracy,auc,precision,recall,f1_score,support
2,1.0,0.7,0.915394,0.857,0.179487,0.795455,0.292887,88.0
1,1.0,0.5,0.857822,0.888,0.126168,0.920455,0.221918,88.0
6,0.01,0.5,0.848561,0.839,0.110106,0.829545,0.194407,88.0


## Logistic Regression with SMOTE

In [65]:
df_results_wsmote[df_results_wsmote['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(3)

Unnamed: 0,C,Threshold,gral_accuracy,auc,precision,recall,f1_score,support
2,1.0,0.7,0.922904,0.861,0.194444,0.795455,0.3125,88.0
1,1.0,0.5,0.872591,0.852,0.128748,0.829545,0.222901,88.0
6,0.01,0.5,0.866583,0.854,0.124789,0.840909,0.217327,88.0


## Decision Tree

In [67]:
dt_results[dt_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(3)

Unnamed: 0,depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
16,3.0,1,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
18,3.0,10,0.875594,0.130199,0.818182,0.224649,88.0,0.847535
17,3.0,5,0.875594,0.130199,0.818182,0.224649,88.0,0.847535


## Random Forest

In [68]:
rf_results[rf_results['recall'] > 0.75].sort_values(by='f1_score', ascending=False).head(3)

Unnamed: 0,n_estimators,max_depth,min_samples,gral_accuracy,precision,recall,f1_score,support,auc
470,90,10.0,100,0.902879,0.159091,0.795455,0.265152,88.0,0.850376
550,100,25.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012
558,100,30.0,100,0.902378,0.158371,0.795455,0.264151,88.0,0.85012


## Decision

With this information we'll use Logistic Regression with SMOTE