# Introduction

This notebook aims to evaluate the performance of various classification algorithms, fine-tune their hyperparameters, and select the best-performing model. Given the imbalanced nature of the dataset, class weighting (`class_weight='balanced'`) will be applied to enhance model performance.

The algorithms to be evaluated include:

- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost

Each model will undergo:
1. Baseline training with default hyperparameters.
2. Hyperparameter tuning to optimize performance.
3. Validation using metrics such as AUC to identify overfitting and select the final model.

## Important Note:
**(Strongly correlated features)**

During initial tests, we observed the model achieved high performance due to features strongly correlated with the target variable, such as economic participation and hours worked, reflecting real-world factors influencing school dropout rates. To better evaluate model performance and for illustrative purposes in this exercise, we will exclude the four most predictive features, allowing us to practice model selection, training, and tuning without their dominant influence. The following cells will demonstrate this process and adjustments made to refine the models.

Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

### Base Model

#### Logistic Regression with default parameters

In [2]:
# Load data
df = pd.read_csv('G:\Mi unidad\###_ ML Zoomcamp 2024\enape_post_eda.csv')

In [3]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.dropout.values
y_val = df_val.dropout.values
y_test = df_test.dropout.values

del df_train['dropout']
del df_val['dropout']
del df_test['dropout']

In [4]:
len(df_train), len(df_val), len(df_test)

(11983, 3995, 3995)

In [5]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Check accuracy

In [6]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
dropout_prediction = (y_pred >= 0.5)
(y_val == dropout_prediction).mean()

np.float64(0.9987484355444305)

In [7]:
report = classification_report(y_val, dropout_prediction)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3907
           1       1.00      0.94      0.97        88

    accuracy                           1.00      3995
   macro avg       1.00      0.97      0.99      3995
weighted avg       1.00      1.00      1.00      3995



In [8]:
roc_auc_score(y_val,dropout_prediction).round(3)

np.float64(0.972)

It seems that our model is overfitting, we'll check with the test split, lets see how it handles new data.


In [9]:
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

y_pred = model.predict_proba(X_test)[:, 1]
dropout_prediction = (y_pred >= 0.5)
(y_test == dropout_prediction).mean()

np.float64(0.9984981226533166)

In [10]:
report = classification_report(y_test, dropout_prediction)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3931
           1       0.97      0.94      0.95        64

    accuracy                           1.00      3995
   macro avg       0.98      0.97      0.98      3995
weighted avg       1.00      1.00      1.00      3995



In [11]:
roc_auc_score(y_test,dropout_prediction).round(3)

np.float64(0.968)

The model performs well on new data, indicating that it is generalizing effectively. Therefore, we can proceed with a cross-validation exercise.

Additionally, we suspect that certain features, such as economic participation, economic consequences, and work hours, have a strong influence on dropout rates. To illustrate the impact of feature selection, we will remove some of these features in the next steps.

#### Cross Validation

In [12]:
len(df_train), len(df_val), len(df_test)

(11983, 3995, 3995)

In [13]:
from sklearn.model_selection import KFold

In [14]:
def train(df_train, y_train, C=1.0):
    dicts = df_train.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    return dv, model

In [15]:
def predict(df, dv, model):
    dicts = df.to_dict(orient='records')

    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [16]:
n_splits = 5
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
scores = []

for train_idx, val_idx in kfold.split(df_full_train):

  df_train = df_full_train.iloc[train_idx]
  df_val = df_full_train.iloc[val_idx]

  y_train = df_train.dropout.values
  y_val = df_val.dropout.values

  dv, model = train(df_train, y_train)
  y_pred = predict(df_val, dv, model)

  auc = roc_auc_score(y_val, y_pred)
  scores.append(auc)

print(scores)
print(np.std(scores))

[np.float64(1.0), np.float64(1.0), np.float64(0.9836065573770492), np.float64(1.0), np.float64(1.0)]
0.006557377049180336


#### Base Model Conclusions

The model performs exceptionally well, partly due to the presence of highly correlated features. After testing it with cross-validation and new data, we can conclude that it generalizes effectively. This aligns with the real-world scenario, where it is intuitive to expect that students with excessive work hours, high economic participation, and significant economic consequences are more likely to drop out.

#### Feature exclusion

As stated before, we'll exclude some features in order to run the model's selection and fine tunning process

In [17]:
# Load data
df = pd.read_csv('G:\Mi unidad\###_ ML Zoomcamp 2024\enape_post_eda.csv')

In [18]:
columns_to_drop = [
    'em_hw_projects',
    'em_tests',
    #'economic_participation',
    #'economic_consequences'
    ] 
df = df.drop(columns=columns_to_drop, axis=1)

In [19]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.dropout.values
y_val = df_val.dropout.values
y_test = df_test.dropout.values

del df_train['dropout']
del df_val['dropout']
del df_test['dropout']

In [20]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(11983, 39)
(3995, 39)
(3995, 39)


Now we can check again performance with base model

In [21]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [22]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
dropout_prediction = (y_pred >= 0.5)
(y_val == dropout_prediction).mean()

np.float64(0.979224030037547)

In [23]:
report = classification_report(y_val, dropout_prediction)
print(report)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      3907
           1       0.60      0.17      0.27        88

    accuracy                           0.98      3995
   macro avg       0.79      0.58      0.63      3995
weighted avg       0.97      0.98      0.97      3995



In [24]:
roc_auc_score(y_val,dropout_prediction).round(3)

np.float64(0.584)

Now we have a poor performance model and we can go on and try to improve it.

## EOF

Train  the model
- Transform training data into a dictionary and then to a vector
- Train the model

In [None]:
dv = DictVectorizer(sparse=False)

train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression(solver='liblinear', class_weight='balanced',C=0.001, max_iter=100)
model.fit(X_train, y_train)

In [None]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
dropout_prediction = (y_pred >= 0.8)
(y_val == dropout_prediction).mean()

In [None]:
y_val

In [None]:
y_pred

In [None]:
from sklearn.metrics import classification_report
# Generar el reporte
report = classification_report(y_val, dropout_prediction)

# Imprimir el reporte
print(report)

In [None]:
from sklearn.feature_extraction import DictVectorizer
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

# Paso 1: Vectorizar los datos
dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

# Paso 2: Aplicar SMOTE para oversampling
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Paso 3: Entrenar el modelo con los datos balanceados
balanced_model = LogisticRegression(solver='liblinear', class_weight='balanced',C=0.001, max_iter=1000)
balanced_model.fit(X_train_balanced, y_train_balanced)


In [None]:
from collections import Counter
print("Distribución antes del SMOTE:", Counter(y_train))
print("Distribución después del SMOTE:", Counter(y_train_balanced))


In [None]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = balanced_model.predict_proba(X_val)[:, 1]
balanced_dropout_prediction = (y_pred >= 0.8)
(y_val == balanced_dropout_prediction).mean()

In [None]:
from sklearn.metrics import classification_report
# Generar el reporte
balanced_report = classification_report(y_val, balanced_dropout_prediction)

# Imprimir el reporte
print(report)
print(balanced_report)

validacion con split test

In [None]:
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

y_pred_balanced = balanced_model.predict_proba(X_test)[:, 1]
balanced_dropout_prediction = (y_pred_balanced >= 0.7)
(y_test == balanced_dropout_prediction).mean()

In [None]:
balanced_report_test = classification_report(y_test, balanced_dropout_prediction)

# Imprimir el reporte
print(balanced_report_test)

In [None]:
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

y_pred_balanced = balanced_model.predict_proba(X_test)[:, 1]
balanced_dropout_prediction = (y_pred_balanced >= 0.3)
(y_test == balanced_dropout_prediction).mean()

In [None]:
balanced_report_test = classification_report(y_test, balanced_dropout_prediction)

# Imprimir el reporte
print(balanced_report_test)

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val,dropout_prediction).round(3)

In [None]:
roc_auc_score(y_val,balanced_dropout_prediction).round(3)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Accuracy:", scores.mean())


In [None]:
n_splits = 5
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
scores = []

for train_idx, val_idx in kfold.split(df_full_train):

  df_train = df_full_train.iloc[train_idx]
  df_val = df_full_train.iloc[val_idx]

  y_train = df_train.y.values
  y_val = df_val.y.values

  dv, model = train(df_train, y_train)
  y_pred = predict(df_val, dv, model)

  print(y_pred)

  auc = roc_auc_score(y_val, y_pred)
  scores.append(auc)

print(scores)
print(np.std(scores))

In [None]:
scores