# AnyoneAI - Sprint Project 02
> Home Credit Default Risk

You've been learning a lot about Machine Learning Algorithms, now we you're gonna be asked to put it all togheter. 

You will create a complete pipeline to preprocess the data, train your model and then predict values for the [Home Credit Default Risk](https://www.kaggle.com/competitions/home-credit-default-risk/) Kaggle competition.


## 1. Introduction

This is a binary Classification task: we want to predict whether the person applying for a home credit will be able to repay their debt or not. Our model will have to predict a 1 indicating the client will have payment difficulties: he/she will have late payment of more than X days on at least one of the first Y installments of the loan in our sample, 0 in all other cases.

The dataset is composed of multiple files with different information about loans taken. In this project, we will work exclusively with the primary files: `application_train_aai.csv` and `application_test_aai.csv`.

We will use [Area Under the ROC Curve](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=es_419) as the evaluation metric, so our models will have to return the probabilities that a loan is not paid for each row.

In [1]:
# Import libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.validation import check_is_fitted
from imblearn.combine import SMOTETomek
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src import config, data_utils, preprocessing

In [2]:
app_train, app_test, columns_description = data_utils.get_datasets()


if app_train.shape == (246008, 122):
    print("Success: app_train shape is correct!")
else:
    raise ValueError("Train dataset shape is incorrect, please review your code")

if isinstance(app_train, pd.DataFrame):
    print("Success: app_train type is correct!")
else:
    raise ValueError("Train dataset type is incorrect, please review your code")

if app_test.shape == (61503, 122):
    print("Success: app_test shape is correct!")
else:
    raise ValueError("Test dataset shape is incorrect, please review your code")

if isinstance(app_test, pd.DataFrame):
    print("Success: app_test type is correct!")
else:
    raise ValueError("Test dataset type is incorrect, please review your code")

Success: app_train shape is correct!
Success: app_train type is correct!
Success: app_test shape is correct!
Success: app_test type is correct!


In [3]:
# Now we execute the function above to get the result
X_train, y_train, X_test, y_test = data_utils.get_feature_target(app_train, app_test)


if X_train.shape == (246008, 121):
    print("Success: X_train shape is correct!")
else:
    raise ValueError("X_train dataset shape is incorrect, please review your code")

if isinstance(X_train, pd.DataFrame):
    print("Success: X_train type is correct!")
else:
    raise ValueError("Train dataset type is incorrect, please review your code")

if y_train.shape == (246008,) or y_train.shape == (246008, 1):
    print("Success: y_train shape is correct!")
else:
    raise ValueError("Train labels shape is incorrect, please review your code")

if X_test.shape == (61503, 121):
    print("Success: X_test shape is correct!")
else:
    raise ValueError("Test dataset shape is incorrect, please review your code")

if isinstance(X_test, pd.DataFrame):
    print("Success: X_test type is correct!")
else:
    raise ValueError("Test dataset type is incorrect, please review your code")

if y_test.shape == (61503,) or y_test.shape == (61503, 1):
    print("Success: y_test shape is correct!")
else:
    raise ValueError("Test labels shape is incorrect, please review your code")

Success: X_train shape is correct!
Success: X_train type is correct!
Success: y_train shape is correct!
Success: X_test shape is correct!
Success: X_test type is correct!
Success: y_test shape is correct!


**Don't change anything in this cell, just make it run correctly**

In [4]:
# Now we execute the function above to get the result
X_train, X_val, y_train, y_val = data_utils.get_train_val_sets(X_train, y_train)


if X_train.shape == (196806, 121):
    print("Success: X_train shape is correct!")
else:
    raise ValueError("X_train dataset shape is incorrect, please review your code")

if isinstance(X_train, pd.DataFrame):
    print("Success: X_train type is correct!")
else:
    raise ValueError("Train dataset type is incorrect, please review your code")

if y_train.shape == (196806,) or y_train.shape == (196806, 1):
    print("Success: y_train shape is correct!")
else:
    raise ValueError("Train labels shape is incorrect, please review your code")

if X_val.shape == (49202, 121):
    print("Success: X_test shape is correct!")
else:
    raise ValueError("Test dataset shape is incorrect, please review your code")

if isinstance(X_val, pd.DataFrame):
    print("Success: X_test type is correct!")
else:
    raise ValueError("Test dataset type is incorrect, please review your code")

if y_val.shape == (49202,) or y_val.shape == (49202, 1):
    print("Success: y_test shape is correct!")
else:
    raise ValueError("Test labels shape is incorrect, please review your code")

Success: X_train shape is correct!
Success: X_train type is correct!
Success: y_train shape is correct!
Success: X_test shape is correct!
Success: X_test type is correct!
Success: y_test shape is correct!


**Don't change anything in this cell, just make it run correctly**

In [5]:
train_data, val_data, test_data = preprocessing.preprocess_data(X_train, X_val, X_test)


if train_data.shape == (196806, 246):
    print("Success: train_data shape is correct!")
else:
    raise ValueError("train_data dataset shape is incorrect, please review your code")

if isinstance(train_data, np.ndarray):
    print("Success: train_data type is correct!")
else:
    raise ValueError("Train dataset type is incorrect, please review your code")

if val_data.shape == (49202, 246):
    print("Success: val_data shape is correct!")
else:
    raise ValueError("val_data dataset shape is incorrect, please review your code")

if isinstance(val_data, np.ndarray):
    print("Success: val_data type is correct!")
else:
    raise ValueError("Validation dataset type is incorrect, please review your code")

if test_data.shape == (61503, 246):
    print("Success: test_data shape is correct!")
else:
    raise ValueError("test_data dataset shape is incorrect, please review your code")

if isinstance(test_data, np.ndarray):
    print("Success: test_data type is correct!")
else:
    raise ValueError("Test dataset type is incorrect, please review your code")

Input train data shape:  (196806, 121)
Input val data shape:  (49202, 121)
Input test data shape:  (61503, 121) 

Creating essential new features...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  working_train_df["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  working_val_df["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because th

Encoding categorical features...
Imputing missing values...
Current feature count: 248
No target available. Taking first 246 features.
Scaling features...
Processed train data shape:  (196806, 246)
Processed val data shape:  (49202, 246)
Processed test data shape:  (61503, 246) 

Success: train_data shape is correct!
Success: train_data type is correct!
Success: val_data shape is correct!
Success: val_data type is correct!
Success: test_data shape is correct!
Success: test_data type is correct!


## 3. Training Models

In [6]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import roc_auc_score, make_scorer

# Ejemplo: Datos de entrada
X = train_data  # Características
y = y_train     # Etiquetas/target

kf = KFold(n_splits=5, shuffle=True, random_state=42)

folds = list(kf.split(X))

In [7]:
X = pd.DataFrame(X)  
y = pd.Series(y)    
# Seleccionar el cuarto fold (índice 3 en Python)
train_idx, val_idx = folds[3]
X_train_fold, y_train_fold = X.iloc[train_idx], y.iloc[train_idx]
X_val_fold, y_val_fold = X.iloc[val_idx], y.iloc[val_idx]


In [13]:
from src.grid_search import grid_search

param_grid = {
    'n_estimators': [150, 200, 300],           
    'max_depth': [8, 10, 12],                     
    'min_samples_split': [2, 3, 5],              
    'min_samples_leaf': [1, 2],                  
    'max_features': [0.5, 0.6, 0.7],         
    'class_weight': ['balanced', {0: 1, 1: 8}, {0: 1, 1: 12}]  
}
    # total combinations = 2 * 2 * 3 * 3 * 2 * 3 = 216

In [14]:
grid_search(
  params = param_grid,
  instance_id=0, 
  X_train_fold=X_train_fold, 
  y_train_fold=y_train_fold, 
  X_val_fold=X_val_fold, 
  y_val_fold=y_val_fold
  )

Searching for instance 1
Evaluando combinaciones 0 a 60 (total: 61)
✓ Data loaded - Train: (157445, 246), Val: (39361, 246)

Starting process to instance: 01 with 61 combinations...

 Evaluate combination: 01/61 (global 0):
  n_estimators: 150
  max_depth: 8
  min_samples_split: 2
  min_samples_leaf: 1
  max_features: 0.5
  class_weight: balanced
 Training time: 226.99 seconds
 ROC AUC Score (train): 0.7846 
 ROC AUC Score (Test): 0.7387 
 Difference AUC: 0.0459 
 F1 Score: 0.2627 
 Precision Score: 0.1666 
 Recall Score: 0.6204 
 New model found! (F1: 0.2627)

 Evaluate combination: 02/61 (global 1):
  n_estimators: 150
  max_depth: 8
  min_samples_split: 2
  min_samples_leaf: 1
  max_features: 0.5
  class_weight: {0: 1, 1: 8}
 Training time: 228.36 seconds
 ROC AUC Score (train): 0.7827 
 ROC AUC Score (Test): 0.7381 
 Difference AUC: 0.0446 
 F1 Score: 0.2830 
 Precision Score: 0.2008 
 Recall Score: 0.4789 
 New model found! (F1: 0.2830)

 Evaluate combination: 03/61 (global 2):
  

KeyboardInterrupt: 

## 4. Predict unlabeled data

Now it's time to finally use the `test_data` samples. Because we don't have the labels we can't see how the model performs on this dataset (╯°□°)╯︵ ┻━┻

But... don't worry, we will internally evaluate your model and give feedback on the results!

In the cells below:
- Take your best model
- Take `test_data` (i.e. the dataset after doing the preprocessing and feature engineering part)
- Run the data through your model and save the predictions on the `TARGET` column in the `app_test` DataFrame (yeah that we've loaded at the very beginning of this notebook).
    - `TARGET` column values must be the probabilities for class 1. So remember to use the `predict_proba()` function from your model as we did in the previous sections.
- Save the modified version of the DataFrame with the same name it has before (`dataset/application_test_aai.csv`) and don't forget to submit it alongside the rest of this sprint project code
- And finally, don't get confused, you shouldn't submit `dataset/application_train_aai.csv`. So please don't upload your solution with this heavy dataset inside.

Let's say your best model is called `best_credit_model_ever`, then your code should be exactly this:

```python
    test_preds = best_credit_model_ever.predict_proba(test_data)[:, 1]
    app_test["TARGET"] = test_preds
    app_test.to_csv(config.DATASET_TEST, index=False)
```


In [15]:
%%time

# 1. Definir el mejor modelo (ajustar al que encontraste como óptimo)
best_model = RandomForestClassifier(
    n_estimators=150,
    max_depth=8,
    min_samples_split=3,
    min_samples_leaf=1,
    max_features=0.5,
    class_weight={0: 1, 1: 8},
    random_state=42
)

# 2. Entrenarlo con tus datos de entrenamiento
print("Entrenando el modelo final...")
best_model.fit(train_data, y_train)




Entrenando el modelo final...


TypeError: CalibratedClassifierCV.__init__() got an unexpected keyword argument 'base_estimator'

In [16]:

# 3. Calibrar probabilidades para mejorar estimaciones
from sklearn.calibration import CalibratedClassifierCV

# Usar el parámetro correcto según la versión de scikit-learn
try:
    # Para versiones más recientes (scikit-learn >= 1.0)
    calibrated_model = CalibratedClassifierCV(
        estimator=best_model,
        method='sigmoid',
        cv=5
    )
except TypeError:
    # Para versiones anteriores (scikit-learn < 1.0)
    calibrated_model = CalibratedClassifierCV(
        base_estimator=best_model,
        method='sigmoid',
        cv=5
    )

calibrated_model.fit(train_data, y_train)

y_train_pred = calibrated_model.predict(X_train_fold)
y_val_pred = calibrated_model.predict(X_val_fold)

# Si deseas obtener las probabilidades para la clase 1
y_train_pred_proba = calibrated_model.predict_proba(X_train_fold)[:, 1]
y_val_pred_proba = calibrated_model.predict_proba(X_val_fold)[:, 1]

# Evaluar el modelo (opcional)
roc_auc_train = roc_auc_score(y_train_fold, y_train_pred_proba)
roc_auc_val = roc_auc_score(y_val_fold, y_val_pred_proba)

print(f"ROC AUC Score (Train): {roc_auc_train:.4f}")
print(f"ROC AUC Score (Validation): {roc_auc_val:.4f}")

# 4. Opcional: Encontrar umbral óptimo en validación
# Nota: solo para referencia - no afecta las probabilidades guardadas
from sklearn.metrics import f1_score
import numpy as np

val_probs = calibrated_model.predict_proba(val_data)[:, 1]
thresholds = np.arange(0.05, 0.5, 0.01)
best_f1 = 0
best_threshold = 0.5

for threshold in thresholds:
    val_preds = (val_probs >= threshold).astype(int)
    f1 = f1_score(y_val, val_preds)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

print(f"Umbral óptimo: {best_threshold:.4f} (F1 en validación: {best_f1:.4f})")

# 5. Predecir en los datos de prueba (usando el modelo calibrado)
print("Generando predicciones para datos de prueba...")
test_preds = calibrated_model.predict_proba(test_data)[:, 1]

# 6. Guardar predicciones en app_test
app_test["TARGET"] = test_preds

# 7. Verificar rango de probabilidades (control de calidad)
print(f"Rango de probabilidades predecidas: [{test_preds.min():.4f}, {test_preds.max():.4f}]")
print(f"Promedio de probabilidades: {test_preds.mean():.4f}")

# 8. Guardar el DataFrame modificado
print("Guardando predicciones...")
app_test.to_csv(config.DATASET_TEST, index=False)
print("¡Predicciones guardadas exitosamente!")

ROC AUC Score (Train): 0.7757
ROC AUC Score (Validation): 0.7752
Umbral óptimo: 0.1500 (F1 en validación: 0.2926)
Generando predicciones para datos de prueba...
Rango de probabilidades predecidas: [0.0170, 0.4625]
Promedio de probabilidades: 0.0805
Guardando predicciones...
¡Predicciones guardadas exitosamente!


## 5. Optional exercises

### Optional: Training a LightGBM model 

5.1. Gradient Boosting Machine is one of the most used machine learning algorithms for tabular data. Lots of competitions have been won using models from libraries like XGBoost or LightGBM. You can try using [LightGBM](https://lightgbm.readthedocs.io/en/latest/) to train a new model an see how it performs compared to the other classifiers you trained. 

In [None]:
### Complete in this cell: train a LightGBM model

### Optional: Using Scikit Learn Pipelines 

5.2. So far you've created special functions or blocks or code to chain operations on data and then train the models. But, reproducibility is important, and you don't want to have to remember the correct steps to follow each time you have new data to train your models. There are a lots of tools out there that can help you with that, here you can use a [Sklearn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to process your data.

In [None]:
### Complete in this cell: use a sklearn Pipeline to automate the cleaning, standardizing and training

### Optional: Build your own model and features

5.3. If you want you can take the original labeled data given and make your own feature selection, data preprocessing, and model tunning. Be creative, the only limit is time and hardware resources. Only be careful and don't modify the previous functions made in the mandatory assignments or, you will break the project tests.

You can even use this newer model to make predictions in the test dataset with hidden labels and submit that.


In [None]:
### Complete in this cell: Make you own experimentation process