# Credit Default Prediction Model: Experimental Summary

## Experiment Overview

Evaluated several classification models to predict credit default risk, using fastai's tabular learning framework and custom metrics focused on discrimination capability.

## Metrics Used

We implemented and tracked three key metrics:
- **Accuracy**: Standard classification accuracy
- **AUC Score**: Area under the ROC curve (between 0.5-1.0)
- **Gini Coefficient**: 2*AUC-1 (between 0-1), a common credit risk metric

## Key Experiments and Results

| Experiment | Configuration | Accuracy | AUC | Gini | Best Epoch Performance |
|------------|---------------|----------|-----|------|------------------------|
| 1 | 10 epochs, lr=5e-3, wd=0.1 | 0.763 | 0.701 | 0.401 | Epoch 7: Gini 0.420 |
| 2 | 10 epochs, lr=5e-3, wd=0.01 | 0.743 | 0.684 | 0.367 | Epoch 4: AUC 0.717 |
| 3 | 10 epochs, lr=3e-3, wd=0.01 | 0.738 | 0.664 | 0.328 | Epoch 8: Gini 0.330 |
| 4 | 5 epochs, lr=3e-3, wd=0.01* | 0.736 | 0.662 | 0.324 | Epoch 4: AUC 0.651 |
| 5 | 5 epochs, lr=3e-3, wd=0.05 | 0.764 | 0.697 | 0.394 | Epoch 3: AUC 0.699, Gini 0.399 |
| Ensemble | Models 1+2+3 | 0.759 | 0.697 | 0.394 | N/A |

*with adjusted class weights

## Hyperparameter Impact

| Parameter | Tested Values | Observation |
|-----------|---------------|-------------|
| Weight Decay | 0.01, 0.05, 0.1 | Higher values (0.1) produced better Gini coefficients |
| Learning Rate | 3e-3, 5e-3 | 5e-3 performed better across metrics |
| Class Weights | default, 0.9*default | Reducing minority class weight improved balance |
| Epochs | 5, 7, 10 | 5-7 epochs sufficient; diminishing returns after |

## Model Comparison with Champion

| Statistic | Our Ensemble vs Champion Model |
|-----------|--------------------------------|
| Mean PD Difference | +0.304 higher |
| Mean Score Difference | -249.43 points lower |
| Max PD Difference | +0.572 (APP_ID: 6341) |
| Min PD Difference | +0.028 (APP_ID: 4743) |
| Max Score Difference | -402.00 points (APP_ID: 6341) |
| Min Score Difference | -79.95 points (APP_ID: 4743) |

## Future Work Opportunities

| Area | Potential Improvements |
|------|------------------------|
| Feature Engineering | Domain-specific features, dimensionality reduction |
| Model Architecture | Deeper networks, activation function testing |
| Validation | Cross-validation, Bayesian hyperparameter optimization |
| Calibration | Probability alignment, score scaling |
| Explainability | Feature importance, SHAP values |
| Thresholds | Business-cost based threshold optimization |


# Data preprocess and Features engineering 

In [None]:
def enhanced_feature_engineering(df):
    "Engineer new features and convert appropriate columns to categorical"
    df = df.copy()
    
    # Log transformations
    df['log_revol_bal'] = np.log1p(df['revol_bal'])
    df['log_annual_inc'] = np.log1p(df['annual_inc'])
    
    # Ratio features
    df['debt_income_ratio'] = df['revol_bal'] / df['annual_inc'].clip(lower=1)
    df['util_acc_ratio'] = df['revol_util'] / df['open_acc'].clip(lower=1)
    
    # Interaction terms
    df['dti_util'] = df['debt_to_income'] * df['revol_util']
    
    # Convert identified columns to categorical
    for col in ['delinq_2yrs', 'inq_last_6mths', 'mths_since_last_record', 'pub_rec']:
        df[f'{col}_cat'] = df[col].astype('category')
    
    return df



In [None]:
def preprocess_data(df):
    "Preprocess loan application data with appropriate handling of missing values"
    # Make a copy to avoid modifying original
    df = df.copy()
    
    # Calculate missing percentages
    missing_pct = df.isna().sum()/df.shape[0]
    
    # Handle low missing rate columns (<1%)
    low_missing = missing_pct[(missing_pct < 0.01) & (missing_pct > 0)].index
    for col in low_missing:
        mode_val = df[col].mode()[0]
        df[col] = df[col].fillna(mode_val)
    
    # Handle medium missing rate columns (1-40%)
    df['emp_title'] = df['emp_title'].fillna('Unknown')
    df['Notes'] = df['Notes'].fillna('Not shared')
    
    # Handle high missing rate columns (>40%)
    high_missing = missing_pct[missing_pct > 0.40].index
    for col in high_missing:
        df[f'has_{col}'] = df[col].notna().astype(int)
        df[col] = df[col].fillna(0)
    
    return df


In [None]:
def feature_engineering(df):
    "Engineer new features and clean up existing ones"
    df = df.copy()
    
    # Drop high cardinality and less useful columns
    cols_to_drop = ['emp_title', 'Notes', 'purpose', 'zip_code', 'APP_ID', 'collections_12_mths_ex_med'] 
    df = df.drop(columns=[c for c in cols_to_drop if c in df.columns])
    
    # Convert earliest_cr_line to credit history years
    if 'earliest_cr_line' in df.columns:
        df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'])
        reference_year = 2008
        df['years_since_first_cr'] = reference_year - df['earliest_cr_line'].dt.year
        df = df.drop(columns=['earliest_cr_line'])
    
    # Handle low cardinality features
    # Convert pub_rec to categorical (very skewed distribution, only 4 values)
    if 'pub_rec' in df.columns:
        df['pub_rec_cat'] = df['pub_rec'].astype('category')
    
    # We keep delinq_2yrs and inq_last_6mths as continuous since they have more values
    # and likely have a somewhat linear relationship with default risk
    
    return df

In [None]:
def adjust_column_types(cont_cols, cat_cols, move_to_cat=None, move_to_cont=None):
    "Move columns between continuous and categorical lists"
    if move_to_cat:
        cont_cols = [col for col in cont_cols if col not in move_to_cat]
        for col in move_to_cat:
            if col not in cat_cols: cat_cols.append(col)
    
    if move_to_cont:
        cat_cols = [col for col in cat_cols if col not in move_to_cont]
        for col in move_to_cont:
            if col not in cont_cols: cont_cols.append(col)
            
    return cont_cols, cat_cols

# Training

In [None]:
from utils import *
from fastai.tabular.all import *
# import fastai metric and sklean AUCScore and roc 
from sklearn.metrics import roc_auc_score
from fastai.metrics import *






In [None]:
cd static

[Errno 2] No such file or directory: 'static'
/app/data/static


In [None]:
df = pd.read_csv('Assessment Modeling Dataset.csv')
# exp1
# df= preprocess_data(df)
# df= feature_engineering(df)
# cont_cols,cat_cols = cont_cat_split(df, max_card=20, dep_var='TARGET')
# cont_cols, cat_cols = adjust_column_types(cont_cols, cat_cols, move_to_cat=['pub_rec'])


#exp2
df_clean = preprocess_data(df)
df_clean = feature_engineering(df_clean)
df_enhanced = enhanced_feature_engineering(df_clean)

# Get initial column splits
cont_cols, cat_cols = cont_cat_split(df_enhanced, max_card=20, dep_var='TARGET')

# Move columns from continuous to categorical
cols_to_move = ['delinq_2yrs_cat', 'inq_last_6mths_cat', 'mths_since_last_record_cat', 'pub_rec_cat']
cont_cols, cat_cols = adjust_column_types(cont_cols, cat_cols, move_to_cat=cols_to_move)

# Remove original columns from continuous list
cols_to_remove = ['delinq_2yrs', 'inq_last_6mths', 'mths_since_last_record', 'pub_rec']
cont_cols = [col for col in cont_cols if col not in cols_to_remove]
df = df_enhanced.copy()

In [None]:
to

[31m---------------------------------------------------------------------------[39m
[31mNameError[39m                                 Traceback (most recent call last)
[36mCell[39m[36m [39m[32mIn[1][39m[32m, line 1[39m
[32m----> [39m[32m1[39m [43mto[49m

[31mNameError[39m: name 'to' is not defined


NameError: name 'to' is not defined

In [None]:
df[cont_cols].nunique()

annual_inc                1590
debt_to_income            2471
mths_since_last_delinq      91
open_acc                    35
revol_bal                 6694
revol_util                1020
total_acc                   73
years_since_first_cr        50
log_revol_bal             6694
log_annual_inc            1590
debt_income_ratio         7703
util_acc_ratio            4886
dti_util                  7718
dtype: int64

In [None]:
df[cat_cols].isna().sum()

emp_length                     0
home_ownership                 0
verification_status            0
pymnt_plan                     0
purpose_cat                    0
addr_state                     0
initial_list_status            0
mths_since_last_major_derog    0
policy_code                    0
has_mths_since_last_delinq     0
has_mths_since_last_record     0
pub_rec_cat                    0
delinq_2yrs_cat                0
inq_last_6mths_cat             0
mths_since_last_record_cat     0
dtype: int64

In [None]:
df[cat_cols].nunique()

emp_length                     14
home_ownership                  5
verification_status             3
pymnt_plan                      2
purpose_cat                    27
addr_state                     49
initial_list_status             2
mths_since_last_major_derog     3
policy_code                     5
has_mths_since_last_delinq      2
has_mths_since_last_record      2
pub_rec_cat                     4
delinq_2yrs_cat                10
inq_last_6mths_cat             18
mths_since_last_record_cat     88
dtype: int64

In [None]:
from fastai.tabular.all import *

# Create train/validation splits
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

# Create TabularPandas
to = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
                  cat_names=cat_cols,
                  cont_names=cont_cols,
                  y_names='TARGET',
                  splits=splits,
                  y_block=CategoryBlock(),
                  )

dls = to.dataloaders(bs=64)



In [None]:
dls.show_batch()

Unnamed: 0,emp_length,home_ownership,verification_status,pymnt_plan,purpose_cat,addr_state,initial_list_status,mths_since_last_major_derog,policy_code,has_mths_since_last_delinq,has_mths_since_last_record,pub_rec_cat,delinq_2yrs_cat,inq_last_6mths_cat,mths_since_last_record_cat,annual_inc,debt_to_income,mths_since_last_delinq,open_acc,revol_bal,revol_util,total_acc,years_since_first_cr,log_revol_bal,log_annual_inc,debt_income_ratio,util_acc_ratio,dti_util,TARGET
0,6,RENT,VERIFIED - income,n,debt consolidation,CA,f,1,PC4,0,0,0.0,0.0,0.0,0.0,31000.001211,14.09,2.27935e-07,7.0,6458.000307,45.5,24.0,9.0,8.77323,10.341775,0.2083226,6.5,641.09497,0
1,2,MORTGAGE,not verified,n,debt consolidation,VA,f,1,PC3,0,0,0.0,0.0,1.0,0.0,86003.999576,4.65,2.27935e-07,16.0,8704.000038,28.899999,42.0,9.0,9.071652,11.362161,0.1012046,1.80625,134.385008,0
2,1,RENT,VERIFIED - income source,n,debt consolidation,CA,f,2,PC1,0,0,0.0,0.0,0.0,0.0,19199.99839,13.5,2.27935e-07,6.0,7082.999931,37.099999,8.0,2.0,8.865594,9.862718,0.3689063,6.183333,500.850011,0
3,2,RENT,not verified,n,debt consolidation,CT,f,1,PC2,0,0,0.0,0.0,1.0,0.0,24999.998527,18.049999,2.27935e-07,5.0,9267.999938,47.5,15.0,7.0,9.134431,10.126671,0.37072,9.5,857.375005,1
4,5,OWN,not verified,n,debt consolidation,OH,f,1,PC4,1,1,1.0,0.0,2.0,86.0,24000.000537,16.549999,62.0,4.0,4792.999897,63.099999,17.0,13.0,8.475121,10.085851,0.1997083,15.775,1044.30504,1
5,4,RENT,VERIFIED - income,n,major purchase,TX,f,3,PC3,0,0,0.0,0.0,0.0,0.0,47000.000011,5.97,2.27935e-07,8.0,8747.999943,33.900001,23.0,10.0,9.076694,10.757924,0.1861277,4.2375,202.382983,1
6,10,MORTGAGE,VERIFIED - income source,n,credit card,PA,f,3,PC3,0,0,0.0,0.0,3.0,0.0,71999.999888,16.82,2.27935e-07,25.0,7401.999638,42.799999,38.0,3.0,8.90964,11.184435,0.1028056,1.712,719.895997,0
7,9,MORTGAGE,VERIFIED - income,n,debt consolidation,SC,f,3,PC1,0,0,0.0,0.0,1.0,0.0,85000.000296,10.09,2.27935e-07,14.0,12126.000099,34.5,37.0,22.0,9.40319,11.350418,0.1426588,2.464286,348.104997,0
8,9,RENT,VERIFIED - income,n,debt consolidation,TX,f,3,PC5,0,0,0.0,0.0,1.0,0.0,62068.800737,16.16,2.27935e-07,4.0,-0.000541,47.360001,19.0,9.0,2.581675e-07,11.036015,-5.295631e-09,11.84,765.337587,0
9,4,RENT,not verified,n,other,CA,f,2,PC2,0,0,0.0,0.0,0.0,0.0,85999.999761,4.79,2.27935e-07,14.0,28088.999375,31.700001,19.0,13.0,10.24317,11.362114,0.3266163,2.264286,151.842997,0


# Metric

In [None]:
class AUCScore(Metric):
    "Computes ROC AUC score"
    def __init__(self): self.reset()
    
    def reset(self): 
        self.preds,self.targs = [],[]
        
    def accumulate(self, learn):
        pred = learn.pred[:,1] if len(learn.pred.shape)>1 else learn.pred
        self.preds.append(pred)
        self.targs.append(learn.yb[0])
    
    @property
    def value(self):
        preds = torch.cat(self.preds).float().sigmoid()
        targs = torch.cat(self.targs)
        if targs.sum() == 0 or targs.sum() == len(targs): return 0.5
        return roc_auc_score(targs.cpu().numpy(), preds.cpu().numpy())
    
    @property
    def name(self): return "auc"

class GiniCoefficient(Metric):
    "Computes Gini coefficient (2*AUC-1)"
    def __init__(self): self.reset()
    
    def reset(self): 
        self.preds,self.targs = [],[]
        
    def accumulate(self, learn):
        pred = learn.pred[:,1] if len(learn.pred.shape)>1 else learn.pred
        self.preds.append(pred)
        self.targs.append(learn.yb[0])
    
    @property
    def value(self):
        preds = torch.cat(self.preds).float().sigmoid()
        targs = torch.cat(self.targs)
        if targs.sum() == 0 or targs.sum() == len(targs): return 0.0
        return 2 * roc_auc_score(targs.cpu().numpy(), preds.cpu().numpy()) - 1
    
    @property
    def name(self): return "gini"

# Create tabular learner with both metrics
learn = tabular_learner(dls, metrics=[accuracy, AUCScore(), GiniCoefficient()])
learn.loss_func = BCEWithLogitsLossFlat()


In [None]:
learn

<fastai.tabular.learner.TabularLearner at 0x706d4773acd0>

In [None]:
def find_optimal_lr(dls):
    "Find the optimal learning rate using fastai's LR Finder"
    # Calculate class weights
    counts = dls.train_ds.y.value_counts()
    total = len(dls.train_ds)
    weights = torch.FloatTensor([total/counts[0], total/counts[1]])
    
    # Create learner
    learn = tabular_learner(dls, metrics=[accuracy, AUCScore(), GiniCoefficient()])
    learn.loss_func = CrossEntropyLossFlat(weight=weights)
    
    # Run the learning rate finder
    learn.lr_find()
    
    return learn

In [None]:
def train_model(dls, epochs=1, lr=1e-2, wd=0.1, metrics=None):
    "Train tabular model with class weights and metrics"
    if metrics is None: metrics = [accuracy, AUCScore(), GiniCoefficient()]
    
    # Calculate class weights
    counts = dls.train_ds.y.value_counts()
    total = len(dls.train_ds)
    weights = torch.FloatTensor([total/counts[0],  0.9 *total/counts[1]])
    print("reduce clss weights" ,weights)
    
    # Create and train model
    learn = tabular_learner(dls, metrics=metrics,lr=lr)
    learn.loss_func = CrossEntropyLossFlat(weight=weights)
    learn.fit_one_cycle(epochs, wd=wd)
    
    return learn


# Experiments

In [None]:
model = train_model(dls, epochs=10, lr=5e-3, wd=0.1)

reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/10 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/10 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/10 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.7470]Epoch 1/10 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.7531]Epoch 1/10 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.7137]Epoch 1/10 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.7244]Epoch 1/10 : |███████---------------------------------| 19.00% [19/100 00:00<00:01 0.7295]Epoch 1/10 : |█████████████---------------------------| 34.00% [34/100 00:00<00:00 0.7203]Epoch 1/10 : |███████████████████---------------------| 49.00% [49/100 00:00<00:00 0.7040]Epoch 1/10 : |██████████████████████████--------------| 65.00% [65/100 00:00<00

In [None]:
model = train_model(dls, epochs=10, lr=5e-3, wd=0.01)

reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/10 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/10 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/10 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.7176]Epoch 1/10 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.8118]Epoch 1/10 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.7933]Epoch 1/10 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.7923]Epoch 1/10 : |████████--------------------------------| 22.00% [22/100 00:00<00:00 0.7644]Epoch 1/10 : |████████████████------------------------| 40.00% [40/100 00:00<00:00 0.7285]Epoch 1/10 : |███████████████████████-----------------| 58.00% [58/100 00:00<00:00 0.6978]Epoch 1/10 : |██████████████████████████████----------| 76.00% [76/100 00:00<00

In [None]:
model = train_model(dls, epochs=10, lr=3e-3, wd=0.01)

reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/10 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/10 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/10 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.7678]Epoch 1/10 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.7812]Epoch 1/10 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.7924]Epoch 1/10 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.7847]Epoch 1/10 : |████████--------------------------------| 22.00% [22/100 00:00<00:00 0.7300]Epoch 1/10 : |████████████████------------------------| 40.00% [40/100 00:00<00:00 0.7180]Epoch 1/10 : |███████████████████████-----------------| 58.00% [58/100 00:00<00:00 0.6987]Epoch 1/10 : |██████████████████████████████----------| 76.00% [76/100 00:00<00

In [None]:
model = train_model(dls, epochs=5, lr=3e-3, wd=0.01)

reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/5 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/5 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/5 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.7035]Epoch 1/5 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.7511]Epoch 1/5 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.7525]Epoch 1/5 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.7468]Epoch 1/5 : |████████--------------------------------| 22.00% [22/100 00:00<00:00 0.7211]Epoch 1/5 : |███████████████-------------------------| 39.00% [39/100 00:00<00:00 0.7073]Epoch 1/5 : |██████████████████████------------------| 57.00% [57/100 00:00<00:00 0.6945]Epoch 1/5 : |██████████████████████████████----------| 75.00% [75/100 00:00<00:00 0.6921

In [None]:
def train_model_deeper(dls, layers=[200,100], ps=[0.2,0.1], epochs=5, lr=3e-3, wd=0.01):
    """Train a deeper tabular model with custom layers and dropout rates"""
    
    # Calculate class weights
    counts = dls.train_ds.y.value_counts()
    total = len(dls.train_ds)
    weights = torch.FloatTensor([total/counts[0], 0.9*total/counts[1]])
    print("Class weights:", weights)
    
    # Create and train model with deeper architecture
    learn = tabular_learner(dls, 
                           layers=layers,  # Define hidden layers
                           ps=ps,          # Define dropout probabilities
                           metrics=[accuracy, AUCScore(), GiniCoefficient()],
                           lr=lr)
    
    learn.loss_func = CrossEntropyLossFlat(weight=weights)
    learn.fit_one_cycle(epochs, wd=wd)
    
    return learn

In [None]:
model = train_model(dls, epochs=7, lr=5e-3, wd=0.1, metrics=[accuracy, AUCScore(), GiniCoefficient()])

reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/7 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/7 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/7 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.6923]Epoch 1/7 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.7033]Epoch 1/7 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.6707]Epoch 1/7 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.6616]Epoch 1/7 : |████████--------------------------------| 22.00% [22/100 00:00<00:00 0.7074]Epoch 1/7 : |███████████████-------------------------| 39.00% [39/100 00:00<00:00 0.6974]Epoch 1/7 : |██████████████████████------------------| 57.00% [57/100 00:00<00:00 0.6897]Epoch 1/7 : |██████████████████████████████----------| 75.00% [75/100 00:00<00:00 0.6965

# Ensemble

In [None]:
# Train model 1 - our best configuration
model1 = train_model(dls, epochs=7, lr=5e-3, wd=0.1)

# Train model 2 - lower weight decay
model2 = train_model(dls, epochs=7, lr=5e-3, wd=0.01)

# Train model 3 - different learning rate
model3 = train_model(dls, epochs=7, lr=3e-3, wd=0.05)



reduce clss weights tensor([1.1449, 7.1111])
█epoch     train_loss  valid_loss  accuracy  auc       gini      time    
█Epoch 1/7 : |----------------------------------------| 0.00% [0/100 00:00<?]Epoch 1/7 : |----------------------------------------| 1.00% [1/100 00:00<00:01]Epoch 1/7 : |----------------------------------------| 2.00% [2/100 00:00<00:01 0.6737]Epoch 1/7 : |█---------------------------------------| 3.00% [3/100 00:00<00:01 0.6875]Epoch 1/7 : |█---------------------------------------| 4.00% [4/100 00:00<00:01 0.6656]Epoch 1/7 : |██--------------------------------------| 5.00% [5/100 00:00<00:01 0.6609]Epoch 1/7 : |████████--------------------------------| 22.00% [22/100 00:00<00:00 0.6856]Epoch 1/7 : |███████████████-------------------------| 39.00% [39/100 00:00<00:00 0.6792]Epoch 1/7 : |██████████████████████------------------| 57.00% [57/100 00:00<00:00 0.6821]Epoch 1/7 : |██████████████████████████████----------| 75.00% [75/100 00:00<00:00 0.6965

In [None]:
def ensemble_predictions(models, dl=None):
    """
    Create ensemble predictions by averaging model outputs
    
    Args:
        models: List of trained fastai Learner objects
        dl: Optional dataloader to use for predictions (defaults to validation)
    
    Returns:
        Averaged predictions and targets
    """
    all_preds = []
    targets = None
    
    for i, model in enumerate(models):
        print(f"Getting predictions from model {i+1}/{len(models)}")
        preds, targs = model.get_preds(dl=dl)
        all_preds.append(preds)
        if targets is None:
            targets = targs
    
    # Average predictions
    avg_preds = torch.stack(all_preds).mean(0)
    return avg_preds, targets

In [None]:
# Get ensemble predictions
avg_preds, targets = ensemble_predictions([model1, model2, model3])

# Calculate metrics on ensemble predictions
from sklearn.metrics import accuracy_score, roc_auc_score

# Convert predictions to binary predictions for accuracy
# Since these are probability outputs, we need to threshold them
threshold = 0.5
binary_preds = (avg_preds[:, 1] > threshold).float()

# Calculate accuracy
acc = accuracy_score(targets.cpu().numpy(), binary_preds.cpu().numpy())

# Calculate AUC
auc = roc_auc_score(targets.cpu().numpy(), avg_preds[:, 1].cpu().numpy())

# Calculate Gini coefficient (2*AUC-1)
gini = 2 * auc - 1

print(f"Ensemble Metrics:")
print(f"Accuracy: {acc:.6f}")
print(f"AUC Score: {auc:.6f}")
print(f"Gini Coefficient: {gini:.6f}")


Getting predictions from model 1/3
█ |----------------------------------------| 0.00% [0/25 00:00<?] |█---------------------------------------| 4.00% [1/25 00:00<00:00] |███-------------------------------------| 8.00% [2/25 00:00<00:00 0.4798] |████------------------------------------| 12.00% [3/25 00:00<00:00 0.4798] |██████----------------------------------| 16.00% [4/25 00:00<00:00 0.4798] |████████--------------------------------| 20.00% [5/25 00:00<00:00 0.4798] |████████████████████████████████████████| 100.00% [25/25 00:00<00:00 0.4798]                                                                                                                                                            Getting predictions from model 2/3
█ |----------------------------------------| 0.00% [0/25 00:00<?] |█---------------------------------------| 4.00% [1/25 00:00<00:00] |███-------------------------------------| 8.00% [2/25 00:00<00:00 0.4893] |████-------------------

# Test Data

In [None]:
df_test = pd.read_csv('Assessment Holdout Dataset.csv')

In [None]:
def get_test_predictions(model, test_df):
    "Get predictions from model on test data with complete preprocessing pipeline"
    # Apply all preprocessing steps in the same order as training
    df_processed = preprocess_data(test_df)
    df_processed = feature_engineering(df_processed)
    df_processed = enhanced_feature_engineering(df_processed)
    
    # Create test dataloader
    test_dl = model.dls.test_dl(df_processed)
    
    all_preds = []
    with model.no_logging():
        for batch in test_dl:
            with torch.no_grad():
                preds = model.model(*batch)
                all_preds.append(preds)
        
        all_preds = torch.cat(all_preds)
        probs = torch.softmax(all_preds, dim=1)[:, 1]  # Use softmax since we're using CrossEntropyLoss
        pred_class = (probs > 0.5).int()
    
    return pd.DataFrame({
        'APP_ID': test_df['APP_ID'],
        'default_probability': probs.numpy(),
        'predicted_class': pred_class.numpy()
    })

In [None]:
res = get_test_predictions(model,df_test)

In [None]:
res

Unnamed: 0,APP_ID,default_probability,predicted_class
0,6253,0.382,0
1,4685,0.514,1
2,1732,0.512,1
3,4743,0.055,0
4,4522,0.214,0
5,6341,0.834,1
6,577,0.183,0
7,5203,0.648,1
8,6364,0.426,0
9,440,0.532,1


In [None]:
def predict_with_ensemble(models, test_dl):
    """
    Make predictions on test data using ensemble of models
    
    Args:
        models: List of trained fastai Learner objects
        test_dl: DataLoader containing test data
    
    Returns:
        DataFrame with predictions
    """
    print("Getting ensemble predictions for test data...")
    all_preds = []
    
    # Get predictions from each model
    for i, model in enumerate(models):
        print(f"Getting predictions from model {i+1}/{len(models)}")
        # Use get_preds with the test dataloader
        test_preds, _ = model.get_preds(dl=test_dl)
        all_preds.append(test_preds)
    
    # Average predictions
    avg_preds = torch.stack(all_preds).mean(0)
    
    # Convert to probabilities if needed
    probs = avg_preds[:, 1]  # Probability of positive class
    
    # Create DataFrame with predictions
    import pandas as pd
    results = pd.DataFrame({
        'probability': probs.cpu().numpy(),
        'prediction': (probs > 0.5).int().cpu().numpy()
    })
    
    return results

In [None]:
res = get_test_predictions(model,df_test)

In [None]:
res

Unnamed: 0,APP_ID,default_probability,predicted_class
0,6253,0.382,0
1,4685,0.514,1
2,1732,0.512,1
3,4743,0.055,0
4,4522,0.214,0
5,6341,0.834,1
6,577,0.183,0
7,5203,0.648,1
8,6364,0.426,0
9,440,0.532,1


In [None]:
#export res file
res.to_csv('submission.csv', index=False)

In [None]:
def get_ensemble_test_predictions(models, test_df):
    """Get ensemble predictions from multiple models on test data"""
    # Process test data once
    df_processed = preprocess_data(test_df)
    df_processed = feature_engineering(df_processed)
    df_processed = enhanced_feature_engineering(df_processed)
    
    # Get predictions from each model manually
    all_preds = []
    for i, model in enumerate(models):
        print(f"Getting predictions from model {i+1}/{len(models)}")
        
        # Create test dataloader
        test_dl = model.dls.test_dl(df_processed)
        
        # Get predictions without using get_preds
        model_preds = []
        with torch.no_grad():
            model.model.eval()
            for batch in test_dl:
                preds = model.model(*batch)
                model_preds.append(preds)
        
        # Concatenate all batch predictions
        model_preds = torch.cat(model_preds)
        all_preds.append(model_preds)
    
    # Average predictions
    avg_preds = torch.stack(all_preds).mean(0)
    probs = torch.softmax(avg_preds, dim=1)[:, 1]  # Probability of positive class
    pred_class = (probs > 0.5).int()
    
    return pd.DataFrame({
        'APP_ID': test_df['APP_ID'],
        'default_probability': probs.cpu().numpy(),
        'predicted_class': pred_class.cpu().numpy()
    })


In [None]:

ensemble_results = get_ensemble_test_predictions([model1, model2, model3], df_test)
ensemble_results.to_csv('ensemble_submission.csv', index=False)

Getting predictions from model 1/3
Getting predictions from model 2/3
Getting predictions from model 3/3


In [None]:
ensemble_results

Unnamed: 0,APP_ID,default_probability,predicted_class
0,6253,0.611,1
1,4685,0.602,1
2,1732,0.587,1
3,4743,0.159,0
4,4522,0.431,0
5,6341,0.696,1
6,577,0.235,0
7,5203,0.62,1
8,6364,0.459,0
9,440,0.646,1


In [None]:
np.clip(0.82, 0.0001, 0.9999)

np.float64(0.82)

In [None]:
def convert_to_score(prob):
    "Convert default probability to credit score using logarithmic scaling"
    # Handle edge cases to avoid log(0) or log(infinity)
    prob = np.clip(prob, 0.0001, 0.9999)
    # Calculate odds (p/(1-p))
    odds = prob / (1 - prob)
    # Apply scaling where score of 200 = 50% probability
    # and probability halves every 100 points
    return 200 - 100 * np.log2(odds)

In [None]:

ensemble_results['credit_score'] = convert_to_score(ensemble_results['default_probability'])

In [None]:
ensemble_results

Unnamed: 0,APP_ID,default_probability,predicted_class,credit_score
0,6253,0.611,1,134.587
1,4685,0.602,1,140.05
2,1732,0.587,1,149.355
3,4743,0.159,0,440.199
4,4522,0.431,0,240.145
5,6341,0.696,1,80.5
6,577,0.235,0,369.91
7,5203,0.62,1,129.374
8,6364,0.459,0,223.914
9,440,0.646,1,113.222


In [None]:
pd.read_csv('Assessment Champion Model Dataset.csv')

Unnamed: 0,APP_ID,Champion_Model_PD,Champion_Model_Score
0,6253,0.137,465
1,4685,0.152,448
2,1732,0.049,626
3,4743,0.04,657
4,4522,0.11,500
5,6341,0.122,484
6,577,0.043,646
7,5203,0.303,319
8,6364,0.124,482
9,440,0.128,477


In [None]:
champion_df = pd.read_csv('Assessment Champion Model Dataset.csv')

# Merge with your ensemble results for comparison
comparison = ensemble_results.merge(champion_df, on='APP_ID', how='inner')

# Calculate differences between your model and champion model
comparison['pd_diff'] = comparison['default_probability'] - comparison['Champion_Model_PD']
comparison['score_diff'] = comparison['credit_score'] - comparison['Champion_Model_Score']

# Calculate performance metrics
comparison['pd_abs_diff'] = abs(comparison['pd_diff'])
comparison['score_abs_diff'] = abs(comparison['score_diff'])

# Display the comparison
comparison

Unnamed: 0,APP_ID,default_probability,predicted_class,credit_score,Champion_Model_PD,Champion_Model_Score,pd_diff,score_diff,pd_abs_diff,score_abs_diff
0,6253,0.611,1,134.587,0.137,465,0.475,-330.413,0.475,330.413
1,4685,0.602,1,140.05,0.152,448,0.451,-307.95,0.451,307.95
2,1732,0.587,1,149.355,0.049,626,0.537,-476.645,0.537,476.645
3,4743,0.159,0,440.199,0.04,657,0.119,-216.801,0.119,216.801
4,4522,0.431,0,240.145,0.11,500,0.32,-259.855,0.32,259.855
5,6341,0.696,1,80.5,0.122,484,0.574,-403.5,0.574,403.5
6,577,0.235,0,369.91,0.043,646,0.192,-276.09,0.192,276.09
7,5203,0.62,1,129.374,0.303,319,0.317,-189.626,0.317,189.626
8,6364,0.459,0,223.914,0.124,482,0.335,-258.086,0.335,258.086
9,440,0.646,1,113.222,0.128,477,0.518,-363.778,0.518,363.778


In [None]:
cd  static

[Errno 2] No such file or directory: 'static'
/app/data/static


In [None]:
df

Unnamed: 0,emp_length,home_ownership,annual_inc,verification_status,pymnt_plan,purpose_cat,addr_state,debt_to_income,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,mths_since_last_major_derog,policy_code,TARGET,has_mths_since_last_delinq,has_mths_since_last_record,years_since_first_cr,pub_rec_cat,log_revol_bal,log_annual_inc,debt_income_ratio,util_acc_ratio,dti_util,delinq_2yrs_cat,inq_last_6mths_cat,mths_since_last_record_cat
0,9,MORTGAGE,45000.0,not verified,n,credit card,IL,13.95,0.0,1.0,0.0,0.0,7.0,0.0,10891,26.9,17.0,f,1,PC2,0,0,0,9,0.0,9.296,10.714,0.242,3.843,375.255,0.0,1.0,0.0
1,2,RENT,51000.0,VERIFIED - income source,n,debt consolidation,WV,10.54,0.0,0.0,44.0,0.0,6.0,0.0,6342,64.7,17.0,f,1,PC1,0,1,0,12,0.0,8.755,10.84,0.124,10.783,681.938,0.0,0.0,0.0
2,10,MORTGAGE,75000.0,VERIFIED - income,n,debt consolidation,NY,9.38,0.0,2.0,0.0,0.0,16.0,0.0,69461,47.3,37.0,f,2,PC5,1,0,0,11,0.0,11.149,11.225,0.926,2.956,443.674,0.0,2.0,0.0
3,3,RENT,84000.0,VERIFIED - income source,n,major purchase,CA,24.06,0.0,2.0,46.0,0.0,5.0,0.0,422,28.1,7.0,f,2,PC2,1,1,0,5,0.0,6.047,11.339,0.005,5.62,676.086,0.0,2.0,0.0
4,1,MORTGAGE,50000.0,not verified,n,debt consolidation,CT,6.72,1.0,1.0,22.0,0.0,7.0,0.0,11992,48.9,28.0,f,3,PC2,0,1,0,7,0.0,9.392,10.82,0.24,6.986,328.608,1.0,1.0,0.0
5,5,RENT,41900.0,not verified,n,credit card,NY,12.26,0.0,3.0,0.0,0.0,12.0,0.0,10044,42.6,15.0,f,2,PC4,1,0,0,2,0.0,9.215,10.643,0.24,3.55,522.276,0.0,3.0,0.0
6,10,MORTGAGE,99996.0,not verified,n,car small business,TX,12.84,0.0,0.0,0.0,0.0,7.0,0.0,13742,96.5,15.0,f,1,PC5,1,0,0,24,0.0,9.528,11.513,0.137,13.786,1239.06,0.0,0.0,0.0
7,5,MORTGAGE,214992.0,VERIFIED - income,n,debt consolidation small business,IL,5.74,0.0,5.0,0.0,0.0,9.0,0.0,18880,23.2,28.0,f,1,PC4,1,0,0,15,0.0,9.846,12.278,0.088,2.578,133.168,0.0,5.0,0.0
8,7,RENT,75000.0,not verified,n,debt consolidation,TX,17.58,0.0,1.0,0.0,0.0,10.0,0.0,630,2.2,15.0,f,2,PC1,0,0,0,3,0.0,6.447,11.225,0.008,0.22,38.676,0.0,1.0,0.0
9,1,RENT,32400.0,VERIFIED - income source,n,small business,OH,13.41,0.0,3.0,0.0,0.0,6.0,0.0,6514,57.1,10.0,f,3,PC5,0,0,0,11,0.0,8.782,10.386,0.201,9.517,765.711,0.0,3.0,0.0


What is a partial dependence plot? Can you explain it intuitively with a simple example? We can look at the previous example later, but first, let me explain the concept intuitively. Let's use a simple example to help understand this.


##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

A partial dependence plot (PDP) shows how a specific feature affects the predictions of a machine learning model, while accounting for the average effect of all other features.

Think of it this way: imagine you have a model that predicts house prices based on multiple features like square footage, number of bedrooms, neighborhood, and age of the house. You want to understand specifically how "square footage" impacts the predicted price, independent of other factors.

Here's how a PDP works intuitively:

1. For each possible value of square footage (say from 500 to 5000 sq ft):
   - Take all houses in your dataset
   - Temporarily set their square footage to this specific value (e.g., 1500 sq ft)
   - Keep all other features (bedrooms, neighborhood, age) at their original values
   - Get predictions for all these modified houses
   - Calculate the average predicted price

2. Plot these average predictions against the square footage values

The resulting curve shows how square footage alone influences the predicted price, averaged across all combinations of other features. For example, the plot might show that:
- Increasing from 1000 to 2000 sq ft raises the predicted price by $100,000 on average
- Beyond 3000 sq ft, additional space has diminishing returns on price

PDPs help you understand non-linear relationships and identify thresholds where a feature's impact changes significantly. They're particularly useful for interpreting complex models like random forests or gradient boosting models where the relationships between features and predictions aren't immediately obvious.

some more  example please and code snippet too...

##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

Let's look at a more concrete example with code. I'll show you how to create a partial dependence plot for a model that predicts diabetes progression based on patient features.

First, let's load a dataset, train a model, and create a partial dependence plot:

```python
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import partial_dependence, plot_partial_dependence
from sklearn.model_selection import train_test_split

# Load diabetes dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

# Train a random forest model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create partial dependence plot for BMI and blood pressure (BP)
features_to_plot = ['bmi', 'bp']
fig, ax = plt.subplots(figsize=(10, 6))
plot_partial_dependence(model, X_train, features_to_plot, ax=ax)
plt.tight_layout()
plt.show()
```

Now, if you want to calculate the partial dependence values manually to better understand what's happening:

```python
def calculate_pdp(model, X, feature_name, feature_values):
    """Calculate partial dependence values for a specific feature"""
    X_temp = X.copy()
    mean_predictions = []
    
    for value in feature_values:
        X_temp[feature_name] = value
        predictions = model.predict(X_temp)
        mean_predictions.append(predictions.mean())
        
    return mean_predictions

# Generate feature values for BMI
bmi_values = np.linspace(X['bmi'].min(), X['bmi'].max(), 20)
pdp_bmi = calculate_pdp(model, X_test, 'bmi', bmi_values)

# Plot manually calculated PDP
plt.figure(figsize=(8, 5))
plt.plot(bmi_values, pdp_bmi)
plt.xlabel('BMI')
plt.ylabel('Partial Dependence')
plt.title('Partial Dependence Plot for BMI')
plt.grid(True, alpha=0.3)
```

Another example with a classification task:

```python
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.inspection import plot_partial_dependence

# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = cancer.target

# Train a gradient boosting classifier
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, test_size=0.2, random_state=42)
clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Create partial dependence plots for important features
important_features = ['mean radius', 'mean texture', 'mean perimeter']
fig, ax = plt.subplots(figsize=(12, 4))
plot_partial_dependence(clf, X_train, important_features, ax=ax)
plt.tight_layout()
plt.show()
```

These plots would help you understand how each feature affects the model's predictions independently of other features. For example, you might see that as BMI increases, the predicted diabetes progression increases steadily until a certain point, after which it plateaus. This gives you insights into the model's behavior that are difficult to extract directly from complex models.

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import partial_dependence, plot_partial_dependence
from sklearn.model_selection import train_test_split

# Load diabetes dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

# Train a random forest model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Create partial dependence plot for BMI and blood pressure (BP)
features_to_plot = ['bmi', 'bp']
fig, ax = plt.subplots(figsize=(10, 6))
plot_partial_dependence(model, X_train, features_to_plot, ax=ax)
plt.tight_layout()
plt.show()


[31m---------------------------------------------------------------------------[39m
[31mImportError[39m                               Traceback (most recent call last)
[36mCell[39m[36m [39m[32mIn[1][39m[32m, line 4[39m
[32m      2[39m [38;5;28;01mfrom[39;00m[38;5;250m [39m[34;01msklearn[39;00m[34;01m.[39;00m[34;01mdatasets[39;00m[38;5;250m [39m[38;5;28;01mimport[39;00m load_diabetes
[32m      3[39m [38;5;28;01mfrom[39;00m[38;5;250m [39m[34;01msklearn[39;00m[34;01m.[39;00m[34;01mensemble[39;00m[38;5;250m [39m[38;5;28;01mimport[39;00m RandomForestRegressor
[32m----> [39m[32m4[39m [38;5;28;01mfrom[39;00m[38;5;250m [39m[34;01msklearn[39;00m[34;01m.[39;00m[34;01minspection[39;00m[38;5;250m [39m[38;5;28;01mimport[39;00m partial_dependence, plot_partial_dependence
[32m      5[39m [38;5;28;01mfrom[39;00m[38;5;250m [39m[34;01msklearn[39;00m[34;01m.[39;00m[34;01mmodel_selection[39;00m[38;5;250m [39m[38;5;28;01mimport[3

ImportError: cannot import name 'plot_partial_dependence' from 'sklearn.inspection' (/app/data/.local/lib/python3.11/site-packages/sklearn/inspection/__init__.py)