<h1 style="text-align:center; color:#4CAF50; font-size:36px; font-weight:bold;">
     Kaggle Competition: 
    <a href="https://www.kaggle.com/competitions/stock-pledge-financing-default-prediction/overview" target="_blank" style="color:#4CAF50; text-decoration:none;">
        Stock Pledge Financing Default Prediction
    </a>
</h1>

<p style="font-size:18px; line-height:1.6; color:#333;">
    Welcome to my solution for the <b>Stock Pledge Financing Default Prediction</b> competition on Kaggle!  
    This notebook outlines a comprehensive machine learning pipeline designed to predict the likelihood of default in stock pledge financing—a scenario where major shareholders use their equity holdings as collateral for loans. 
    Accurately forecasting such defaults is crucial for financial institutions to manage risks effectively.
</p>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#FF5722; font-size:28px;">Objective</h2>

<p style="font-size:18px; line-height:1.6; color:#333;">
    The primary goal is to develop a robust model that predicts the probability of default in stock pledge financing arrangements. 
    The evaluation metric for this competition is the <b>F1 Score</b>, which balances precision and recall, making it particularly suitable for datasets with imbalanced classes.
</p>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#3F51B5; font-size:28px;"> Notebook Overview</h2>

<ol style="font-size:16px; color:#555; line-height:1.8;">
    <li><b> Data Processing</b>
        <ul>
            <li>Loading and cleaning the dataset.</li>
            <li>Handling missing values and encoding categorical variables.</li>
            <li>Ensuring consistent indexing and feature naming.</li>
        </ul>
    </li>
    <li><b>Model Training</b>
        <ul>
            <li>Utilizing <b>LightGBM</b>, renowned for its efficiency with large datasets.</li>
            <li>Implementing <b>Stratified K-Fold Cross-Validation</b> to ensure robust performance.</li>
            <li>Optimizing probability thresholds to maximize the F1 score.</li>
        </ul>
    </li>
    <li><b> Evaluation</b>
        <ul>
            <li>Assessing the model using metrics like <b>Log-loss</b> and <b>F1 Score</b>.</li>
            <li>Identifying the optimal cutoff threshold for classification.</li>
        </ul>
    </li>
    <li><b>Submission</b>
        <ul>
            <li>Preparing the final submission file for Kaggle.</li>
        </ul>
    </li>
</ol>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#009688; font-size:28px;"> Why LightGBM?</h2>

<ul style="font-size:16px; color:#555; line-height:1.8;">
    <li><b>Speed</b>: It's one of the fastest gradient boosting frameworks available.</li>
    <li><b>Performance</b>: Often achieves high accuracy with minimal tuning.</li>
    <li><b>Flexibility</b>: Easily handles categorical features and missing values.</li>
</ul>

<p style="font-size:18px; line-height:1.6; color:#333;">
    Let's dive in and explore how this approach performs in predicting stock pledge financing defaults! 
</p>


#### ====================================================
# Imports and Configurations
#### ====================================================

In [1]:
# Core Libraries
import numpy as np
import pandas as pd
import warnings

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Libraries
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss, f1_score, precision_recall_curve, confusion_matrix
from lightgbm import LGBMClassifier

# Configurations
warnings.filterwarnings('ignore')


In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("jiaoyouzhang/stock-pledge-defaults-prediction")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/stock-pledge-defaults-prediction


#### =======================================================================================
#  Data Processing Section (Code)         
#### =======================================================================================

In [3]:
# Configuration
target = "IsDefault"
n_splits = 7  # Number of cross-validation splits

# Load datasets
train = pd.read_csv(f"/kaggle/input/stock-pledge-defaults-prediction/train.csv", index_col = "Stock code")
test = pd.read_csv(f"/kaggle/input/stock-pledge-defaults-prediction/test.csv", index_col = "Stock code")

# Initialize submission DataFrame
sub_fl = pd.DataFrame(index=test.index, columns=[target], dtype=np.int8).fillna(0)

# Separate features and target
Xtrain = train.drop(columns=[target])
ytrain = train[target]
Xtest = test.copy()

# Standardize column names
for df in [Xtrain, Xtest]:
    df.columns = [f"col{i}" for i in range(len(df.columns))]

# Handle categorical columns
cat_cols = Xtrain.select_dtypes(include=["object", "string", "category"]).columns
Xtrain[cat_cols] = Xtrain[cat_cols].fillna("missing").astype("category")
Xtest[cat_cols] = Xtest[cat_cols].fillna("missing").astype("category")

# Reset indices
Xtrain.index = range(len(Xtrain))
ytrain.index = range(len(Xtrain))
Xtest.index = range(len(Xtest))

#### =======================================================================================
#   Model Training with Cross-Validation (Code)  
#### =======================================================================================

In [4]:
print("\n ================== OFFLINE MODEL TRAINING ================== \n")

# Initialize Stratified K-Fold
cv = StratifiedKFold(n_splits=n_splits, random_state=42, shuffle=True)

# Initialize arrays for predictions
oof_preds_lgbm = np.zeros(len(Xtrain))
mdl_preds_lgbm = np.zeros(len(Xtest))

# Ensure column names are clean
Xtrain.columns = Xtrain.columns.str.replace(r'[^\w]', '_', regex=True)
Xtest.columns = Xtest.columns.str.replace(r'[^\w]', '_', regex=True)

# Training loop
for fold_nb, (train_idx, dev_idx) in enumerate(cv.split(Xtrain, ytrain)):
    print(f"\n ============== FOLD {fold_nb+1} - LGBM ==============")
    
    # Split data
    Xtr, ytr = Xtrain.iloc[train_idx], ytrain.iloc[train_idx]
    Xdev, ydev = Xtrain.iloc[dev_idx], ytrain.iloc[dev_idx]
    
    # Train model
    lgbm_model = LGBMClassifier(n_estimators=450, random_state=42, verbose=-1)
    lgbm_model.fit(Xtr, ytr)
    
    # Predictions
    dev_preds = lgbm_model.predict_proba(Xdev)[:, 1]
    test_preds = lgbm_model.predict_proba(Xtest)[:, 1]
    
    # Metrics
    logloss = log_loss(ydev, dev_preds)
    f1_raw = f1_score(ydev, np.round(dev_preds))
    print(f"---> LGBM Log-loss = {logloss:,.8f} | F1 raw = {f1_raw:,.8f}")
    
    # Store predictions
    oof_preds_lgbm[dev_idx] = dev_preds
    mdl_preds_lgbm += test_preds / n_splits
    
    # Optimal threshold search
    cutoffs = {cut: f1_score(ydev, np.where(dev_preds >= cut, 1, 0)) for cut in np.arange(0.05, 0.99, 0.01)}
    best_cutoff = max(cutoffs, key=cutoffs.get)
    best_f1 = cutoffs[best_cutoff]
    print(f"---> Best F1 = {best_f1:,.8f} | Best cutoff = {best_cutoff:,.4f}")

# Final evaluation
final_f1_lgbm = f1_score(ytrain, np.round(oof_preds_lgbm, 0))
print(f"\n---> LGBM Overall F1 score = {final_f1_lgbm:,.8f} | raw")




---> LGBM Log-loss = 0.03026621 | F1 raw = 0.96000000
---> Best F1 = 0.98701299 | Best cutoff = 0.0500

---> LGBM Log-loss = 0.00566426 | F1 raw = 0.98666667
---> Best F1 = 1.00000000 | Best cutoff = 0.0800

---> LGBM Log-loss = 0.00359298 | F1 raw = 1.00000000
---> Best F1 = 1.00000000 | Best cutoff = 0.0500

---> LGBM Log-loss = 0.00178036 | F1 raw = 1.00000000
---> Best F1 = 1.00000000 | Best cutoff = 0.0500

---> LGBM Log-loss = 0.00052280 | F1 raw = 1.00000000
---> Best F1 = 1.00000000 | Best cutoff = 0.0500

---> LGBM Log-loss = 0.00404107 | F1 raw = 1.00000000
---> Best F1 = 1.00000000 | Best cutoff = 0.0500

---> LGBM Log-loss = 0.00027447 | F1 raw = 1.00000000
---> Best F1 = 1.00000000 | Best cutoff = 0.0500

---> LGBM Overall F1 score = 0.99245283 | raw


#### =======================================================================================
#   Create Submission File (Code)      
#### =======================================================================================

In [5]:
# Finalize predictions and save submission file
sub_fl[target] = np.uint8(mdl_preds_lgbm.round())
sub_fl.to_csv("submission.csv", index=True)

# List the files in the current directory to confirm saving
!ls
print()

# Display the first few rows of the submission file
!head submission.csv

__notebook__.ipynb  submission.csv

Stock code,IsDefault
X01443,1
X01444,1
X01445,1
X01446,0
X01447,0
X01448,1
X01449,1
X01450,0
X01451,0


<h1 style="text-align:center; color:#4CAF50; font-size:36px; font-weight:bold;">✅ Conclusion</h1>

<p style="font-size:18px; line-height:1.6; color:#333;">In this notebook, we successfully developed a machine learning pipeline using <b>LightGBM</b> to tackle the <b><a href="https://www.kaggle.com/competitions/stock-pledge-financing-default-prediction/overview" target="_blank" style="color:#4CAF50; text-decoration:none;">
        Stock Pledge Financing Default Prediction
    </a>
</h1></b> challenge. Here's a summary of our approach and key takeaways:</p>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#FF5722; font-size:28px;">Key Highlights</h2>

<ul style="font-size:16px; color:#555; line-height:1.8;">
    <li><b> Data Processing:</b>  
        <ul>
            <li>Efficiently handled missing values and encoded categorical variables.</li>
            <li>Standardized column names and ensured consistent indexing for clean data handling.</li>
        </ul>
    </li>
    <li><b>Model Training:</b>  
        <ul>
            <li>Implemented <b>Stratified K-Fold Cross-Validation</b> to ensure robust and reliable model performance.</li>
            <li>Optimized the probability threshold for each fold to maximize the <b>F1 Score</b>.</li>
            <li>Achieved an overall <b>F1 Score</b> of <code style="color:#4CAF50; font-weight:bold;"> 0.9924</code> on the validation data.</li>
        </ul>
    </li>
    <li><b> Submission:</b>  
        <ul>
            <li>Final predictions were rounded and saved in the required format for submission.</li>
        </ul>
    </li>
</ul>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#3F51B5; font-size:28px;">Possible Improvements</h2>

<ol style="font-size:16px; color:#555; line-height:1.8;">
    <li><b>Hyperparameter Tuning</b>:  
        <span>Leverage techniques like <b>Grid Search</b> or <b>Optuna</b> for more refined model tuning.</span>
    </li>
    <li><b>Feature Engineering</b>:  
        <span>Explore new features, create interaction terms, or apply dimensionality reduction techniques.</span>
    </li>
    <li><b>Advanced Threshold Optimization</b>:  
        <span>Implement sophisticated techniques to fine-tune the optimal threshold across folds.</span>
    </li>
</ol>

<hr style="border:1px solid #ccc; margin:20px 0;">

<h2 style="color:#009688; font-size:28px;"> Final Thoughts</h2>

<p style="font-size:16px; line-height:1.6; color:#555;">
    This solution is a strong starting point, and I'm excited to see how it performs on the leaderboard. The focus was on building a clean and efficient pipeline that could be iteratively improved upon.
</p>

<p style="font-size:16px; line-height:1.6; color:#555;">
    If you found this notebook helpful or insightful, an upvote would be greatly appreciated! 
</p>

<p style="font-size:16px; font-style:italic; color:#555;">
    Good luck to everyone participating in the competition. Let's keep learning and building! 
</p>
