# Link to World Bank Dataset Overview
https://financesone.worldbank.org/api-explorer?id=DS00975

# 📘 Loan Cancellation and Disbursement Behavior Prediction

## Problem Statement
The goal of this project is to predict loan cancellation and disbursement outcomes as indicators of a country’s ability to repay its loans.

## Topic
**Loan Cancellation and Disbursement Behavior Prediction**

---

## Phase 1: Data Acquisition & Preparation
- [x] Create public repositroy through GitHub  
- [x] Connect Open Source World Bank dataset to Jupyter Notebook via World Bank API
- [x] Implement partial downloading of data to save progress and prevent system throttling 
- [ ] Clean dataset and remove irrelevant columns  
- [ ] Impute missing values using a pretrained model  
- [ ] Mask sensitive data using models
- [x] Feature Engineering    

---

## Phase 2: Model Implementation
- [ ] Implement machine learning models to predict loan disbursement and cancellation behavior 
- [ ] Implement hugging face tabular transformer based model
- [ ] Evaluate performance using appropriate metrics (accuracy, F1-score, ROC-AUC)  
- [ ] Compare baseline models vs. advanced models (e.g., Hugging Face TabTransformer)  

---

## Phase 3: Model Deployment
- [ ] Deploy the trained model using **Hugging Face Spaces** or **Azure Machine Learning**  
- [ ] Create a simple web-based interface for predictions and visual insights  

---

## Phase 4: Dataset Expansion
- [ ] Merge an additional dataset (e.g., GDP data)  
- [ ] Join using the **World Bank Country Code**  
- [ ] Analyze the impact of macroeconomic indicators on loan outcomes  

---

## 📈 Additional Topics to Explore
- **Loan Default Risk Prediction**  
- **Loan Repayment Time Forecasting**  
- **Regional Loan Portfolio Performance Analysis**
- **CI/CD With Versioning**
- **Containerzation**
---

## Objective Summary
This project aims to build an interpretable AI model capable of predicting loan cancellation and disbursement behavior using historical World Bank data, regional attributes, and macroeconomic indicators. The findings will help assess each country’s ability to fulfill its loan obligations and guide data-driven lending decisions.


In [None]:
import requests, json, time, os
from tqdm import tqdm

def fetch_all_data():
    base_url = "https://datacatalogapi.worldbank.org/dexapps/fone/api/apiservice"
    dataset_id = "DS00975"
    resource_id = "RS00905"
    top = 1000
    page = 1
    all_data = []

    # 🔁 Resume if a partial file already exists
    if os.path.exists("worldbank_partial.json"):
        with open("worldbank_partial.json") as f:
            all_data = json.load(f)
        page = len(all_data)//top + 1
        print(f"Resuming from page {page} ({len(all_data):,} records already saved)")

    with tqdm(total=1472089, desc="Downloading rows", unit="rows") as pbar:
        pbar.update(len(all_data))
        while True:
            url = f"{base_url}?datasetId={dataset_id}&resourceId={resource_id}&top={top}&type=json&skip={1000*(page-1)}"
            try:
                r = requests.get(url, timeout=20)
                r.raise_for_status()
                data = r.json()
                if "data" not in data or not data["data"]:
                    break
                all_data.extend(data["data"])
                pbar.update(len(data["data"]))

                if page % 10 == 0:
                    with open("worldbank_partial.json", "w") as f:
                        json.dump(all_data, f)
                    print(f"Saved checkpoint at page {page} ({len(all_data):,} rows)")
                page += 1
                time.sleep(0.2)
            except requests.RequestException as e:
                print(f"Error on page {page}: {e}. Retrying in 10s.")
                time.sleep(10)
                continue

    # Final save
    with open("worldbank_loans_full.json", "w") as f:
        json.dump(all_data, f)
    print(f"\n✅ Download complete — {len(all_data):,} records saved.")
    return all_data

if __name__ == "__main__":
    all_data = fetch_all_data()


In [None]:
import pandas as pd
df = pd.read_json("worldbank_loans_full.json")
df.to_parquet("worldbank_loans.parquet")


In [None]:
import pandas as pd
df = pd.read_parquet("worldbank_loans.parquet")


In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
df.describe(include="all")

In [None]:
df.drop_duplicates(subset=["loan_number"]).groupby("country")["original_principal_amount"].mean().sort_values(ascending=False).head(20)


In [None]:
df["loan_number"].describe()

In [None]:
df_clean = df[[
    "loan_number", "country", "region",
    "original_principal_amount", "disbursed_amount", "cancelled_amount",
    "interest_rate", "loan_status", "loan_type",
    "board_approval_date", "end_of_period",
    "borrowers_obligation", "due_to_ibrd", "undisbursed_amount",
    "project_name_"
]].copy()


In [None]:
df_clean.head(5)

In [None]:
df_clean["disbursement_ratio"] = df_clean["disbursed_amount"] / df_clean["original_principal_amount"]
df_clean["cancellation_ratio"] = df_clean["cancelled_amount"] / df_clean["original_principal_amount"]

df_clean.describe(include='all')