<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/L5_LLMs_Structured_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Creditworthiness: A FinTech Application

In this notebook, we'll explore the power of structured data and its critical role in various industries, especially in the rapidly evolving world of FinTech.

## The Ubiquity of Structured Data

Despite the rise of unstructured data sources like text and images, structured data remains the backbone of many analytical and operational systems. This type of data, organized into tables with rows and columns, is found everywhere, from financial transactions and customer records to scientific measurements and inventory management. Its clear, predefined format makes it ideal for traditional database systems and powerful machine learning algorithms.

## Structured Data in FinTech

In FinTech (Financial Technology), structured data is paramount. Banks, lending institutions, and investment firms rely heavily on structured datasets to:
*   **Assess Credit Risk:** Determine the likelihood of a borrower defaulting on a loan.
*   **Detect Fraud:** Identify suspicious patterns in transactions.
*   **Personalize Services:** Offer tailored financial products to customers.
*   **Optimize Investments:** Analyze market trends and portfolio performance.

## Case Study: German Credit Dataset

We will dive into the **German Credit Dataset**, a classic dataset from Kaggle that is widely used for credit risk assessment. This dataset contains 20 features describing individuals applying for credit, along with a target variable indicating their creditworthiness.

Our objective is to build a predictive model to classify applicants as either:
*   **0: Bad Credit Risk**
*   **1: Good Credit Risk**

## Our Approach: Random Forest Classifier

To achieve this, we will employ a traditional yet highly effective machine learning algorithm: the **Random Forest Classifier**. This ensemble learning method combines multiple decision trees to produce a more robust and accurate prediction. By leveraging this algorithm, we aim to demonstrate how structured data and machine learning can be used to make informed decisions in a real-world FinTech scenario.

First, we need to install the `ucimlrepo` library, which allows us to easily fetch datasets from the UCI Machine Learning Repository, where the German Credit Dataset is hosted.

In [None]:
!pip install ucimlrepo

Now, we'll use the `ucimlrepo` library to fetch the German Credit Dataset. We'll separate the features (`X`) from the target variable (`y`). We'll also print the dataset's metadata and variable information to understand its structure and contents better.

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
statlog_german_credit_data = fetch_ucirepo(id=144)

# data (as pandas dataframes)
X = statlog_german_credit_data.data.features
y = statlog_german_credit_data.data.targets

# metadata
print(statlog_german_credit_data.metadata)

# variable information
print(statlog_german_credit_data.variables)


Before training our Random Forest model, we need to preprocess the data. This involves:
1.  **Identifying Categorical Columns:** Finding columns with non-numeric (object) data types.
2.  **One-Hot Encoding:** Converting these categorical columns into a numerical format that machine learning algorithms can understand. We use `drop_first=True` to avoid multicollinearity.
3.  **Splitting Data:** Dividing the dataset into training and testing sets to evaluate the model's performance on unseen data.
4.  **Training the Model:** Initializing and fitting a `RandomForestClassifier` to our training data.
5.  **Making Predictions:** Using the trained model to predict creditworthiness on the test set.
6.  **Evaluating the Model:** Assessing the model's accuracy and generating a classification report to understand its precision, recall, and F1-score for each class.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Identify categorical columns in X
categorical_cols = X.select_dtypes(include=['object']).columns

# Apply one-hot encoding to categorical features
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train.values.ravel())

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

#TabPFN (Tabular Prior-Data Fitted Network)
TabPFN is a foundation model for tabular data that fundamentally changes how machine learning models are trained and deployed on structured datasets. Unlike traditional methods (such as Gradient-Boosted Decision Trees) that must be trained from scratch on every new dataset, TabPFN is a pre-trained Transformer that generates predictions in a single forward pass.

Link to learn more: https://github.com/PriorLabs/TabPFN

Next, we'll install `tabpfn-client` and `tabpfn-extensions`. TabPFN is a novel deep learning model for tabular data that acts as an in-context learner. It can often provide strong performance without extensive hyperparameter tuning, making it an interesting alternative to traditional models.

In [None]:
## TabPFN Installation optimized for Google Colab
# Install the TabPFN Client library
!uv pip install tabpfn-client

# Install TabPFN extensions for additional functionalities
!uv pip install tabpfn-extensions[all]

With the TabPFN libraries installed, we can now import the `TabPFNClassifier` for use in our notebook.

In [None]:
# Simple import for TabPFN
from tabpfn_client import TabPFNClassifier

# Now you can use it like any other sklearn classifier
# model = TabPFNClassifier()
print("TabPFNClassifier imported successfully.")

Similar to the Random Forest model, we'll train and evaluate the `TabPFNClassifier`. This involves:
1.  **Splitting Data:** We'll split the original `X` and `y` data again to ensure a clean split for TabPFN.
2.  **Training the Classifier:** Initializing and fitting the `TabPFNClassifier` to the training data.
3.  **Making Predictions:** Generating both probability predictions (`predict_proba`) and class predictions (`predict`) on the test set.
4.  **Evaluating the Model:** Calculating the ROC AUC score, accuracy, and a classification report to understand TabPFN's performance, especially for binary classification tasks.

***First time you run this, you will be asked to create an account.***

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from tabpfn_client import TabPFNClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42
)

# Train and evaluate the TabPFN classifier
tabpfn_classifier = TabPFNClassifier(random_state=42)
tabpfn_classifier.fit(X_train, y_train)
y_pred_proba = tabpfn_classifier.predict_proba(X_test)
y_pred = tabpfn_classifier.predict(X_test)

# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
print(f"TabPFN ROC AUC Score: {roc_auc:.4f}")


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

#Required Task 6
The code below gets a dataset to predict default. The outcome variable of interest is `default.payment.next.month`. Use two traditional machine learning algorithms (random forest, XGboost, etc.) and TabPFN to predict the outcome. Use a test set of 25% of the data. How well does TabPFN perform in comparison to machine learning algorithms?

In [None]:
import kagglehub
import pandas as pd
import os

dataset_path = kagglehub.dataset_download("uciml/default-of-credit-card-clients-dataset")

# List contents of the downloaded dataset directory to find the data file
files_in_dataset = os.listdir(dataset_path)
print(f"Files in the dataset directory: {files_in_dataset}")

# Assuming the main data file is a CSV and is directly in the downloaded path
# You might need to adjust the filename if it's different or in a subdirectory
# For this dataset, it's typically 'UCI_Credit_Card.csv'

# Construct the full path to the CSV file
csv_file_name = 'UCI_Credit_Card.csv'
full_csv_path = os.path.join(dataset_path, csv_file_name)

# Load the CSV into a pandas DataFrame
df = pd.read_csv(full_csv_path)

print("Data loaded into DataFrame 'df' successfully.")
print(df.head())

In [None]:
df