<a href="https://colab.research.google.com/github/SohailVibeCoder/IB9AU---GenAI/blob/main/Task_6_L5_LLMs_Structured_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Creditworthiness: A FinTech Application

In this notebook, we'll explore the power of structured data and its critical role in various industries, especially in the rapidly evolving world of FinTech.

## The Ubiquity of Structured Data

Despite the rise of unstructured data sources like text and images, structured data remains the backbone of many analytical and operational systems. This type of data, organized into tables with rows and columns, is found everywhere, from financial transactions and customer records to scientific measurements and inventory management. Its clear, predefined format makes it ideal for traditional database systems and powerful machine learning algorithms.

## Structured Data in FinTech

In FinTech (Financial Technology), structured data is paramount. Banks, lending institutions, and investment firms rely heavily on structured datasets to:
*   **Assess Credit Risk:** Determine the likelihood of a borrower defaulting on a loan.
*   **Detect Fraud:** Identify suspicious patterns in transactions.
*   **Personalize Services:** Offer tailored financial products to customers.
*   **Optimize Investments:** Analyze market trends and portfolio performance.

## Case Study: German Credit Dataset

We will dive into the **German Credit Dataset**, a classic dataset from Kaggle that is widely used for credit risk assessment. This dataset contains 20 features describing individuals applying for credit, along with a target variable indicating their creditworthiness.

Our objective is to build a predictive model to classify applicants as either:
*   **0: Bad Credit Risk**
*   **1: Good Credit Risk**

## Our Approach: Random Forest Classifier

To achieve this, we will employ a traditional yet highly effective machine learning algorithm: the **Random Forest Classifier**. This ensemble learning method combines multiple decision trees to produce a more robust and accurate prediction. By leveraging this algorithm, we aim to demonstrate how structured data and machine learning can be used to make informed decisions in a real-world FinTech scenario.

First, we need to install the `ucimlrepo` library, which allows us to easily fetch datasets from the UCI Machine Learning Repository, where the German Credit Dataset is hosted.

In [None]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


Now, we'll use the `ucimlrepo` library to fetch the German Credit Dataset. We'll separate the features (`X`) from the target variable (`y`). We'll also print the dataset's metadata and variable information to understand its structure and contents better.

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
statlog_german_credit_data = fetch_ucirepo(id=144)

# data (as pandas dataframes)
X = statlog_german_credit_data.data.features
y = statlog_german_credit_data.data.targets

# metadata
print(statlog_german_credit_data.metadata)

# variable information
print(statlog_german_credit_data.variables)


{'uci_id': 144, 'name': 'Statlog (German Credit Data)', 'repository_url': 'https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data', 'data_url': 'https://archive.ics.uci.edu/static/public/144/data.csv', 'abstract': 'This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1000, 'num_features': 20, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Other', 'Marital Status', 'Age', 'Occupation'], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5NC77', 'creators': ['Hans Hofmann'], 'intro_paper': None, 'additional_info': {'summary': 'Two datasets are provided.  the original dataset, in the form provided by

Before training our Random Forest model, we need to preprocess the data. This involves:
1.  **Identifying Categorical Columns:** Finding columns with non-numeric (object) data types.
2.  **One-Hot Encoding:** Converting these categorical columns into a numerical format that machine learning algorithms can understand. We use `drop_first=True` to avoid multicollinearity.
3.  **Splitting Data:** Dividing the dataset into training and testing sets to evaluate the model's performance on unseen data.
4.  **Training the Model:** Initializing and fitting a `RandomForestClassifier` to our training data.
5.  **Making Predictions:** Using the trained model to predict creditworthiness on the test set.
6.  **Evaluating the Model:** Assessing the model's accuracy and generating a classification report to understand its precision, recall, and F1-score for each class.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Identify categorical columns in X
categorical_cols = X.select_dtypes(include=['object']).columns

# Apply one-hot encoding to categorical features
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.3, random_state=42)

# Initialize and train the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train.values.ravel())

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

Accuracy: 0.7533
Classification Report:
              precision    recall  f1-score   support

           1       0.76      0.94      0.84       209
           2       0.71      0.32      0.44        91

    accuracy                           0.75       300
   macro avg       0.73      0.63      0.64       300
weighted avg       0.74      0.75      0.72       300



#TabPFN (Tabular Prior-Data Fitted Network)
TabPFN is a foundation model for tabular data that fundamentally changes how machine learning models are trained and deployed on structured datasets. Unlike traditional methods (such as Gradient-Boosted Decision Trees) that must be trained from scratch on every new dataset, TabPFN is a pre-trained Transformer that generates predictions in a single forward pass.

Link to learn more: https://github.com/PriorLabs/TabPFN

Next, we'll install `tabpfn-client` and `tabpfn-extensions`. TabPFN is a novel deep learning model for tabular data that acts as an in-context learner. It can often provide strong performance without extensive hyperparameter tuning, making it an interesting alternative to traditional models.

In [None]:
## TabPFN Installation optimized for Google Colab
# Install the TabPFN Client library
!uv pip install tabpfn-client

# Install TabPFN extensions for additional functionalities
!uv pip install tabpfn-extensions[all]

[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m33 packages[0m [2min 411ms[0m[0m
[2K[2mPrepared [1m5 packages[0m [2min 102ms[0m[0m
[2mUninstalled [1m1 package[0m [2min 11ms[0m[0m
[2K[2mInstalled [1m5 packages[0m [2min 15ms[0m[0m
 [32m+[39m [1mbackoff[0m[2m==2.2.1[0m
 [32m+[39m [1mpassword-strength[0m[2m==0.0.3.post2[0m
 [32m+[39m [1msseclient-py[0m[2m==1.8.0[0m
 [32m+[39m [1mtabpfn-client[0m[2m==0.2.8[0m
 [31m-[39m [1mtqdm[0m[2m==4.67.2[0m
 [32m+[39m [1mtqdm[0m[2m==4.67.1[0m
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m87 packages[0m [2min 1.61s[0m[0m
[2K[2mPrepared [1m19 packages[0m [2min 1.67s[0m[0m
[2mUninstalled [1m1 package[0m [2min 2ms[0m[0m
[2K[2mInstalled [1m19 packages[0m [2min 191ms[0m[0m
 [32m+[39m [1mautogluon-common[0m[2m==1.4.0[0m
 [32m+[39m [1mautogluon-core[0m[2m==1.4.0[0m
 [32m+[39m [1mautogluon-features[0m[2m==1.4.0[0m


With the TabPFN libraries installed, we can now import the `TabPFNClassifier` for use in our notebook.

In [None]:
# Simple import for TabPFN
from tabpfn_client import TabPFNClassifier

# Now you can use it like any other sklearn classifier
# model = TabPFNClassifier()
print("TabPFNClassifier imported successfully.")

TabPFNClassifier imported successfully.


Similar to the Random Forest model, we'll train and evaluate the `TabPFNClassifier`. This involves:
1.  **Splitting Data:** We'll split the original `X` and `y` data again to ensure a clean split for TabPFN.
2.  **Training the Classifier:** Initializing and fitting the `TabPFNClassifier` to the training data.
3.  **Making Predictions:** Generating both probability predictions (`predict_proba`) and class predictions (`predict`) on the test set.
4.  **Evaluating the Model:** Calculating the ROC AUC score, accuracy, and a classification report to understand TabPFN's performance, especially for binary classification tasks.

***First time you run this, you will be asked to create an account.***

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from tabpfn_client import TabPFNClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42
)

# Train and evaluate the TabPFN classifier
tabpfn_classifier = TabPFNClassifier(random_state=42)
tabpfn_classifier.fit(X_train, y_train)
y_pred_proba = tabpfn_classifier.predict_proba(X_test)
y_pred = tabpfn_classifier.predict(X_test)

# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
print(f"TabPFN ROC AUC Score: {roc_auc:.4f}")


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)


Opening browser for login. Please complete the login/registration process in your browser and return here.


Could not open browser automatically. Falling back to command-line login...



1


y


essajee1998@gmail.com


Output()

Output()


Password: ··········



Password: ··········
Confirm password: ··········


y


Sohail


Essajee


Warwick Uni


c


predictive modelling


n


Output()

buTCbhYc


Output()

  y_ = column_or_1d(y, warn=True)
Processing: 100%|██████████| [00:00<00:00]
Processing: 100%|██████████| [00:00<00:00]

TabPFN ROC AUC Score: 0.8182
Accuracy: 0.7800
Classification Report:
              precision    recall  f1-score   support

           1       0.80      0.91      0.85       209
           2       0.70      0.48      0.57        91

    accuracy                           0.78       300
   macro avg       0.75      0.70      0.71       300
weighted avg       0.77      0.78      0.77       300






#Required Task 6
The code below gets a dataset to predict default. The outcome variable of interest is `default.payment.next.month`. Use two traditional machine learning algorithms (random forest, XGboost, etc.) and TabPFN to predict the outcome. Use a test set of 25% of the data. How well does TabPFN perform in comparison to machine learning algorithms?

In [None]:
import kagglehub
import pandas as pd
import os

dataset_path = kagglehub.dataset_download("uciml/default-of-credit-card-clients-dataset")

# List contents of the downloaded dataset directory to find the data file
files_in_dataset = os.listdir(dataset_path)
print(f"Files in the dataset directory: {files_in_dataset}")

# Assuming the main data file is a CSV and is directly in the downloaded path
# You might need to adjust the filename if it's different or in a subdirectory
# For this dataset, it's typically 'UCI_Credit_Card.csv'

# Construct the full path to the CSV file
csv_file_name = 'UCI_Credit_Card.csv'
full_csv_path = os.path.join(dataset_path, csv_file_name)

# Load the CSV into a pandas DataFrame
df = pd.read_csv(full_csv_path)

print("Data loaded into DataFrame 'df' successfully.")
print(df.head())

Downloading from https://www.kaggle.com/api/v1/datasets/download/uciml/default-of-credit-card-clients-dataset?dataset_version_number=1...


100%|██████████| 0.98M/0.98M [00:00<00:00, 88.8MB/s]

Extracting files...
Files in the dataset directory: ['UCI_Credit_Card.csv']
Data loaded into DataFrame 'df' successfully.
   ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3  PAY_4  \
0   1    20000.0    2          2         1   24      2      2     -1     -1   
1   2   120000.0    2          2         2   26     -1      2      0      0   
2   3    90000.0    2          2         2   34      0      0      0      0   
3   4    50000.0    2          2         1   37      0      0      0      0   
4   5    50000.0    1          2         1   57     -1      0     -1      0   

   ...  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3  \
0  ...        0.0        0.0        0.0       0.0     689.0       0.0   
1  ...     3272.0     3455.0     3261.0       0.0    1000.0    1000.0   
2  ...    14331.0    14948.0    15549.0    1518.0    1500.0    1000.0   
3  ...    28314.0    28959.0    29547.0    2000.0    2019.0    1200.0   
4  ...    20940.0    19146.0    19131.




In [None]:
df

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,1,20000.0,2,2,1,24,2,2,-1,-1,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,2,120000.0,2,2,2,26,-1,2,0,0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,3,90000.0,2,2,2,34,0,0,0,0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,4,50000.0,2,2,1,37,0,0,0,0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,5,50000.0,1,2,1,57,-1,0,-1,0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29996,220000.0,1,3,1,39,0,0,0,0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0
29996,29997,150000.0,1,3,2,43,-1,-1,-1,-1,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0
29997,29998,30000.0,1,2,2,37,4,3,2,-1,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1
29998,29999,80000.0,1,3,1,41,1,-1,0,0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from tabpfn_client import TabPFNClassifier
from sklearn.metrics import classification_report, accuracy_score

# 1. Prepare Data
# Remove ID and separate target
X = df.drop(['ID', 'default.payment.next.month'], axis=1)
y = df['default.payment.next.month']

# Split data (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Scale data (important for many TabPFN implementations)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Note: TabPFN is optimized for smaller datasets (<10k samples).
# If the dataset is large, we sub-sample for the TabPFN training step
# to ensure it fits within memory limits.
X_train_tab = X_train_scaled[:5000]
y_train_tab = y_train[:5000]

# 2. Initialize Models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
tabpfn = TabPFNClassifier(random_state=42) # Removed device='cpu'

# 3. Training and Prediction
models = {'Random Forest': rf, 'XGBoost': xgb, 'TabPFN': tabpfn}
results = {}

for name, model in models.items():
    if name == 'TabPFN':
        # Convert y_train_tab to a 1D array if it's a Series or DataFrame column
        if isinstance(y_train_tab, (pd.Series, pd.DataFrame)):
            y_train_tab_fit = y_train_tab.values.ravel()
        else:
            y_train_tab_fit = y_train_tab
        model.fit(X_train_tab, y_train_tab_fit)
    else:
        # Convert y_train to a 1D array if it's a Series or DataFrame column
        if isinstance(y_train, (pd.Series, pd.DataFrame)):
            y_train_fit = y_train.values.ravel()
        else:
            y_train_fit = y_train
        model.fit(X_train_scaled, y_train_fit)

    preds = model.predict(X_test_scaled)
    results[name] = {
        'Accuracy': accuracy_score(y_test, preds),
        'Report': classification_report(y_test, preds)
    }

# 4. Output Comparison
for name, metrics in results.items():
    print(f"--- {name} ---")
    print(f"Accuracy: {metrics['Accuracy']:.4f}")
    print(metrics['Report'])

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Processing: 100%|██████████| [00:02<00:00]


--- Random Forest ---
Accuracy: 0.8155
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      5873
           1       0.63      0.36      0.46      1627

    accuracy                           0.82      7500
   macro avg       0.74      0.65      0.67      7500
weighted avg       0.80      0.82      0.80      7500

--- XGBoost ---
Accuracy: 0.8127
              precision    recall  f1-score   support

           0       0.84      0.94      0.89      5873
           1       0.62      0.36      0.46      1627

    accuracy                           0.81      7500
   macro avg       0.73      0.65      0.67      7500
weighted avg       0.79      0.81      0.79      7500

--- TabPFN ---
Accuracy: 0.8223
              precision    recall  f1-score   support

           0       0.84      0.95      0.89      5873
           1       0.67      0.35      0.46      1627

    accuracy                           0.82      7500
   macro avg       0.76 

In [3]:
import json

# 1. Provide your current notebook filename
# If you haven't renamed it, it's usually 'Untitled.ipynb'
input_file = 'Task 6 -  L5_LLMs_Structured_Data.ipynb'
output_file = 'SE Task 6 -  L5_LLMs_Structured_Data.ipynb '

with open(input_file, 'r', encoding='utf-8') as f:
    nb = json.load(f)

# 2. Check if the metadata and widgets key exists
if 'metadata' in nb and 'widgets' in nb['metadata']:
    # Check for the specific 'state' key issue
    widget_data = nb['metadata']['widgets'].get('application/vnd.jupyter.widget-state+json', {})

    if 'state' not in widget_data:
        print("Found corrupted widget metadata. Cleaning...")
        # Option A: Add an empty state if you want to try and keep the structure
        # nb['metadata']['widgets']['application/vnd.jupyter.widget-state+json']['state'] = {}

        # Option B: Remove the widgets metadata entirely (Safest for GitHub rendering)
        del nb['metadata']['widgets']
        print("Success: Widget metadata removed. Cell outputs remain intact.")
    else:
        print("Widget 'state' key found. The error might be elsewhere.")
else:
    print("No widget metadata found in this file.")

# 3. Save the fixed version
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(nb, f, indent=1)

print(f"Done! Please download '{output_file}' and try pushing that to GitHub.")

FileNotFoundError: [Errno 2] No such file or directory: 'Task 6 -  L5_LLMs_Structured_Data.ipynb'