# Generate Synthetic Credit Application Data with OpenAI

This notebook generates a complete synthetic credit application dataset using OpenAI's GPT-4o mini model. We'll create structured data, unstructured text descriptions, and default outcomes all through AI generation.

## Main Objectives
- Use OpenAI API to generate realistic credit application data
- Create structured financial data through AI prompting
- Generate unstructured text descriptions for loan applications
- Determine default outcomes using AI reasoning
- Store all generated data in JSON format

## Setup and Imports

In [None]:
import os
import json
import openai
from openai import OpenAI
import pandas as pd
import numpy as np
from datetime import datetime
import time
import random
import tiktoken

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Initialize OpenAI client
from google.colab import userdata

client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))

print("Libraries imported and OpenAI client initialized successfully!")

Libraries imported and OpenAI client initialized successfully!


## Generate Structured Credit Application Data

First, let's define the parameters and distributions for our structured data:

In [None]:
import random
import json
import time
import openai
from google.colab import userdata
from tqdm import tqdm
# ── Static, domain-specific reference data ────────────────────────────────────
LOAN_PURPOSES = {
    "home_improvement":  {"typical_amount_range": (5_000, 50_000)},
    "debt_consolidation": {"typical_amount_range": (3_000, 30_000)},
    "business":          {"typical_amount_range": (10_000, 100_000)},
    "education":         {"typical_amount_range": (2_000, 25_000)},
    "medical":           {"typical_amount_range": (1_000, 15_000)},
    "vacation":          {"typical_amount_range": (2_000, 15_000)},
    "wedding":           {"typical_amount_range": (3_000, 25_000)},
    "car":               {"typical_amount_range": (5_000, 40_000)},
    "other":             {"typical_amount_range": (1_000, 20_000)}
}
CREDIT_HISTORY_LEVELS = [ "good", "fair", "poor", "terrible"]

# ── Low-level helper – generates ONE record ───────────────────────────────────
def generate_single_application(app_id: str, client: openai.OpenAI) -> dict | None:
    """
    Call the OpenAI Chat API once to create a single structured credit-application
    record whose applicant_id is fixed to `app_id`.  Returns the parsed dict or
    None on failure.
    """
    purpose_list = list(LOAN_PURPOSES.keys())
    amount_ranges = {k: v["typical_amount_range"] for k, v in LOAN_PURPOSES.items()}

    prompt = f"""
    Generate ONE realistic credit-application record in valid JSON (no markdown fences).
    Hard-set the field `applicant_id` to "{app_id}".  Vary between good and bad applications,
    in the moment the client is applying for a loan the economy can be also in a bad state. I need
    you to have an appropriate distribution of good and bad applications.


    Other requirements:

    • age: 18-80
    • income: realistic annual USD income
    • purpose: pick from {purpose_list}
    • loan_amount: respect these ranges {amount_ranges}
    • credit_history: one of {CREDIT_HISTORY_LEVELS}
    • employment_length: years (may be float, must be plausible given age)
    • debt_to_income: 0.00-1.00
    • location: US state abbreviation
    • education: none | high_school | some_college | bachelors | masters | phd
    • criminal_record: yes | no

    Maintain realistic correlations (e.g. higher income ↔ better credit, etc.).
    Return ONLY a single JSON object, no surrounding text.
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system",
                 "content": "You are a financial data generator. Output valid JSON only."},
                {"role": "user", "content": prompt.strip()}
            ],
            temperature=0.7,
            max_tokens=800
        )

        raw = response.choices[0].message.content.strip()
        # Remove stray fences if the model adds them
        cleaned = raw.replace("```json", "").replace("```", "").strip()
        return json.loads(cleaned)

    except Exception as exc:
        print(f"[{app_id}] - Error: {exc}")
        return None

# ── High-level wrapper – generates N records sequentially ─────────────────────
def generate_structured_data_with_ai(
    n_samples: int = 100,
    start_index: int = 1,
    pause_secs: float = 0.6
) -> list[dict]:
    """
    Loop over `n_samples`, calling the API once per record so partial failures
    don’t spoil the whole batch.  Returns a list of successfully generated dicts.
    """
    client = openai.OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
    records: list[dict] = []
    pbar = tqdm(range(start_index, start_index + n_samples))
    for i in pbar:
        pbar.set_description(f"Generating {i:06d}")
        pbar.refresh()
        app_id = f"APP_{i:06d}"
        record = generate_single_application(app_id, client)
        if record:
            records.append(record)
        else:
            print(f"[{app_id}] skipped due to error.")
        time.sleep(pause_secs)       # gentle pacing to avoid rate-limit spikes

    return records


print("Generating structured credit-application data one-by-one …")
structured_data = generate_structured_data_with_ai(n_samples=200)

if structured_data:
    print(f"\nSuccessfully generated {len(structured_data)} records.")
    print("Sample:")
    print(json.dumps(structured_data[0], indent=2))
else:
    print("No records generated.")


Generating structured credit-application data one-by-one …


Generating 000200: 100%|██████████| 200/200 [08:36<00:00,  2.58s/it]


✔️  Generated 200 records.
Sample:
{
  "applicant_id": "APP_000001",
  "age": 34,
  "income": 55000,
  "purpose": "debt_consolidation",
  "loan_amount": 15000,
  "credit_history": "fair",
  "employment_length": 8.5,
  "debt_to_income": 0.35,
  "location": "CA",
  "education": "bachelors",
  "criminal_record": "no"
}





## Generate Unstructured Text Descriptions

Now let's use OpenAI to generate realistic loan application text descriptions:

In [48]:
def generate_text_descriptions_with_ai(structured_records):
    """Generate text descriptions for each credit application using OpenAI"""

    enhanced_records = []

    for i, record in enumerate(structured_records):
        # Create a detailed prompt for text generation
        prompt = f"""Write a realistic loan application description for this applicant:

        Age: {record['age']}
        Income: ${record['income']:,}
        Loan Amount: ${record['loan_amount']:,}
        Purpose: {record['purpose']}
        Credit History: {record['credit_history']}
        Employment Length: {record['employment_length']} years
        Debt-to-Income: {record['debt_to_income']:.2f}
        Education: {record['education']}
        Location: {record['location']}

        Generate a 2-3 sentence description that sounds like a real loan application explanation. Include:
        - Why they need the loan
        - Brief mention of their financial situation
        - Confidence in repayment ability

        Make it sound natural and personalized based on their characteristics."""

        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are writing loan application descriptions. Write realistic, personalized descriptions based on applicant characteristics."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.8,
                max_tokens=200
            )

            # Add text description to record
            enhanced_record = record.copy()
            enhanced_record['text_description'] = response.choices[0].message.content.strip()
            enhanced_records.append(enhanced_record)

            # Rate limiting
            time.sleep(0.1)

            if (i + 1) % 10 == 0:
                print(f"Generated {i + 1} text descriptions...")

        except Exception as e:
            print(f"Error generating text for record {i}: {e}")
            # Add record without text description
            enhanced_record = record.copy()
            enhanced_record['text_description'] = "Standard loan application request."
            enhanced_records.append(enhanced_record)

    return enhanced_records

# Generate text descriptions
print("Generating text descriptions for loan applications...")
enhanced_data = generate_text_descriptions_with_ai(structured_data)

print(f"Successfully generated text descriptions for {len(enhanced_data)} records")
print("\nSample text description:")
print(f"Purpose: {enhanced_data[0]['purpose']}")
print(f"Description: {enhanced_data[0]['text_description']}")

Generating text descriptions for loan applications...
Generated 10 text descriptions...
Generated 20 text descriptions...
Generated 30 text descriptions...
Generated 40 text descriptions...
Generated 50 text descriptions...
Generated 60 text descriptions...
Generated 70 text descriptions...
Generated 80 text descriptions...
Generated 90 text descriptions...
Generated 100 text descriptions...
Generated 110 text descriptions...
Generated 120 text descriptions...
Generated 130 text descriptions...
Generated 140 text descriptions...
Generated 150 text descriptions...
Generated 160 text descriptions...
Generated 170 text descriptions...
Generated 180 text descriptions...
Generated 190 text descriptions...
Generated 200 text descriptions...
Successfully generated text descriptions for 200 records

Sample text description:
Purpose: debt_consolidation
Description: I am seeking a loan of $15,000 for debt consolidation to help streamline my finances and reduce my monthly payments. With a stable 

## Generate Default Outcomes Using AI

Now let's use OpenAI to determine default outcomes based on all available information:

In [None]:
def get_token_ids(text, model="gpt-4o-mini"):
    """Get token IDs for given text"""
    # Use the appropriate encoding for the model
    encoding = tiktoken.encoding_for_model("gpt-4o-mini")
    return encoding.encode(text)

def generate_default_outcomes_with_token_probs(records):
    """Generate default outcomes using token probabilities for D and ND tokens"""

    # Get token IDs for D and ND
    encoding = tiktoken.encoding_for_model("gpt-4o-mini")
    d_token_id = encoding.encode("D")[0]  # Default token
    nd_token_id = encoding.encode("ND")[0]  # No Default token

    print(f"Token IDs: D={d_token_id}, ND={nd_token_id}")

    final_records = []

    for i, record in enumerate(records):

        # random state, good with 20% prob bad with 80% prob
        state = random.choices(['good', 'bad'], weights=[0.2, 0.8])[0]

        # Create prompt for binary classification
        prompt = f"""Based on the following credit application, will this customer default on their loan?
        Take into account the fact that there are things that the customer might not be disclosing, or that even with average
        financial health it is possible to default in bad times.

        Assume that the current state of the economy is {state}

        Customer Profile:
        - Age: {record['age']}
        - Annual Income: ${record['income']:,}
        - Loan Amount: ${record['loan_amount']:,}
        - Purpose: {record['purpose']}
        - Credit History: {record['credit_history']}
        - Employment Length: {record['employment_length']} years
        - Debt-to-Income Ratio: {record['debt_to_income']:.2f}
        - Education: {record['education']}
        - Location: {record['location']}
        - Application Description: {record['text_description']}

        Consider standard credit risk factors:
        - Credit history quality
        - Debt-to-income ratio
        - Employment stability
        - Income level relative to loan amount
        - Loan purpose risk

        Respond with only one token: D (for default) or ND (for no default)."""

        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a credit risk analyst. Respond with only 'D' for default or 'ND' for no default."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.1,
                max_tokens=1,
                logprobs=True,
                top_logprobs=10
            )

            # Get the response token and its logprobs
            choice = response.choices[0]

            if choice.logprobs and choice.logprobs.content:
                # Get the first token's logprobs
                token_logprobs = choice.logprobs.content[0]

                # Extract probabilities for D and ND tokens
                d_logprob = None
                nd_logprob = None

                # Check the top logprobs for D and ND tokens
                for top_logprob in token_logprobs.top_logprobs:
                    if top_logprob.token == "D":
                        d_logprob = top_logprob.logprob
                    elif top_logprob.token == "ND":
                        nd_logprob = top_logprob.logprob

                # If we don't find both tokens in top logprobs, use the predicted token
                predicted_token = choice.message.content.strip()
                if predicted_token == "D" and d_logprob is None:
                    d_logprob = token_logprobs.logprob
                elif predicted_token == "ND" and nd_logprob is None:
                    nd_logprob = token_logprobs.logprob

                # Convert logprobs to probabilities
                if d_logprob is not None and nd_logprob is not None:
                    d_prob = np.exp(d_logprob)
                    nd_prob = np.exp(nd_logprob)

                    # Normalize probabilities
                    total_prob = d_prob + nd_prob
                    d_prob_normalized = d_prob / total_prob
                    nd_prob_normalized = nd_prob / total_prob

                    default_probability = d_prob_normalized
                    no_default_probability = nd_prob_normalized


            # Add results to record
            enhanced_record = record.copy()
            enhanced_record['default_probability'] = round(default_probability, 4)
            enhanced_record['no_default_probability'] = round(no_default_probability, 4)
            enhanced_record['predicted_token'] = predicted_token if 'predicted_token' in locals() else choice.message.content.strip()

            final_records.append(enhanced_record)

            # Rate limiting
            time.sleep(0.1)

            if (i + 1) % 10 == 0:
                print(f"Processed {i + 1}/{len(records)} records")

        except Exception as e:
            print(f"Error processing record {i}: {e}")
            # Add record with default values
            enhanced_record = record.copy()
            enhanced_record['default_probability'] = 0.1
            enhanced_record['no_default_probability'] = 0.9
            enhanced_record['default_outcome'] = 0
            enhanced_record['predicted_token'] = "ND"
            final_records.append(enhanced_record)

    return final_records

# Generate default outcomes using token probabilities
print("Generating default outcomes using token probabilities...")
complete_data = generate_default_outcomes_with_token_probs(enhanced_data)

print(f"Successfully generated token-based risk assessments for {len(complete_data)} records")

# Display statistics
default_probs = [record['default_probability'] for record in complete_data]
predicted_tokens = [record['predicted_token'] for record in complete_data]

print(f"\nDefault Statistics:")
print(f"Average default probability: {np.mean(default_probs):.3f}")
print(f"Min/Max default probability: {np.min(default_probs):.3f} / {np.max(default_probs):.3f}")
print(f"Token predictions: D={predicted_tokens.count('D')}, ND={predicted_tokens.count('ND')}")

# Show sample with token probabilities
print(f"\nSample record with token probabilities:")
sample = complete_data[0]
print(f"Applicant: {sample['applicant_id']}")
print(f"Credit History: {sample['credit_history']}")
print(f"Predicted Token: {sample['predicted_token']}")
print(f"Default Probability: {sample['default_probability']:.3f}")
print(f"No Default Probability: {sample['no_default_probability']:.3f}")

final_data = complete_data.copy()


Generating default outcomes using token probabilities...
Token IDs: D=35, ND=17538
Processed 10/200 records
Processed 20/200 records
Processed 30/200 records
Processed 40/200 records
Processed 50/200 records
Processed 60/200 records
Processed 70/200 records
Processed 80/200 records
Processed 90/200 records
Processed 100/200 records
Processed 110/200 records
Processed 120/200 records
Processed 130/200 records
Processed 140/200 records
Processed 150/200 records
Processed 160/200 records
Processed 170/200 records
Processed 180/200 records
Processed 190/200 records
Processed 200/200 records
Successfully generated token-based risk assessments for 200 records

Default Statistics:
Average default probability: 0.195
Min/Max default probability: 0.000 / 1.000
Token predictions: D=39, ND=161

Sample record with token probabilities:
Applicant: APP_000001
Credit History: fair
Predicted Token: ND
Default Probability: 0.009
No Default Probability: 0.991


## Save Dataset to JSON

Let's save our complete dataset to a JSON file for use in subsequent notebooks:

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Add metadata to the dataset
dataset_metadata = {
    "dataset_info": {
        "creation_date": datetime.now().isoformat(),
        "total_records": len(final_data),
        "generation_method": "OpenAI GPT-4o mini",
        "version": "1.0",
        "description": "Synthetic credit application dataset with structured data, text descriptions, and default outcomes"
    },
    "feature_descriptions": {
        "applicant_id": "Unique identifier for each applicant",
        "age": "Age of applicant in years",
        "income": "Annual income in USD",
        "loan_amount": "Requested loan amount in USD",
        "purpose": "Purpose of the loan",
        "credit_history": "Credit history rating (excellent, good, fair, poor)",
        "employment_length": "Years of employment",
        "debt_to_income": "Debt-to-income ratio (0-1)",
        "text_description": "Applicant's loan application narrative",
        "default_probability": "Probability of default (0-1)",
        "risk_rating": "Risk assessment (Low, Medium, High)",
        "risk_explanation": "Explanation of key risk factors",
        "default_label": "Binary outcome (0=no default, 1=default)"
    },
}

# Complete dataset structure
complete_dataset = {
    "metadata": dataset_metadata,
    "applications": final_data
}

output_file = '/content/drive/MyDrive/data/credit_applications_dataset.json'

import os, pathlib
pathlib.Path(output_file).parent.mkdir(parents=True, exist_ok=True)

# Add metadata
dataset = {
    'metadata': {
        'generated_date': datetime.now().isoformat(),
        'total_records': len(complete_data),
        'generation_method': 'OpenAI GPT-4o mini with token probabilities',
        'average_default_probability': np.mean(default_probs),
        'prediction_method': 'Token probability for D (Default) and ND (No Default) tokens',
        'token_distribution': {
            'D_predictions': predicted_tokens.count('D'),
            'ND_predictions': predicted_tokens.count('ND')
        }
    },
    'feature_descriptions': {
        'applicant_id': 'Unique identifier for each applicant',
        'age': 'Age of applicant in years',
        'income': 'Annual income in USD',
        'loan_amount': 'Requested loan amount in USD',
        'purpose': 'Purpose of the loan',
        'credit_history': 'Credit history rating (excellent, good, fair, poor)',
        'employment_length': 'Years of employment',
        'debt_to_income': 'Debt-to-income ratio (0-1)',
        'location': 'US state abbreviation',
        'education': 'Education level',
        'text_description': 'Applicant loan application narrative',
        'default_probability': 'Normalized probability of default from D token',
        'no_default_probability': 'Normalized probability of no default from ND token',
        'default_outcome': 'Binary outcome (0=no default, 1=default)',
        'predicted_token': 'Token predicted by model (D or ND)'
    },
    'data': complete_data
}

# Save to JSON file
with open(output_file, 'w') as f:
    json.dump(dataset, f, indent=2)

print(f"Dataset saved to {output_file}")
print(f"Total records: {len(complete_data)}")
print(f"File size: {os.path.getsize(output_file)} bytes")

# Also save a simplified CSV for quick analysis
df_simple = pd.DataFrame(final_data)
df_simple.to_csv("credit_applications_simple.csv", index=False)
print(f"Simplified dataset saved to credit_applications_simple.csv")

# Display sample records
print("\nSample records from final dataset:")
for i in range(2):
    print(f"\n--- Applicant {final_data[i]['applicant_id']} ---")
    print(f"Age: {final_data[i]['age']}, Income: ${final_data[i]['income']:,}")
    print(f"Loan: ${final_data[i]['loan_amount']:,} for {final_data[i]['purpose']}")
    print(f"Credit: {final_data[i]['credit_history']}")
    print(f"Description: {final_data[i]['text_description']}")

# Display sample complete record
print("\nSample complete record:")
sample_record = complete_data[0]
for key, value in sample_record.items():
    if key == 'text_description':
        print(f"{key}: {value[:100]}...")
    else:
        print(f"{key}: {value}")

Dataset saved to /content/drive/MyDrive/data/credit_applications_dataset.json
Total records: 200
File size: 188411 bytes
✅ Simplified dataset saved to credit_applications_simple.csv

Sample records from final dataset:

--- Applicant APP_000001 ---
Age: 34, Income: $55,000
Loan: $15,000 for debt_consolidation
Credit: fair
Description: I am seeking a loan of $15,000 for debt consolidation to help streamline my finances and reduce my monthly payments. With a stable income of $55,000 and 8.5 years at my current job, I believe I have a solid foundation to manage this loan despite my fair credit history. My debt-to-income ratio of 0.35 reflects my commitment to responsible financial management, and I am confident in my ability to repay the loan while improving my overall financial health.

--- Applicant APP_000002 ---
Age: 34, Income: $52,000
Loan: $15,000 for debt_consolidation
Credit: fair
Description: I am a 34-year-old resident of California with a stable job, having been employed for ov

## Validate Generated Data

Let's perform a quick validation of our generated dataset:

In [52]:
# Load and validate the saved dataset
with open(output_file, 'r') as f:
    loaded_dataset = json.load(f)

print("Dataset validation:")
print(f"Metadata: {loaded_dataset['metadata']}")
print(f"Number of records: {len(loaded_dataset['data'])}")

# Convert to DataFrame for analysis
df = pd.DataFrame(loaded_dataset['data'])
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Basic statistics
print(f"\nNumerical statistics:")
numerical_cols = ['age', 'income', 'loan_amount', 'employment_length', 'debt_to_income', 'default_probability']
print(df[numerical_cols].describe())

# Categorical distributions
print(f"\nCategorical distributions:")
categorical_cols = ['purpose', 'credit_history', 'education', 'location']
for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts().head())

print(f"\nDataset generation completed successfully!")

Dataset validation:
Metadata: {'generated_date': '2025-07-10T07:57:38.514066', 'total_records': 200, 'generation_method': 'OpenAI GPT-4o mini with token probabilities', 'average_default_probability': 0.1952205, 'prediction_method': 'Token probability for D (Default) and ND (No Default) tokens', 'token_distribution': {'D_predictions': 39, 'ND_predictions': 161}}
Number of records: 200

DataFrame shape: (200, 15)
Columns: ['applicant_id', 'age', 'income', 'purpose', 'loan_amount', 'credit_history', 'employment_length', 'debt_to_income', 'location', 'education', 'criminal_record', 'text_description', 'default_probability', 'no_default_probability', 'predicted_token']

Numerical statistics:
              age        income   loan_amount  employment_length  \
count  200.000000    200.000000    200.000000         200.000000   
mean    34.010000  60585.000000  14790.000000           7.672500   
std      1.147381   8893.720612   1091.723086           1.774327   
min     29.000000  45000.000000 