<a href="https://colab.research.google.com/github/LiuChen-5749342/Generative-AI-and-AI-Applications/blob/main/Task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1: Data Loading and Target Mapping**

Action: Use pandas to read the dataset from the provided UCI URL. Since it's space-separated and headerless, we will use sep=' ' and apply the exact 21 column names you provided.

Target Adjustment: Isolate the Creditability column. Per your tip, map the values from 1 (Good) and 2 (Bad) to 1 and 0 respectively using the .replace() method. This ensures compatibility with binary classification metrics and PyTorch's BCELoss.

In [1]:
import pandas as pd

# 1. Define the URL and column names
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
columns = [
    "Status of existing checking account", "Duration in month", "Credit history",
    "Purpose", "Credit amount", "Savings account/bonds", "Present employment since",
    "Installment rate in percentage of disposable income", "Personal status and sex",
    "Other debtors / guarantors", "Present residence since", "Property", "Age in years",
    "Other installment plans", "Housing", "Number of existing credits at this bank",
    "Job", "Number of people being liable to provide maintenance for", "Telephone",
    "foreign worker", "Creditability"
]

# 2. Load Data
# The dataset is space-separated and has no header row
df = pd.read_csv(url, sep=' ', header=None, names=columns)

# 3. Target Adjustment
# Original: 1 (Good), 2 (Bad). New: 1 (Good), 0 (Bad)
df['Creditability'] = df['Creditability'].replace({1: 1, 2: 0})

# Display the first few rows and the target distribution to verify
print("Data loaded successfully. Shape:", df.shape)
print("\nTarget Variable Distribution (1=Good, 0=Bad):")
print(df['Creditability'].value_counts())
display(df.head())

Data loaded successfully. Shape: (1000, 21)

Target Variable Distribution (1=Good, 0=Bad):
Creditability
1    700
0    300
Name: count, dtype: int64


Unnamed: 0,Status of existing checking account,Duration in month,Credit history,Purpose,Credit amount,Savings account/bonds,Present employment since,Installment rate in percentage of disposable income,Personal status and sex,Other debtors / guarantors,...,Property,Age in years,Other installment plans,Housing,Number of existing credits at this bank,Job,Number of people being liable to provide maintenance for,Telephone,foreign worker,Creditability
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,0
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,0


# **Step 2: Categorical Encoding (Pre-Split)**

Action: Separate the dataset into features (X) and target (y).

Encoding: Use pd.get_dummies(X) to convert all string/categorical variables (like "A11", "A32") into numerical formats via one-hot encoding.

In [2]:
# 1. Separate features (X) and target (y)
X = df.drop('Creditability', axis=1)
y = df['Creditability']

# 2. Apply one-hot encoding to categorical features
# We cast to float to ensure strict numerical formatting
X_encoded = pd.get_dummies(X, dtype=float)

# Verify the transformation
print(f"Original features shape: {X.shape}")
print(f"Encoded features shape: {X_encoded.shape}")
print(f"Number of new dummy columns created: {X_encoded.shape[1] - X.shape[1]}")

# Display the first few columns to see the one-hot encoding in action
display(X_encoded.iloc[:, :10].head())

Original features shape: (1000, 20)
Encoded features shape: (1000, 61)
Number of new dummy columns created: 41


Unnamed: 0,Duration in month,Credit amount,Installment rate in percentage of disposable income,Present residence since,Age in years,Number of existing credits at this bank,Number of people being liable to provide maintenance for,Status of existing checking account_A11,Status of existing checking account_A12,Status of existing checking account_A13
0,6,1169,4,4,67,2,1,1.0,0.0,0.0
1,48,5951,2,2,22,1,1,0.0,1.0,0.0
2,12,2096,2,3,49,1,2,0.0,0.0,0.0
3,42,7882,2,4,45,1,2,1.0,0.0,0.0
4,24,4870,3,4,53,2,2,1.0,0.0,0.0


# **Step 3: Train-Test Split and Data Scaling**

Splitting: Use scikit-learn's train_test_split to divide the dataset into an 80% training set and a 20% testing set.

Scaling: To strictly avoid data leakage, initialize StandardScaler. We will .fit_transform() the scaler only on the training data, and then use .transform() on the test data.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Split the data into 80% training and 20% testing sets
# We use a random_state to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.20, random_state=42, stratify=y
)

# 2. Initialize the StandardScaler
scaler = StandardScaler()

# 3. Fit the scaler ONLY on the training data, then transform it
X_train_scaled = scaler.fit_transform(X_train)

# 4. Transform the test data using the scaler fitted on the training data
X_test_scaled = scaler.transform(X_test)

# Verify the splits and scaling
print(f"Training data shape: {X_train_scaled.shape}")
print(f"Testing data shape: {X_test_scaled.shape}")
print(f"Training feature mean (approx 0): {X_train_scaled[:, 0].mean():.4f}")
print(f"Training feature std (approx 1): {X_train_scaled[:, 0].std():.4f}")

Training data shape: (800, 61)
Testing data shape: (200, 61)
Training feature mean (approx 0): -0.0000
Training feature std (approx 1): 1.0000


# **Step 4: PyTorch Tensors and DataLoaders**

Conversion: Cast the scaled numpy arrays into PyTorch tensors. We will use torch.float32 for the features and torch.float32 for the targets (since BCELoss expects float targets, not integers).

DataLoaders: Wrap the tensors in TensorDataset and pass them into DataLoader objects. We'll set a reasonable batch size (e.g., 32 or 64) and shuffle the training data.

In [4]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# 1. Convert data to PyTorch Tensors
# Features need to be float32 for standard PyTorch layers
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)

# Targets need to be float32 for BCELoss and reshaped to (N, 1)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

# 2. Combine into TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# 3. Create DataLoaders
# A batch size of 64 is a good starting point for a dataset of this size.
# We shuffle the training data to prevent the model from learning order-based patterns.
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Verify the tensor shapes and dataloader output
print(f"X_train tensor shape: {X_train_tensor.shape}")
print(f"y_train tensor shape: {y_train_tensor.shape}")

# Grab one batch to inspect
features_batch, targets_batch = next(iter(train_loader))
print(f"\nBatch features shape: {features_batch.shape}")
print(f"Batch targets shape: {targets_batch.shape}")

X_train tensor shape: torch.Size([800, 61])
y_train tensor shape: torch.Size([800, 1])

Batch features shape: torch.Size([64, 61])
Batch targets shape: torch.Size([64, 1])


# **Step 5: Define the MLP Architecture**

Module Definition: Create a class inheriting from nn.Module.

Layers: * Input Layer: Dynamically sized to match the number of features after one-hot encoding.

Hidden Layers: At least two hidden layers (e.g., 64 neurons in the first, 32 in the second) using ReLU activation functions.

Output Layer: 1 neuron with a Sigmoid activation function to output a probability between 0 and 1.

In [5]:
import torch.nn as nn

class LoanDefaultMLP(nn.Module):
    def __init__(self, input_dim):
        super(LoanDefaultMLP, self).__init__()

        # We define a sequential container for clean and readable code
        self.network = nn.Sequential(
            # First hidden layer
            nn.Linear(input_dim, 64),
            nn.ReLU(),

            # Second hidden layer
            nn.Linear(64, 32),
            nn.ReLU(),

            # Output layer (1 neuron for binary classification)
            nn.Linear(32, 1),
            # Sigmoid activation to squash the output between 0 and 1
            # This represents the probability of the positive class (Good Loan)
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.network(x)

# 1. Determine the exact number of input features
num_features = X_train_tensor.shape[1]

# 2. Instantiate the model
model = LoanDefaultMLP(input_dim=num_features)

# Print the model architecture to verify
print("Model Architecture:")
print(model)
print(f"\nExpected input dimension: {num_features}")

Model Architecture:
LoanDefaultMLP(
  (network): Sequential(
    (0): Linear(in_features=61, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

Expected input dimension: 61


# **Step 6: Training Setup and Loop**

Hyperparameters: Initialize the model, set the loss function to nn.BCELoss(), and use the Adam optimizer.

Training Loop: Run for at least 50 epochs (we can set it to 100 for better convergence). In each epoch, the loop will handle the forward pass, calculate the loss, zero the gradients, perform backpropagation, and update the weights.

In [6]:
import torch.optim as optim

# 1. Define the Loss Function and Optimizer
criterion = nn.BCELoss()
# A learning rate of 0.001 is a standard and effective starting point for Adam
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 2. Set the number of epochs
epochs = 80

# Track loss for visualization or inspection later
train_losses = []

print("Starting Training...")

# 3. The Training Loop
for epoch in range(epochs):
    # Set the model to training mode (important if using dropout or batch norm)
    model.train()

    running_loss = 0.0

    for inputs, labels in train_loader:
        # Step 1: Zero the gradients so they don't accumulate from previous batches
        optimizer.zero_grad()

        # Step 2: Forward pass (get predictions)
        outputs = model(inputs)

        # Step 3: Calculate the loss
        loss = criterion(outputs, labels)

        # Step 4: Backward pass (calculate gradients)
        loss.backward()

        # Step 5: Update the weights
        optimizer.step()

        running_loss += loss.item()

    # Calculate average loss for the epoch
    epoch_loss = running_loss / len(train_loader)
    train_losses.append(epoch_loss)

    # Print progress every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {epoch_loss:.4f}")

print("Training Completed")

Starting Training...
Epoch [10/80], Loss: 0.3809
Epoch [20/80], Loss: 0.2388
Epoch [30/80], Loss: 0.0961
Epoch [40/80], Loss: 0.0337
Epoch [50/80], Loss: 0.0148
Epoch [60/80], Loss: 0.0067
Epoch [70/80], Loss: 0.0043
Epoch [80/80], Loss: 0.0027
Training Completed


# **Step 7: Evaluation**

Action: Disable gradient tracking using torch.no_grad(). Pass the test DataLoader through the trained model.

Metrics: Apply a 0.5 threshold to the output probabilities to get binary predictions (0 or 1). Compare these against the actual labels to calculate and print the final test accuracy.

In [7]:
# 1. Set the model to evaluation mode
# This disables layers like Dropout or BatchNorm that behave differently during training
model.eval()

correct = 0
total = 0

# 2. Disable gradient calculation for inference
with torch.no_grad():
    for inputs, labels in test_loader:
        # Get the model's raw probability outputs
        outputs = model(inputs)

        # Apply a 0.5 threshold to convert probabilities to binary classes (0 or 1)
        predictions = (outputs >= 0.5).float()

        # Keep track of total samples and correct predictions
        total += labels.size(0)
        correct += (predictions == labels).sum().item()

# 3. Calculate and print final accuracy
accuracy = (correct / total) * 100

print(f"\nFinal Test Accuracy: {accuracy:.2f}%")


Final Test Accuracy: 71.50%
