# MLOps Project - Group 5
## **Predicting Credit Risk in German Credit Data**
### Group Members:
1. Kunal Deshmukh (25PGAI0036)
2. Maulik Ruparel (25PGAI0024)
3. Anju Venkiteswaran (25PGAI0005)
4. Kapil Ahuja (25PGAI0111)
5. Bhawana Thawarani (25PGAI0137)

### Dataset: [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data))

### Overview:
The German Credit Data from the UCI Machine Learning Repository contains information about individuals applying for credit. The dataset is utilized to predict whether an applicant poses a good or bad credit risk. It comprises 1000 entries with 20 attributes (7 numerical, 13 categorical).

- **Domain:** Finance  
- **Task:** Classification  
- **Dataset Type:** Multivariate  
- **Number of Instances:** 1000  
- **Number of Features:** 20  
- **Feature Types:** Categorical, Integer

### Key Features:
Below are descriptions of a few features from the dataset:

- `Duration in Months`: 
  - **Type**: Numerical
  - **Description**: The duration for which the credit is requested, measured in months.

- `Credit History`:
  - **Type**: Categorical
  - **Description**: Quality of the existing credit history, with categories such as 'no credits taken', 'all credits at this bank paid back duly', etc.

- `Purpose`:
  - **Type**: Categorical
  - **Description**: The purpose for which the credit is taken, includes values like 'car (new)', 'car (used)', 'furniture/equipment', etc.

- `Credit Amount`:
  - **Type**: Numerical
  - **Description**: The credit amount requested by the applicant.

### Target Variable
- `Credit Risk`:
  - **Type**: Categorical (Binary: Good/Bad)
  - **Description**: The classification outcome, where 'Good' indicates a lower risk of default and 'Bad' indicates a higher risk of default.

## Loading the data

In this section, we use pandas to import the dataset and display the first few rows. This initial action confirms that the data is loaded correctly and offers an early glimpse into the structure of the dataset.

In [None]:
# import pandas library for data manipulation
import pandas as pd

# Define the path to the dataset
file_path = 'german.data'

# Load the dataset with space as the delimiter
data = pd.read_csv(file_path, delim_whitespace=True, header=None)

# Display the first 5 records in a transposed view
data.head().T

  data = pd.read_csv(file_path, delim_whitespace=True, header=None)


Unnamed: 0,0,1,2,3,4
0,A11,A12,A14,A11,A11
1,6,48,12,42,24
2,A34,A32,A34,A32,A33
3,A43,A43,A46,A42,A40
4,1169,5951,2096,7882,4870
5,A65,A61,A61,A61,A61
6,A75,A73,A74,A74,A73
7,4,2,2,2,3
8,A93,A92,A93,A93,A93
9,A101,A101,A101,A103,A101


## Dataset Schema and Storage

This section sets the schema for our dataset, aligning column names with the detailed dataset documentation to ensure clarity and accuracy in data handling.

**Steps undertaken:**
1. **Assign Column Names**: We specify names for each column according to the dataset's descriptive attributes.
2. **Update DataFrame**: The new column names are applied to our DataFrame.
3. **Data Preview**: We display the first few rows in a transposed view.
4. **Storage**: The transformed data is stored locally as a parquet file.

In [None]:
# Define the column names based on the dataset description
column_names = [
    'ExistingAccount_Status', 'Duration_Months', 'Credit_History', 'Purpose', 'Credit_Amount',
    'SavingsAccount_Bonds', 'Present_Employment_Since', 'Installment_Rate',
    'PersonalStatus_Sex', 'OtherDebtors_Guarantors', 'Present_Residence_Since', 'Property',
    'Age_Years', 'Other_Installment_Plans', 'Housing', 'Num_Existing_Credits',
    'Job', 'Num_People_Liable', 'Telephone', 'Foreign_Worker', 
    'Credit_Risk'  # This is the target variable
]

data.columns = column_names

data.head().T

Unnamed: 0,0,1,2,3,4
ExistingAccount_Status,A11,A12,A14,A11,A11
Duration_Months,6,48,12,42,24
Credit_History,A34,A32,A34,A32,A33
Purpose,A43,A43,A46,A42,A40
Credit_Amount,1169,5951,2096,7882,4870
SavingsAccount_Bonds,A65,A61,A61,A61,A61
Present_Employment_Since,A75,A73,A74,A74,A73
Installment_Rate,4,2,2,2,3
PersonalStatus_Sex,A93,A92,A93,A93,A93
OtherDebtors_Guarantors,A101,A101,A101,A103,A101


In [None]:
# Display a concise summary of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ExistingAccount_Status    1000 non-null   object
 1   Duration_Months           1000 non-null   int64 
 2   Credit_History            1000 non-null   object
 3   Purpose                   1000 non-null   object
 4   Credit_Amount             1000 non-null   int64 
 5   SavingsAccount_Bonds      1000 non-null   object
 6   Present_Employment_Since  1000 non-null   object
 7   Installment_Rate          1000 non-null   int64 
 8   PersonalStatus_Sex        1000 non-null   object
 9   OtherDebtors_Guarantors   1000 non-null   object
 10  Present_Residence_Since   1000 non-null   int64 
 11  Property                  1000 non-null   object
 12  Age_Years                 1000 non-null   int64 
 13  Other_Installment_Plans   1000 non-null   object
 14  Housing                  

In [None]:
# Store the transformed data in Parquet format
data.to_parquet('german_credit.parquet')

## Data Profiling

In this section, we utilize the 'ydata-profiling' library to create a detailed data profile report. This report offers insights into various aspects such as data distributions, missing values, correlations, and more. The report is saved as an HTML file for easy access and reference in the future.

In [None]:
from ydata_profiling import ProfileReport

# read the parquet file and store it in a dataframe
credit_df = pd.read_parquet('german_credit.parquet')

# Generate the profile report
profile = ProfileReport(credit_df, title="German Credit Data Profile Report", explorative=True)

# Save the report to a file
profile.to_file("german_credit_data_profile_report.html")


## Train-Test-Production Split

In this section, we divide the transformed dataset into three distinct subsets:
- **Training Set (60%)**: Utilized for training the model.
- **Test Set (20%)**: To evaluate the model throughout the development process.
- **Production Set (20%)**: Allocated for ongoing monitoring and deployment validation.

We set a random seed and store each subset in Parquet format for future use and reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

# First split: Allocate 60% for training and 40% for a temporary set (temp_df)
train_df, temp_df = train_test_split(credit_df, test_size=0.4, random_state=42)

# Second split: Divide the temporary set into two equal parts (each 20% of original data)
test_df, prod_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Save the three datasets to Parquet files
train_df.to_parquet('credit_data_train.parquet', index=False)
test_df.to_parquet('credit_data_test.parquet', index=False)
prod_df.to_parquet('credit_data_prod.parquet', index=False)

# Print shapes to verify the splits
print("Training set shape:", train_df.shape)
print("Test set shape:", test_df.shape)
print("Production set shape:", prod_df.shape)


Training set shape: (600, 21)
Test set shape: (200, 21)
Production set shape: (200, 21)


## ML Pipeline with Scikit-Learn

In this section, we build an end-to-end ML pipeline that encompasses the following steps:
- Retrieves the training and test datasets from GitHub.
- Isolates features from the target variable.
- Establishes distinct preprocessing pipelines for numerical and categorical features.
- Merges these preprocessing steps with a ColumnTransformer.
- Incorporates the preprocessor and a classifier (using 'Random Forest' for this instance) into a unified scikit-learn pipeline.
- Trains the model and assesses its performance using the test dataset.

In [None]:
# import required libraries
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the training and test datasets from GitHub and store in dataframes
url_train = 'https://github.com/DeshKunal/MLOps_Project/raw/refs/heads/main/Datasets/Processed/credit_data_train.parquet'
url_test = 'https://github.com/DeshKunal/MLOps_Project/raw/refs/heads/main/Datasets/Processed/credit_data_test.parquet'

train_data = pd.read_parquet(url_train)
test_data = pd.read_parquet(url_test)

# extract the numerical and categorical features
numerical_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.drop('Credit_Risk')
categorical_cols = train_data.select_dtypes(include=['object', 'category']).columns

print(numerical_cols)
print(categorical_cols)

Index(['Duration_Months', 'Credit_Amount', 'Installment_Rate',
       'Present_Residence_Since', 'Age_Years', 'Num_Existing_Credits',
       'Num_People_Liable'],
      dtype='object')
Index(['ExistingAccount_Status', 'Credit_History', 'Purpose',
       'SavingsAccount_Bonds', 'Present_Employment_Since',
       'PersonalStatus_Sex', 'OtherDebtors_Guarantors', 'Property',
       'Other_Installment_Plans', 'Housing', 'Job', 'Telephone',
       'Foreign_Worker'],
      dtype='object')


In [13]:
# Create transformers for numerical and categorical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler()) # Scale data
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Impute missing values with 'missing'
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categories
])

# Combine transformers into a preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a complete pipeline that includes preprocessing and modeling
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))  # Using RandomForest for classification
])

# Separate features and target variable from training and test datasets
X_train = train_data.drop('Credit_Risk', axis=1)
y_train = train_data['Credit_Risk']

X_test = test_data.drop('Credit_Risk', axis=1)
y_test = test_data['Credit_Risk']

# Train the model
pipeline.fit(X_train, y_train)

In [15]:
# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.77


## ML Experimentation and Tracking with MLflow

In this section, we conduct several machine learning experiments with various algorithms and hyperparameters.
1. Initialize an MLflow experiment to organize all runs under a single name.
2. Employ K-Fold Cross-Validation to assess each model's performance on the training set.
3. Test each model's efficacy on the test set.
4. Record parameters, metrics, and the complete model pipeline in MLflow for version control and future purposes.

In [18]:
# !pip install mlflow

In [26]:
import mlflow

# Set the MLflow tracking URI to store data locally in './mlruns'
mlflow.set_tracking_uri('file:///C:/Users/kunal/Downloads/MLOps/Project/Gr-05_MLOps_Project/Gr-05_Notebooks/mlruns')

In [None]:
# import few classifier models for experimentation
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.model_selection import cross_val_score, KFold # for K-Fold cross validation
import numpy as np 

In [None]:
# Define a list of experiments with different models and hyperparameters
experiments = [
    ("Random Forest", RandomForestClassifier(), {'n_estimators': [100, 200]}),
    ("Gradient Boosting", GradientBoostingClassifier(), {'learning_rate': [0.01, 0.2]}),
    ("Decision Tree", DecisionTreeClassifier(), {'max_depth': [5, 10]}),
    ("SVM", SVC(), {'C': [1, 10]})
]

# Set up MLflow experiment (all runs will be clubbed under this experiment)
mlflow.set_experiment("GermanCredit_ML_Experimentation")

# Define a KFold cross-validator
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for model_name, model, params in experiments:
    for param, values in params.items():
        for value in values:
            with mlflow.start_run(run_name=f"{model_name} {param}={value}"):
                mlflow.log_params({param: value}) # log the parameter settings used
                
                # Set the parameter for the model
                model.set_params(**{param: value})

                # Create a pipeline with preprocessor and the current classifier
                pipeline = Pipeline(steps=[
                    ('preprocessor', preprocessor),
                    ('classifier', model)
                ])

                # Perform cross-validation
                cv_scores = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='accuracy')
                cv_mean = np.mean(cv_scores)
                cv_std = np.std(cv_scores)

                # Log the average and standard deviation of cross-validation accuracy
                mlflow.log_metric("cv_mean_accuracy", cv_mean)
                mlflow.log_metric("cv_std_accuracy", cv_std)
                
                # Train the model on full training data and evaluate on test data
                pipeline.fit(X_train, y_train)
                y_pred = pipeline.predict(X_test)
                test_accuracy = accuracy_score(y_test, y_pred)
                mlflow.log_metric("test_accuracy", test_accuracy) # log the test set accuracy
        
                # Log the model
                mlflow.sklearn.log_model(pipeline, "model")

                # Format the parameter settings for printing
                param_settings = ", ".join([f"{p}={v}" for p, v in {param: value}.items()])

                print(f"{model_name} ({param_settings}) -> CV Mean: {cv_mean:.4f}, CV Std: {cv_std:.4f}, Test Accuracy: {test_accuracy:.4f}")

2025/03/16 20:15:55 INFO mlflow.tracking.fluent: Experiment with name 'GermanCredit_ML_Experimentation' does not exist. Creating a new experiment.


Random Forest (n_estimators=100) -> CV Mean: 0.7700, CV Std: 0.0296, Test Accuracy: 0.7350




Random Forest (n_estimators=200) -> CV Mean: 0.7617, CV Std: 0.0352, Test Accuracy: 0.7400




Gradient Boosting (learning_rate=0.01) -> CV Mean: 0.7100, CV Std: 0.0331, Test Accuracy: 0.7150




Gradient Boosting (learning_rate=0.2) -> CV Mean: 0.7217, CV Std: 0.0282, Test Accuracy: 0.7500




Decision Tree (max_depth=5) -> CV Mean: 0.7000, CV Std: 0.0465, Test Accuracy: 0.7000




Decision Tree (max_depth=10) -> CV Mean: 0.6850, CV Std: 0.0291, Test Accuracy: 0.7300




SVM (C=1) -> CV Mean: 0.7483, CV Std: 0.0193, Test Accuracy: 0.7450




SVM (C=10) -> CV Mean: 0.7150, CV Std: 0.0238, Test Accuracy: 0.7800
