<a href="https://colab.research.google.com/github/SamuelOnyangoOmondi/MLOPs-System-Deployment/blob/main/Samuel_Omondi_MLOPS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Credit Mix Using Machine Learning

This project involves creating a machine learning model to predict the 'Credit Mix' of customers based on various financial attributes. The dataset includes customer financial details like age, income, number of bank accounts, and monthly balances. The goal is to preprocess this data, train a Multi-layer Perceptron (MLP) model, and evaluate its performance in classifying the credit mix categories.


In [9]:
# Importing necessary libraries for data manipulation, machine learning, and metrics calculation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.impute import SimpleImputer


## Section 2: Data Loading and Initial Exploration

We start by loading our dataset into pandas DataFrames. This step includes a preliminary look at the data to understand its structure, discover any glaring issues like missing values or incorrect data types that might need addressing.


In [10]:
# Load the datasets
train_df = pd.read_csv('/content/drive/MyDrive/MLOPS/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/MLOPS/test.csv')

# Display the first few rows of the training dataset and some descriptive statistics to understand the data better
train_df.head()
train_df.describe()


  train_df = pd.read_csv('/content/drive/MyDrive/MLOPS/train.csv')


Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,84998.0,100000.0,100000.0,100000.0,100000.0,98035.0,100000.0,100000.0
mean,4194.17085,17.09128,22.47443,72.46604,21.06878,27.754251,32.285173,1403.118217
std,3183.686167,117.404834,129.05741,466.422621,14.860104,193.177339,5.116875,8306.04127
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.0,0.0
25%,1625.568229,3.0,4.0,8.0,10.0,3.0,28.052567,30.30666
50%,3093.745,6.0,5.0,13.0,18.0,6.0,32.305784,69.249473
75%,5957.448333,7.0,7.0,20.0,28.0,9.0,36.496663,161.224249
max,15204.633333,1798.0,1499.0,5797.0,67.0,2597.0,50.0,82331.0


## Section 3: Data Cleaning and Preprocessing

In this section, we will clean the data by handling missing values, correcting any erroneous data (like negative ages), and transforming features into a format suitable for machine learning models. This includes normalizing numerical features and encoding categorical features.


In [17]:
# Subsection 3.1: Convert and Clean 'Age'

# Convert 'Age' to numeric, coercing errors to NaN (non-numeric entries become NaN)
train_df['Age'] = pd.to_numeric(train_df['Age'], errors='coerce')

# Handle missing values in 'Age' after conversion (e.g., fill with median age)
median_age = train_df['Age'].median(skipna=True)
train_df['Age'] = train_df['Age'].fillna(median_age)

# Correct negative ages by applying the absolute value
train_df['Age'] = train_df['Age'].apply(lambda x: np.abs(x) if x < 0 else x)


In [18]:
# Subsection 3.2: Handle 'Monthly_Inhand_Salary' Missing Values

# Impute missing values for 'Monthly_Inhand_Salary' using mean imputation
imputer = SimpleImputer(strategy='mean')
train_df['Monthly_Inhand_Salary'] = imputer.fit_transform(train_df[['Monthly_Inhand_Salary']])


In [19]:
# Subsection 3.3: Parse 'Credit_History_Age'

# Define a function to robustly parse 'Credit_History_Age' into total number of months
def parse_credit_age(credit_age):
    if pd.isna(credit_age):
        return np.nan
    # Cleaning up the string and extracting numeric parts
    parts = credit_age.replace('and', '').replace('Years', '').replace('Months', '').split()
    years = months = 0
    if len(parts) == 2:
        years, months = int(parts[0]), int(parts[1])
    elif len(parts) == 1:
        if 'Year' in credit_age or 'Years' in credit_age:
            years = int(parts[0])
        else:
            months = int(parts[0])
    total_months = years * 12 + months
    return total_months

# Apply the function to the 'Credit_History_Age' column
train_df['Credit_History_Age'] = train_df['Credit_History_Age'].apply(parse_credit_age)
# Impute missing values that resulted from parsing
train_df['Credit_History_Age'] = imputer.fit_transform(train_df[['Credit_History_Age']].values)


In [20]:
# Subsection 3.4: Encode Categorical Variables

# Encoding categorical variables using LabelEncoder
label_encoder = LabelEncoder()
categorical_columns = ['Occupation', 'Credit_Mix', 'Payment_Behaviour']
for column in categorical_columns:
    train_df[column] = label_encoder.fit_transform(train_df[column])


In [25]:
# Subsection 3.5: Verify Columns and Normalize Numerical Data

# First, ensure all numerical columns exist and contain the expected data type
print(train_df[numerical_columns].info())

# Cleaning numerical columns to remove non-numeric characters and convert to float
def clean_numeric(x):
    try:
        if isinstance(x, str):
            return float(x.replace('_', '').strip())
        return x  # Return as is if not a string
    except ValueError:
        return np.nan  # Return NaN for non-convertible strings

numerical_columns = ['Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Outstanding_Debt', 'Total_EMI_per_month']
for col in numerical_columns:
    if col in train_df.columns:  # Check if each column actually exists in DataFrame
        train_df[col] = train_df[col].apply(clean_numeric)
    else:
        print(f"Column {col} missing from DataFrame.")

# Verify after cleaning
print(train_df[numerical_columns].info())

# Impute any missing values created by cleaning process
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(train_df[numerical_columns])

# Ensure the output from imputer is correctly reshaped and reassigned to DataFrame
if imputed_data.size != 0:  # Check if imputed data is not empty
    train_df[numerical_columns] = pd.DataFrame(imputed_data, index=train_df.index, columns=numerical_columns)
else:
    print("Imputed data is empty. Check the imputation process and column specifications.")

# Normalizing numerical data using StandardScaler
scaler = StandardScaler()
train_df[numerical_columns] = scaler.fit_transform(train_df[numerical_columns])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Annual_Income        0 non-null      float64
 1   Num_Bank_Accounts    0 non-null      float64
 2   Num_Credit_Card      0 non-null      float64
 3   Outstanding_Debt     0 non-null      float64
 4   Total_EMI_per_month  0 non-null      float64
dtypes: float64(5)
memory usage: 3.8 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Annual_Income        0 non-null      float64
 1   Num_Bank_Accounts    0 non-null      float64
 2   Num_Credit_Card      0 non-null      float64
 3   Outstanding_Debt     0 non-null      float64
 4   Total_EMI_per_month  0 non-null      float64
dtypes: float64(5)
memory usage: 3.8 

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


## Section 4: Model Training and Evaluation

With our data prepared, we can now define our MLP model, train it on the dataset, and evaluate its performance using various metrics. This will help us understand how well our model is performing in predicting the 'Credit Mix'.


In [29]:
# Subsection 4.1: Prepare Data for Modeling

# Concatenate training and test data frames to ensure consistent label encoding
combined = pd.concat([X_train, X_test], axis=0)

# Convert all categorical features to string to unify data types
categorical_features = [col for col in combined.columns if combined[col].dtype == 'object' or combined[col].dtype == 'int64']
for col in categorical_features:
    combined[col] = combined[col].astype(str)  # Convert all categorical data to string

# Encode categorical features
for col in categorical_features:
    encoder = LabelEncoder()
    combined[col] = encoder.fit_transform(combined[col])

# Split the combined DataFrame back into training and testing datasets
X_train = combined.iloc[:len(X_train)]
X_test = combined.iloc[len(X_train):]

# Verify and display the data types in the training set to ensure proper encoding
print(X_train.info())


<class 'pandas.core.frame.DataFrame'>
Index: 80000 entries, 75220 to 15795
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        80000 non-null  int64  
 1   Customer_ID               80000 non-null  int64  
 2   Month                     80000 non-null  int64  
 3   Name                      80000 non-null  int64  
 4   Age                       80000 non-null  float64
 5   SSN                       80000 non-null  int64  
 6   Occupation                80000 non-null  int64  
 7   Annual_Income             0 non-null      float64
 8   Monthly_Inhand_Salary     80000 non-null  float64
 9   Num_Bank_Accounts         0 non-null      float64
 10  Num_Credit_Card           0 non-null      float64
 11  Interest_Rate             80000 non-null  int64  
 12  Num_of_Loan               80000 non-null  int64  
 13  Type_of_Loan              80000 non-null  int64  
 14  Delay_f

In [31]:
# Subsection 4.2: Define and Train the MLP Model

from sklearn.impute import SimpleImputer
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

# Creating a pipeline to handle imputation and model training
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Replace 'mean' with 'median' or 'most_frequent' if more appropriate
    ('mlp', MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=500))
])

# Training the model on the training dataset
# Ensure that X_train does not have any NaN values by using the pipeline
pipeline.fit(X_train, y_train)


In [35]:
# Subsection 4.3: Make Predictions and Evaluate the Model

from sklearn.metrics import confusion_matrix, classification_report

# Ensure the same preprocessing pipeline is used for predictions
# The pipeline includes imputation, so it should handle any NaNs in X_test
predictions = pipeline.predict(X_test)

# Evaluating the model using confusion matrix and classification report
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
print("\nClassification Report:\n", classification_report(y_test, predictions))


Confusion Matrix:
 [[  66   34 3761    9]
 [  96   53 4588   37]
 [ 145   59 7032   45]
 [  62   40 3947   26]]

Classification Report:
               precision    recall  f1-score   support

           0       0.18      0.02      0.03      3870
           1       0.28      0.01      0.02      4774
           2       0.36      0.97      0.53      7281
           3       0.22      0.01      0.01      4075

    accuracy                           0.36     20000
   macro avg       0.26      0.25      0.15     20000
weighted avg       0.28      0.36      0.21     20000




### Hyperparameter Tuning Rationale for Our Dataset

#### Dataset Overview and Model Performance
We are working with a dataset focused on predicting financial behaviors across multiple classes. Our initial model using an MLPClassifier revealed several performance issues:
- High recall but very low precision for Class 2, suggesting overfitting to this class.
- Low precision and recall for Classes 0, 1, and 3, indicating underfitting and an inability to differentiate these categories effectively.
- The overall F1-score and accuracy are suboptimal, highlighting the need for improved model tuning.

#### Need for Hyperparameter Tuning
Given the initial results, hyperparameter tuning is essential to:
- **Adjust Model Complexity**: Optimize the number of neurons and layers to better capture the data's complexity and improve classification across all classes.
- **Enhance Learning Dynamics**: Fine-tune the learning rate and activation functions to enhance the learning process and achieve better convergence.
- **Improve Generalization**: Aim to increase the model's ability to generalize, potentially by adjusting regularization parameters to reduce overfitting, especially for the dominant class.

#### Tuning Strategy
We will use **GridSearchCV** to explore a range of hyperparameters systematically, incorporating cross-validation to ensure robustness. This strategy is intended to directly address the skewed class sensitivity and overall metric improvement, ensuring the model performs well across diverse data segments.



In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix, classification_report

# Define the pipeline with an imputer and MLPClassifier
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('mlp', MLPClassifier(max_iter=500))
])

# Define the parameter grid
param_grid = {
    'mlp__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 100)],
    'mlp__activation': ['tanh', 'relu'],
    'mlp__learning_rate_init': [0.001, 0.01]
}

# Configure GridSearchCV
search = GridSearchCV(pipeline, param_grid, n_jobs=-1, cv=5, scoring='accuracy', verbose=10)
search.fit(X_train, y_train)

# Output the best parameters and the corresponding score
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

# Evaluate the best model on the test data
best_model = search.best_estimator_
predictions = best_model.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
print("\nClassification Report:\n", classification_report(y_test, predictions))


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best parameter (CV score=0.517):
{'mlp__activation': 'relu', 'mlp__hidden_layer_sizes': (100, 100), 'mlp__learning_rate_init': 0.001}
Confusion Matrix:
 [[  13    2 3142  713]
 [  34   25 4010  705]
 [  52   14 6027 1188]
 [  30   14 3385  646]]

Classification Report:
               precision    recall  f1-score   support

           0       0.10      0.00      0.01      3870
           1       0.45      0.01      0.01      4774
           2       0.36      0.83      0.51      7281
           3       0.20      0.16      0.18      4075

    accuracy                           0.34     20000
   macro avg       0.28      0.25      0.17     20000
weighted avg       0.30      0.34      0.22     20000



Feature Engineering