The goal of this project is to build a model that can accurately predict the credit score of an individual based on their financial history and personal information.

This project is a crucial component of my portfolio as it showcases my ability to work with real-world data, perform data cleaning and pre-processing, and apply machine learning algorithms to solve a practical problem

In [None]:
#import the libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold,GridSearchCV
from sklearn.model_selection import train_test_split,  cross_val_score
from sklearn.metrics._plot.confusion_matrix import ConfusionMatrixDisplay
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

import pickle

import warnings
warnings.filterwarnings("ignore")

In [None]:
#import the dataset

In [None]:
data=pd.read_csv(r'/content/credit.csv')
data.head(10)

In [None]:

data.shape

In [None]:
data.columns

In [None]:
#get the shape of object type of data

In [None]:
data.info()

In [None]:
#the statistics summary of data

In [None]:
data.describe()

In [None]:
# to checking the missing records in each column

In [None]:
data.isnull().sum()

In [None]:
#to find the duplicated rows

In [None]:
data[data.duplicated(keep='first')]

In [None]:
#arranging numerical and categorical columns

In [None]:
credit_cal=data.select_dtypes(include='object')
credit_num=data.select_dtypes(include='number')

**OBSERVATION**

 Dataset contain 100,000 datapoints.This is the credit score of an individual based on their financial history and personal information.

In [None]:
#Data visualisation

#EDA EXPLORATORY DATA ANALYSIS

In [None]:
# Create a figure and 2x3 grid of subplots
fig, ax = plt.subplots(5, 3, figsize=(12, 20))

# Flatten the 2D array of subplots into a 1D array
ax = ax.flatten()

# Plot multiple box plots on the same axis
sns.boxplot(x='Credit_Score', y='Annual_Income', data=data, ax=ax[0])
sns.boxplot(x='Credit_Score', y='Monthly_Inhand_Salary', data=data, ax=ax[1])
sns.boxplot(x='Credit_Score', y='Num_Bank_Accounts', data=data, ax=ax[2])
sns.boxplot(x='Credit_Score', y='Num_Credit_Card', data=data, ax=ax[3])
sns.boxplot(x='Credit_Score', y='Interest_Rate', data=data, ax=ax[4])
sns.boxplot(x='Credit_Score', y='Num_of_Loan', data=data, ax=ax[5])
sns.boxplot(x='Credit_Score', y='Delay_from_due_date', data=data, ax=ax[6])
sns.boxplot(x='Credit_Score', y='Num_of_Delayed_Payment', data=data, ax=ax[7])
sns.boxplot(x='Credit_Score', y='Outstanding_Debt', data=data, ax=ax[8])
sns.boxplot(x='Credit_Score', y='Credit_Utilization_Ratio', data=data, ax=ax[9])
sns.boxplot(x='Credit_Score', y='Credit_History_Age', data=data, ax=ax[10])
sns.boxplot(x='Credit_Score', y='Total_EMI_per_month', data=data, ax=ax[11])
sns.boxplot(x='Credit_Score', y='Amount_invested_monthly', data=data, ax=ax[12])
sns.boxplot(x='Credit_Score', y='Monthly_Balance', data=data, ax=ax[13])
sns.boxplot(x='Credit_Score', y='Age', data=data, ax=ax[14])

# Add a title and labels
#plt.title('Relationship between Credit Score and Different Features')
plt.xlabel('Credit Score')
plt.ylabel('Feature Value')

# Adjust the spacing between subplots
fig.subplots_adjust(hspace=0.4, wspace=0.4)

# Add a white grid
for i in range(15):
    ax[i].grid(color='white', linestyle='-', linewidth=2, alpha=0.5)

# Add a title and labels
fig.suptitle('Relationship between Credit Score and Different Features', fontsize=16, fontweight='bold')

# Set the font size for all subplot titles
titles = ['Annual Income', 'Monthly Inhand Salary', 'Number of Bank Accounts', 'Number of Credit Cards',
          'Interest Rate', 'Number of Loans', 'Delay from Due Date', 'Number of Delayed Payments',
          'Outstanding Debt', 'Credit Utilization Score', 'Credit History Age', 'Total EMI Per Month',
          'Amount Invested Monthly', 'Monthly Balance', 'Age']
for i in range(15):
    ax[i].set_title(titles[i], fontsize=14, fontweight='bold')

    # Set the font size for all x and y labels
for i in range(15):
    ax[i].set_xlabel('Credit Score', fontsize=12)
    ax[i].set_ylabel('Feature Value', fontsize=12)

# Show the plot
plt.show()


**INFERENCE**

From the 15 box plots above, we can deduce the following:

1.  The more someone earns anually, the better their credit score is
2.  Similar to annual income, a higher monthly in-hand salary leads to
    a better credit score.
3.  The ideal number of bank accounts is 2 - 4. Having more than 5
   negatively affects your credit score.
4.  Similar to bank accounts, having more than 5 credit cards will
   negatively affect your credit scores. The ideal number is 3-5.
5.  4 - 11% is the sweet spot for average instest rate.     
   Anything above 15% is a no-no.
6.  Take 1-3 loans at a time inorder to keep a good credit score.
    Having more than 3 loans negatively impacts credtit scores.
7.  To maintain a good credit score, you have a 5-14 day delay window.  
    Delaying for more than 17 days affects your credit score negatively.
8.  Delaying 4-12 payments from the due date is the safety window.  
    Anything above 12 payments negatively affexts credit scores.
9.  An outstanding debt of 1150 will not affect your credit scores, but
    going above $1338 affects your credit scores negatively.
10. Your credit utilization ratio doesn’t affect your credit scores.
11. Having a long credit history results in better credit scores.
12. The number of EMIs you are paying in a month doesn’t affect credit
    scores that much.
13. How much you invest monthly doesn’t really affect your credit
    scores.
14. Having a high monthly balance in your account at the end of the month is good for your credit scores.
15. Credit scores tend to improve with an increase in age.

In [None]:
#Histplot of credit_score

In [None]:
plt.figure(figsize=(12,6))

In [None]:
plt.subplot(1,2,1)
sns.histplot(data['Credit_Score'],bins=30,kde=True,color='blue')
plt.show()

**INFERENCE**

The histogram plot of credit scores reveals a roughly symmetric distribution, indicating that the majority of individuals have credit scores clustered around the central tendency

In [None]:
corr_format= credit_num.corr()
corr_format

In [None]:
#correlation Analysis
correlation_matrix = corr_format.corr()
sns.heatmap(correlation_matrix, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [None]:
# Pairplot for Numerical Variables

sns.pairplot(credit_num)
plt.suptitle('Pairplot of Numerical Variables', y=1.02)

In [None]:
data['Credit_Score'].value_counts()

In [None]:
#count plot of categorical variable(credit_score)
sns.countplot(x='Credit_Score',data=data)
plt.title('credit_score distribution')


**INFERENCE**

Credit scores are namely Good,Standard and poor.  53174 data points belongs to standard category, 28998 data points belongs to poor  and 17828 belongs to Good.

DEMOGRAPHIC ANALYSIS

In [None]:
#histplot of credict score by occupation

In [None]:
plt.figure(figsize=(14,7))
sns.histplot(hue='Occupation',x='Credit_Score',data=data ,bins=30, kde=True)
plt.title('Credit_Score by occupation level')
plt.tight_layout()
plt.show()

**INFERENCE**

The histogram plot comparing credit scores across different occupations reveals interesting insights into the distribution of credit scores within each occupation category..We observe that the distribution of credit scores varies noticeably across occupations, indicating potential differences in creditworthiness among different groups of individuals. The distribution of credit scores across occupations underscore the importance of considering occupation as a factor in credit assessment and risk management.

In [None]:
# Scatter plot for Outstanding Debts vs. Credit Scores
sns.scatterplot(x='Outstanding_Debt', y='Credit_Score', data=data, alpha=0.7, color='green')
plt.title('Outstanding Debts vs. Credit Scores')
plt.xlabel('Outstanding Debts')
plt.ylabel('Credit Score')

**INFERENCE**

There is a negative correlation between outstanding debts and credit scores.that is as outstanding debts increase, credit scores tend to decrease

In [None]:
# Scatter plot for Credit Utilization Ratio vs. Credit Scores
plt.subplot(1, 2, 1)
sns.scatterplot(x='Credit_Utilization_Ratio', y='Credit_Score', data=data, alpha=0.7, color='blue')
plt.title('Credit Utilization Ratio vs. Credit Scores')
plt.xlabel('Credit Utilization (monthly Balance / changed Credit Limit)')
plt.ylabel('Credit Score')

**INFERENCE**

There is a relationship between credit utilization ratio and credit scores, indicating its impact on individual creditworthiness.

Explanation: By observing the scatter plot, we can infer that individuals with lower credit utilization ratios tend to have higher credit scores, while those with higher credit utilization ratios tend to have lower credit scores.

PAYMENT ANALYSIS

In [None]:
# Box plot for Late Payments vs. Credit Scores
sns.boxplot(x='Num_of_Delayed_Payment', y='Credit_Score', data=data, palette='Blues')
plt.title('Delayed Payments vs. Credit Scores')
plt.xlabel('Number of Delayed Payments')
plt.ylabel('Credit Score')

**INFERENCE**

There is a relationship between credit scores and the likelihood of making late payments.

Explanation: By examining the box plot, if we observe that individuals with lower credit scores tend to have a wider spread of late payments and higher median number of late payments compared to individuals with higher credit scores, we can infer that there is a correlation between credit scores and the frequency of late payments

In [None]:
data.head()

In [None]:
#drop unwanted columns

In [None]:
data.drop(['Customer_ID','ID','Month','Name','SSN','Type_of_Loan','Credit_History_Age','Interest_Rate','Changed_Credit_Limit','Credit_Utilization_Ratio'],axis=1,inplace=True)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
numeric_cols= data.select_dtypes(exclude='object').columns
cat_cols= data.select_dtypes(include='object').columns

In [None]:
numeric_cols

In [None]:
cat_cols

PIE CHART

In [None]:
# Assuming 'creditscore' is the name of the column containing credit scores
credit_score_counts = data['Credit_Score'].value_counts()

# Plot pie chart
plt.figure(figsize=(8, 8))
plt.pie(credit_score_counts, labels=credit_score_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Credit Scores')
plt.show()


**INFERENCE**

In analyzing the credit score distribution of our dataset, it's evident that a significant portion of individuals have been categorized as having a 'good' credit score, representing approximately 17.8%  suggesting that a minority of applicants may have credit challenges or issues that could pose higher risk for lenders.. This suggests that the minority of applicants may have a favorable credit history, which could indicate a lower risk profile for lenders. Conversely, 'standard' credit scores account for around 53.2% of the dataset, indicating a majority of applicants may have a favourable credit history. Finally, 'poor' credit scores constitute a average proportion, roughly 29%, suggesting that a average of applicants may have credit challenges or issues that could pose higher risk for lenders. Understanding these distributions is crucial for risk assessment and decision-making in lending practices.

#CHECK AND DROP OUTLIERS

In [None]:
# Drop outlier by IQR calculation
Q1 = data.Annual_Income.quantile(0.25)
Q3 = data.Annual_Income.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = data.drop(data.loc[data['Annual_Income'] > (Q3 + 1.5 * IQR)].index)
df_cleaned = data.drop(data.loc[data['Annual_Income'] < (Q1 - 1.5 * IQR)].index)
df_cleaned


In [None]:
sns.boxplot(x=df_cleaned['Annual_Income'])

#STANDARDISING

In [None]:
col=(['Age','Annual_Income','Outstanding_Debt'])

In [None]:
col_std=data[col]

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(col_std)

#ENCODING

In [None]:
#obtaining their counts

In [None]:
data['Credit_Mix'].value_counts()

In [None]:
data['Occupation'].value_counts()

In [None]:
data['Payment_of_Min_Amount'].value_counts()

In [None]:
data['Payment_Behaviour'].value_counts()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le=LabelEncoder()

In [None]:
data['Occupation_encoded']=le.fit_transform(data['Occupation'])

In [None]:
data['Credit_Mix_encoded']=le.fit_transform(data['Credit_Mix'])

In [None]:
data['Payment_of_Min_Amount_encoded']=le.fit_transform(data['Payment_of_Min_Amount'])

In [None]:
data['Payment_Behaviour_encoded']=le.fit_transform(data['Payment_Behaviour'])

In [None]:
print(data)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
#remove the main encoded columns

In [None]:
data.drop(['Occupation','Credit_Mix','Payment_of_Min_Amount','Payment_Behaviour'],axis=1,inplace=True)
data.head()

In [None]:
data.info()

In [None]:
data.columns

In [None]:
col=(['Age','Occupation_encoded','Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts','Num_Credit_Card','Delay_from_due_date' ,'Num_of_Loan','Payment_of_Min_Amount_encoded','Num_of_Delayed_Payment',
'Num_Credit_Inquiries','Credit_Mix_encoded','Outstanding_Debt','Total_EMI_per_month','Amount_invested_monthly', 'Monthly_Balance','Payment_Behaviour_encoded','Credit_Score'])

In [None]:
data['Credit_Score']= LabelEncoder().fit_transform(data['Credit_Score'])
data['Credit_Score'].value_counts()#good,poor,standard.

In [None]:
data['Credit_Score'].unique()

#creating csv file

In [None]:
#generating new_data csv to read dataframe and do feature evaluation and prediction
df1=data.to_csv("final_credit_data.csv",index=False)
df1

#SPLITTING DATASET

In [None]:
X=data.drop(['Credit_Score'],axis=1)
y=data['Credit_Score']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)

#BALANCING THE DATA

Here we use SMOTE(synthetic minority oversampling) for handling imbalanced data

In [None]:
from imblearn.over_sampling import SMOTE
# Assuming X_train contains your feature vectors and y_train contains the corresponding labels

# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Resample the dataset
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the distribution of classes after resampling
unique, counts = np.unique(y_train_resampled, return_counts=True)
print(dict(zip(unique, counts)))


In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)

#MODEL BUILDING

In [None]:
models = [['LogisticRegression ', LogisticRegression()],
        ['DecisionTreeClassifier ', DecisionTreeClassifier()],
        ['RandomForestClassifier ', RandomForestClassifier()],
         ['SVC ', SVC()],['KNN',KNeighborsClassifier()]]

In [None]:
for name, model in models:
    model = model
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    print(name, accuracy_score(y_test, prediction))

 KNN and RandomForestClassifier are selected for Hyperparameter Tuning based the accuracy scores

# HYPERPARAMETERS FOR THEIR IMPROVED PERFORMANCE

In [None]:
model_params = {

    'DecisionTreeClassifier  ': {
        'model': DecisionTreeClassifier(),
        'params' : {
            'criterion' : ['gini', 'entropy']
        }
    },
    'Random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'KNeighbors': {
        'model': KNeighborsClassifier(),
        'params' : {
            'n_neighbors' : [5,8,10]
        }
    }

}

In [None]:
#to find best parameters

In [None]:
# Dictionary to store best parameters for each model
best_params = {}

# Iterate over each model and perform hyperparameter tuning
for model_name, model_info in model_params.items():
    print(f"Searching best parameters for {model_name}...")
    model = model_info['model']
    params = model_info['params']

    # Perform GridSearchCV
    grid_search = GridSearchCV(model, params, cv=5, return_train_score=False)
    grid_search.fit(X_train, y_train)  # Assuming X_train and y_train are your training data

    # Store the best parameters
    best_params[model_name] = grid_search.best_params_

# Print the best parameters for each model
for model_name, params in best_params.items():
    print(f"Best parameters for {model_name}: {params}")


In [None]:
#modeling using best parameters

In [None]:
# Define the Random Forest classifier with n_estimators=[10]
rf_classifier = RandomForestClassifier(n_estimators=10)

# Train the classifier with your data
rf_classifier.fit(X_train, y_train)  # Assuming X_train and y_train are your training data
rf_predictions=rf_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, rf_predictions)
print("Accuracy:", accuracy)

#HYPERTUNING

In [None]:
#gridsearchcv
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_

    })


In [None]:
#randomisedsearchcv
from sklearn.model_selection import RandomizedSearchCV

scores = []

for model_name, mp in model_params.items():
    clf =  RandomizedSearchCV(mp['model'], mp['params'], n_iter=10, return_train_score=False)
    clf.fit(X_train, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_
    })


In [None]:

df = pd.DataFrame(scores,columns=['model','best_score'])
df

Here we chooses RANDOM FOREST CLASSIFIER model for further analysis.

#CROSS VALIDATION

Cross validation is to assess the performance of a model and to prevent overfitting.It involves dividing the dataset into multiple subsets,using some of training the model and rest of testing,multiple times to obtain reliable performance metrics.(that is,one of these folds as a validation set and training the model on the remaining folds).

In [None]:
# Define Random Forest classifier
rf_classifier = RandomForestClassifier()

# Define cross-validation strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(rf_classifier, X, y, cv=cv)

# Print cross-validation scores
print("Cross Validation Scores:", cv_scores)
print("Mean CV Accuracy:", np.mean(cv_scores))

#MODEL INTERPRETABILITY

 Analyze feature importance and the model's decision-making
process.

In [None]:
# Train a Random Forest classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

# Analyze feature importance
feature_importance = rf_classifier.feature_importances_
sorted_idx = np.argsort(feature_importance)[::-1]

#Print feature importance
print("Feature Importance:")
for i in sorted_idx:
    print(f"{X.columns[i]}: {feature_importance[i]}")

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.bar(range(len(feature_importance)), feature_importance[sorted_idx], align='center')
plt.xticks(range(len(feature_importance)), X.columns[sorted_idx], rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance Score')
plt.title('Feature Importance')
plt.show()

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

**INFERENCE**
Analyzing feature importance and the model's decision-making process in credit data can provide valuable insights into which factors are most influential in determining creditworthiness. Here are some inferences that can be made from such analysis:

1. Identify Key Predictive Features: By examining feature importance, you can identify which features have the most significant impact on the model's predictions. For example, if the model assigns high importance to features such as credit score, income level, and debt-to-income ratio, it suggests that these factors strongly influence creditworthiness.

2. Risk Assessment: Features with high importance indicate that they have a strong relationship with the target variable (e.g., default or non-default). Understanding these key features allows lenders to better assess the risk associated with granting credit to individuals.

3. Interpretability: Feature importance analysis helps in understanding the model's decision-making process in a more interpretable way. Lenders can explain to borrowers why certain decisions were made based on specific factors such as payment behaviour, occupation, or loan amount.

5. Policy Implications: Insights gained from feature importance analysis can inform policymakers and regulatory agencies about the key factors driving credit decisions. This information can be used to develop fair lending policies and regulations that promote equal access to credit for all individuals.

6. Model Improvement: Understanding which features are most important allows for targeted model improvement efforts. For example, if a particular feature is highly influential but prone to missing or inaccurate data, efforts can be made to improve data quality or incorporate alternative sources of information.

7. Customer Segmentation: By analyzing how different features contribute to credit decisions, lenders can segment their customer base more effectively. This can lead to tailored product offerings and pricing strategies based on the specific needs and risk profiles of different customer segments.

Overall, analyzing feature importance and the model's decision-making process in credit data can provide actionable insights for lenders, policymakers, and regulators to make more informed and fair lending decisions.

#MODEL DEPLOYMENT

A web application is developed using HTML and Flask web framework which is capable of predicting  inputs.

In [None]:
# Define the Random Forest classifier with n_estimators=[10]
model = RandomForestClassifier(n_estimators=10)

# Train the classifier with your data
model.fit(X_train, y_train)  # Assuming X_train and y_train are your training data
model_predictions=model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, model_predictions)
print("Accuracy:", accuracy)
pickle.dump(model,open('model.pkl','wb'))#save trained model.

In [None]:
# Load the trained model from the file
model=pickle.load(open('model.pkl','rb'))