# Introduction

In financial services, accurate prediction of loan repayment outcomes holds paramount importance. This dataset contains a collection of key variables associated with loan applicants. These variables provide valuable insights into the applicant's financial profile, employment history, credit behavior, and loan-specific details.

The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults. By analyzing the relationships between borrower attributes and loan repayment outcomes, the model will enable lending institutions to make well-informed decisions that balance profitability and risk management.

### Understanding the Variables

1. id: Unique ID of the loan application.

2. grade: LC assigned loan grade.

3. annual_inc: The self-reported annual income provided by the borrower during registration.

4. short_emp: 1 when employed for 1 year or less.

5. emp_length_num: Employment length in years. Possible values are - between 0 and 10 where 0 means less than one year and      10 means ten or more years.

6. home_ownership: Type of home ownership.

7. dti (Debt-To-Income Ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt            obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

8. purpose: A category provided by the borrower for the loan request.

9. term: The number of payments on the loan. Values are in months and can be either 36 or 60.

10. last_delinq_none: 1 when the borrower had at least one event of delinquency.

11. last_major_derog_none: 1 borrower had at least 90 days of a bad rating.

12. revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available         revolving credit.

13. total_rec_late_fee: Late fees received to date.

14. od_ratio: Overdraft ratio.

15. bad_loan: 1 when a loan was not paid.

In [None]:
# Import libraries. begin, let's import the necessary libraries that we'll be using throughout this notebook:

# Data Manipulation Libraries
import numpy as np 
import pandas as pd 

# Data Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Machine Learning Libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, precision_recall_curve, average_precision_score 
from sklearn.model_selection import GridSearchCV

# Data Resampling Libraries
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

# Machine Learning Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
# knowing the name of the dataset.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Load tha data.
df = pd.read_csv("/kaggle/input/machine-learning/lending_club_loan_dataset.csv")
df.head()

# Preparation the data

In [None]:
# Seeing the shape of the data.
df.shape

In [None]:
# Seeing if there are dublicated.
df.duplicated().sum()

In [None]:
# seeing if there are null values.
df.isna().sum()

In our dataset, we've identified missing data in three columns: "homeownership," "dti," and "last_major_derog_none." With 1491 null values in "homeownership" and 154 in "dti," and a substantial 97% (19426) of missing values in "last_major_derog_none," we have options for managing these gaps without significantly compromising accuracy.

Given our dataset's size of 20,000 rows, we can reasonably remove the rows with missing values in "homeownership" and "dti." This removal should have a minimal impact on the overall accuracy and integrity of our data.

However, due to the overwhelming 97% missing values in "last_major_derog_none," it appears reasonable to drop this column entirely from our analysis. Relying on it could introduce noise and bias into our analysis, impacting the reliability of our results.

In [None]:
# Drop last_major_derog_none and id.
df.drop(['last_major_derog_none', 'id'], axis=1, inplace=True)

In [None]:
# Drop the rows that have null values in home_ownership.
df.dropna(subset = ["home_ownership", "dti"], inplace=True)

In [None]:
df.isna().sum()

In [None]:
# Seeing information about data.
df.info()

In [None]:
# Seeing the unique values in the grade.
df['grade'].value_counts()

We currently have data divided into 7 distinct groups in the "grade". To achieve a more balanced and optimal distribution, we will condense these groups into 5 categories. This adjustment aims to create a more refined and manageable data segmentation for analysis and decision-making.  

In [None]:
# Define a mapping dictionary to combine the clusters
cluster_mapping = {"E":"E", "F":"E" ,"G":"E"}

# Update the "grade" column with the new cluster labels
df['grade'] = df['grade'].replace(cluster_mapping)

In [None]:
df['grade'].value_counts()

In [None]:
# Seeing the unique values in the home_ownership.
df['home_ownership'].value_counts()

In [None]:
# Seeing the unique values in the purpose.
df['purpose'].value_counts()

We currently have data divided into 12 distinct groups in the "purpose".  To enhance clarity and streamline analysis, we have consolidated these groups into 6 broader categories by merging those with similar characteristics. This consolidation maintains the essence of the original classification while simplifying the representation for more effective interpretation.

In [None]:
# Define a mapping dictionary to combine the clusters
cluster_mapping = {
    "wedding": "entertainment",
    "vacation": "entertainment",
    "moving": "entertainment",
    "car": "entertainment",
    "major_purchase": "projects",
    "small_business": "projects",
    "house": "projects",
    "medical": "projects"
}

# Update the "purpose" column with the new cluster labels using map
df['purpose'] = df['purpose'].replace(cluster_mapping)

In [None]:
df['purpose'].value_counts()

In [None]:
df['term'].value_counts()

Among the variables, there appears to be a distinction made between '36 months' and '36 Months.' To ensure consistency and improve data clarity, these two variations will be combined into a single variable labeled as '36 months.

In [None]:
# Convert the "term" values to lowercase
df['term'] = df['term'].str.lower()

# Define a mapping dictionary to combine the values
term_mapping = {
    "36 months": "36 months",
    "60 months": "60 months"
}

# Update the "term" column with the new combined values using map
df['term'] = df['term'].replace(term_mapping)

In [None]:
df['term'].value_counts()

# Data Visualization

In [None]:
# Categorical columns.
categorical_features = df[["grade", "short_emp", "home_ownership", "purpose", "term", "last_delinq_none", "bad_loan"]]

# Numerical columns.
numerical_features = df[["annual_inc", "emp_length_num", "dti", "total_rec_late_fee", "revol_util", "od_ratio"]]

In [None]:
# calculate descriptive statistics for categorical values.
categorical_features.astype('object').describe()

In [None]:
# Visualize the distribution of the "bad_loan" with "grade". 
sns.countplot(data=categorical_features, x="grade", hue="bad_loan")
plt.xlabel("Grade")
plt.ylabel("Count")
plt.title("Count Plot of Grade with Bad Loan")
plt.legend(title="Bad Loan", labels=["No", "Yes"])
plt.show()

The data illustrates a discernible trend where higher loan grades are consistently associated with lower instances of loan default. This trend is indicative of a strong negative correlation between loan grade and the likelihood of loan repayment failure.

In [None]:
# The percentage of each element of the data
for feature in categorical_features:
    categorical_features[feature].value_counts().plot(kind = 'pie', autopct = '%1.1f%%')
    plt.show()

Evidently depicted in the visualization, there exists a noticeable imbalance within the variables "short_emp," "term," and "bad_loan." As a countermeasure, a data resampling strategy will be implemented to rectify this imbalance. It's important to underscore that the distribution across the remaining dataset remains equitable and unaffected by this resampling process, thus preserving the integrity of the broader data structure.

In [None]:
# calculate descriptive statistics for numerical values.
numerical_features.describe()

In [None]:
# Histograms for numerical features.
numerical_features.hist(bins=20, figsize=(12, 8))
plt.suptitle("Histograms of Numerical Features", y=1.02)
plt.show()

Evidently, there are notable disparities in the distributions of "total_rec_late_fee" and "revol_util," with "annual_inc" also exhibiting outliers. However, it's noteworthy that the distribution of the remaining data appears to be relatively balanced, devoid of significant irregularities.

# Data preprocessing

### Encoding and scalling the data

In [None]:
# One hot Endocing for "home_ownership" and "purpose".
df = pd.get_dummies(df, columns=['home_ownership', 'purpose', 'grade'])
df.head()

In [None]:
# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Fit and transform the categorical column
df["term"] = label_encoder.fit_transform(df["term"])

In [None]:
# Scalling the numerical data.

# Create scaler object.
scaler = StandardScaler()

# Fit scaler on selected columns.
scaler.fit(numerical_features)

# Transform selected columns with scaler.
numerical_features = scaler.transform(numerical_features)

### Split the data

In [None]:
# Split data into x and y.
X = df.drop("bad_loan", axis=1)
y = df["bad_loan"]

In [None]:
# Split the data into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

The dataset will be split into two distinct groups to facilitate the implementation of over-sampling and under-sampling techniques. This strategy is devised to systematically assess and identify the optimal performance outcomes of both methodologies. By conducting controlled experiments with various machine learning models, the intention is to rigorously compare the effectiveness of both methods and subsequently determine the approach that yields the most favorable results.

In [None]:
# Split the train data into two subsets, train1 and test1
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_train, y_train, test_size=0.25, random_state=10, stratify=y_train)

# Split the train data into two subsets, train2 and test2
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_train, y_train, test_size=0.25, random_state=20, stratify=y_train)

# Resampling tha data

### Oversampling (SMOTE)

In [None]:
# Instantiate the SMOTE class
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Perform SMOTE oversampling on the dataset
X_overesampled, y_overesampled = smote.fit_resample(X_train1, y_train1)

# Visualize the distribution of values in the y_underesampled Series
y_overesampled.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')

### Trying the models with oversampling

We will proceed to evaluate the models using oversampling techniques to determine which model exhibits superior performance. Given the imbalanced nature of the dataset, we will not rely solely on accuracy for model evaluation. Our initial focus will center on precision, as our priority lies in minimizing the instances of false positives—cases where the model predicts loan repayment, yet the borrower defaults.

To gain a comprehensive understanding of model performance, we will analyze the confusion matrix. This matrix will provide valuable insights into true positives, true negatives, false positives, and false negatives. This approach allows us to effectively navigate the intricacies of model performance in light of the dataset's inherent imbalance and make informed decisions about model selection.

In [None]:
# Initialize the models.
models = {
    'Random Forest': RandomForestClassifier(),
    'Logistic Regression': LogisticRegression(),
}

In [None]:
# Iterate over each model and evaluate its accuracy using cross-validation.
for model_name, model in models.items():
    scores = cross_val_score(model, X_overesampled, y_overesampled)
    accuracy = scores.mean()
    print(f'{model_name} Accuracy: {accuracy}')
    
    # Fit the model to the full training set and make predictions on the test set
    model.fit(X_overesampled, y_overesampled)
    y_pred1 = model.predict(X_test1)
    
    # Evaluate the model on the test set
    acc = accuracy_score(y_test1, y_pred1)
    prec = precision_score(y_test1, y_pred1)
    
    print(f"Accuracy: {acc:.3f}")
    print(f"Precision: {prec:.3f}")
    print(confusion_matrix(y_test1, y_pred1))
    print('-' * 50)

It is clear that the logistic regression model exhibited better performance compared to the random forest model. This outcome signifies that the logistic regression algorithm was more adept at capturing the underlying patterns and nuances present in the dataset, leading to improved predictive accuracy.

### Undersampling (TomekLinks)

In [None]:
# Instantiate the TomekLinks class
tomek_links = TomekLinks(sampling_strategy='auto', n_jobs=-1)

# Perform Tomek Links undersampling on the dataset
X_underesampled, y_underesampled = tomek_links.fit_resample(X_train2, y_train2)

# Visualize the distribution of values in the y_underesampled Series
y_underesampled.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')

### Trying the models with undersampling

We will also explore model performance using undersampling techniques to ensure a comprehensive evaluation of the models and select the most suitable approach. Employing undersampling involves reducing the number of instances from the majority class to balance the class distribution. This approach aims to mitigate the impact of class imbalance and allows us to observe how different models perform under varying data conditions.

By comparing the results obtained from both oversampling and undersampling approaches, we will make a well-informed decision regarding the optimal technique for our dataset. This comprehensive analysis will enable us to choose the model and technique that deliver the highest predictive accuracy and precision while accounting for the inherent complexities of imbalanced data.

In [None]:
# Iterate over each model and evaluate its accuracy using cross-validation.
for model_name, model in models.items():
    scores = cross_val_score(model, X_underesampled, y_underesampled)
    accuracy = scores.mean()
    print(f'{model_name} Accuracy: {accuracy}')
    
    # Fit the model to the full training set and make predictions on the test set
    model.fit(X_underesampled, y_underesampled)
    y_pred2 = model.predict(X_test2)
    
    # Evaluate the model on the test set
    acc = accuracy_score(y_test2, y_pred2)
    prec = precision_score(y_test2, y_pred2)
    
    print(f"Accuracy: {acc:.3f}")
    print(f"Precision: {prec:.3f}")
    print(confusion_matrix(y_test2, y_pred2))
    print("-" * 50)

The performance analysis indicates that the random forest model outperforms the logistic regression model under the undersampling technique. Comparing the results obtained from both oversampling and undersampling approaches, we observe that the random forest model exhibits superior performance when paired with the undersampling technique.

This outcome underscores the effectiveness of the random forest algorithm in handling the complexities associated with imbalanced data, particularly when combined with undersampling to balance the class distribution. The results reaffirm the importance of tailoring the model and technique to the specific characteristics of the data at hand.

# Modeling

Finally, we will proceed to build the final model using the random forest algorithm combined with the undersampling technique. This choice is driven by the extensive performance analysis, which indicated that the random forest model, when used in conjunction with undersampling, yields the most favorable outcomes in terms of predictive accuracy and precision.

By leveraging the strengths of the random forest algorithm and addressing the class imbalance through undersampling, we aim to create a robust and effective model for assessing loan repayment probabilities. This final model will serve as a valuable tool for lending institutions, providing them with a reliable mechanism to make well-informed decisions, manage risks, and foster responsible lending practices.

In [None]:
# Perform SMOTE undrsampling on the dataset
X_resampled, y_resampled = tomek_links.fit_resample(X_train, y_train)

In [None]:
# Define the parameter grid.
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Create an instance of the model.
model = RandomForestClassifier()

# Create an instance of GridSearchCV.
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# Fit the GridSearchCV to the data.
grid_search.fit(X_train, y_train)

# Get the best hyperparameters.
best_params = grid_search.best_params_

# Print the best hyperparameters.
print("Best hyperparameters: ", best_params)

In [None]:
# Make tha random forest model with specific best hyperparameters.
model = RandomForestClassifier(max_depth= 10, min_samples_split= 10, n_estimators= 200)

# Fit the model.
model.fit(X_train, y_train)

# Predict y-predict.
y_pred = model.predict(X_test)

# Evaluate the accuracy and precision of y-predict.
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
    
print(f"Accuracy: {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(confusion_matrix(y_test, y_pred))

In [None]:
# Getting predicted probabilities
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

# Calculate average precision score
avg_precision = average_precision_score(y_test, y_prob)

# Plot precision-recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'Precision-Recall Curve (Avg Precision = {avg_precision:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.show()

In the final step, we reach the culmination of our process by creating the ultimate model. We choose the Random Forest algorithm after effectively using undersampling with TomekLinks. To enhance accuracy, we use Grid Search to fine-tune the model's settings and achieve the best possible performance.

We then turn our attention to a vital measure—how well the model predicts outcomes. Using a visual tool called the Precision-Recall Curve, we evaluate our model's effectiveness. The curve reveals an Average Precision score of 0.41. This value expresses the relative ability of the model to distinguish the positive category (the category that we aim to correctly predict) based on the level of accuracy and retrieval.

# Conclusion

In conclusion, we began by looking closely at the dataset and understanding the different information it contained. We noticed that there were no repeated entries, but some values were missing, which we removed. Then, we took a closer look at the data and made visual representations to better understand it.

One interesting thing we found was that loans with higher grades were less likely to have issues with repayment. We also realized that some information was not evenly balanced.

Because the data was not balanced, we tried different methods to fix this issue. We used techniques like making the groups larger or smaller (oversampling and undersampling). We applied two types of computer programs, Random Forest and Logistic Regression, to predict whether a loan might have problems. We compared how well these programs worked in terms of accuracy and precision. Our results showed that Random Forest worked best when we used a method called TomekLinks undersampling.

This journey involved carefully handling the data, choosing the right computer programs, and studying the results closely. The outcome is a solution that meets our goals of being very accurate and careful about risks. Ultimately, this effort helps create a financial system that's fair and stable.ConclusionConclusion