# Lending Club Case Study
## by Ankit Kumar Surana

## Introduction
As a worker of a consumer finance company that specialises in lending different kinds of loans to urban clients, part of my job is to facilitate loan approval decision-making by evaluating application profiles and identifying hazards related to loan payback potential. To do this, I would need to analyze data in "loan.csv", which contains historical information about past loan applicants with default status information. This means finding patterns that indicate the applicant is likely to default, which in turn enables taking further action, such as denying a loan, adjusting the loan amount, or applying higher interest rates to risky applicants.

Through the analysis, I aim to understand consumer and loan attributes affecting the customer's tendency to default, and also to find the driving factors, or variables, behind loan defaults. The company can then use such knowledge to improve its portfolio and risk assessment strategies.

## Preliminary Wrangling

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import warnings

warnings.filterwarnings('ignore')

%matplotlib inline

pd.set_option('max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


# Gathering

In [None]:
df = pd.read_csv('loan.csv')

In [None]:
# high-level overview of data shape and composition
print(df.shape)

In [None]:
print(df.info())

In [None]:
df.head()

In [None]:
# Data Dictionary
data_dictionary = pd.read_excel('Data_Dictionary.xlsx')
data_dictionary = data_dictionary.dropna(axis=0, how="any")
data_dictionary.shape

# Assessing

In [None]:
# Check duplicated value
df.duplicated().sum()

In [None]:
# Check null value for each column
null_cols = df.columns[df.isnull().all(axis=0)].tolist()

print(f"List of columns with NULL's : \n\n {null_cols} \n")
print(f"Count of columns having all NULL values : {len(null_cols)}")

In [None]:
# Find the uniqueness of a column in data frame

uniq_list = df.columns[(df.nunique() == 1)].tolist()
print("\nList of columns that have same value for all records : ", uniq_list )

In [None]:
# Find columns that have Categorical variables in the dataset

# Function that lists the categorical_values in a column
def categorical_values(column_list):
    for column in column_list:
         print(f"<<<<< {column} >>>>> \n")
         print(df[column].value_counts(), "\n")

column_list = ['term', 'grade', 'sub_grade', 'verification_status', 'loan_status', 'purpose', "home_ownership"]
categorical_values(column_list)

>1) Columns used post-loan approval need to be dropped.
>2) Some rows have loan_status as "Current".
>3) Some columns have all NULL values.
>4) Some columns are textual and masked and do not aid in the analysis.
>5) Some columns have the same values across all rows of the dataset.
>6) For the columns where the data has a % symbol in it, clean the data.
>7) Removing the alphabet from the sub-grade.
>8) The values in the emp_length need to be cleaned.
>9) Round off the amount values to the nearest 2 digits.
>10) Some columns with date values are of object data type.
>11) Convert the data type to float after cleaning the data with % in them.
>12) Convert the data type to categorical for columns that have categorical values.
>13) Break down the date columns to smaller metrics like month, and year.
>14) Deriving a categorical column form loan_amnt.
>15) Handle the missing values: imputing/ deleting.
>16) Renaming the columns : Abbrevations etc.
>17) Treating the outliers.

# Cleaning

In [None]:
df_clean = df.copy()

In [None]:
df_clean.shape

##### Define

> 1) Dropping columns used post-loan approval that would not aid in analysis.

##### Code

In [None]:
# Columns in data_dictionary not available in the data
data_dictionary[data_dictionary.LoanStatNew.isin(df.columns.tolist()) == False]

In [None]:
# Updating the data_dictionary
data_dictionary = data_dictionary[~data_dictionary.LoanStatNew.isin(df.columns.tolist()) == False]

In [None]:
# data_dictionary

In [None]:
post_loan_cols = ['earliest_cr_line', 'collection_recovery_fee' , 'last_credit_pull_d',
 'delinq_2yrs', 'inq_last_6mths', 'last_pymnt_amnt', 'last_pymnt_d', 
 'open_acc', 'pub_rec', 'recoveries', 'revol_bal', 'revol_util', 
 'total_acc', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 
 'total_rec_late_fee' ]

# Updating the data_dictionary of useful columns
data_dictionary = data_dictionary[data_dictionary.LoanStatNew.isin(post_loan_cols) == False].reset_index(drop=True)

#Dropping columns used post-loan approval
df_clean = df_clean.drop(post_loan_cols, axis=1)

##### Test

In [None]:
# Validating df_clean for post-approval columns
df_clean.columns[df_clean.columns.isin(post_loan_cols) == True]

##### Define

>2) Dropping rows that have loan_status as "Current".
>3) Dropping the columns having all NULL values.
>4) Dropping additional columns that do not aid in analysis : ''id', 'member_id', 'url', 'title', 'emp_title',  'desc', 'zip_code'
>5) Dropping the columns that have same values in all rows of the dataset.

##### Code

In [None]:
# Dropping rows that have loan_status as "Current".
df_clean = df_clean[df_clean['loan_status']!='Current']

In [None]:
# Excluding columns available in the data whose all the values are null
data_dictionary = data_dictionary[data_dictionary.LoanStatNew.isin(df_clean.columns[df_clean.isna().all()].tolist()) == False].reset_index(drop=True)

# Dropping all the columns having NULL values
df_clean = df_clean.dropna(axis = 1, how = 'all')

In [None]:
# Dropping any additional columns that do not aid in analysis.
col_drop = ['id', 'member_id', 'url', 'title', 'emp_title', 'desc', 'zip_code']

#Update the data_dictionary by removing the col_drop
data_dictionary = data_dictionary[data_dictionary.LoanStatNew.isin(col_drop) == False].reset_index(drop=True)

df_clean = df_clean.drop(col_drop, axis=1)

In [None]:
# Dropping all the columns that have same values in all the rows of the dataset.
uniq_val_cols = df_clean.columns[(df_clean.nunique() == 1)].tolist()

#Update the data_dictionary by removing the uniq_val_cols 
data_dictionary = data_dictionary[data_dictionary.LoanStatNew.isin(uniq_val_cols) == False].reset_index(drop=True)

df_clean = df_clean.drop(uniq_val_cols, axis=1)

##### Test

In [None]:
# Validating if there are any rows that have loan_status as "Current".
df_clean[df_clean['loan_status']=='Current']

In [None]:
# Validating if there is any column that has all NULL values
df_clean.columns[df_clean.isnull().all(axis=0)].tolist()

In [None]:
# Validating if there are columns that have same values for all rows in the dataset.
df_clean.columns[(df_clean.nunique() == 1)].tolist()

##### Define

>6) Removing the % symbol from the "int_rate" column.
>7) Removing the alphabet from the sub-grade.
>8) Cleaning the values in "emp_length" column by removing the "years" from the data and converting "10+" to 10 and "< 1" to 0.
>9) Round off the amount to nearest 2 digits.

##### Code

In [None]:
# Removing the % symbol from the int_rate column.

df_clean['int_rate'] = df_clean['int_rate'].str.split("%").str[0]

In [None]:
# Removing the alphabet from the sub_grade.
df_clean['sub_grade'] = df_clean['sub_grade'].str[1]

In [None]:
# Cleaning the values in "emp_length" column by removing the "years" from the data and converting "10+" to 10 and "< 1" to 0.

df_clean.emp_length = df_clean.emp_length.str.split("year").str[0]
df_clean.emp_length = df_clean.emp_length.str.replace("+","").str.replace("< 1","0")

In [None]:
# Round off the amounts to nearest 2 digits

rnd_cols = ['loan_amnt','funded_amnt','funded_amnt_inv','installment','annual_inc']

# Converting all the columns to float and then rounding to 2 digits
df_clean[rnd_cols] = df_clean[rnd_cols].astype("float").round(5)

##### Test

In [None]:
# Validating the int_rate column.
df_clean.int_rate.describe()

In [None]:
# Validating the emp_length column.
df_clean.emp_length.value_counts()

In [None]:
# Validating the amount and rate columns data type.
print(df_clean.info())

df_clean.head()


> 10) Converting the data type of date columns from object to datetime.
> 11) Converting the data type of rate columns to float : 'int_rate'
> 12) Convert the data type to categorical for columns that have categorical values.

##### Code

In [None]:
# Converting to date type
for col in df_clean.columns.to_list():
    if re.match('(.*_d$|.*cr_line$)', col):
        df_clean[col] = pd.to_datetime(df_clean[col],format="%b-%y")

In [None]:
# Converting to float type
df_clean['int_rate'] = df_clean['int_rate'].astype("float")

In [None]:
# Converting to Categories
columns = ['emp_length', 'home_ownership', 'grade', 'sub_grade', 'loan_status', 'term', 'verification_status']
df_clean[columns] = df_clean[columns].astype("category")

##### Test

In [None]:
# Validating data types of date and rate columns
df_clean.info()

##### Define

> 13)  Breakdown the date column into smaller metrics like : years and months
> 14) Deriving a categorical column loan_amnt_b from loan_amnt

##### Code

In [None]:
#Breaking down the date column into smaller metrics like : years and months

df_clean['issue_d_year'] = df_clean['issue_d'].dt.year
df_clean['issue_d_month'] = df_clean['issue_d'].dt.month_name()
# df_clean.drop('issue_d', axis=1, inplace=True)

# Converting the data type to categorical
issue_d_month_range = df_clean.issue_d_month.unique().tolist()
issue_d_month_range.reverse()
df_clean['issue_d_month'] = pd.Categorical(df_clean['issue_d_month'], issue_d_month_range)

In [None]:
# Deriving categorical column loan_amount_b from loan amount
bins = [bin for bin in range(0,35000,5000)]
labels = [f"{bins[i]}-{bins[i+1]}" for i in range(len(bins)-1)]
df_clean['loan_amnt_b'] = pd.cut(df_clean['loan_amnt'], bins=bins, labels=labels)

##### Test

In [None]:
#Validating the smaller metrics like : years and months of a date column
df_clean[['issue_d_year','issue_d_month']].head()

In [None]:
# Validating categorical column loan_amount_b derived from loan amount
df_clean['loan_amnt_b'].head()

##### Define

> 15) Handle the missing values: imputing/ deleting.

##### Code

In [None]:
# Columns having NULL/NaN values 
round(df_clean.isnull().sum().sort_values(ascending=False)/len(df)*100,2)

In [None]:
# Columns mths_since_last_record and mths_since_last_delinq can be dropped as more than 60% of the data is NULL/ NaN
drop_cols = ['mths_since_last_record','mths_since_last_delinq']
df_clean = df_clean.drop(drop_cols, axis=1)

#Update the data_dictionary by removing the drop_cols 
data_dictionary = data_dictionary[data_dictionary.LoanStatNew.isin(drop_cols) == False].reset_index(drop=True)

In [None]:
# Handling missing values for emp_length
print(round(df_clean.emp_length.value_counts()/len(df)*100,2))

mode_value = df_clean.emp_length.mode()[0]
print("\nMode value for emp_length : ", mode_value)

# Imputing the NULL/ NaN values with mode value for emp_length
df_clean.emp_length.fillna(mode_value, inplace=True)

In [None]:
# Handling missing values for pub_rec_bankruptcies
print(round(df_clean.pub_rec_bankruptcies.value_counts()/len(df)*100,2))

mode_value = df_clean.pub_rec_bankruptcies.mode()[0]
print("\nMode value for pub_rec_bankruptcies : ", mode_value)

#More than 90% of the records have pub_rec_bankruptcies as 0.0. Hence imputing the value with 0.0
df_clean.pub_rec_bankruptcies.fillna(mode_value, inplace=True)

##### Test

In [None]:
# Validating the handling of missing values
df_clean.isnull().sum()

In [None]:
df_clean.head()

In [None]:
df_clean.info()

In [None]:
data_dictionary

##### Define

> 16) Renaming the abbrevated column : dti

##### Code

In [None]:
# Renaming the dti to debt_to_income
new_mapping = {'dti': 'debt_to_income'}
                        
df_clean = df_clean.rename(columns=new_mapping)

##### Test

In [None]:
df_clean.info()

##### Define

> 17) Indetifying and Handling the outliers/extreme values.

##### Code

In [None]:
# Treating outliers

def outlier_plot(dataframe, column_list): 
    """
    Plots boxplots to examine outliers in the specified columns of the given dataframe.

    Parameters:
    dataframe (DataFrame): The pandas DataFrame containing the data.
    column_list (list): A list of column names to examine for outliers.

    Returns:
    None
    """
    for index, value in enumerate(column_list): 
        title_name = f"Outlier Examination for {value} column"    
        plt.subplot(2, 3, index+1)
        plt.subplots_adjust(hspace = .4, wspace = .4)
        plt.title(title_name, fontsize=7)  
        dataframe[value].plot(figsize=(16,8), kind='box')

In [None]:
cols = ['loan_amnt', 'funded_amnt','funded_amnt_inv','installment','annual_inc']
outlier_plot(df_clean, cols)
plt.show()
plt.tight_layout()

In [None]:
# Setting the upper quartile to 80% as most outliers lay outside the 80% range. 
Q1 = df_clean[cols].quantile(0.05)
Q3 = df_clean[cols].quantile(0.80)
IQR = Q3 - Q1

df_clean = df_clean[~((df_clean[cols] < (Q1 - 1.5 * IQR)) | (df_clean[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

##### Test

In [None]:
outlier_plot(df_clean, cols)
df_clean.shape

### Below is the data dictionary for the remaining columns on which we will conduct the analysis.

In [None]:
data_dictionary

### Below is the segregation of Customer and Loan attributes post Data Assessment and Cleaning

__Customer Attributes__
> 1. annual_inc → Float Data Type
> 2. debt_to_income → Float Data Type
> 3. pub_rec_bankruptcies → Float Data Type
> 4. home_ownership → Categorical Data Type
> 5. addr_state → String Data Type
> 6. emp_length → Categorical Data Type

__Loan Attributes__
> 1. term → Categorical Data Type
> 2. issue_d → DateTime Data Type
> 3. grade → Categorical Data Type
> 4. sub_grade → Categorical Data Type
> 5. verification_status → Categorical Data Type
> 6. loan_status → Categorical Data Type
> 7. purpose → Categorical Data Type
> 8. loan_amnt → Float Data Type
> 9. funded_amnt → Float Data Type
> 10. funded_amnt_inv → Float Data Type
> 11. int_rate → Float Data Type
> 12. installment → Float Data Type


__Derived Attributes__
> 1. issue_d_year → Integer Data Type
> 2. issue_d_month → Categorical Data Type
> 3. loan_amnt_b → Categorical Data Type

In [None]:
df_clean.info()

In [None]:
numeric_columns  = df_clean.select_dtypes(exclude=['object','datetime','category']).columns.tolist()
categorical_columns = df_clean.select_dtypes(include=['category']).columns.tolist()
extra_columns = df_clean.select_dtypes(include=['object','datetime']).columns.tolist()
print("numeric_columns : ", numeric_columns)
print("categorical_columns : ", categorical_columns)
print("extra_columns : ", extra_columns)

# Exploratory Data Analisys

#### __Univariate Analysis__ 
 → Mean, Median, Max, Min, Std, Variance, Count
 → Distribution ( Histogram, CountPlot, BoxPlot)
#### __Bivariate Analysis__
 → Relationship Between 2 Variables ( ScatterPlot, BoxPlot, BarPlot etc)
#### __Multivariate Analysis__
 → Relationship Between more variables ( Heatmap etc.)m

In [None]:
numerical_columns  = df_clean.select_dtypes(exclude=['object','datetime','category']).columns.tolist()
cateogrical_columns = df_clean.select_dtypes(include=['category']).columns.tolist()
extra_columns = df_clean.select_dtypes(include=['object','datetime']).columns.tolist()
print("numerical_columns -> ", numeric_columns)
print("cateogrical_columns -> ", cateogrical_columns)
print("extra_columns -> ", extra_columns)

## Univariate Exploration

In [None]:
# Re-ordering categorical variables

# Sorting emp_length order
emp_length_order = df_clean['emp_length'].unique().tolist()
emp_length_order = sorted(emp_length_order, key=lambda emp_length_order: int(emp_length_order))
df_clean['emp_length'] = df_clean['emp_length'].cat.reorder_categories(emp_length_order)

In [None]:
# Class for performing univariate analysis on a specified column in a DataFrame.
class UnivariateAnalysis:
    # Initializes the UnivariateAnalysis object with the given DataFrame.
    def __init__(self, dataframe,column_name):
       
        self.dataframe = dataframe
        self.column_name = column_name
        print(f"Initiating detailed analysis of {column_name}...")
        print(f"\nStatistical summary for {self.column_name}:\n{self.dataframe[self.column_name].describe()}")
        mode = self.dataframe[self.column_name].mode()[0]
        print(f"\nThe mode of {self.column_name} is: {mode}\n")

    # Performs univariate analysis on the specified column with bins.
    def analyze_with_bins(self, bin_range=None, discrete=False):
        sns.set_style('whitegrid')
        plt.figure(figsize=(12, 6))

        sns.histplot(data=self.dataframe, x=self.column_name, bins=bin_range, discrete=discrete, kde=True, color='skyblue')
        plt.title(f'Distribution of {self.column_name} with Bins', fontsize=16, fontweight='bold')
        
        plt.xlabel(self.column_name, fontsize=14)
        plt.ylabel('Frequency', fontsize=14)
        plt.xticks(bin_range, rotation=45, fontsize=12)
        plt.yticks(fontsize=12)

        plt.tight_layout()
        plt.show()

    # Performs univariate analysis on the specified column without bins.
    def analyze_without_bins(self):
        sns.set_style('whitegrid')
    
        fig, ax = plt.subplots(1, 2, figsize=(16, 6))
        
        sns.histplot(data=self.dataframe, x=self.column_name, ax=ax[0], kde=True, color='salmon')
        ax[0].set_title(f'{self.column_name} Histogram', fontsize=16, fontweight='bold')
    
        sns.boxplot(data=self.dataframe, y=self.column_name, ax=ax[1], palette='muted')
        ax[1].set_title(f'{self.column_name} Box Plot', fontsize=16, fontweight='bold')
    
        for axis in ax:
            axis.set_xlabel(self.column_name, fontsize=14)
            axis.set_ylabel('Frequency', fontsize=14)
            axis.tick_params(axis="x", rotation=45, labelsize=12)
            axis.tick_params(axis="y", labelsize=12)
    
        plt.tight_layout()
        plt.show()

In [None]:
univariate_analysis = UnivariateAnalysis(df_clean, 'loan_amnt')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(0, 35000, 5000))

##### __Observation__ : From the above distribution we can see that most of the loan application amount were between 5000-10000, followed by 0-5000 and then 10000-15000. However the mean for the loan amount is 10678 and the mode is 10000.

In [None]:
univariate_analysis = UnivariateAnalysis(df_clean, 'annual_inc')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(0, 240000, 20000))

##### __Observation__ : From the above distribution we can see that most of the loan application where from customers whose annual income lies between 30000-60000. The mean of annual income of the customers is 63517 and the mode is 60000.

In [None]:
univariate_analysis = UnivariateAnalysis(df_clean, 'int_rate')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(5, 25, 1))

In [None]:
univariate_analysis = UnivariateAnalysis(df_clean, 'debt_to_income')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(0, 30, 2))

In [None]:
for category in categorical_columns:
    univariate_analysis = UnivariateAnalysis(df_clean, category)
    univariate_analysis.analyze_without_bins()

In [None]:
month_range = df_clean.issue_d_month.unique().tolist()
univariate_analysis = UnivariateAnalysis(df_clean, 'issue_d_month')
univariate_analysis.analyze_with_bins(bin_range=month_range, discrete=True)

In [None]:
year_range = df_clean.issue_d_year.unique().tolist()
univariate_analysis = UnivariateAnalysis(df_clean, 'issue_d_year')
univariate_analysis.analyze_with_bins(bin_range=year_range, discrete=True)

## Segmented Univariate Exploration

#### Segmenting the loan status into 'fully_paid' and 'charged_off' and analyzing the impact of other parameters.

> Loan Status → Fully Paid 

In [None]:
df_fully_paid = df_clean[df_clean['loan_status'] == 'Fully Paid']
df_fully_paid.shape

In [None]:
univariate_analysis = UnivariateAnalysis(df_fully_paid, 'loan_amnt')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(1000, 38000, 3000))

In [None]:
univariate_analysis = UnivariateAnalysis(df_fully_paid, 'int_rate')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(1, 38, 1))

In [None]:
univariate_analysis = UnivariateAnalysis(df_fully_paid, 'annual_inc')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(0, 220000, 8000))

In [None]:
cat_columns = ['loan_amnt_b', 'grade', 'emp_length','verification_status', 'home_ownership']
for category in cat_columns:
    univariate_analysis = UnivariateAnalysis(df_fully_paid, category)
    univariate_analysis.analyze_without_bins()

> Loan Status → Charged Off 

In [None]:
df_charged_off = df_clean[df_clean['loan_status'] == 'Charged Off']
df_charged_off.shape

In [None]:
univariate_analysis = UnivariateAnalysis(df_charged_off, 'loan_amnt')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(1000, 38000, 3000))

In [None]:
univariate_analysis = UnivariateAnalysis(df_charged_off, 'int_rate')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(1, 38, 1))

In [None]:
univariate_analysis = UnivariateAnalysis(df_charged_off, 'annual_inc')
univariate_analysis.analyze_without_bins()
univariate_analysis.analyze_with_bins(bin_range=range(0, 220000, 8000))

In [None]:
cat_columns = ['loan_amnt_b', 'grade', 'emp_length','verification_status', 'home_ownership']
for category in cat_columns:
    univariate_analysis = UnivariateAnalysis(df_charged_off, category)
    univariate_analysis.analyze_without_bins()

# Bivariate Exploration

In [None]:
# A class for performing bivariate analysis on a DataFrame.

class BivariateAnalysis:
    def __init__(self, dataframe):
        self.dataframe = dataframe

    # Generates a scatter plot for two specified columns in a DataFrame.
    def scatter_plot(self, x_column, y_column, marker_size=10, alpha=0.2, color='orange'):

        sns.set(style="whitegrid")
        plt.figure(figsize=(7, 5))
        sns.scatterplot(data=self.dataframe, x=x_column, y=y_column, s=marker_size, alpha=alpha, color=color)

        # Set plot title and labels
        plt.title(f'Scatter Plot: {x_column} vs {y_column}', fontsize=13)
        plt.xlabel(x_column, fontsize=12)
        plt.ylabel(y_column, fontsize=12)

        plt.tight_layout()
        plt.show()

    # Generates a boxplot for a categorical column against a numerical column for bivariate analysis.
    def boxplot(self, categorical_column, numerical_column, palette='pastel'):

        sns.set(style="whitegrid")
        plt.figure(figsize=(7, 5))
        sns.boxplot(data=self.dataframe, x=categorical_column, y=numerical_column, palette=palette)

        # Set plot title and labels
        plt.title(f'Boxplot: {categorical_column} vs {numerical_column}', fontsize=13)
        plt.xlabel(categorical_column, fontsize=12)
        plt.ylabel(numerical_column, fontsize=12)

        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

### Bivariate Analysis - Numerical vs Numerical

In [None]:
bivariate_analysis = BivariateAnalysis(df_clean)

bivariate_analysis.scatter_plot('loan_amnt', 'int_rate', color='orange')
bivariate_analysis.scatter_plot('loan_amnt', 'installment', color='#FF5733')
bivariate_analysis.scatter_plot('loan_amnt', 'annual_inc', color='skyblue')
bivariate_analysis.scatter_plot('loan_amnt', 'pub_rec_bankruptcies', color='green')
bivariate_analysis.scatter_plot('annual_inc', 'int_rate', color='#8A2BE2')
bivariate_analysis.scatter_plot('annual_inc', 'debt_to_income', color='pink')

### Bivariate Analysis - Categorical vs Numerical

In [None]:
bivariate_analysis = BivariateAnalysis(df_clean)

bivariate_analysis.boxplot('term', 'loan_amnt', palette='deep')
bivariate_analysis.boxplot('grade', 'loan_amnt', palette='muted')
bivariate_analysis.boxplot('emp_length', 'loan_amnt', palette='pastel')
bivariate_analysis.boxplot('loan_status', 'int_rate', palette='dark')
bivariate_analysis.boxplot('grade', 'int_rate', palette='colorblind')
bivariate_analysis.boxplot('verification_status', 'loan_amnt', palette='OrRd')
bivariate_analysis.boxplot('home_ownership', 'loan_amnt', palette='YlOrRd')

Here are the observations derived from the above analysis:


## Multivariate Analysis

In [None]:
# A class for performing multivariate analysis on a DataFrame.
class MultivariateAnalysis:
    def __init__(self, dataframe):
        self.dataframe = dataframe
    
    # Generates a heatmap for visualizing the correlation matrix of numerical columns in the DataFrame. 
    def heatmap(self, cmap='coolwarm'):
        sns.set(style="white")
        plt.figure(figsize=(10, 8))
        sns.heatmap(self.dataframe.corr(), cmap=cmap, annot=True, fmt=".2f", linewidths=0.5)
        plt.title('Correlation Matrix Heatmap', fontsize=16)
        plt.xticks(rotation=45)
        plt.yticks(rotation=0)
        plt.tight_layout()
        plt.show()


In [None]:
# Generating a heatmap to visualize the correlation matrix of numerical columns in the DataFrame
df_heatmap = df_clean[numerical_columns]
multivariate_analysis = MultivariateAnalysis(df_heatmap)
multivariate_analysis.heatmap()