## A Bird's Eye View of All LendingClub Loan Dataset and it's Exploratory Data Analysis from Kaggle Notebooks

**LendingClub** is an American peer-to-peer lending company, headquartered in San Francisco, California. It is the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. At its height, LendingClub is the world's largest peer-to-peer lending platform. The company claims that $15.98 billion in loans had been originated through its platform up to December 31, 2015.
LendingClub enables borrowers to create unsecured personal loans between  1,000 𝑎𝑛𝑑 40,000. The standard loan period is three years. Investors are able to search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from the interest on these loans. LendingClub makes money by charging borrowers an origination fee and investors a service fee. 
-https://en.wikipedia.org/wiki/LendingClub

#### Our dataset of choice is the All LendingClub Loan Dataset
[**Dataset**](https://www.kaggle.com/wordsforthewise/lending-club)

![](https://drive.google.com/uc?export=view&id=13xO5P23NpjukRCmWyylB72sNCfrhNnn9)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy import stats
sns.set(color_codes=True)
import functools
import matplotlib

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.simplefilter(action='ignore')

In [None]:
# Set the global plotting parameters
sns.set_style('white')
matplotlib.rcParams['font.size'] = 22
matplotlib.rcParams['font.weight'] = 'bold'
matplotlib.rcParams['figure.facecolor'] = '#00000000'

*******************************************************************************************************************************

In [None]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

#### Exploratory Data Analysis (EDA) is an important initial assessement of the structure of the data

In most cases EDA consists of
* The physical structure of the data viz size,number of samples and features, presence/absence of missing values,whether the features are numerical or string objects, which all have a bearing on further analysis
* The distribution of data in individual features (normal/binomial/multimodal etc)
* The correlation of the features with each other
* Important inferences that can be obtained from the data with respect to the business goals for conducting the analysis

EDA forms the basis of further analysis by various ML algorithms. 
Our approach of EDA for the  All Lending Club Loan Dataset will follow the concept of Literature Review. 

What is a Literature Review ?

A literature review is a piece of academic writing demonstrating knowledge and understanding of the academic literature on a specific topic placed in context.  A literature review also includes a critical evaluation of the material

 -University of Edinburgh
 </font>

EDA of this dataset has been conducted several times before at Kaggle and it would be illuminating to study them, for posing further queries and analyses. This notebook attempts to describe the challenges with the dataset and encapsulate earlier attempts, for future adventerous datascientists. The efforts in this notebook are to provide maximum possible meta-information
and for interpretation of the data, the previous kaggle notebooks can be consulted

In [None]:
!zcat accepted_2007_to_2018Q4.csv.gz|wc -l

In [None]:
!zcat rejected_2007_to_2018Q4.csv.gz|wc -l

#### The dataset comprises of two .csv files

1. the accepted_2007_to_2018Q4.csv is a 1.6 GB csv file with 2.2 million rows

2. the rejected_2007_to_2018Q4.csv is 1.5 GB csv file with 27 million rows


#### A dataset with 30 million odd rows seems challenging and  we will compare this dataset compare to other popular Kaggle datasets.



#### We will use the Kaggle API to download the most voted Kaggle datasets

In [None]:
# install the Kaggle API
!pip install kaggle --upgrade --quiet --force-reinstall --no-deps kaggle

In [None]:
# set API authentication
!cp  kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

In [None]:
# example kaggle API to download the datasets with the keyword "lending"
!kaggle datasets list -v -s '^lending'|head -n 3

In [None]:
# code to fetch Kaggle datasets and append to file
for i in range (1,4):
  !kaggle datasets list --sort-by 'votes' -p $i -s ""  -v >> 'kaggle_datasets.csv'

In [None]:
kaggle_datasets=pd.read_csv('./input/lendingclub-eda-dataset/kaggle_datasets.csv') 

#### Kaggle datasets sorted on vote counts

In [None]:
kaggle_datasets['vote_per_download']=(kaggle_datasets.voteCount/kaggle_datasets.downloadCount)*100

#### Highly upvoted Kaggle datasets

In [None]:
kaggle_datasets.sort_values('voteCount',ascending=False).reset_index(drop=True).head(n=8)

#### Kaggle datasets with high votes as percentage of downloads

In [None]:
kaggle_datasets.sort_values('vote_per_download',ascending=False).reset_index(drop=True).head(n=8)

In [None]:
# function to convert size feature into numeric
def convert(item):
  if 'GB' in item:
    item=int(item[:-2])*1000
  elif 'MB' in item:
    item=int(item[:-2])  
  elif 'KB' in item:
    item=int(item[:-2])/1000  
  else:
    item=int(item)  
  return item

In [None]:
# convert the text data to numeric data
kaggle_datasets['size']=kaggle_datasets['size'].map(convert)

In [None]:
kaggle_votecount=kaggle_datasets.voteCount
kaggle_size=kaggle_datasets['size']
kaggle_download=kaggle_datasets.downloadCount

**Lending-club Dataset**

This dataset is one of the several versions of LendingClub datasets available on Kaggle. This is the most downloaded and voted

In [None]:
!kaggle datasets list -s 'wordsforthewise/lending-club'

In [None]:
lendingclub=pd.DataFrame({"ref":["wordsforthewise/lending-club"],"title":["All Lending Club loan data"],"size":[1],"downloadCount":[23013],"voteCount":[436],"usabilityRating":[0.75]})
lendingclub['vote_per_download']=(lendingclub.voteCount/lendingclub.downloadCount)*100

In [None]:
lendingclub

In [None]:
# for plotting a bar for lending-club in usability
x=[0.75 for num in range(18)]
y=[435 for num1 in range(30)]
z=[22981 for num2 in range(15)]
m=[1000 for num3 in range(6)]

In [None]:
fig,axes=plt.subplots(1,4,figsize=(16,3))
sns.histplot(kaggle_datasets.usabilityRating,ax=axes[0]);
sns.histplot(x,ax=axes[0],color='red',binrange=(0.73,0.76));
axes[0].set_title('Kaggle datasets')
sns.histplot(kaggle_votecount,ax=axes[1]);
sns.histplot(y,ax=axes[1],color='red',binrange=(300,800));
axes[1].set_title('Kaggle datasets')
sns.histplot(kaggle_datasets.downloadCount,ax=axes[2]);
sns.histplot(z,ax=axes[2],color='red',binrange=(14000,28000));
axes[2].set_title('kaggle datasets')
sns.histplot(kaggle_size,ax=axes[3]);
sns.histplot(m,ax=axes[3],color='red',binrange=(900,1200));
axes[3].set_ylim(0,8)
axes[3].set_title('Kaggle datasets')
plt.legend(["popular", "lendingclub"], loc ="upper right")
plt.tight_layout()

#### Highly downloaded datasets have high votes. But also there are datasets with a high ratio of upvotes compared to the downloads, indicating their popularity in the niche audience. This dataset is among the larger sized datasets at > 1 GB (zipped) and has high Usability Rating of 0.75, but has not been downloaded as prolifically (and therefore analyzed) as other popular datasets or has been upvoted as handsomely. At 1.89 votes per download,it has a lower vote per downloads compared to datasets like "Animal Crossing New Horizons Catalog", with votes per download of 69.11


### <font color="eb3480"> What are the features of the dataset ?
</font>

#### **Loading the dataset in raw .csv format  fails**

**the kernel died in Binder attempting to read the data as an csv object of Apache Arrow**
* !pip install pyarrow
* from pyarrow import csv
* arrow_csv=csv.read_csv('accepted_2007_to_2018Q4.csv')

**the kernel died both in Binder as well as Google colab attempting to read the data as a Pandas csv object**
* pandas_csv=pd.read_csv('accepted_2007_to_2018Q4.csv')
* pd.read_csv('accepted_2007_to_2018Q4.csv',low_memory=True)

#### **Loading the dataset in .gz format improves the situation but takes a fair amount of time**

In [None]:
%%time
accepted_url = "./lending-club/accepted_2007_to_2018Q4.csv.gz"
accepted_loans_df = pd.read_csv(accepted_url, compression='gzip',header=0, sep=',', quotechar='"')

In [None]:
%%time
rejected_url = "./lending-club/rejected_2007_to_2018Q4.csv.gz"
rejected_loans_df = pd.read_csv(accepted_url, compression='gzip',header=0, sep=',', quotechar='"')

#### The Accepted Cases dataset loaded in 17 minutes and the Rejected Cases dataset loaded in 1 hour on Google colab engine

### <font color="eb3480"> Since the data is difficult to load, can we quickly find the features of the data ?
</font>

In [None]:
!head -n 3 ./lending-club/accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv>accepted_header.csv
!head -n 3 ./lending-club/rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv>rejected_header.csv

### <font color='blue'>To quickly check for the features of the dataset, we can use Unix tools </font>

#### **the features of the accepted cases dataset**

In [None]:
acceptedHead=pd.read_csv('accepted_header.csv').head(n=4)
acceptedHead

In [None]:
# Characteristics of the Data
data_type=acceptedHead.dtypes.reset_index()
data_type.columns=["count","column_type"]
data_type.groupby("column_type").aggregate("count").reset_index()

#### **the features of the rejected cases dataset**

In [None]:
rejectedHead=pd.read_csv('rejected_header.csv').head(n=4)
rejectedHead

In [None]:
# Characteristics of the Data
data_type=rejectedHead.dtypes.reset_index()
data_type.columns=["count","column_type"]
data_type.groupby("column_type").aggregate("count").reset_index()

#### <font color='blue'> The features of this dataset are explained at this [link](https://https://www.kaggle.com/jonchan2003/lending-club-data-dictionary)

### <font color="eb3480">which features are common between accepted cases and rejected cases ?</font>

In [None]:
rejected_columns=rejectedHead.columns.to_list()
accepted_columns=acceptedHead.columns.to_list()
[column for column in accepted_columns if column in rejected_columns]

**common features in the 2 datasets**

In [None]:
common_features=pd.DataFrame({"accepted_headings":['loan_amnt','title','dti','zip_code','addr_state','emp_length'],
 "rejected_headings":['Amount Requested','Loan Title','Debt-To-Income Ratio','Zip Code','State','Employment Length']})
common_features

#### The feature headings of the rejected cases dataset and accepted cases dataset are named differently. The number features for Accepted Cases and Rejected cases are different. It is possible that  LendingClub could be following a 2 step verification process for approving loans. A quicker set of features to initially reject loans and then a more comprehensive set of features to approve the remaining loans. Also, several new features are monitered to track the status of the loans given. The rejected cases documented here outnumber the accepted cases 1:10

**For further analysis a sample of the dataset can be now created and saved as a seperate file. This will help in faster analysis**

**Advantages of sampling-**
* less time for analysis
* less consumption of resources like RAM,CPU,Disk space
* established techniques like simple random sampling, systematic sampling, stratified sampling, cluster sampling
* statistical tools like p-value, Power Analysis exist to determine the size of a sample

**But there can be disadvantages-**
* chance of bias
* sampling can be erroneous with imbalanced datasets

**A thumb rule is that the sample should be at least 10 % of the dataset**

In [None]:
accepted_loans_df_sample=accepted_loans_df.sample(frac=0.2)

In [None]:
rejected_loans_df_sample=rejected_loans_df.sample(frac=0.2)

In [None]:
accepted_loans_df_sample.to_csv('lending_club_dataset_accepted_sample.csv')

In [None]:
rejected_loans_df_sample.to_csv('rejected_loans_sample.csv')

#### We realize that the dataset is large sized dataset and is difficult to load as a Pandas dataframe object. There is also a difference in the number of features of the 2 subsets of the data. There are 150 odd features and features like Risk Score and Debt-to-income ratio  needs domain knowledge for better analysis. This might explain why this dataset has not been as enthusiastically pursued as the other datasets

### <font color='blue'>**Now let us see the kernels in Kaggle which have attempted to study this dataset** </font>

#### <font color='blue'> There are 17 kernels attributed directly to this dataset, but we found at least 230 odd kernels which have attempted to study the LendingClub data and we have compiled them. However due to nonspecificity of the search string, there may be a false entries or some genuine kernels missing </font>

In [None]:
# code to fetch Kaggle datasets and append to file
for i in range (1,18):
  !kaggle kernels list --sort-by 'voteCount' -p $i -s "lending-club"  -v >> 'kaggle_kernels_updated.csv'

In [None]:
# read the data in Pandas dataframe
lending_club_kernels=pd.read_csv('kaggleKernelsUpdated.csv',encoding='utf-8',error_bad_lines=False,parse_dates=True)

In [None]:
lending_club_kernels.shape

In [None]:
import datetime

In [None]:
# convert the text data for notebook run time to Python datetime object and extract the year
lending_club_kernels['lastRunTime']=lending_club_kernels.lastRunTime.map(lambda x:x[:10])
lending_club_kernels['lastRunTime']=pd.to_datetime(lending_club_kernels['lastRunTime'])
lending_club_kernels['lastRunTime']=lending_club_kernels['lastRunTime'].dt.year

In [None]:
lending_club_kernels['logVotes']=lending_club_kernels.totalVotes.map(lambda x:np.log(x)if(x>0) else 0)

In [None]:
lending_club_kernels.head(n=3)

#### D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures. It integrates seamlessly with ipython notebooks & python/ipython terminals. Currently this tool supports such Pandas objects as DataFrame, Series, MultiIndex, DatetimeIndex & RangeIndex
-[D-Tale](https://pypi.org/project/dtale/#jupyterhub-w-jupyter-server-proxy)

On executing **dtale.show**, a separate browser window opens to show summmary statistics of the features of the dataframe. This library allows one to execute EDA on a dataframe with 1 line of code !

In [None]:
!pip install dtale --upgrade --quiet

In [None]:
import dtale
import dtale.app as dtale_app

dtale_app.USE_COLAB = True

In [None]:
d=dtale.show(lending_club_kernels.sort_values('totalVotes',ascending=False).reset_index(drop=True))
d # click on the link to open the dataframe in the browser and view the statistics

**screenshot of the D-Tale output window**

![](https://drive.google.com/uc?export=view&id=1wIvJEkplbrv9Q-kaVgOZhuGlDbrfIUez)


In [None]:
# check the dtale instances running
dtale.instances()

In [None]:
# kill the dtale instances
d.kill()

###<font color="eb3480">What has been the level of interest in this dataset over a period of time?</font>

In [None]:
!pip install plotly matplotlib seaborn --quiet

In [None]:
import plotly.express as px
import plotly.graph_objects as go

In [None]:
fig = px.strip(data_frame=lending_club_kernels, x="lastRunTime", y="logVotes", title= "Kernels over time")
fig.update_traces(marker_size=4, marker_opacity=0.7)
fig.show()

In [None]:
fig=px.scatter(data_frame=lending_club_kernels,x="lastRunTime",y="logVotes",hover_name='author',color='author')
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

### <font color="blue"> The dataset has seen a fair amount of interest with 230 odd attempts. The highest interest in the kernels was in 2019, a period of financial crisis and even now attempts are being made to analyze this data</font>

********************************************************************************

#### <font>Let us download these kernels and after studying the features of the dataset, we will study these kernels and compile and review the Data Analysis to understand the pattern in the data. There is a valid reason for exploring these kernels while conducting a Exploratory Data Analysis. Independently analyzing the data amounts to **"Reinventing the Wheel"** and contributes to unnecessary redundancy. Rather a review of the excellent efforts taken by the pioneers, will both motivate as well as guide the future Datascience enthusiast</font>

********************************************************************************

In [None]:
import subprocess
import time

In [None]:
# example of downloading a kaggle kernel using API
var='kaggle kernels pull janiobachmann/lending-club-risk-analysis-and-metrics'
subprocess.run(var,text=True,shell=True)

In [None]:
# example of converting downloaded kaggle kernel to html
str_krnl='jupyter nbconvert --to html lending-club-risk-analysis-and-metrics.ipynb'
subprocess.run(str_krnl,text=True,shell=True)

In [None]:
# convert the kaggle reference entries to a list
lending_club_list=lending_club_kernels.ref.tolist()

In [None]:
# code to download the kernels
def notebook_html(html_list):    
    for kernel in html_list:
        time.sleep(1)
        var='kaggle kernels pull '+ str(kernel) # space is important after 'pull'
        #print(var)
        subprocess.run(var,text=True,shell=True)
        time.sleep(1)
        kernelBase=kernel.split('/')[1]+".ipynb"
        #print(kernelBase)
        str_krnl='jupyter nbconvert --to html '+ kernelBase # space is important after 'html'
        subprocess.run(str_krnl,shell=True,text=True) 
notebook_html(lending_club_list)        

In [None]:
!mkdir lending_notebooks

In [None]:
# linux commands to collect notebooks
!if [ ! -f "Jovian_EDA.ipynb"];
!then
!mv *.R *.ipynb *.sql ./lending_notebooks; 
!fi
!tar -zcvf lendingClub_notebooks.tar.gz lending_notebooks/

In [None]:
# linux commands to zip html files 
!mkdir html
!mv *.html *.Rmd  *.irnb ./html
!tar -zcvf lendingClub_notebook_html.tar.gz html/

#### <font color='blue'> Due to the large size of the dataset, loading the data is a challenge and several of the top voted kernels have used a subsample of the dataset, which is unavailable now </font>

In [None]:
loadingData=pd.read_csv('loading.csv',encoding="utf8")
loadingData

![](https://drive.google.com/uc?export=view&id=1e3wyFaSy-oL1cu9550Qud225kASRL29G)

#### At present there are libraries such as Dask which offer parallel computing in Python and can run parallel dataframes, which helps in manipulating large datasets with lesser resources
[Link to the DASK website](https://docs.dask.org/en/latest/)

In [None]:
cd lending-club/accepted_2007_to_2018q4.csv/

In [None]:
!pip install cloudpickle --quiet

In [None]:
!pip install dask[complete] --quiet

In [None]:
!pip install cloudpickle --quiet

In [None]:
import dask
import dask.dataframe as dd

In [None]:
from dask import delayed

In [None]:
%%time
accepted_df=dd.read_csv('accepted*',parse_dates=True,low_memory=False,dtype={'desc': 'object','id': 'object','sec_app_earliest_cr_line': 'object'})

#### <font color="blue"> Loading the Accepted Cases dataset as a DASK dataframe has reduced the loading time from approximately 17 minutes  to less than 200 milliseconds </font>

In [None]:
accepted_df.head()

**Let us check the missing values in the dataset**

In [None]:
%%time
null_value_accepted_df=accepted_df.isnull().sum().compute(low_memory=False)

In [None]:
null_value_accepted_df=pd.DataFrame(null_value_accepted_df)

In [None]:
null_value_accepted_df.columns=['NULL_VALUE']

In [None]:
total_records=len(accepted_df)

In [None]:
null_value_accepted_df.head(n=3)

In [None]:
# code to find the null values as an % of the total records
null_value_accepted_df['NULL_VALUE']= null_value_accepted_df.NULL_VALUE.map(lambda x:(x/total_records)*100)

In [None]:
fig, ax = plt.subplots(figsize=(15,20))
ax=plt.barh(null_value_accepted_df.index,null_value_accepted_df.NULL_VAL);
plt.tight_layout()
plt.show()

### <font color="blue">Almost 44 features in the accepted cases have more than 50% of the records missing, thus making these features untenable for further analysis. The member_id would have been removed for privacy purpose.</font>

In [None]:
cd ../rejected_2007_to_2018q4.csv/

In [None]:
%%time
rejected_df=dd.read_csv('rejected*',parse_dates=True,low_memory=False)

#### <font color="blue">Loading the Rejected cases dataframe as a DASK dataframe reduced the loading time from 1 hour to 37 milliseconds !</font>

In [None]:
rejected_df.head()

In [None]:
%%time
null_value_rejected_df=rejected_df.isnull().sum().compute(low_memory=False)/len(rejected_df)

In [None]:
null_value_rejected_df=pd.DataFrame(null_value_rejected_df)
null_value_rejected_df.columns=['NULL_VALUE']

In [None]:
total_records_rejected=len(rejected_df)

In [None]:
# code to find the null values as an % of the total records
null_value_rejected_df['NULL_VALUE']= null_value_rejected_df.NULL_VALUE.map(lambda x:(x/total_records_rejected)*100)

In [None]:
fig, ax = plt.subplots(figsize=(5,2))
ax=plt.barh(null_value_rejected_df.index,null_value_rejected_df.NULL_VALUE);
plt.tight_layout()
plt.show()

### <font color="blue"> A striking observation in this subset is the missing Risk_Score in 66% of the cases. If this data has not been redacted, then for 18 million cases, the loan was rejected even without calculation a Risk Score. A cursory guess would be that Debt-to-Income Ratio and Employment Length would the key parameters for rejection, but this needs to be confirmed </font>

#### <font color='blue'> The rule of thumb is that the percentage of missing values in a column should not exceed 30 %,beyond which these columns should be dropped. Let us apply the rule to the dataset and see the outcome</font>

**accepted cases dataset**

In [None]:
# code to filter out columns with >30 % NA values
%%time
lowNA_accept_df=accepted_df.loc[:,(accepted_df.isnull().sum().compute(low_memory=False)/len(accepted_df)<0.3)]

In [None]:
lowNA_accept_df.shape

#### <font color='blue'>58 features have been removed from the accepted cases data. That is a substantial loss of information. Hence, it is desirable to further study individual features before removing it compeltely based on general guidelines</font>

**rejected cases dataset**

In [None]:
# code to filter out columns with >70 % NA values
%%time
lowNA_reject_df=rejected_df.loc[:,(rejected_df.isnull().sum().compute(low_memory=False)/len(rejected_df)<0.3)]

In [None]:
lowNA_reject_df.shape

#### <font color='blue'> One feature was removed for the rejected cases data subset</font>

###<font color ="eb3480">What is the inter-relationship between the numeric features in data ?


In [None]:
%%time
# We have selected the numeric columns
numeric_accepted_df=accepted_df.select_dtypes(include=['float64'])

In [None]:
%%time
# The member ID is used for indexing and unnecessary for analysis
numeric_accepted_df=numeric_accepted_df.drop(['member_id'],axis=1)

In [None]:
%%time
# We are removing all rows with NA values
numeric_accepted_df=numeric_accepted_df.dropna(how='all')

In [None]:
%%time
# We are removing all rows with duplicates.by default only the first row of the duplicate is retained
numeric_accepted_df=numeric_accepted_df.drop_duplicates()

In [None]:
numeric_accepted_df.head()

In [None]:
%%time
accepted_corr=numeric_accepted_df.corr().compute()

The system used a single DASK scheduler

### <font color="blue"> Correlation Plot </font>

<font color="blue"> A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

-Tim Bock
</font>
[displayr](https://www.displayr.com/what-is-a-correlation-matrix/)

In [None]:
accepted_corr.shape

In [None]:
plt.figure(figsize=(13,13))
sns.heatmap(accepted_corr,vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=True)
plt.title("Accepted Cases -Numeric Features Correlation")

### <font color="blue">While most features in the Accepted Cases Dataset show no correlation, some features show predominantly positive correlation</font>
### <font color="blue">The features prominenty showing positive correlation are
* Number of accounts ever 120 or more days past due
* Number of currently active revolving trades
* Number of bankcard accounts
* Number of open revolving accounts</font>
### <font color="blue">All these features relate to the activity of credit products that the creditors utilize and hence there is redundancy in this selection of features in this dataset </font>

### <font color="eb3480"> Does the sample size affect the outcome of the result. We will address this key question by collecting different sizes of subsamples and analyzing the outcome ?</font>

**40% sample size**

In [None]:
num_accp_df_04=numeric_accepted_df.sample(frac=0.4)

In [None]:
accp_corr_04=num_accp_df_04.corr().compute()

In [None]:
# plt.figure(figsize=(13,13))
# sns.heatmap(accp_corr_04,vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=False)
# plt.title("Accepted Cases 40% sample size-Numeric Features Correlation")

**20 % sample size**

In [None]:
num_accp_df_02=numeric_accepted_df.sample(frac=0.2)

In [None]:
accp_corr_02=num_accp_df_02.corr().compute()

In [None]:
# plt.figure(figsize=(13,13))
# sns.heatmap(accp_corr_02,vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=False)
# plt.title("Accepted Cases 20% sample size-Numeric Features Correlation")

**10% sample size**

In [None]:
num_accp_df_01=numeric_accepted_df.sample(frac=0.1)

In [None]:
accp_corr_01=num_accp_df_01.corr().compute()

In [None]:
# plt.figure(figsize=(13,13))
# sns.heatmap(accp_corr_01,vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=False)
# plt.title("Accepted Cases 10% sample size-Numeric Features Correlation")

**5% sample size**

In [None]:
num_accp_df_005=numeric_accepted_df.sample(frac=0.05)

In [None]:
accp_corr_005=num_accp_df_005.corr().compute()

In [None]:
# plt.figure(figsize=(13,13))
# sns.heatmap(accp_corr_005,vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=False)
# plt.title("Accepted Cases 5% sample size-Numeric Features Correlation")

![](https://drive.google.com/uc?export=view&id=1AWk9LZNLd8pnNDwQsiuL9HQraMRqakbB)
### <font color="blue"> We have run correlation plots with different fraction of sample sizes to detect if the results change with a change in sample size. From 10% sample size onwards in few cases (earmarked by arrows), there are differences in the correlation. However the overall relationships are robust even at 5 % of the total data size </font>

![](https://drive.google.com/uc?export=view&id=1cxi1mKrmHIm9YRZN6_wS3MIBInhcT2QQ)


**Let us run a similar analysis for the Rejected Cases dataset**

In [None]:
rejected_df=rejected_df.dropna(how='any')

In [None]:
rejected_df=rejected_df.drop_duplicates()

**based on our analysis we can safely take a 20% sample for the correlation**

In [None]:
sample_rejected_df=rejected_df.sample(frac=0.2)

In [None]:
sample_rejected_df=sample_rejected_df.compute()

In [None]:
# converting the string values of the feature to numerical values
sample_rejected_df['Debt-To-Income Ratio']=sample_rejected_df['Debt-To-Income Ratio'].map(lambda x: float(x[:-1]))

In [None]:
sample_rejct_corr_df=sample_rejected_df.loc[:,['Amount Requested','Risk_Score','Debt-To-Income Ratio']]

In [None]:
#correlation
sample_rejct_corr_df.corr()

### <font color="eb3480">Since Risk Factor was missing in 66 % of the samples in the rejected dataset, what is the correlation of this feature with other features ?</font>

In [None]:
sns.heatmap(sample_rejct_corr_df.corr(),vmin=-1.0,vmax=1.0,cmap='coolwarm',robust=True,square=True,fmt="0.1f",cbar=True)

### <font color='blue'> In the rejected cases dataset, there is a poor correlation between Risk_Score and the other numerical features, which belies our earlier assumption. This underscores the need for careful evaluation of the data

**we can now work with the 20% sample**

In [None]:
num_accp_df_02=numeric_accepted_df.sample(frac=0.2)

In [None]:
%%time
num_accp_df_02=num_accp_df_02.compute()

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaled_num_accp_df_02=scaler.fit_transform(num_accp_df_02)

In [None]:
scaled_num_accp_df_02=pd.DataFrame(scaled_num_accp_df_02)

In [None]:
scaled_num_accp_df_02.columns=num_accp_df_02.columns

### <font color="eb3480">What is the pattern of distribution of the features in the Accepted Cases dataset ?</font>


#### Let us use D-Tale library to analyze the features in the Accepted Cases data subset

In [None]:
dtale_app.USE_COLAB = True

n=dtale.show(scaled_num_accp_df_02)
n # click on the link to open the dataframe in the browser and view the statistics

In [None]:
# check the dtale instances running
dtale.instances()

In [None]:
# kill the dtale instances
n.kill()

#### <font color='blue'> Many of the features show positively skewed distribution with extremely long tails. This skewness causes what is called as class imbalance, where a major set of samples have values in a narrow range while many other samples have values over a wide  range and are very low in quantity. This calls for chooing proper metrics like sensitivity and specificity for predicting the less abundant classes, otherwise these samples tend to get misclassified
[check out this article by Faizan Ahemad](https://towardsdatascience.com/selecting-the-right-metric-for-skewed-classification-problems-6e0a4a6167a7)

### <font color="eb3480">What is the variability of the features in the Accepted Cases dataset ?</font>


In [None]:
plt.figure(figsize=(20,8))
s=sns.boxplot(data=scaled_num_accp_df_02);
s.set_xticklabels(scaled_num_accp_df_02.columns,rotation=90);
plt.title("variation in the features of Accepted Cases dataset",fontsize=20);

### <font color='blue'>The dataset has variability in each of it's numerical features. This attribute will help in choosing the features for further analysis</font>

### <font color='blue'>Getting back to the kernels, let us extract the text based inference from the kernels for futher analysis.Our own analysis shows that the dataset has missing values and outliers in several features </font>

In [None]:
# unzipping the previously obtained notebooks in html format
!tar -xvf lendingClub_notebook_html.tar.gz

In [None]:
import requests
from bs4 import BeautifulSoup as bs

In [None]:
import os
os.chdir('./html')

In [None]:
html_file_list=os.listdir()

In [None]:
# function to extract the text in the notebooks
def html_text(file):  
  remark_list=[] 
  try:
      if 'html' in file:
          print(file)
        with open (file,'rb') as f:
          html=f.read()
          soup=bs(html)
          print(soup.title.text)
          remark=soup.body.div.find_all(class_="text_cell_render border-box-sizing rendered_html",recursive=True)
          remark_list.append(remark)

  except AttributeError:
    print('attribute not found')
  else:
      if file==None:
          print('no content') 
  return remark_list          

In [None]:
# the extracted text is saved in the file "eda_inference.txt"
with open('eda_inference.txt','a') as fh:
  for file in html_file_list:
    remark_list=html_text(file)
    for tags in remark_list:
      for tag in tags:
        fh.write(tag.get_text())

In [None]:
# the edited html files are saved
!tar cfvz revised_lendingClub_notebook_html.tar.gz html/


### Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP
-Joanna Jablonski

www.realpython.com
[source](https://realpython.com/nltk-nlp-python/)

### we will use this library to parse the text extracted from the LendingClub analysis notebooks to understand the insights gained in these kernels and summarize our exploration

In [None]:
!pip install nltk==3.5 --quiet

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
#load the extracted text
with open('eda_inference.txt','r') as fh:
  eda_txt=fh.read()

In [None]:
eda_txt[:200]

**this code removes the punctuations from the text**

In [None]:
# import string punctuation
import string
string.punctuation

In [None]:
# many features have underscores and underscores will not be filtered
revised_string_punctuation='!"#$%&\'()*+,-./:;<=>?@[\]^`{|}~'

In [None]:
no_punct_txt=[charac for charac in eda_txt.replace('\n',' ') if charac not in revised_string_punctuation]

In [None]:
no_punct_txt = ''.join(no_punct_txt)

In [None]:
no_punct_txt[:200]

**tokenization of text**

In [None]:
import nltk
nltk.download('punkt')

In [None]:
tokenized_sentences=sent_tokenize(no_punct_txt)

In [None]:
tokenized_words=word_tokenize(no_punct_txt)

In [None]:
nltk.download("stopwords")
from nltk.corpus import stopwords

**removal of stop words**

In [None]:
stop_words = set(stopwords.words("english"))

In [None]:
filtered_eda_words = [word for word in tokenized_words if word.casefold() not in stop_words]

#### <font color="blue"> using the tools of NLTK, we have created a list of words for further analysis</font>

### <font color="eb3480">What is the distribution of words in the text of notebooks which analyzed Lending Club data ?</font>

In [None]:
 from nltk import FreqDist

In [None]:
tokenized_freq_dist = FreqDist(filtered_eda_words)

In [None]:
print(tokenized_freq_dist)

#### <font color="blue"> The words in the text from the Jupyter notebooks were distributed into 14540 distinct words</font>

#### <font color="eb3480"> Which features from the Accepted cases dataset were most analyzed ? </font>

In [None]:
term_keys=tokenized_freq_dist.keys()
term_occurence=tokenized_freq_dist.values()
tokenized_freq_dist=pd.DataFrame({"term":term_keys,"frequency":term_occurence})

In [None]:
# feature names of the Accepted Cases Dataset
accepted_cols=pd.read_csv("accepted_header.csv").columns

In [None]:
# match the features of Accepted Cases with the terms found from the text
acp=tokenized_freq_dist[tokenized_freq_dist.term.isin(accepted_cols)] 

In [None]:
acp_list=acp.term.tolist()

In [None]:
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

In [None]:
token_string=" ".join(acp_list)

In [None]:
wc=WordCloud(background_color=(255,241,241)).generate(token_string)

In [None]:
plt.figure(figsize=(10,10))
plt.axis("off")
plt.imshow(wc, interpolation="bilinear")
plt.title('Most frequently found features of Accepted cases')
plt.show()

#### <font color="blue">Viola! our efforts of following the pioneers pays off. As we can see interest rate, loan status,loan amount, verification status, funded amount, employee title form the most studied features. It would be instructive to further study these features </font>

#### <font color="eb3480"> Which features from the Rejected cases dataset were most analyzed ? </font>

In [None]:
# feature names of the Rejected Cases Dataset
rejected_cols=pd.read_csv("rejected_header.csv").columns

In [None]:
# match the features of Rejected Cases with the terms found from the text
rcp=tokenized_freq_dist[tokenized_freq_dist.term.isin(rejected_cols)] 

In [None]:
# only the term state is present in the analysis
rcp

#### <font color="blue"> we were expecting Risk_Score to figure in the analysis, but it's absence indicates, this feature was removed owing to 66% missing values</font> 

In [None]:
tokenized_freq_dist.shape

In [None]:
plt.hist(tokenized_freq_dist.frequency,bins=40,range=(0,30),density=True);

#### <font color='blue'> The text analysis shows 14540 unique words distributed amongst all the analysis. Of which, from the frequency distribution above, we can see that few words are frequently used words form a small part of the text, while majority of the words used are unique. The presence of a larger proportaion of unique words indicates that most kernels have attempted to provide unique insights and the prospective datascientists should keenly study these kernels 

In [None]:
tokenized_freq_dist['freq_ratio']=tokenized_freq_dist.frequency.map(lambda x:x/sum(tokenized_freq_dist.frequency))

In [None]:
top_001_freq_dist=tokenized_freq_dist[tokenized_freq_dist.freq_ratio>0.001].sort_values('freq_ratio',ascending=False)

#### <font color="eb3480">What do the frequenty used words in the kernels tell of the dataset ? </font>

In [None]:
plt.figure(figsize=(12,16))
fq=sns.barplot(data=top_001_freq_dist,x="frequency",y="term",orient="h");
plt.title("most frequent words used in the kernels",fontsize=20);
plt.tight_layout()
plt.show()

#### <font color="blue"> The words "good" and "bad" among freqeuntly used words show that obvious efforts were made to study features which can distinguish the accepted cases and rejected cases. efforts have been made to study the categorical features, outliers, missing values and correlation between numerical values. Most authors have focused on interest rate, amount loaned, the duration of the loan, income of the borrower, the grade and status of the loan. Interestingly authors have also commented on the risk on the loans, a feature we could not earlier trace lexically</loan> 

#### <font color="eb3480">Can we learn something from the most frequent phrases ?</font>

#### n-grams are statistical word models based on the number of words paired.
* "hello" - unigram
* "hello there" - bigram
* "hello there Jovian" -trigram

 n-grams are used for auto completion, auto spell check and finding the importance of phrases using the tf-idf algorithm.
 Here we will use bigrams, the most prevalent n-gram model to study patterns in the text

In [None]:
bigrams_eda_text=pd.Series(nltk.ngrams(filtered_eda_words,2))

In [None]:
bigrams_eda_text.name='bigrams'

In [None]:
plt.figure(figsize=(12,16))
fx=bigrams_eda_text.value_counts(ascending=False,sort=True).head(n=50).plot(kind="barh");
plt.title("most frequent bigrams used in the kernels",fontsize=20);
plt.tight_layout()
plt.show()

#### <font color ="blue">The bigrams also show similar inferences as the word analysis. Interest rate, loan amount, income, loan status, paid status are important
  </font>

### <font color="blue">Conclusion</font>
#### <font color="blue"> Finally we wind up our analysis. We studied the dataset and the kernels which addressed this dataset and understood the pertinent features of the dataset. Here is the  [link](https://drive.google.com/file/d/1wbJar7kN8vxbzjZy9Q0vPNEy9NFu450u/view?usp=sharing) to the extracted text and the [link](https://drive.google.com/file/d/15t3PD-BxVnjhHQVoQMpv9kFHE8wb4AEv/view?usp=sharing) to the html version of the kaggle notebooks. Datascientists are strongly urged to read the kernels before attempting Machine Learning approaches. The dataset is large, but could be conveniently manipulated as a DASK dataframe. Using a graphic web page based EDA tool like D-Tale, greatly helped in analyzing the dataset and distribution of the data. As a final step, let us merge the Accepted cases and Rejected cases datasets, on common features and analyze the distribution among the 2 subtypes  

In [None]:
import jovian
jovian.commit(project="jovian-eda") 

#### <font color="eb3480">Can we discriminate between accepted cases and rejected cases based on common features of the 2 data subsets ?</font>

In [None]:
# import the Regular Expression library
import re

#### Now we begin a series of data remodelling to collect all relevant features

##### First we handle the accepted cases sub dataset

In [None]:
# collect a 10% sample of the dataset
accpt_df_plt=accepted_df.sample(frac=0.1)
# select the relevant columns
a=accpt_df_plt[['loan_amnt','dti','emp_length']]
# convert the DASK dataframe to Pandas dataframe
accpt_smpl=a.compute()

In [None]:
accpt_smpl.shape

In [None]:
# convert the data to csv file for subsequent easier loading
accpt_smpl.to_csv("accpt_smpl.csv",header=True)

#####  we will carry out similar operations with the rejected cases sub dataset

In [None]:
rejct_df_plt=rejected_df.sample(frac=0.01)
b=rejct_df_plt[['Amount Requested','Debt-To-Income Ratio','Employment Length']]
rejct_smpl=b.compute()

In [None]:
rejct_smpl.to_csv("rejct_smpl.csv",header=True)

In [None]:
rejct_smpl.shape

##### reloading accepted cases data from csv file

In [None]:
a=pd.read_csv('accpt_smpl.csv')
a=a[['loan_amnt','dti','emp_length']]
a.head(n=2)

In [None]:
# function to extract numbers from string
def txtNum(item):
  tn=re.findall('[0-9]+', str(item))
  if tn:
    return(int(tn[0]))
  else:
    return int(0)

In [None]:
#update dataframe with integer values for emp_length
a['emp_length']=a.emp_length.map(txtNum)
a['dti']=a.dti.map(txtNum)
# add a feature to distinguish accepted cases
a['status']="yes"
a.dropna(how="any",inplace=True)

##### A quick and dirty visualization of the individual features

In [None]:
sns.boxplot(x=a["emp_length"], y=a["status"]);
plt.title('accepted cases')

In [None]:
sns.boxplot(x=a["dti"], y=a["status"]);

##### reloading rejected cases data from csv file

In [None]:
b=pd.read_csv('rejct_smpl.csv')
b=b[['Amount Requested','Debt-To-Income Ratio','Employment Length']]
b.columns=['loan_amnt','dti','emp_length']
b.head(n=2)

In [None]:
import math

In [None]:
#update dataframe with integer values for emp_length
b['emp_length']=b.emp_length.map(txtNum)
b['dti']=b.dti.map(txtNum)
# add a feature to distinguish rejected cases
b['status']="no"
b.dropna(how="all",inplace=True)

In [None]:
sns.boxplot(x=b["emp_length"], y=b["status"]);
plt.title('rejected cases')

In [None]:
sns.boxplot(x=b["dti"], y=b["status"]);

### <font color="blue">Pitfalls in Data
In the Rejected cases dataset a dti value is exceptionally high. we are checking the validity of this observation</font>

In [None]:
# the maximum value of dti with dti as object 
b.dti.max()

In [None]:
# the maximum value of dti with dti as numeric object
b.dti.max()

In [None]:
bdti=b.dti.to_list()

In [None]:
# there is an entry for 426000 in the rejected cases dataset
[val for val in bdti if re.match('4260+%$',val)]

#### <font color='blue'> There are outliers in the dti values in both the accepted cases and rejected cases.The value of 426000 in rejected cases is an extreme case</font>

In [None]:
fig,(axs1,axs2)=plt.subplots(nrows=1,ncols=2)
axs1.hist(a.emp_length);
axs1.set_title('Accepted')
axs2.hist(b.emp_length);
axs2.set_title('Rejected')
plt.tight_layout()
plt.show()

#### <font color='blue'> The distribution of lenght of service among the *accepted cases* and *rejected cases* is obvious. Most candidates whose loans were rejected has service life less than 1 year, while most Accepted cases had service life of more than 1 year


In [None]:
# numeric description of accepted cases with outliers
a.describe()

In [None]:
# numeric description of rejected cases with outliers
b.describe()

#### <font color='blue'> This code to remove outliers and reset the features to 1.5 times interquartile range is adapted from @saurabh48782 at www.geeksforgeeks.org </font>
[link for code] (https://www.geeksforgeeks.org/how-to-use-pandas-filter-with-iqr/)

In [None]:
# Code for Removing the outliers
def removeOutliers(data, col):
  print(data.shape)
  Q3 = np.quantile(data[col], 0.75)
  Q1 = np.quantile(data[col], 0.25)
  IQR = Q3-Q1

  print("IQR value for column %s is: %s" % (col, IQR))
  #global outlier_free_list
  #global filtered_data
  lower_range = Q1 - 1.5 * IQR
  print(f"lower range {lower_range}")
  upper_range = Q3 + 1.5 * IQR
  print(f"upper range {upper_range}")
  outlier_free_list = [x for x in data[col] if ((x > lower_range) & (x < upper_range))]
  data[col]=pd.Series(outlier_free_list)
  #filtered_data = data.loc[data[col].isin(outlier_free_list)]
  return data

In [None]:
a=removeOutliers(a,'dti')
a=removeOutliers(a,'loan_amnt')
a=removeOutliers(a,'emp_length')

In [None]:
#  outliers are removed in accepted samples
a.describe()

In [None]:
# removing outliers in rejected samples
b=removeOutliers(b,'loan_amnt')
b=removeOutliers(b,'dti')
b['emp_length']=b.emp_length.map(lambda x:np.float64(x))
b.dropna(how="any",inplace=True)

In [None]:
# outliers are removed from the rejected cases dataset 
b.describe()

In [None]:
accep_emp_counts=a.emp_length.value_counts().items()

In [None]:
emplen=[]
lencnt=[]
for idx,cnt in accep_emp_counts:
  emplen.append(int(idx))
  lencnt.append(cnt)

In [None]:
# create a dataframe of the number of employees per year of employment
accep_emp_df=pd.DataFrame({"emp_length":np.float64(emplen),"emp_count":lencnt})

In [None]:
# calculate the debt-to-income ratio and loan amount for each aggregated years of employment
accep_smpldf=a.groupby('emp_length').agg({'dti': 'mean', 'loan_amnt': 'mean'}).reset_index(level=0)

In [None]:
accep_smpldf=pd.merge(accep_smpldf,accep_emp_df,left_on="emp_length",right_on="emp_length")

In [None]:
accep_smpldf["status"]="accept"

In [None]:
accep_smpldf

In [None]:
accep_smpldf.to_csv("accepted_composite.csv",header=True)

##### Similarly we proceed to aggregate the employees per number of years employed for the rejected cases

In [None]:
rejec_emp_counts=b.emp_length.value_counts().items()

In [None]:
emplen=[]
lencnt=[]
for idx,cnt in rejec_emp_counts:
  emplen.append(int(idx))
  lencnt.append(cnt)

In [None]:
rejec_emp_df=pd.DataFrame({"emp_length":emplen,"emp_count":lencnt})

In [None]:
rejec_smpldf=b.groupby('emp_length').agg({'dti': 'mean', 'loan_amnt': 'mean'}).reset_index(level=0)

In [None]:
rejec_df_emplen=pd.merge(rejec_smpldf,rejec_emp_df,left_on='emp_length',right_on='emp_length')

In [None]:
rejec_df_emplen["status"]="reject"

In [None]:
rejec_df_emplen

In [None]:
rejec_df_emplen.to_csv("rejected_composite.csv",header=True)

#### <font color="blue">A merged dataset is created from the common numeric values from Accepted Cases and Rejected Cases datasets</font>

In [None]:
merged_accept_reject_df=pd.concat([a,b],ignore_index=True,axis=0,verify_integrity=True)

In [None]:
merged_accept_reject_df.head(n=3)

In [None]:
merged_accept_reject_df.tail(n=3)

In [None]:
fig,axs=plt.subplots(figsize=(18,10))
plt.xticks(rotation=70)
sns.set(font_scale=1)
ax=sns.scatterplot('loan_amnt','dti',data=merged_accept_reject_df,hue='status',size='emp_length',sizes=(40,300),alpha=0.4,style="status")
ax.set_xlabel('Loan Amount',fontsize=18)
ax.set_ylabel('Debt-to-Income Ratio',fontsize=18)
plt.tight_layout()
plt.title('Relationship of features for granting loans')
plt.show()

#### <font color="blue">All successful cases have low Debt-to-Income Ratio (dti) and longer years of service. Interestingly these people also requested higher loans. The individuals who failed to obtain loans have usually applied for smaller amounts and yet their cases were rejected. However, if we note keenly there are exceptions where individuals with low dti and longer years of service have failed to secure even loans less than $ 10000.00. Were they defaulters ? only more data can tell this

#### <font color="eb3480">Debt-to-Income Ratio and period of employment determine loan sanctions, but can we improve the insight?</font>

In [None]:
emp_accp_df=pd.read_csv('accepted_composite.csv')
emp_rejc_df=pd.read_csv('rejected_composite.csv')

In [None]:
emp_accp_df=emp_accp_df.drop('Unnamed: 0',axis=1)

In [None]:
emp_rejc_df=emp_rejc_df.drop('Unnamed: 0',axis=1)

In [None]:
emp_accp_df

In [None]:
emp_rejc_df

#### <font color="blue"> We have calculated the loan amount and Debt-to-Income Ratio (dti) categorized on years of service. We immediately notice that the dti and loan amount are normalized and the wide intra-status variation is reduced</font>

In [None]:
master_df=pd.concat([emp_accp_df,emp_rejc_df],ignore_index=True,axis=0,verify_integrity=True)

In [None]:
fig=px.scatter(data_frame=master_df,x="emp_count",y="dti",hover_name='emp_length',color='status',size="emp_length");
fig.update_traces(marker=dict(line=dict(width=2,color='DarkSlateGrey')),selector=dict(mode='markers'))
fig.update_layout(
    title="Factors determining Loan Approval",
    xaxis_title="Employees per years in employment",
    yaxis_title="Debt-to-income Ratio",
    legend_title="Status",
    font=dict(
        family="Courier New, monospace",
        size=18,
        color="RebeccaPurple"
    )
)


************************************

### <font color="blue">Finally our efforts have paid off. We can see that by categorizing the loan applicants by period of employment, there is a clear distinction, between the accepted cases and rejected cases based on debt-to-income ratio.In all probability, All LendingClub could be following this metric. Offcourse, all this needs verification. But we could handle a large dataset and with the help of previous kernels on this topic, gain insights into the dataset. We sincerly hope this notebook will help future datascience enthusiasts </font>


#### <font color="blue">Anubrata Das</font>


#### Several intermediate csv files were generated. Here is the [link](https://drive.google.com/file/d/1T71Kj6TxFZ7ljm3AFvkzReOSHPcb4LG0/view?usp=sharing) to these files