#Analysing the Lending club data for fraud
Specifically, this investigation is to find similar profiles of individuals who fraudulently make claims under different profiles in order to change their credit rating in order to get better interest rates etc.

The idea for the investigation comes from this bloomberg article,
https://www.bloomberg.com/news/features/2016-08-18/how-lending-club-s-biggest-fanboy-uncovered-shady-loans.
Here an individual has found similar profiles rampant in the site and has led to considerable bad publicity for the service and has "dethroned" its CEO primarily because of his inaction and not revealing this matter to stakeholders.


One unique aspect of this "ebay for loans" is that it publishes detailed information about its loans(abeit anonymised) and can be readily downloaded from the website. 

In this investigation I will use the capabilities of the scikit learn package to find similar profiles using the information given on the website as .csv files. The main tool I will use is the K nearest neighbour algorithms to look for similar profiles. Each feature such as the loan recepient's declared income,zip code,employment title and purpose for loan is considered as a dimension in the K nearest neighbour search.

In [39]:
#We now import the important packages
from sklearn import preprocessing
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd

I will use pandas to load the necessary columns in the csv file into memory. Note that there are several thousand entries and more than 100 columns so it is wise to selectively import the necessary columns to prevent memory overload

In [135]:
df = pd.read_csv("All raw data/LoanStats3d.csv",skiprows=1,
                 usecols=["home_ownership","zip_code","purpose","emp_title"
                          ,"emp_length","annual_inc","int_rate","installment",
                          "grade","sub_grade"],
                 skipfooter=5,engine="python"
            )

Lets have a brief look at a single column,"home_ownership"

In [44]:
df["home_ownership"].head()

0        RENT
1        RENT
2    MORTGAGE
3    MORTGAGE
4         OWN
Name: home_ownership, dtype: object

One of the columns, "emp_title"(employment title) has a number of null values that causes type errors so I have replaced them with a string:"not_revealed".

In [43]:
new_emp_var = np.where(df["emp_title"].isnull(), # Logical check
                       "not_revealed",                       # Value if check is true
                       df["emp_title"])     # Value if check is false

df["emp_title"] = new_emp_var

We will now have to encode all the non numerical data such as zip code,house ownership status and employment title in a numerical format. Then it has to scaled to a standard scalar to prevent any bias towards any one particular feature

In [7]:
# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()
min_max_scaler = preprocessing.MinMaxScaler()

In [47]:
# Convert unordered variables to numeric and scale them to between 0 and 1

encoded_home_ownership = label_encoder.fit_transform(df["home_ownership"])
ownership_minmax = min_max_scaler.fit_transform(encoded_home_ownership.reshape
                                                (-1,1))

encoded_zip = label_encoder.fit_transform(df["zip_code"])
zip_minmax = min_max_scaler.fit_transform(encoded_zip.reshape(-1,1))

encoded_purpose = label_encoder.fit_transform(df["purpose"])
purpose_minmax = min_max_scaler.fit_transform(encoded_purpose.reshape
                                                (-1,1))

encoded_emp_title = label_encoder.fit_transform(df["emp_title"])
emp_title_minmax = min_max_scaler.fit_transform(encoded_emp_title.reshape
                                                (-1,1))

encoded_emp_length = label_encoder.fit_transform(df["emp_length"])
emp_length_minmax = min_max_scaler.fit_transform(encoded_emp_length.reshape
                                                (-1,1))
income_minmax = min_max_scaler.fit_transform(df["annual_inc"].reshape(-1,1))





We can have a look at some of the standardised data from one column:"loan purpose"

In [50]:
purpose_minmax

array([[ 0.15384615],
       [ 0.07692308],
       [ 0.15384615],
       ..., 
       [ 0.30769231],
       [ 0.15384615],
       [ 0.15384615]])

Now we combine all the numpy arrays(feature data) into a pandas dataframe for simplicity sake.

In [52]:
data = np.hstack([ownership_minmax,zip_minmax,purpose_minmax,emp_title_minmax,
                 emp_length_minmax,income_minmax])

data_features = pd.DataFrame(data,
                             columns=["ownership_minmax","zip_minmax",
                                      "purpose_minmax","emp_title_minmax",
                 "emp_length_minmax","income_minmax"])


print(data_features.head())

   ownership_minmax  zip_minmax  purpose_minmax  emp_title_minmax  \
0          1.000000    0.913472        0.153846          0.711243   
1          1.000000    0.460022        0.076923          0.624844   
2          0.333333    0.944140        0.153846          0.615573   
3          0.333333    0.709748        0.076923          0.778456   
4          0.666667    0.236583        0.153846          0.778456   

   emp_length_minmax  income_minmax  
0           0.636364       0.015789  
1           0.909091       0.005474  
2           0.818182       0.014947  
3           0.545455       0.005053  
4           0.090909       0.007163  


Now we load all the data into memory and index it using the inbuilt nearest neighbours algorithm in scikit-learn. The algorithm used here is the ball tree algo.

In [11]:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(data_features)

We now query the ball tree with individual profiles and find the most similar profile in the huge ball tree. The metric for closest neighbour is the eucledian distance. The query will return the distance to the nearest profile and the index of that profile in the profile population.We willthen manually have a look at the similar profiles.

In [137]:
#index number 1003 seems to be very interesting. a look into that is recommmended
#here we select the range of profiles to query into the tree. Here we have queried betwen profile 35000 and 37000
query = data_features.iloc[35000:37000]
distances, indices = nbrs.kneighbors(query)

#logic to deter mine the closest profiles by distance and find those indexes to recover the original profile data
abs_distances = distances[:, 1]
min_distance = abs_distances[(abs_distances > 0)].min()
min_loc = np.where((distances == min_distance))

#now display all the similar profiles
print("distance:", distances[min_loc])
print("")

print("similar profile 1:")
print(df.iloc[indices[min_loc[0]][0][1]])
print("")

print("similar profile 2:")
print(df.iloc[indices[min_loc[0]][0][0]])

distance: [  2.05263158e-05]

similar profile 1:
int_rate                       7.26%
installment                   774.91
grade                              A
sub_grade                         A4
emp_title                    Manager
emp_length                 10+ years
home_ownership                  RENT
annual_inc                    102000
purpose           debt_consolidation
zip_code                       937xx
Name: 198880, dtype: object

similar profile 2:
int_rate                      17.57%
installment                   704.49
grade                              D
sub_grade                         D4
emp_title                    Manager
emp_length                 10+ years
home_ownership                  RENT
annual_inc                    102195
purpose           debt_consolidation
zip_code                       937xx
Name: 35836, dtype: object


As you might see, the two profiles shown above are very similar. This profile can manually be investigated if needed by the bank or firm. As seen from the above example, the individual has furnished slightly different information but has a widely difffering credit rating(A4 and D4). Investers are likely to be misled by this false information. As seen here the two loans have widely differing interest rates for the same person(and credit worthiness)

In [None]:
#similar profile pairs:333083,1180