# Gia Gillis 

## Loan Interest Rate Analysis Part 1 of 3

#### About the data
- There are a number of columns that are probably not useful to analysis even at first glance, such as Reason.
- Although Loan Id and Borrower Id are useful from a database standpoint, the values will not contribute much to analysis.
- There is clearly dirty data - dates that are backwards and unclear, dollar and percent signs, null values, etc.
#### The limitations of the data
- If the interest rate is null, the row will not contribute to the analysis.  
- There are too many values in X16 Reason and X18 Loan title.  Both are entered by user freeform and not very clean.  Some natural language processing may find something interesting, but the data for those columns currently do not make useful categories.
- X10 Job title or employer is similar to X16 and X18.  It is also appears to be entered by the user in freeform.  This column could be useful if made uniform (for example, title case) and limited to the top counts.
- The data for states ID, IA, NE, and ME is in the single digits.  At first glance, it appears that these states have signifiantly higher or lower interest rates, but if you consider the lack of data, this probably isn't the case.
- For home values, any, other, and none categories were less than 140 records, which isn't a significant number.  The other values, mortgage, own, and rent, are clearer in any case.
#### Factors that are the most telling/Conclusions
- There doesn't seem to be a strong correlation between the interest rate and any of the numeric values.
- The strongest correlation appears to be between Number of Payments (36 or 60 months) and interest rate.
- Homeownership and loan category also seem to have some correlation to interest rate.
- Not verified income also seems to have a lower interest rate, which seems curious.
- See also Tableau link https://public.tableau.com/profile/gia.g#!/vizhome/LoanInterestRateAnalysis/Dashboard1
#### Other features that might be helpful to predict interest rates
- Economic measures for the particular time period (inflation, government, etc.)
- A borrower's credit score
- More demographic data of borrower.
#### Opportunities to consider
- Adding other factors, such as economic measures and credit scores, may help predict borrower interest rate.
- Adding functionality to the point of record entry to restrict fields, such as job, to certain values.
- Requiring borrower to give other demographic data at point of record entry.

Import needed libraries.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil.relativedelta import relativedelta

Read csv file and force Loan Id (X2), Borrower Id (X3), and Reason (X16) columns to strings, and rename columns.

In [2]:
loans=pd.read_csv(r'C:\Users\Gia\Downloads\Analyst_Test\Analyst_Test\loan_interest_rates.csv', 
                  dtype={'X2': str, 'X3':str, 'X16': str}, parse_dates=True)
loans.columns=['Interest Rate', 'Loan Id', 'Borrower Id', 'Requested', 'Funded', 'Investor Funded', 'Number of Payments',
              'Loan Grade', 'Loan Subgrade', 'Job', 'Years Employed', 'Home', 'Annual Income', 'Income Verified', 'Loan Date',
              'Reason', 'Loan Cat', 'Loan Title', 'State', 'Ratio', 'Late Payments', 'Credit Line Date', 'Months Del',
              'Months PR', 'Derog Recs', 'Credit Lines', 'Status']

In [3]:
loans.head()

Unnamed: 0,Interest Rate,Loan Id,Borrower Id,Requested,Funded,Investor Funded,Number of Payments,Loan Grade,Loan Subgrade,Job,...,Loan Title,State,Ratio,Late Payments,Credit Line Date,Months Del,Months PR,Derog Recs,Credit Lines,Status
0,11.89%,54734,80364,"$25,000","$25,000","$19,080",36 months,B,B4,,...,Debt consolidation for on-time payer,CA,19.48,0.0,Feb-94,,,0.0,42.0,f
1,10.71%,55742,114426,"$7,000","$7,000",$673,36 months,B,B5,CNN,...,Credit Card payoff,NY,14.29,0.0,Oct-00,,,0.0,7.0,f
2,16.99%,57167,137225,"$25,000","$25,000","$24,725",36 months,D,D3,Web Programmer,...,mlue,NY,10.5,0.0,Jun-00,41.0,,0.0,17.0,f
3,13.11%,57245,138150,"$1,200","$1,200","$1,200",36 months,C,C2,city of beaumont texas,...,zxcvb,TX,5.47,0.0,Jan-85,64.0,,0.0,31.0,f
4,13.57%,57416,139635,"$10,800","$10,800","$10,692",36 months,C,C3,State Farm Insurance,...,Nicolechr1978,CT,11.63,0.0,Dec-96,58.0,,0.0,40.0,f


In [4]:
loans.dtypes

Interest Rate          object
Loan Id                object
Borrower Id            object
Requested              object
Funded                 object
Investor Funded        object
Number of Payments     object
Loan Grade             object
Loan Subgrade          object
Job                    object
Years Employed         object
Home                   object
Annual Income         float64
Income Verified        object
Loan Date              object
Reason                 object
Loan Cat               object
Loan Title             object
State                  object
Ratio                 float64
Late Payments         float64
Credit Line Date       object
Months Del            float64
Months PR             float64
Derog Recs            float64
Credit Lines          float64
Status                 object
dtype: object

Find the null values in columns and count.

In [5]:
missing_data = loans.isnull()
null_columns=loans.columns[missing_data.any()]
loans[null_columns].isnull().sum()

Interest Rate          61010
Loan Id                    1
Borrower Id                1
Requested                  1
Funded                     1
Investor Funded            1
Number of Payments         1
Loan Grade             61270
Loan Subgrade          61270
Job                    23986
Years Employed         17538
Home                   61361
Annual Income          61028
Income Verified            1
Loan Date                  1
Reason                276440
Loan Cat                   1
Loan Title                19
State                      1
Ratio                      1
Late Payments              1
Credit Line Date           1
Months Del            218802
Months PR             348845
Derog Recs                 1
Credit Lines               1
Status                     1
dtype: int64

Drop rows with null values for Interest Rate (only drops about 15 percent).

In [6]:
loans.dropna(subset=['Interest Rate'], inplace=True)
loans.shape

(338990, 27)

Replace nulls in months since last deliquent and months since last public record with 0, assuming that null indicates 0 monhts.

In [7]:
loans.fillna(value={'Months PR': 0, 'Months Del': 0}, inplace=True)

In [8]:
loans['Months PR'].head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: Months PR, dtype: float64

In [9]:
loans['Months Del'].head()

0     0.0
1     0.0
2    41.0
3    64.0
4    58.0
Name: Months Del, dtype: float64

Create methods to parse columns to convert percent and money values to data type float

In [10]:
def parse_columns(df):
    for i in df.columns:
        df[i]=df[i].map(lambda x: parse_data(x))
    return df

def parse_data(data):
    value=data
    if value is not None and isinstance(value, str) and value is not '' and not any(char.isalpha() for char in value):
        if value[0] =='$' and (len(value)>1 and value[1].isdigit()):
            if len(value)>1:
                value = float(value.replace('$','').replace(',',''))
            else:
                value=0
        elif value[len(value)-1]=='%':
            value = float(value.replace('%',''))/100  
    return value

In [11]:
loans=parse_columns(loans)

Create method to convert date strings to date objects.

In [12]:
def fix_dates(string):
    fixed_date=string
    if isinstance(fixed_date, str) and len(fixed_date)>1:
        if fixed_date[0].isdigit():
            split=fixed_date.split('-')
            y=split[0]
            if len(y)==1:
                y='0'+y
            fixed_date=split[1]+'-'+y
        date_object=datetime.strptime(fixed_date, '%b-%y')
        if date_object.year>2040:
            date_object = date_object - relativedelta(years=100)
        return date_object

In [13]:
loans['Loan Date']=loans['Loan Date'].apply(lambda x: fix_dates(x))
loans['Credit Line Date']=loans['Credit Line Date'].apply(lambda x: fix_dates(x))

In [14]:
loans['Credit Line Date'].dt.year.value_counts(ascending=True)

1944.0        1
1946.0        1
1949.0        1
1951.0        1
1953.0        3
1950.0        4
1954.0        4
1957.0        4
1955.0        5
1956.0        6
1958.0        8
1959.0       12
1961.0       21
1960.0       29
1962.0       35
1963.0       56
1964.0       65
1965.0      103
1966.0      121
1968.0      174
1967.0      194
1969.0      271
1971.0      317
1970.0      322
2011.0      443
1972.0      475
1973.0      529
1974.0      569
1975.0      682
1976.0      885
          ...  
1981.0     1623
2010.0     1717
1982.0     2241
2009.0     2449
1983.0     2974
1984.0     3631
1985.0     3938
1986.0     4421
2008.0     5017
1987.0     5136
1988.0     5968
1989.0     6959
1991.0     7660
1990.0     8050
1992.0     8232
2007.0     8808
1993.0    11397
2006.0    11541
2005.0    12376
1994.0    14420
2004.0    14978
1995.0    15755
1996.0    16928
1997.0    17295
2003.0    17811
1998.0    20715
2002.0    20960
1999.0    23976
2001.0    25013
2000.0    26222
Name: Credit Line Date, 

In [15]:
loans['Loan Date'].dt.year.value_counts()

2014.0    145681
2013.0    114219
2012.0     45289
2011.0     18246
2010.0      9792
2009.0      4008
2008.0      1517
2007.0       237
Name: Loan Date, dtype: int64

In [16]:
loans.head()

Unnamed: 0,Interest Rate,Loan Id,Borrower Id,Requested,Funded,Investor Funded,Number of Payments,Loan Grade,Loan Subgrade,Job,...,Loan Title,State,Ratio,Late Payments,Credit Line Date,Months Del,Months PR,Derog Recs,Credit Lines,Status
0,0.1189,54734,80364,25000.0,25000.0,19080.0,36 months,B,B4,,...,Debt consolidation for on-time payer,CA,19.48,0.0,1994-02-01,0.0,0.0,0.0,42.0,f
1,0.1071,55742,114426,7000.0,7000.0,673.0,36 months,B,B5,CNN,...,Credit Card payoff,NY,14.29,0.0,2000-10-01,0.0,0.0,0.0,7.0,f
2,0.1699,57167,137225,25000.0,25000.0,24725.0,36 months,D,D3,Web Programmer,...,mlue,NY,10.5,0.0,2000-06-01,41.0,0.0,0.0,17.0,f
3,0.1311,57245,138150,1200.0,1200.0,1200.0,36 months,C,C2,city of beaumont texas,...,zxcvb,TX,5.47,0.0,1985-01-01,64.0,0.0,0.0,31.0,f
4,0.1357,57416,139635,10800.0,10800.0,10692.0,36 months,C,C3,State Farm Insurance,...,Nicolechr1978,CT,11.63,0.0,1996-12-01,58.0,0.0,0.0,40.0,f


In [17]:
loans.dtypes

Interest Rate                float64
Loan Id                       object
Borrower Id                   object
Requested                    float64
Funded                       float64
Investor Funded              float64
Number of Payments            object
Loan Grade                    object
Loan Subgrade                 object
Job                           object
Years Employed                object
Home                          object
Annual Income                float64
Income Verified               object
Loan Date             datetime64[ns]
Reason                        object
Loan Cat                      object
Loan Title                    object
State                         object
Ratio                        float64
Late Payments                float64
Credit Line Date      datetime64[ns]
Months Del                   float64
Months PR                    float64
Derog Recs                   float64
Credit Lines                 float64
Status                        object
d

In [18]:
loans['Number of Payments'].value_counts()

 36 months    247791
 60 months     91198
Name: Number of Payments, dtype: int64

In [19]:
loans['Reason'].head()

0    Due to a lack of personal finance education an...
1    Just want to pay off the last bit of credit ca...
2    Trying to pay a friend back for apartment brok...
3    If funded, I would use this loan consolidate t...
4    I currently have a personal loan with Citifina...
Name: Reason, dtype: object

Drop columns because values for each row are different, and columns will not contribute to analysis.

In [20]:
loans.drop(['Reason', 'Loan Id', 'Borrower Id'], axis=1, inplace=True)

In [21]:
loans.shape

(338990, 24)

Convert all values in Loan Title column to lowercase.

In [22]:
loans['Loan Title']=loans['Loan Title'].str.lower()

In [23]:
loans['Loan Title'].head()

0    debt consolidation for on-time payer
1                      credit card payoff
2                                    mlue
3                                   zxcvb
4                           nicolechr1978
Name: Loan Title, dtype: object

In [24]:
loans['Loan Title'].value_counts()

debt consolidation                       123592
credit card refinancing                   40820
home improvement                          12023
other                                      8256
consolidation                              6490
debt consolidation loan                    3977
major purchase                             3256
credit card consolidation                  2970
personal loan                              2782
business                                   2475
consolidation loan                         2201
credit card payoff                         2145
credit card refinance                      2057
medical expenses                           1887
consolidate                                1752
personal                                   1668
car financing                              1395
loan                                       1321
vacation                                   1229
payoff                                     1195
credit cards                            

Create method to make Loan Title column more uniform.

In [25]:
def uniform_loan_title(title):
    new_title=title
    if isinstance(title, str):  
        if 'consolidat' in title:
            new_title='debt consolidation'
        elif 'credit card refi' in title:
            new_title='credit card refinance'
    return new_title

In [26]:
loans['Loan Title']=loans['Loan Title'].map(lambda x: uniform_loan_title(x))

In [27]:
loans['Loan Title'].value_counts()

debt consolidation                      156898
credit card refinance                    44092
home improvement                         12023
other                                     8256
major purchase                            3256
personal loan                             2782
business                                  2475
credit card payoff                        2145
medical expenses                          1887
personal                                  1668
car financing                             1395
loan                                      1321
vacation                                  1229
payoff                                    1195
credit cards                              1151
freedom                                   1090
debt                                      1040
my loan                                    995
credit card loan                           959
credit card                                952
moving and relocation                      952
debt free    

In [28]:
loans['Loan Cat'].value_counts()

debt_consolidation    198226
credit_card            75680
home_improvement       19625
other                  17154
major_purchase          7312
small_business          5359
car                     4115
medical                 3329
moving                  2138
wedding                 1934
vacation                1848
house                   1723
educational              279
renewable_energy         267
Name: Loan Cat, dtype: int64

After some consolidation, there are still too many values in 'Loan Title' to be useful.  With 'Loan Cat' available, 'Loan Title' may be dropped.

In [29]:
loans.drop('Loan Title', axis=1, inplace=True)

In [30]:
loans.shape

(338990, 23)

In [31]:
loans['Home'].head()

0    RENT
1    RENT
2    RENT
3     OWN
4    RENT
Name: Home, dtype: object

In [32]:
loans['Home'].value_counts()

MORTGAGE    145958
RENT        115958
OWN          24976
OTHER          107
NONE            30
ANY              1
Name: Home, dtype: int64

Since the Home values OTHER, NONE, and HOME have very low counts relative to the other values, drop rows from dataframe.

In [33]:
home=loans[(loans['Home']=='OTHER')  | (loans['Home']=='ANY') | (loans['Home']=='NONE')]
home['Home'].value_counts()

OTHER    107
NONE      30
ANY        1
Name: Home, dtype: int64

In [34]:
loans.drop(home.index, inplace=True)

In [35]:
loans.shape

(338852, 23)

In [36]:
loans['Status'].value_counts()

f    232474
w    106377
Name: Status, dtype: int64

In [37]:
loans['Years Employed'].value_counts()

10+ years    108455
2 years       30103
3 years       26659
< 1 year      25983
5 years       23060
1 year        21418
4 years       20250
6 years       19591
7 years       19440
8 years       16210
9 years       12890
Name: Years Employed, dtype: int64

In [38]:
loans['Job'].value_counts()

Teacher                              3602
Manager                              2875
Registered Nurse                     1537
RN                                   1452
Supervisor                           1286
Project Manager                      1095
Sales                                1048
Office Manager                        912
Owner                                 870
manager                               866
Driver                                842
General Manager                       806
Director                              783
teacher                               761
Engineer                              704
Vice President                        618
driver                                612
President                             557
Administrative Assistant              549
Operations Manager                    548
Attorney                              543
supervisor                            524
Accountant                            524
US Army                           

Convert Job titles/companies to title case for uniformity.

In [39]:
loans['Job']=loans['Job'].str.title()
loans['Job'].value_counts()

Teacher                                4421
Manager                                3908
Registered Nurse                       2170
Supervisor                             1897
Rn                                     1623
Sales                                  1586
Driver                                 1514
Owner                                  1391
Project Manager                        1261
Office Manager                         1203
General Manager                        1083
Truck Driver                            982
Engineer                                872
Director                                850
Us Army                                 754
Police Officer                          725
Store Manager                           700
Vice President                          699
Operations Manager                      688
President                               664
Sales Manager                           663
Administrative Assistant                632
Attorney                        

In [40]:
loans['Income Verified']=loans['Income Verified'].str.lower()
loans['Income Verified'].value_counts()

verified - income           126990
not verified                107802
verified - income source    104059
Name: Income Verified, dtype: int64

In [41]:
loans['State'].value_counts()

CA    52812
NY    29216
TX    26480
FL    22752
IL    13480
NJ    13182
PA    11875
OH    11037
GA    10844
VA    10329
NC     9302
MI     8348
MA     8032
MD     8013
AZ     7743
WA     7703
CO     7115
MN     5862
MO     5396
CT     5242
NV     4750
IN     4608
OR     4408
WI     4245
TN     4215
AL     4188
LA     4015
SC     3979
KY     3173
KS     3091
OK     3014
AR     2530
UT     2529
NM     1848
HI     1797
WV     1737
NH     1648
RI     1485
DC     1076
MT      994
AK      946
DE      896
WY      853
SD      730
MS      706
VT      602
ID        8
IA        7
NE        6
ME        4
Name: State, dtype: int64

Save cleaned data to new csv file.

In [42]:
loans.to_csv(r'C:\Users\Gia\Downloads\Analyst_Test\Analyst_Test\clean_loan_interest_rates.csv', index=False)