In [2]:
import numpy as np
import pandas as pd

In [3]:
# Loading the dataset

pdf = pd.read_csv('combined_dataframes.csv')
pdf = pdf.drop(['Unnamed: 0'], axis=1)
pdf.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,loan_id,mon_rep_dt,UPB_x,default_status,loan_age,credit_score,1st_pay_dt,1st_home,maturity_dt,MI%,...,channel,prod_type,state,home_type,zip_code,purpose,loan_term,no_borrowers,seller,servicer
0,F115Q1000001,201610,0.0,0,18,796,201505,9,203004,0,...,R,FRM,IA,SF,51000,C,180,1,Other sellers,Other servicers
1,F115Q1000002,201709,100534.98,0,30,805,201504,9,204503,0,...,B,FRM,NE,SF,68500,N,360,1,Other sellers,Other servicers
2,F115Q1000003,201709,320369.23,0,31,730,201503,9,203002,0,...,R,FRM,KY,SF,40400,N,180,2,Other sellers,NATIONSTARMTGELLCDBA
3,F115Q1000004,201709,281120.77,0,29,762,201505,9,204504,0,...,R,FRM,CO,SF,81200,N,360,2,Other sellers,USBANKNA
4,F115Q1000005,201709,183054.42,0,29,777,201505,9,204504,0,...,R,FRM,IL,SF,61700,N,360,1,Other sellers,Other servicers


In [4]:
# Retaining only rows with 0 or 1 for defaut status

ndf = pdf.loc[pdf['default_status'].isin(['0','1'])]

## 1.  Question - *Is there a difference in the interest rates charged for loans acquired through different channels?*

### Hypothesizing

First, we'll look at the average interest rates charged by the loan channels - Retail, Broker and Correspondent

In [4]:
# Retail
ndf[ndf.channel=='R'].interest.mean()

3.87903121852351

In [5]:
# Broker
ndf[ndf.channel=='B'].interest.mean()

3.945533008856271

In [6]:
# Correspondent
ndf[ndf.channel=='C'].interest.mean()

3.956289618597947

Since there seems to be a difference between the channels for interest rates, especially between channels 'Correspondent' and 'Retail', let's look at these two closely

In [7]:
# difference between mean interest for C and R channels

ndf[ndf.channel=='C'].interest.mean() - ndf[ndf.channel=='R'].interest.mean()

0.07725840007443718

So, clearly there is a difference between thse two mean interest rates. But is this difference <b>*Statistically Significant*</b>?

To test this, we consider the following hypotheses:

<b>Null Hypothesis</b>: There is no significant difference between the mean interest rates for channels C and R.

<b>Alternate Hypothesis</b>: The difference between the mean interest rates for channels C and R is statistically significant.

### Hypothesis Testing

We will use a two-sample hypothesis test since the channels C, R can be thought of as two samples. Here, we will use the bootstrap method.

In [5]:
# Compute mean of all interests: mean_interest
mean_interest = ndf.interest.mean()

# difference in two means
empirical_diff_means = ndf[ndf.channel=='C'].interest.mean() - ndf[ndf.channel=='R'].interest.mean()


# functions for generating bootstrap replicates
def bootstrap_replicate_1d(data, func):
    return func(np.random.choice(data, size=len(data)))


def draw_bs_reps(data, func, size):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data, func)

    return bs_replicates

In [12]:
# SHIFTING the means???  -  Not done here

# Cint_mean_shifted = ndf[ndf.channel=='C'].interest - ndf[ndf.channel=='C'].interest.mean() + mean_interest
# Rint_mean_shifted = ndf[ndf.channel=='R'].interest - ndf[ndf.channel=='R'].interest.mean() + mean_interest

In [11]:
# generating bootstrap replicates

bs_replicates_C = draw_bs_reps(ndf[ndf.channel=='C'].interest, np.mean, size=100)
bs_replicates_R = draw_bs_reps(ndf[ndf.channel=='R'].interest, np.mean, size=100)

In [12]:
# difference between C and R in bootstrap replicates
bs_replicates = bs_replicates_C - bs_replicates_R

In [13]:
# # Compute and print p-value: p
p = np.sum(bs_replicates >= empirical_diff_means) / len(bs_replicates)
print('p-value =', p)

p-value = 0.53


In [10]:
# what does this show???

bs_replicates

array([0.07777039, 0.07678189, 0.07686499, 0.07830874, 0.07670319,
       0.07777816, 0.07731031, 0.07810242, 0.07746706, 0.07842238])

*Important*<br/>
p = 0.7 with bootstrap sample size 10<br/>
p = 0.53 with bootstrap sample size 100

How to choose size??

### Frequentist method

In [32]:
from scipy.stats import ttest_ind
ttest_ind(ndf[ndf.channel=='C'].interest, ndf[ndf.channel=='R'].interest, equal_var=False)

Ttest_indResult(statistic=111.37353808326525, pvalue=0.0)

## 2. Question - *Is there a general trend in the interest rates charged for defaulters and non-defaulters?*

In [14]:
# non - defaulters mean interest rate
ndf[ndf.default_status=='0'].interest.mean()

3.908587672397223

In [15]:
# defaulters mean interest rate
ndf[ndf.default_status=='1'].interest.mean()

4.145132667914326

Difference

In [16]:
diff_means1 = ndf[ndf.default_status=='1'].interest.mean() - ndf[ndf.default_status=='0'].interest.mean()
diff_means1

0.23654499551710284

### Hypotheses:

<b>Null Hypothesis</b>: There is no significant difference between the mean interest rates for defaulters and non-defaulters.

<b>Alternate Hypothesis</b>: The difference between the mean interest rates for defaulters and non-defaulters is *statistically significant.*

### Bootstrap test

In [20]:
# Hypothesis Testing using Bootstrap sampling
# generating bootstrap replicates
bs_replicates_0 = draw_bs_reps(ndf[ndf.default_status=='0'].interest, np.mean, size=100)
bs_replicates_1 = draw_bs_reps(ndf[ndf.default_status=='1'].interest, np.mean, size=100)


# difference between C and R in bootstrap replicates
bs_replicates1 = bs_replicates_1 - bs_replicates_0


# # Compute and print p-value: p
p = np.sum(bs_replicates1 >= diff_means1) / len(bs_replicates1)
print('p-value =', p)

p-value = 0.57


### Frequentist test

In [26]:
from scipy.stats import ttest_ind
t, p = ttest_ind(ndf[ndf.default_status=='0'].interest, ndf[ndf.default_status=='1'].interest, equal_var=False)

In [27]:
t

-51.574526575730395

In [29]:
print('p-value =', p)

p-value = 0.0


The p-value is very very small for the frequentist test as opposed to a much larger value for the bootstrap approach. The p-value is inconsistent across the two test and hence no conclusion can be drawn about the hypotheses.

## 3. Question - *Is the correlation between credit score and interest rate significant?*

In [21]:
import scipy.stats as stats
stats.pearsonr(ndf['credit_score'], ndf['interest'])

(-0.14643477252161768, 0.0)

We see that the correlation between credit-scores and interest rates is weak but statistically significant wrt 0.05 and 0.005 thresholds.<br/>
*The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.(scipy documentation)*