# Capstone Project 2: Lending Club

# Inferential Statistics

In [None]:
import pandas as pd #for building pandas dataframes for analysis
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import numpy as np #fundamental package for scientific computing with Python

import matplotlib.pyplot as plt #for visualizations
import seaborn as sns #for neat visualizations
import scipy.stats as stats #large number of probability distributions and statistical functions
import statsmodels as sm #provides estimation of many different statistical models, tests and data exploration 

from ggplot import *
%matplotlib notebook
%matplotlib inline

from scipy import stats
import collections

In [None]:
#Cleaned Approved Data
approved1 = pd.read_csv('/Users/carolinerosefrensko/Downloads/data_wrangling_json/approved2018-08-14.csv')
approved1.head()

In [None]:
approved1.describe()

In [None]:
#Cleaned Approved Data Separated
approved2 = pd.read_csv('/Users/carolinerosefrensko/Downloads/data_wrangling_json/approved22018-08-14.csv')
approved2.head()

## Heat Map

In [None]:
corr = approved2.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(11, 9))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.title('Correlation Heat Map of the Filtered Dataframe')
plt.show()

In [None]:
corr_matrix = approved2.corr()
corr_matrix["loan_status_separated"].sort_values(ascending=False)

## Hypothesis Testing

### Examining Home Ownership in Approved Good and Bad Loans

The end goal of performing a hypothsis test is to evaluate the variability of home ownership within the different loan status's in this data set. The two sample test test is applied to compare whether the average difference between the two groups is really significant or if it is due instead to random chance. 

In this scenario since there is a large sample size with a binomial population ('1' and '0') but the standard deviation is unknown, we should use a 2-sample t-test. With the large data set (>30) we can assume the means are normally distributed across the sample and the central limit theorem applies to this problem.

In the dataset provided, each row represents a loan. The 'loan status separated' column has two values, '1' and '0', indicating good loan(paid/current) and bad loan (charged off/default). The column 'home ownership' has 2 values, 1 and 0, indicating whether the person taking the loan owns their home (1 being a home owner 0 being a renter or other). The end goal is to evaluate if loan status impacts the home ownership in this collected sample.

The Null Hypothesis: Approved Good and Bad loans have the same home ownership values.  

The Alternative Hypothesis: Approved Good and Bad loans have different home ownership values.

### Compute margin of error, confidence interval, and p-value

In [None]:
# Separate into two datasets
g = approved1[approved1.loan_status_separated== 1]
b = approved1[approved1.loan_status_separated== 0]

# Number of loans
n_g = len(g)
n_b = len(b)

# Proportion of home ownership
prop_g = np.sum(g['home_ownership_separated']) / n_g
prop_b = np.sum(b['home_ownership_separated']) / n_b
print('Percentage of home_ownership for good_loans: ', prop_g)
print('Percentage of home_ownership for bad_loans: ', prop_b)

# Difference in proportion of home ownership
prop_diff = prop_g - prop_b
print('Difference in percentage of home ownership: ', prop_diff)

# T-score
t_stat, p = stats.ttest_ind(g['home_ownership_separated'],b['home_ownership_separated'],equal_var=False)
print('t-statistic: ', t_stat)
print('p-value: ', p)

# Standard error
s_error = np.sqrt(g['home_ownership_separated'].var()/n_g + b['home_ownership_separated'].var()/n_b)

# Margin of error = Critical value x Standard error of the statistic
m_error = 1.96 * s_error
print('Margin of error:', m_error)

# Confidence Interval
c_int = prop_diff + (np.array([-1, 1]) * m_error)
print('Confidence interval:', c_int)

# p-value
p_value = stats.norm.cdf(-t_stat) * 2
print('p-value:', p_value)

prop_g/prop_b

### Frequentist Bootstrapping

Resample from the same dataframe with the assumption that there is no difference between the two proportions.

In [None]:
df = approved1[['loan_status_separated','home_ownership_separated']]

def get_prop_diff(sample1, sample2):
    
    prop_g = np.sum(sample1['home_ownership_separated'] == 1)/len(sample1)
    prop_b = np.sum(sample2['home_ownership_separated'] == 1)/len(sample2)
    
    return abs(prop_g-prop_b)
    
def get_bs_samples_diff(sample1, sample2, func, size):
    length1 = len(sample1)
    length2 = len(sample2)
    bs_prop_diffs = np.empty(size)
    
    for i in range(size):
        combined_sample = pd.concat([sample1,sample2])
        ]\==      shuffled_sample = combined_sample.sample(length1+length2).reset_index(drop=True)

        new_sample1 = shuffled_sample.iloc[:length1,:]
        new_sample2 = shuffled_sample.iloc[length1:,:]
        
        bs_prop_diffs[i] = func(new_sample1,new_sample2)
        
    return bs_prop_diffs

In [None]:
bs_samples_diff = get_bs_samples_diff(df[approved1.loan_status_separated==1], 
                                      df[approved1.loan_status_separated==0], get_prop_diff, 1000)
print(bs_samples_diff[:5])

In [None]:
# p value
p = np.sum(bs_samples_diff > prop_diff)/len(bs_samples_diff)
print(p)

A 0 in 1,000 probability of getting a sample as extreme as the diffence that we see in our samples suggests there is an impact of home ownership on good vs. bad loans.

### Hypothesis Testing Assesment

After analysis on this data set it appears that home ownership has a significant value difference in good vs bad loans. The results from the tests show that the percentage of home ownership for good_loans is 61.17 percent and the percentage of home ownership for bad loans is 52.96 percent. This difference in percentage of home ownership is 8.2 percent. 

We performed a few tests to see if this difference could be perceived as random using a 2 sample t-test and frequentist bootstrapping. The p-value for the t-test and bootstrapping are both 0 so we can reject the null hypothesis. 

This would mean loan status is impacted by home ownership in this collected sample. There are however other variables (verification, grade, employment length, term) in the study that also should be taken into consideration in further investigation. 