T-Test at Last Fico Range
==============

Among the variables that LendingClub provides on their webpage, there are two of them very correlated to the `loan_status`. They are the `last_fico_range_low` and `last_fico_range_high` which correspond with the lower and upper boundary range the borrower’s last FICO pulled belongs to. Only using these two variables and the target, we achieved almost __100% accuracy__ and __0.99 of area under the curve ROC__, regardless the type of model or the paramaters of the model we trained.

Nevertheless, we don't know if we can use these two variables as training features since it is possible that one person FICO range changed once she/he had unpaid the loan (_"Correlation does not imply causation"_).

So, the question here is: __can we use "last" FICO range (low and/or high) for predicting if a person is going to fully paid a loan?__ We are going to do some t-tests in order to try to figure it out. 

__Libraries__

In [3]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



__Some exploration of the data__:

In [4]:
loans = readRDS("/media/juanan/DATA/loan_data_analysis/data/clean/loans.rds")

In [6]:
head(loans)

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,⋯,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
38098114,,15000,15000,15000,60 months,12.39,336.64,C,C1,⋯,,,Cash,N,,,,,,
36805548,,10400,10400,10400,36 months,6.99,321.08,A,A3,⋯,,,Cash,N,,,,,,
37842129,,21425,21425,21425,60 months,15.59,516.36,D,D1,⋯,,,Cash,N,,,,,,
37612354,,12800,12800,12800,60 months,17.14,319.08,D,D4,⋯,,,Cash,N,,,,,,
37662224,,7650,7650,7650,36 months,13.66,260.2,C,C3,⋯,,,Cash,N,,,,,,
37822187,,9600,9600,9600,36 months,13.66,326.53,C,C3,⋯,,,Cash,N,,,,,,


Selecting fico columns and dates (`loan issued` and `last payment`):

In [7]:
ficoStudy <- loans %>% 
  select(issue_d, last_pymnt_d, 
         last_fico_range_low, last_fico_range_high, fico_range_low, fico_range_high, 
         loan_status) %>% 
  mutate(loanInProgress = as.numeric(loan_status == "Current"))

In [8]:
head(ficoStudy)

issue_d,last_pymnt_d,last_fico_range_low,last_fico_range_high,fico_range_low,fico_range_high,loan_status,loanInProgress
Dec-2014,Jun-2016,680,684,750,754,Fully Paid,0
Dec-2014,Aug-2016,560,564,710,714,Charged Off,0
Dec-2014,May-2016,700,704,685,689,Fully Paid,0
Dec-2014,Dec-2017,625,629,665,669,Current,1
Dec-2014,Aug-2015,555,559,685,689,Charged Off,0
Dec-2014,Apr-2015,720,724,680,684,Fully Paid,0


Fico mean for finished and current loans

In [11]:
ficoStudy %>% 
  group_by(loanInProgress) %>% 
  summarise(meanLastFico = mean(last_fico_range_high, na.rm = TRUE), 
            meanFico = mean(fico_range_high, na.rm = TRUE),
            counter = n())

loanInProgress,meanLastFico,meanFico,counter
0,673.8153,699.1254,857851
1,700.4573,700.3393,788950


Ratio current/finished loans by date issued

In [13]:
ficoStudy %>% 
  group_by(issue_d) %>% 
  summarise(ratioCurrentFinished = mean(loanInProgress))

issue_d,ratioCurrentFinished
,0.000000000
Apr-2008,0.000000000
Apr-2009,0.000000000
Apr-2010,0.000000000
Apr-2011,0.000000000
Apr-2012,0.000000000
Apr-2013,0.042255016
Apr-2014,0.093807351
Apr-2015,0.407118864
Apr-2016,0.620223979


T-Tests
====
___________

We are going to find out whether `last_fico_range` is significantly different than `fico_range` (that is the FICO range when the loan was issued) for the following cases of study:

01. Current Loans.
02. Finished Loans.
       - 2.1 Fully Paid Loans
       - 2.2 Charged Off Loans
03. September-2016 issued Loans Sample (we have choosen this date due to there is more or less the same amount of current and finished loans):
    - 3.1 Current Loans
    - 3.2 Fully Paid Loans
    - 3.3 Charged Off Loans
 

### 01 - Current Loans:

In [17]:
currentLoans <- ficoStudy %>% 
  filter(loanInProgress == 1)

`fico_range_low`:

In [18]:
t.test(currentLoans$last_fico_range_low, currentLoans$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  currentLoans$last_fico_range_low and currentLoans$fico_range_low
t = -2.2674, df = 1281300, p-value = 0.02337
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.29408267 -0.02138726
sample estimates:
mean of x mean of y 
 696.1815  696.3392 


`fico_range_high`:

In [19]:
t.test(currentLoans$last_fico_range_high, currentLoans$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  currentLoans$last_fico_range_high and currentLoans$fico_range_high
t = 1.753, df = 1316900, p-value = 0.0796
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.01392142  0.24979695
sample estimates:
mean of x mean of y 
 700.4573  700.3393 


### 02 - Finished Loans:

#### 02.1 - Fully Paid Loans

In [20]:
paidLoans <- ficoStudy %>% 
  filter(loan_status == "Fully Paid")

`fico_range_low`:

In [21]:
t.test(paidLoans$last_fico_range_low, paidLoans$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  paidLoans$last_fico_range_low and paidLoans$fico_range_low
t = 12.515, df = 905590, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.023848 1.404093
sample estimates:
mean of x mean of y 
 699.0454  697.8314 


`fico_range_high`:

In [22]:
t.test(paidLoans$last_fico_range_high, paidLoans$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  paidLoans$last_fico_range_high and paidLoans$fico_range_high
t = 33.322, df = 996250, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.654544 2.986332
sample estimates:
mean of x mean of y 
 704.6520  701.8316 


#### 02.2 - Charged Off Loans

In [23]:
chargedLoans <- ficoStudy %>% 
  filter(loan_status == "Charged Off")

`fico_range_low`:

In [24]:
t.test(chargedLoans$last_fico_range_low, chargedLoans$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  chargedLoans$last_fico_range_low and chargedLoans$fico_range_low
t = -389.65, df = 174320, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -177.4301 -175.6541
sample estimates:
mean of x mean of y 
 510.4852  687.0273 


`fico_range_high`:

In [25]:
t.test(chargedLoans$last_fico_range_high, chargedLoans$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  chargedLoans$last_fico_range_high and chargedLoans$fico_range_high
t = -860.22, df = 239910, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -123.2127 -122.6525
sample estimates:
mean of x mean of y 
 568.0947  691.0273 


### 03 - September-2016 sampling

#### 03.1 - Current Loans (issued at September-2016)

In [14]:
currentAtSept2016 <- ficoStudy %>% 
  filter(issue_d == "Sep-2016", loanInProgress == 1)

`fico_range_low`:

In [15]:
t.test(currentAtSept2016$last_fico_range_low, currentAtSept2016$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  currentAtSept2016$last_fico_range_low and currentAtSept2016$fico_range_low
t = -8.4321, df = 30819, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.650006 -2.895952
sample estimates:
mean of x mean of y 
 691.6136  695.3866 


`fico_range_high`:

In [16]:
t.test(currentAtSept2016$last_fico_range_high, currentAtSept2016$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  currentAtSept2016$last_fico_range_high and currentAtSept2016$fico_range_high
t = -7.9978, df = 31878, p-value = 1.31e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.275681 -2.592478
sample estimates:
mean of x mean of y 
 695.9525  699.3866 


#### 03.2 - Fully Paid Loans (issued at September-2016):

In [26]:
paidAtSept2016 <- ficoStudy %>% 
  filter(issue_d == "Sep-2016", loan_status == "Fully Paid")

`fico_range_low`:

In [27]:
t.test(paidAtSept2016$last_fico_range_low, paidAtSept2016$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  paidAtSept2016$last_fico_range_low and paidAtSept2016$fico_range_low
t = 7.2075, df = 11524, p-value = 6.055e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.826743 6.685749
sample estimates:
mean of x mean of y 
 704.7550  699.4987 


`fico_range_high`:

In [28]:
t.test(paidAtSept2016$last_fico_range_high, paidAtSept2016$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  paidAtSept2016$last_fico_range_high and paidAtSept2016$fico_range_high
t = 7.592, df = 11716, p-value = 3.389e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 4.016524 6.812432
sample estimates:
mean of x mean of y 
 708.9135  703.4990 


#### 03.3 - Charged Off Loans (issued at September-2016):

In [29]:
chargedAtSept2016 <- ficoStudy %>% 
  filter(issue_d == "Sep-2016", loan_status == "Charged Off")

`fico_range_low`:

In [30]:
t.test(chargedAtSept2016$last_fico_range_low, chargedAtSept2016$fico_range_low,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  chargedAtSept2016$last_fico_range_low and chargedAtSept2016$fico_range_low
t = -41.498, df = 1814.8, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -200.5345 -182.4346
sample estimates:
mean of x mean of y 
 495.8896  687.3741 


`fico_range_high`:

In [31]:
t.test(chargedAtSept2016$last_fico_range_high, chargedAtSept2016$fico_range_high,
       paired=F,var.equal=F)


	Welch Two Sample t-test

data:  chargedAtSept2016$last_fico_range_high and chargedAtSept2016$fico_range_high
t = -95.83, df = 2652.3, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -133.2730 -127.9284
sample estimates:
mean of x mean of y 
 560.7735  691.3741 


Conclusions
======
_______

According to the obtained results we can conclude:

- In __Current__ loans there is no evidence that `last_fico_range` is significantly different than `fico_range`, with p-values between 0.02 and 0.07, what we can not enough evidence to reject the null hipothesis.
- In __Fully Paid__ loans there is evidence that `last_fico_range` is higher than `fico_range`. But this difference is not greater than 5 points in the best cases.
- In __Charged Off__ loans there is evidence that `last_fico_range` is lower than `fico_range` and this difference is very significant: -175 FICO points for the low range and -122 FICO points for the high range. Therefore, __`last_fico_range` is an effect when loan have charged off (without correlation).__
- __We can not use `last_fico_range` as variable in modeling phase for predicting `loan_status`.__