<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Inferential Statistics Lab

_Author: Matt Brems (DC)_

Head to the `data` folder. There are two files:
- `crx_names.txt`: This summarizes the source of the data and provides information that is important to understanding what the data mean. **Be sure to read this first!**
- `crx_data.csv`: This will be the data itself. **Note that there are no column headers.**

A source of the data is [here](https://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening) if you would like to learn more.

**Exercise 1**: Load the data in using any method that you choose.

In [51]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [70]:
df = pd.read_csv("data/crx_data.csv", header=None)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**Exercise 2**: Note that there are no meaningful column names. Why is this the case? Do you agree with this or disagree with this?

**Answer**: According to the text file the column titles has been changed and removed in the interest of confidentiality of credit data. This makes sense but I would believe there are better ways to do this without have the data be inherently hard to decipher to anyone who sees it.

**Exercise 3:** You want to give names to each column. Read the following line of code:

```python
['X' + str(i) for i in range(1,17)]
```

Before running this line of code, what will this create? Go ahead and use this for the column names.

**Answer**: It will give a list with the letter X and a number per line of string from 1 to 16

In [71]:
new_header = ['X' + str(i) for i in range(1,17)]
df = df[0:]
df.columns = new_header
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


s**Exercise 4**: Count the number of missing values in each column. (There are multiple ways to do this.)

In [54]:
#Counts nans and then summing them
df.info()
df.isna().sum()
#no NanS

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
X1     690 non-null object
X2     690 non-null object
X3     690 non-null float64
X4     690 non-null object
X5     690 non-null object
X6     690 non-null object
X7     690 non-null object
X8     690 non-null float64
X9     690 non-null object
X10    690 non-null object
X11    690 non-null int64
X12    690 non-null object
X13    690 non-null object
X14    690 non-null object
X15    690 non-null int64
X16    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB


X1     0
X2     0
X3     0
X4     0
X5     0
X6     0
X7     0
X8     0
X9     0
X10    0
X11    0
X12    0
X13    0
X14    0
X15    0
X16    0
dtype: int64

In [55]:
#Filter views to find question marks. And then returns the sum of the NonNAs, aka. the questions marks. 
df[(df == '?')].notna().sum()


X1     12
X2     12
X3      0
X4      6
X5      6
X6      9
X7      9
X8      0
X9      0
X10     0
X11     0
X12     0
X13     0
X14    13
X15     0
X16     0
dtype: int64

**Exercise 5**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company.

How would you describe the sample here?

**Answer**: The sample represents a number of 689 credit applications submitted to this bank who were rejected or approved.

**Exercise 6**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company. We specifically want to estimate the true proportion of approved applications for this company.

What is the parameter here? What is the statistic here? 
> Be sure to identify which is which!

**Answer**: The statistic is the percent of pooled applications for this sample of 689. The parameter applies to the population for all credit applications submitted to the company. 

**Exercise 7**: Recall that the formula for a confidence interval is:

$$
\text{[sample statistic] } \pm \text{[multiplier] } \times \text{[standard deviation of sampling distribution]}
$$

Calculate the:
- sample percentage of `+` applications
- sample standard deviation of `+` applications
- size of our sample.

Use these to generate a 95% confidence interval for the true proportion of approved applications for this company. Note that column `X16` identifies which applications in our sample were approved (`+`) and denied (`-`).

> Some data "munging" (cleaning/transforming) may be required!

In [56]:
for x in range(0, len(df)):
    try:
        df['X2'] = pd.to_numeric(df.X2)
    except:
        pass

In [57]:
#Converting columns to numberic instead of object
for x in range(0, len(df)):
    try:
        df['X14'] = pd.to_numeric(df.X14)   
    except:
        pass


In [58]:
#Caclulates the amount of + and - and then converting it to percentages
plusminusper = (df['X16'].value_counts()) / len(df['X16'])
plusminusper

-    0.555072
+    0.444928
Name: X16, dtype: float64

In [59]:
x16mean = np.mean((df['X16'] == '+'))
x16mean

0.4449275362318841

In [60]:
#Finding STD
stdx16 = (df['X16'] == '+')
stdx16 = stdx16*1
stdx16 = np.std(stdx16)
stdx16

0.4969577685623922

In [61]:
n = len(df['X16'])

In [62]:
#Finding confidence interval
round(x16mean + 1.96 * stdx16 / (n ** 0.5), 4)


0.482

In [63]:
round(x16mean - 1.96 * stdx16 / (n ** 0.5), 4)

0.4078

|**Answer**: We are 95% confident that the number of passed credit applications is between 49.07% and 41.42% 

**Exercise 8**: Interpret the above interval.
> The confidence interval represents that with 95% confidence that the true population mean lies between the two confidence intervals of 49.07% and 41.42% based on our sample.

**Answer**:  The confidence interval represents that with 95% confidence that the true population mean lies between the two confidence intervals of 49.07% and 41.42% based on our sample.

**Exercise 9**: Define a function named `conf_int()` that accepts two arguments: 
- `data`, which should be an array or Series of data
- `conf_level`, which should be either `90`, `95`, or `99`.

Your function should return the 90% confidence interval, 95% confidence interval, or 99% confidence interval, depending on what value the user selected. **Set the default to be 95.**

> For a 90% confidence interval, the multiplier is 1.645.

> For a 95% confidence interval, the multiplier is 1.96.

> For a 99% confidence interval, the multiplier is 2.576.

In [64]:
def conf_int(data, conf_level):
    mean = np.mean(data)
    std = np.std(data)
    n = len(data)
    
    
    if conf_level == 90:
        con = 1.645
    elif conf_level == 99:
        con = 2.576
    else: 
        con = 1.96
    
    return ( round(mean - con * std / (n ** 0.5), 4) ), ( round(mean + con * std / (n ** 0.5), 4) )

**Exercise 10**: Test your function to find the 99% confidence interval for the mean of `X3`. Your answer should be **(4.2709, 5.2466)**. Also interpret the interval.

In [65]:
conf_int(df['X3'], 99)

(4.2709, 5.2466)

In [66]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**Answer**: The interval means we are 99% confident that the true mean lies between 4.2709 and 5.2466. 

**Exercise 11**: We want to test whether or not the mean of $X_3$ was equal to 5.

State the null and alternative hypotheses.

**Answer:** The null hypothesis would state that the mean of X3 is not 5. Whereas the alternative hypothesis states that the mean of X3 is 5.

**Exercise 12**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [67]:

pval = stats.ttest_1samp(df['X3'], 5.0)
pval.statistic, pval.pvalue

(-1.2731172058046707, 0.20340583433417578)

**Answer**: The T stat is the measure of difference and the p value indicates the proability that our measured difference was because of random chance. We have a Tstat of -1.223 and a pvalue of .2169. Since the P-Value is greater than the signficance. We need to accept the null hypothesis in this case. there is a high chance that a mean of 5 is by chance.

**Exercise 13**: We want to test whether or not the true proportion of $X_{16}$ was equal to 0.5.

State the null and alternative hypotheses.

**Answer**: The null hypothesis is that X16 true mean is not equal to .5. Alternative hypothesis is that X16 true mean is equal to .5

**Exercise 14**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [68]:
DF16 = df.X16

for x in range(0, len(DF16)):
    if DF16[x] == '+':
        DF16[x] = 1
    elif DF16[x] == '-':
        DF16[x] = 0
        
DF16.mean()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


0.4449275362318841

In [69]:
pval = stats.ttest_1samp(DF16, .5)
pval.statistic, pval.pvalue

(-2.9088721445108887, 0.0037441556293536346)

**Answer**: We reject the null hypothesis in this case since the P value is lower than the significance value of .05