<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Inferential Statistics Lab

_Author: Matt Brems (DC)_

Head to the `data` folder. There are two files:
- `crx_names.txt`: This summarizes the source of the data and provides information that is important to understanding what the data mean. **Be sure to read this first!**
- `crx_data.csv`: This will be the data itself. **Note that there are no column headers.**

A source of the data is [here](https://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening) if you would like to learn more.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import scipy.stats as stats

**Exercise 1**: Load the data in using any method that you choose.

In [2]:
df = pd.read_csv("./data/crx_data.csv", skipinitialspace=True)

In [3]:
df.head()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


**Exercise 2**: Note that there are no meaningful column names. Why is this the case? Do you agree with this or disagree with this?

**Answer**:  It seems there are no meaningful column names for one of two reasons: 1. the data is raw, and therefore was not formatted/organized in a meaningful, presentable way; 2. the column names were changed to protect identity of the credit applicants.

If the reason is the first, raw data that needs to be formatted/organized before manipulating seems like something Data Scientists will have to deal with on a regular basis, and therefore is not surprising.  I do not agree nor disagree, but would go ahead and add column names to the copy of the dataset I worked with.

If the reason is the second, I do agree with keeping the credit applicants' identities anonymous.  It could be a legal liability, I'm assuming, to release the identities of the applicants.  Even if it weren't, it'd be a breach of ethics.

**Exercise 3:** You want to give names to each column. Read the following line of code:

```python
['X' + str(i) for i in range(1,17)]
```

Before running this line of code, what will this create? Go ahead and use this for the column names.

**Answer**:

This will create a list of X# strings, the #'s going from 1 to 16, inclusive.

In [4]:
['X' + str(i) for i in range(1,17)]

['X1',
 'X2',
 'X3',
 'X4',
 'X5',
 'X6',
 'X7',
 'X8',
 'X9',
 'X10',
 'X11',
 'X12',
 'X13',
 'X14',
 'X15',
 'X16']

In [5]:
df.columns = ['X' + str(i) for i in range(1,17)]

**Exercise 4**: Count the number of missing values in each column. (There are multiple ways to do this.)

In [6]:
df.count()

X1     689
X2     689
X3     689
X4     689
X5     689
X6     689
X7     689
X8     689
X9     689
X10    689
X11    689
X12    689
X13    689
X14    689
X15    689
X16    689
dtype: int64

#### At first, I was not able to figure out why all columns were registering as containing 100% of its values, until I took a look at the .csv file in Excel, where I was able to see that instead of leaving the cells empty, the dataset filled it's NaN values with '?'.

In [7]:
df = df.replace('?',np.nan)

In [8]:
len(df) - df.count()

X1     12
X2     12
X3      0
X4      6
X5      6
X6      9
X7      9
X8      0
X9      0
X10     0
X11     0
X12     0
X13     0
X14    13
X15     0
X16     0
dtype: int64

**Exercise 5**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company.

How would you describe the sample here?

**Answer**:  The sample is not that good.  Without column names and with data changed for the sake of protecting identities, the data is ambiguous.  The .txt file provides some context, but not enough to clarify what exactly the values in each column mean.  I assume it would be difficult to make any significant inferences with the sample provided.

**Exercise 6**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company. We specifically want to estimate the true proportion of approved applications for this company.

What is the parameter here? What is the statistic here? 
> Be sure to identify which is which!

**Answer**:  The parameter is the true proportion of approved credit applications for this company.  The statistic is the proportion of approved credit applications in the sample set provided above (crx_data.csv).

**Exercise 7**: Recall that the formula for a confidence interval is:

$$
\text{[sample statistic] } \pm \text{[multiplier] } \times \text{[standard deviation of sampling distribution]}
$$

Calculate the:
- sample percentage of `+` applications
- sample standard deviation of `+` applications
- size of our sample.

Use these to generate a 95% confidence interval for the true proportion of approved applications for this company. Note that column `X16` identifies which applications in our sample were approved (`+`) and denied (`-`).

> Some data "munging" (cleaning/transforming) may be required!

In [9]:
df.X16.value_counts(normalize=True)[1]

0.444121915820029

In [10]:
df.std()

X3        4.978470
X8        3.348739
X11       4.866180
X15    5213.743149
dtype: float64

In [11]:
df.X16 = df.X16.replace('+',1)

In [12]:
df.X16 = df.X16.replace('-',0)

In [13]:
df.std()

X3        4.978470
X8        3.348739
X11       4.866180
X15    5213.743149
X16       0.497229
dtype: float64

In [14]:
df.shape

(689, 16)

In [15]:
interval = ((44.41-(1.96)*(0.4972)),(44.41+(1.96)*(0.4972)))
print(interval)

(43.435488, 45.384511999999994)


**Answer**:  95% Confidence Interval:  (43.44, 45.38)

**Exercise 8**: Interpret the above interval.
> While you _could_ copy and paste text from the notes and fill in the blanks, you should practice interpreting the interval. Remember, this will come up in interviews!

**Answer**:  I am 95% confident that the true approval rating for credit applicants for this company is between 43.44 and 45.38 percent.

**Exercise 9**: Define a function named `conf_int()` that accepts two arguments: 
- `data`, which should be an array or Series of data
- `conf_level`, which should be either `90`, `95`, or `99`.

Your function should return the 90% confidence interval, 95% confidence interval, or 99% confidence interval, depending on what value the user selected. **Set the default to be 95.**

> For a 90% confidence interval, the multiplier is 1.645.

> For a 95% confidence interval, the multiplier is 1.96.

> For a 99% confidence interval, the multiplier is 2.576.

In [16]:
def conf_int(data, conf_level=95):
    #sam_stat = data.value_counts(normalize=True)[1]
    sam_stat = np.mean(data)
    std_dev = data.std()
    sam_count = ((data.shape[0]) - 1)
    
    if conf_level == 90:
        multiplier = 1.645
    elif conf_level == 95:
        multiplier = 1.96
    elif conf_level == 99:
        multiplier = 2.576
        
    return ((sam_stat - multiplier*(std_dev/(sam_count**0.5))), (sam_stat + multiplier*(std_dev/(sam_count**0.5))))

**Exercise 10**: Test your function to find the 99% confidence interval for the mean of `X3`. Your answer should be **(4.2709, 5.2466)**. Also interpret the interval.

**Answer**:  The interval is (4.277, 5.2546).  I am 99% confident that the true mean is between 4.277 and 5.2546.

In [17]:
conf_int(data = df.X3, conf_level=99)

(4.2767003591739154, 5.2545623403906685)

**Exercise 11**: We want to test whether or not the mean of $X_3$ was equal to 5.

State the null and alternative hypotheses.

**Answer:**

The null hypotheses is that the mean of X3 is equal to 5.  The alternative hypothese is that the mean of X3 is not equal to 5.

**Exercise 12**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [18]:
stats.ttest_1samp(a=df.X3, popmean=5)

Ttest_1sampResult(statistic=-1.2357004468189647, pvalue=0.2169916850026909)

**Answer**:  The pvalue that returned was 0.217, which is much greater than the alpha of 0.05.  Therefore, cannot reject the null hypotheses and must accept that the true mean is equal to 5.

**Exercise 13**: We want to test whether or not the true proportion of $X_{16}$ was equal to 0.5.

State the null and alternative hypotheses.

**Answer**: The null hypotheses is that the mean of X16 is equal to 0.5.  The alternative hypothese is that the mean of X16 is not equal to 0.5.

**Exercise 14**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [19]:
stats.ttest_1samp(a=df.X16, popmean=0.5)

Ttest_1sampResult(statistic=-2.9498154665812626, pvalue=0.0032875845025836952)

**Answer**:  The pvalue that returned was 0.0033, which is less than the alpha of 0.05.  Therefore, we reject the null hypotheses and accept that the true mean is not equal to 0.5.