<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Inferential Statistics Lab

_Author: Matt Brems (DC)_

We've saved the data for you in a file named "housing.data". Load it in using any method you choose, or run the following cells to import it from `sklearn`.

### Data Dictionary: Boston Housing Data

Sources: Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the 
                 demand for clean air', J. Environ. Economics & Management,
                 vol.5, 81-102, 1978.


- Number of Observations: 506
- Number of Variables: 13

#### Variable Information
```
    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of African-American residents by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in $1000's
```

In [1]:
from sklearn import datasets
import pandas as pd
import scipy.stats as stats

In [3]:
df = datasets.load_boston()
data = pd.DataFrame(df.data,columns=df.feature_names)

In [4]:
data.head()


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [5]:
data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.647423,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


Exercise 1: Conduct a brief integrity check of your data. This integrity check should include, but is not limited to, checking for missing values and making sure all values make logical sense. (i.e. Is one variable a percentage, but there are observations above 100%?)
Summarize your findings in a few sentences, including what you checked and, if appropriate, any steps you took to rectify potential integrity issues.

In [19]:
data.isnull().any()


CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
dtype: bool

In [7]:
data.dtypes   

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object

Exercise 2: For what two attributes does it make the least sense to calculate mean and median? Why?

In [8]:
data.describe(include='all')

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.647423,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


Exercise 3: Find the mean, standard deviation, and the standard error of the mean for variable 'AGE.'

In [14]:
data.AGE.describe()

count    506.000000
mean      68.574901
std       28.148861
min        2.900000
25%       45.025000
50%       77.500000
75%       94.075000
max      100.000000
Name: AGE, dtype: float64

In [16]:
data.AGE.sem()

1.251369525258305

Mean = 68.575
<br>

Standard deviation = 12.15
<br>
Standard Error = 1.251

Exercise 4: Generate a 95% and confidence interval for 'AGE' manually.

Remember that the formula for a confidence interval is:

$$\bar{x} \pm z^* \frac{\sigma}{\sqrt{n}}$$

In [20]:
sample_mean =68.574901
z_star =1.96
sigma =28.148861
n =506
#low_end = sample_mean - z_star * sigma / n ** 0.5
low_end = sample_mean - z_star * sigma / n ** 0.5

high_end = sample_mean + z_star * sigma / n ** 0.5

In [21]:
(low_end,high_end)

(66.12221676594831, 71.02758523405168)

Exercise 5: Create a function to take in the data and level of significance, then return the confidence interval with a helpful message interpreting it! Then, for variable 'NOX', generate a 95% confidence interval and its interpretation.

In [49]:
def confidence_interval(data_column_name,sig_level):
    z_score=1-sig_level 
    z_percentage=z_score*100
    if z_score== 0.90:
        z_star=1.645
    elif z_score== 0.95:
        z_star=1.96
    else:
        z_star == 2.575 
    
    low_end=data_column_name.mean()-z_star*(data_column_name.std())/n**0.5
    high_end=data_column_name.mean()+z_star*(data_column_name.std())/n**0.5
    print ('The lower end is {:.2f} and the upper end is {:.2f} for this distribution.'.format(low_end,high_end))
    print('There is a {} % chance of the mean being between {:.2f} and {:.2f}.'.format(z_percentage,low_end,high_end))


In [48]:
confidence_interval(data.NOX,0.05)

The lower end is 0.54 and the upper end is 0.56 for this distribution
There is a 95.0 % chance of the mean being between 0.54 and 0.56


Exercise 6: For the variable 'NOX', find the median.

In [50]:
data.NOX.median()

0.538

Exercise 7: For the variable 'NOX', test the hypothesis that the mean is equal to the 0.538. We'll complete all five steps.

Exercise 7, Step 1: Set up your hypotheses.

H0: The difference between the sample mean and expected mean is zero.
<br>
H1: The difference between the sample mean and expected mean is not zero.    

Exercise 7, Step 2: Our level of significance is 0.05. There's no work to do here. :)

In [53]:
mean_hyp=0.538 #the expected mean
alpha =0.05

Exercise 7, Step 3: Calculate your point estimate. In this case, it's your sample mean.

In [59]:
p_e = data.NOX.mean() #Point Estimate is the sample mean


In [57]:
len(data.NOX)

506

Exercise 7, Step 4: Calculate your test statistic. In this case, it's:

$$ z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$$

Note that $\mu_0$ is the mean we assume in our null hypothesis!

In [65]:
p=(p_e-mean_hyp)/(data.NOX.std())/(len(data.NOX))**(0.5)
print("The p value is {:.4f} while the level of significance is {}".format(p,alpha))


The p value is 0.0064 while the level of signnificance is 0.05


Exercise 7, Step 5: Suppose your p-value is 0.06. Interpret your result!

since the p value less than the level of significance, the mean is probably not equal to 0.538. 

Exercise 7, Step 6: Suppose your p-value is actually 0.02. Now interpret your result!

since the p value greater than the level of significance, the mean is probably equal to 0.538.

Exercise 8: We're going to run this exact same thing using SciPy. (We'll use a function that assumes our test statistic is $t$. That's okay! Don't worry about that issue for now. If you want to see the documentation, check it out [here](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_1samp.html).)

In [68]:
#list_of_values =     # This should be your NOX data!
#popmean =            # This should be mu_0, the assumed value of mu in the null hypothesis!

stats.ttest_1samp(data.NOX, p_e)

Ttest_1sampResult(statistic=6.465572227655941e-14, pvalue=0.9999999999999485)