**NAME:** Hazman Naim Bin Ahsan

**CLASS:** GA-DSBC-23-003

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Review CLT, Confidence Intervals, and Hypothesis Testing


---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Boston Housing dataset

Information about the boston housing dataset can be found [here](https://www.kaggle.com/datasets/simpleparadox/bostonhousingdataset)


In [4]:
# Read in the dataset
data = pd.read_csv('datasets/boston.csv')

# we will only explore the NOX and AGE variables
NOX = data['NOX']
AGE = data['AGE']
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### 1. Find the mean, standard deviation, and the standard error of the mean for variable `AGE`

In [10]:
# A:
print("Mean:", np.mean(AGE))
print("Standard Deviation:", np.std(AGE))

Mean: 68.57490118577076
Standard Deviation: 28.121032570236885


In [12]:
# scipy standard error function
from scipy.stats import sem
print("Standard Error:", sem(AGE))

Standard Error: 1.2513695252583041


### 2. Generate a 90%, 95%, and 99% confidence interval for `AGE`

You can use the `scipy.stats.t.interval` function to calculate confidence interval range.

```python
# Endpoints of the range that contains alpha percent of the distribution
stats.t.interval(confidence, df, loc=0, scale=1)	
```

Arguments:
- `confidence` = confidence level, between 0 and 1
- `df` = the degrees of freedom, will be the length of the vector -1.
- `loc` = the mean of the t-distribution (your point estimate - mean of the variable)
- `scale` = the standard deviation of the t-distribution (the standard error of your sample mean)

**Interpret the results from all three confidence intervals.**

In [21]:
import scipy
import scipy.stats as stats
from scipy.stats import t

In [27]:
# A: 
df = len(AGE) - 1

confidence_levels = [0.90, 0.95, 0.99]

for confidence in confidence_levels:
    print(f"\n{int(confidence*100)}% Confidence Interval:")
    interval = stats.t.interval(confidence, df, np.mean(AGE), sem(AGE))
    print(interval)


90% Confidence Interval:
(66.51279866704186, 70.63700370449965)

95% Confidence Interval:
(66.11636971854321, 71.0334326529983)

99% Confidence Interval:
(65.33936041834139, 71.81044195320013)


**Interpretation**

The confidence intervals calculated represent the range in which we are fairly certain the true population mean of `AGE` lies, given the data we have.

- The **90% confidence interval** is from approximately **66.51 to 70.64**. This means that we can be 90% confident that the true population mean of `AGE` is between these two values.
- The **95% confidence interval** is from approximately **66.12 to 71.03**. This means that we can be 95% confident that the true population mean of `AGE` is between these two values.
- The **99% confidence interval** is from approximately **65.34 to 71.81**. This means that we can be 99% confident that the true population mean of `AGE` is between these two values.

As we can see, as we increase our confidence level (from 90% to 99%), our confidence interval becomes wider. This is because to be more confident that we've captured the true population parameter, we need to provide a larger range.

### 3. Did you rely on the Central Limit Theorem in question 2? Why or why not? Explain.

Yes, the Central Limit Theorem (CLT) is implicitly relied upon when calculating confidence intervals using the methods in question 2. Here's are the reasons:

The Central Limit Theorem states that if we have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).

In our calculations for confidence intervals, we're dealing with the mean of `AGE` (a sample statistic). According to the CLT, this mean should follow a normal distribution if our sample size is large enough, regardless of the distribution of `AGE` in the population.

This normal distribution is what allows us to use the t-distribution (when population standard deviation is unknown) or z-distribution (when population standard deviation is known) to calculate confidence intervals. Both distributions are used when we assume our data follows a normal distribution.

### 4. For the variable `NOX`, generate a 95% confidence interval and interpret it.

In [28]:
# A:
df = len(NOX) - 1
mean = np.mean(NOX)
sem = sem(NOX)

conf_interval = stats.t.interval(0.95, df, mean, sem)
print(conf_interval)

(0.5445742622921801, 0.5648158562848951)


**Interpretation**

The 95% confidence interval for the variable `NOX` is approximately **(0.5446, 0.5648)**. This means that we are 95% confident that the true population mean of `NOX` lies within this interval.

In other words, if we were to take many samples and compute a 95% confidence interval for each sample, we would expect the true population mean to fall within these intervals 95% of the time.

### 5. For the variable `NOX`, we are going to test the hypothesis that the (true) mean is equal to the median in the sample

In this case, we are performing the hypothesis test to test the mean based on a single sample.
These are the steps:
1. Define hypothesis
2. Set alpha (Let alpha = 0.05)
3. Calculate point estimate
4. Calculate test statistic
5. Find the p-value
6. Interpret results

In [29]:
# A:
## Step 1: Define hypotheses.
### H_0: mu_NOX = M_NOX
### H_A: mu_NOX != M_NOX

## Step 2: alpha = 0.05.
alpha = 0.05

## Step 3: Calculate point estimate.
sample_mean = NOX.mean()
sample_median = 0.54
sample_std = NOX.std()
sample_size = len(NOX)

## Step 4: Calculate test statistic.
t_statistic = (sample_mean - sample_median)/(sample_std/sample_size**0.5)

## Step 5: Find p-value.
## t.sf is survival function, which is 1-cdf at a given value 
## (proportion of values at least as extreme as...)
p_value = t.sf(np.abs(t_statistic), len(NOX)-1) * 2 


## Because our alternative hypothesis is != (rather than greater than or less than),
## we multiply our p-value by 2. (This is called a two-sided test.)
print("Our sample median is {:.4f}.".format(0.54))
print("Our sample mean is {:.4f}.".format(sample_mean))
print("Our t-statistic is {:.6f}.".format(t_statistic))
print("Our p-value is {:.6f}.".format(p_value))

if p_value < alpha:
    print("We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.")
elif p_value > alpha:
    print("We fail to reject our null hypothesis and cannot conclude that the true mean NOX value is different from the median .")
else:
    print("Our test is inconclusive.")

Our sample median is 0.5400.
Our sample mean is 0.5547.
Our t-statistic is 2.852639.
Our p-value is 0.004514.
We reject our null hypothesis and conclude that the true mean NOX value is different from the median NOX value.


**1-sample t-test**

To perform the t-test on a single sample, you can use `scipy.stats.ttest_1samp()`.

Try it out. Do you get the same values?

In [31]:
from scipy import stats

# Use stats.ttest_1samp() with appropriate parameters to check the results above
t_statistic, p_value  = stats.ttest_1samp(NOX, sample_median)

print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

t-statistic:  2.8526390677766322
p-value:  0.004513586425934958


### 6. What do you notice about the results from Exercise 4 and Exercise 5? 

**If you were going to generalize this to the relationship between hypothesis tests and confidence intervals, what might you say? Be specific.**

The results from Exercise 4 and Exercise 5 are indeed related. 

In Exercise 4, we calculated a 95% confidence interval for the `NOX` data, which resulted in an interval of approximately (0.5446, 0.5648). This means that we are 95% confident that the true population mean of `NOX` lies within this interval.

In Exercise 5, we conducted a hypothesis test to determine whether the mean of `NOX` is significantly different from its median (0.54). The p-value from this test was approximately 0.004514, which is less than the commonly used significance level of 0.05. Therefore, we rejected the null hypothesis and concluded that the true mean `NOX` value is different from the median `NOX` value.

Now, if we look at the confidence interval from Exercise 4, we'll notice that the median value of 0.54 does not fall within this interval. This is consistent with the results of the hypothesis test in Exercise 5, where we concluded that the mean is different from the median.

So, to generalize this to the relationship between hypothesis tests and confidence intervals:

- A confidence interval gives a range of plausible values for a population parameter (like the mean) based on sample data. If a given value falls within this interval, we would not reject the null hypothesis that the population parameter is equal to that value at the corresponding significance level.
- A hypothesis test assesses whether a sample statistic is significantly different from a hypothesized population parameter. If we reject the null hypothesis, it means that the hypothesized value falls outside of the confidence interval for that parameter.

In other words, if we're conducting a two-sided hypothesis test at significance level α, and we have a (1-α)100% confidence interval for a parameter, we will reject the null hypothesis that the parameter equals some value if and only if that value falls outside of our confidence interval.