# Confidence Intervals and Statistical Inference for Means
In this Notebook, we will work on confidence intervals and statistical inference for means. This particular Notebook is mostly adopted from the [Inferential Statistics](https://www.coursera.org/learn/inferential-statistics-intro/home/welcome) course of Duke University, converted from R to Python and tweaked to match the needs of our CSMODEL course.

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email your instructor.

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with 'Question #' on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* The notebooks will undergo a 'Restart and Run All' command, so make sure that your code is working properly.
* You are expected to understand the dataset loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

## Import Libraries

For the statistical functions, we will be using `scipy` - specifically, the `stats` submodule. The `scipy.stats` [(docs)](https://docs.scipy.org/doc/scipy/reference/stats.html) module provides a number of probability distribution functions, summary and frequency statistics, correlation functions, statistical tests, and more.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import ttest_ind

## Real Estate Data

Let's consider the real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor's  office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010.  This collection represents our **population** of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let's load the data.

In [None]:
ames_df = pd.read_csv("ames.csv", index_col="Order")
ames_df.head()

### Get a Sample

Here, we have access to the population data. But in most cases, we do not. Instead, we have to work with a **sample**. Here, let's try to take a sample from our population.

**Note**: The random state is any number that allows us to make our notebooks reproducible. The random state, in very simple terms, dictates where to start "searching" and sampling at random.

In [None]:
n = 60 # sample size
ames_sample_df = ames_df.sample(n, random_state=8)
ames_sample_df.head()

For now, we will only focus on the `Lot.Area` variable. Let us compute the summary statistics for this variable.

In [None]:
agg = ames_sample_df.agg({"Lot.Area": ["mean", "median", "std"]})

sample_mean = agg.loc["mean"][0]
sample_median = agg.loc["median"][0]
sample_std = agg.loc["std"][0]

print('Sample Mean: {:.2f}'.format(sample_mean))
print('Sample Median: {:.2f}'.format(sample_median))
print('Sample Standard Deviation: {:.2f}'.format(sample_std))

**Question #1:** What is the mean of your sample? Limit to 2 decimal places.
- *Write your answer here.*

**Question #2:** What is the median of your sample? Limit to 2 decimal places.
- *Write your answer here.*

**Question #3:** What is the standard deviation of your sample? Limit to 2 decimal places.
- *Write your answer here.*

### Confidence Interval

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as $\bar{x}$. That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This uncertainty can be quantified using a confidence interval.

A confidence interval for a population mean is of the following form:

$$\bar{x} \pm z^* \frac{s}{\sqrt{n}}$$

Where $z^*$, also known as the **critical value**, is the z-score that corresponds to the middle 95% of the data.

We can use the `norm.ppf` function for this task, which will give the critical value associated with a given percentile under the normal distribution. Remember that confidence levels and percentiles are not equivalent. For example, a 95% confidence level refers to the middle 95% of the distribution, and the critical value associated with this area will correspond to the 97.5th percentile.

We can find the critical value for a 95% confidence interval using:

In [None]:
z_star_95 = norm.ppf(0.975)
print('{:.2f}'.format(z_star_95))

We can compute the **margin of error** using the formula

$$z^* \frac{s}{\sqrt{n}}$$

Compute and display the margin of error given a 95% confidence level.

In [None]:
# Write your code here


**Question #4:** Given a 95% confidence level, what is the margin of error? Limit to 2 decimal places.
- *Write your answer here.*

The 95% confidence interval is the sample mean $\pm$ the margin of error. 

Compute and display the 95% confidence interval.

In [None]:
# Write your code here


**Question #5:** Specify the 95% confidence interval (minimum value, maximum value). Limit to 2 decimal places.
- *Write your answer here.*

To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

**Question #6:** What are the conditions that need to be met for the central limit theorem for means and consequently, our confidence interval to be valid?
- *Write your answer here.*

**Question #7:** Is our confidence interval valid, based on the above conditions?
- *Write your answer here.*

### Verify if Our Range Covers the True Mean

In this case, we have the rare luxury of knowing the true population mean since we have data on the entire population. Let’s calculate this value so that we can determine if our confidence intervals actually capture it.

Let us get the mean from the population (not the sample).

Compute and display the true population mean for the variable.

In [None]:
# Write your code here


**Question #8:** What is the true population mean of the variable? Limit to 2 decimal places.
- *Write your answer here.*

**Question #9:** Is the true population mean within your confidence interval range?
- *Write your answer here.*

### Increase the Confidence Level to 99%

Let's get another sample from the population, where `n` is 60.

In [None]:
n = 60 # sample size
ames_sample_df = ames_df.sample(n, random_state=9)
ames_sample_df.head()

Let's focus on the `Lot.Area` variable again.

Compute and display the summary statistics - mean, median, and standard deviation for this variable.

In [None]:
# Write your code here


**Question #10:** What is the mean of your new sample? Limit to 2 decimal places.
- *Write your answer here.*

Now, let's increase the confidence level from 95% to 99%. Get the **critical value**, $z^*$, or the z-score that corresponds to the middle 99% of the data.

In [None]:
# Write your code here


Compute and display the margin of error.

In [None]:
# Write your code here


**Question #11:** Given a 99% confidence level, what is the margin of error? Limit to 2 decimal places.
- *Write your answer here.*

Compute and display the confidence interval (minimum value, maximum value).

In [None]:
# Write your code here


**Question #12:** Specify the confidence interval (minimum value, maximum value). Limit to 2 decimal places.
- *Write your answer here.*

**Question #13:** Is the true population mean within your confidence interval range?
- *Write your answer here.*

From here, we have seen that even though we do not have access to the population, we can use a sample to estimate the the true population mean with the use of confidence intervals.

## Birth Records Data

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Load the `nc` data set into our workspace.

In [None]:
nc_df = pd.read_csv("nc.csv")
nc_df.head()

We have observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows.

- **`fage`**: father’s age in years.
- **`mage`**:	mother’s age in years.
- **`mature`**: maturity status of mother.
- **`weeks`**: length of pregnancy in weeks.
- **`premie`**: whether the birth was classified as premature (premie) or full-term.
- **`visits`**: number of hospital visits during pregnancy.
- **`marital`**: whether mother is married or not married at birth.
- **`gained`**: weight gained by mother during pregnancy in pounds.
- **`weight`**: weight of the baby at birth in pounds.
- **`lowbirthweight`**: whether baby was classified as low birthweight (low) or not (not low).
- **`gender`**: gender of the baby, female or male.
- **`habit`**: status of the mother as a nonsmoker or a smoker.
- **`whitemom`**:	whether mom is white or not white.

**Question #14:** What does each observation in this dataset represent?
- *Write your answer here.*

We will consider the possible relationship between a mother’s smoking habit (`habit`) and the weight (`weight`) of her baby. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Let's use a boxplot to compare the two groups:

In [None]:
nc_df.groupby("habit").boxplot(column="weight")
plt.show()

Now let's look at the summary statistics across the two groups.

In [None]:
summary_stat = nc_df.groupby("habit").agg({"weight": ["mean", "median", "std", len]})
summary_stat

It appears that babies of smokers tend to have less weight, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.

### Hypothesis Test

Based on the our sample, the difference in the means of the baby weights for smokers and non-smokers is:

In [None]:
non_smoker_mean = summary_stat.loc["nonsmoker"].loc["weight"].loc["mean"]
smoker_mean = summary_stat.loc["smoker"].loc["weight"].loc["mean"]

diff = non_smoker_mean - smoker_mean
print('{:.2f}'.format(diff))

We set up our hypotheses as follows:

$H_0$ (null hypothesis): The true difference is 0.

$H_A$ (alternative hypothesis): The true difference is not 0.

Now, we can use a $t$-test to compare the two means from the unpaired groups. This function assumes that the null hypothesis is that the difference between the two means is 0, while the alternative hypothesis is that the diference between them is not 0. We set the `equal_var` parameter to `False` because we don't want to assume that the population has equal variances.

In [None]:
ttest_ind(nc_df[nc_df["habit"] == "smoker"]["weight"],
          nc_df[nc_df["habit"] == "nonsmoker"]["weight"],
          equal_var = False)

Note that you the function above is to perform a $t$-test for **independent means** (unpaired). We would need to use other functions if we need to perform tests for other groups. We leave this for you to find out.

**Question #15:** What can you conclude based on the $p$-value under a 5% significance level? Kindly state your conclusions properly.
- *Write your answer here.*

**Question #16:** Can we say that smoking among mothers causes their babies to be lighter? Why or why not?
- *Write your answer here.*

## Try It Yourself

Compute the **90%** confidence interval for the average baby weights using the `nc` dataset.

Compute and display the sample mean.

In [None]:
# Write your code here


**Question #17:** What is the sample mean? Limit to 2 decimal places.
- *Write your answer here.*

Compute and display the confidence interval (minimum value, maximum value).

In [None]:
# Write your code here


**Question #18:** Specify the confidence interval (minimum value, maximum value). Limit to 2 decimal places.
- *Write your answer here.*