<a href="https://colab.research.google.com/github/DavidSenseman/BIO5853/blob/master/Lesson_03_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 5853: Biostatistics**

##### **Module 3: Inference**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 3 Material

* **Part 3.1: Confidence Intervals**
* Part 3.2: Hypothesis Testing
* Part 3.3: Comparison of Two Means
* Part 3.4: Analysis of Variance (ANOVA)
* Part 3.5: Nonparametric Methods
* Part 3.6: Inference on Proportions
* Part 3.7: Contingency Tables
* Part 3.8: Correlation
* Part 3.9: Simple Linear Regression
* Part 3.10: Multiple Linear Regression
* Part 3.11: Logistic Regression
* Part 3.12: Survival Analysis

## Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
# YOU MUST RUN THIS CELL FIRST
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

# **Part 3.1: Confidence Intervals**

**_Confidence intervals (CIs)_** are a range of values used to estimate the true value of a population parameter based on sample data. They provide a measure of the precision and reliability of the estimate. The most commonly used confidence interval is the 95% CI, which means that if we were to take 100 different samples and compute a CI for each sample, we would expect about 95 of the intervals to contain the true population parameter. 

#### **Key Concepts:**
1. **Point Estimate**: The sample statistic (e.g., sample mean) used as the best estimate of the population parameter.
2. **Margin of Error**: The range above and below the point estimate that defines the confidence interval.
3. **Confidence Level**: The probability that the confidence interval contains the true population parameter (e.g., 95%).

#### **How Confidence Intervals are Used in Biostatistics:**
1. **Estimating Population Parameters**: CIs are used to estimate population parameters such as means, proportions, and differences between groups. For example, a 95% CI for the mean blood pressure in a population might be 120 to 130 mmHg, suggesting that the true mean is likely within this range¹.

2. **Assessing Precision**: The width of the CI indicates the precision of the estimate. Narrower intervals suggest more precise estimates, while wider intervals indicate less precision².

3. **Hypothesis Testing**: CIs can be used to perform hypothesis tests. If a CI for a difference between two groups does not include zero, it suggests a statistically significant difference³.

4. **Communicating Uncertainty**: CIs provide a way to communicate the uncertainty associated with sample estimates. This is crucial in medical research where decisions are often based on sample data⁴.

### Example in Biostatistics:
Suppose researchers are studying the effect of a new drug on cholesterol levels. They collect data from a sample of patients and calculate the mean reduction in cholesterol. A 95% CI for the mean reduction might be 15 to 25 mg/dL. This interval suggests that the researchers are 95% confident that the true mean reduction in cholesterol for the population lies between 15 and 25 mg/dL.

## **Point Estimation**

_From your textbook page 209:_

Now that we have investigated the theoretical properties of a distribution of sample means, we are ready to take the next step and apply this knowledge to the process of statistical inference. Recall that our goal is to estimate the population mean associated with a continuous random variable using the information contained in a sample of observations drawn from that population. There are two methods of estimation which are commonly used. The first is called **_point estimation_**; it involves using the sample data to calculate a _single_ number to estimate the parameter of interest. For instance, we might use the sample mean $\bar{x}$ to estimate the population mean $µ$. The problem is that two different samples are very likely to result in different sample means, and thus there is some degree of uncertainty involved. A point estimate does not provide any information about the inherent variability of the estimator; we do not know how close $\bar{x}$ is to $µ$ in any given  situation. While $\bar{x}$ is more likely to be near the true population mean if the sample on which it is  based is large – recall the property of consistency – a point estimate provides no information about the size of this sample.

(Pagano, Marcello; Gauvreau, Kimberlee; Mattie, Heather. Principles of Biostatistics (p. 209). CRC Press. Kindle Edition.) 

## **Two-Sided Confidence Intervals**

To construct a confidence interval for $µ$, we draw on our knowledge of the sampling distribution of the mean from the previous chapter. Given a random variable $X$ that has mean $µ$ and standard deviation $σ$, the central limit theorem states that 

$$ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} $$

As shown in a step-by-step fashion, the following equation (see page 212) can be derived from the equation above. The general form for a confidence interval for $µ$ can be obtained by introducing some new notation. Let $z_{1-\alpha/2}$ be the (1 − α/2)th percentile of the standard normal distribution. By the  definition of a percentile, the probability that a standard normal random variable $Z$ takes a value less than $z_{1-\alpha/2}$ is 1 − α/2, and the probability that it takes a value greater than $z_{1-\alpha/2}$ is 1 − (1 − α/2) =  α/2. If α = 0.05, for example, then $z_{1-0.05/2}$ = $z_{0.975}$ = 1.96; if α = 0.01, then $z_{0.995}$ = 2.58. 

Using this notation, the general form for a 100% × (1 − α) confidence interval for $µ$ is 

$$ \left( \bar{X} - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}} \right). $$ 

For the 95% confidence interval, the general equation becomes:

$$  \left( \bar{X} - 1.96 \frac{\sigma}{\sqrt{n}}, \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}} \right) $$

Every time we compute the mean $\bar{X}$ of a sample of values (_n_ $\geq$ 30), we will almost certainly obtain a different value--a different **_point estimate_**. However, due to the Central Limit Theorem (CLT), we can be _confident_ that the true population mean $µ$ will not smaller than $\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}}$, nor is larger than $\bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}$ in 95 out of 100 experiments (trials).   

So instead of simply providing a point estimate of a statistic (e.g. mean), it is also necessary to add a confidence interval to your estimate. A point estimate alone doesn’t convey the uncertainty or variability inherent in your point estimate. A confidence interval provides a range within which the true parameter value is likely to fall, giving a sense of the estimate’s precision. Confidence intervals can also be used to assess statistical significance. If a confidence interval for a difference between groups does not include zero, it suggests that there is a statistically significant difference.


### Example 1: Compute 95% Confidence Interval and Inteval Length

While the above provides a solid theorectical framework to the use of confidence intervals with point estimates, it doesn't show how to actually generate they using your computer. The Python code in the cell below shows how to compute this first entry in the following table shown on page 212 in your textbook:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image07.png)

_Code Description:_

The code requires two Python packages be imported:

~~~text
import scipy.stats as stats
import numpy as np
~~~

The package `scipylstats` is required for the following code chunk that computes the Z value for a given significance level (`alpha`):
~~~text
# Calculate the Z-score for the given confidence level
z = stats.norm.ppf(1 - alpha / 2)
~~~

The `z` value is then used to compute the `margin of error` with this code chunk:
~~~text
# Calculate the margin of error
margin_of_error = z * (sigma / np.sqrt(n)
~~~

The last code line has some unusual values in it:

~~~text
print(f"   {n}            X\u0305 \u00B1{(CI):.3f}σ               {(length/sigma):.3f}σ ")
~~~

The value `\u0305` immediately following the letter `X` is used to print a bar over the X. The value `\u00B1` is used to print the $\pm$ sign.


In [None]:
# Example 1: Compute 95% Confidence Interval and Inteval Length

import scipy.stats as stats
import numpy as np

# Parameters
n = 10  # Sample size
mu = 70  # Mean
sigma = 10  # Standard deviation
alpha = 0.05  # Significance level

# Calculate the Z-score for the given confidence level
z = stats.norm.ppf(1 - alpha / 2)

# Calculate the margin of error
margin_of_error = z * (sigma / np.sqrt(n))

# Calculate the confidence interval
lower_bound = mu - margin_of_error
upper_bound = mu + margin_of_error
length=(upper_bound - lower_bound)
CI=((mu-upper_bound)/sigma)

# Print the header
print("_______________________________________________________________")
print("    n    95% Confidence Limits for µ  Length of Interval")
print("_______________________________________________________________")
# Print line in table
print(f"   {n}            X\u0305 \u00B1{(CI):.3f}σ               {(length/sigma):.3f}σ ")



If the code is correct, you should see the following output:
~~~text
_______________________________________________________________
    n    95% Confidence Limits for µ  Length of Interval
_______________________________________________________________
   10            X̅ ±-0.620σ               1.240σ 
~~~

### **Exercise 1A: Compute 95% Confidence Interval and Inteval Length**

In the cell below, write the code that will generate the second line in the table shown on page 212.

(_Hint_: You only need to change the value of one variable).

In [None]:
# Insert your code for Exercise 1A here





If the code is correct, you should see the following output:
~~~text
_______________________________________________________________
    n    95% Confidence Limits for µ  Length of Interval
_______________________________________________________________
   100            X̅ ±-0.196σ               0.392σ
~~~

### **Exercise 1B: Compute 95% Confidence Interval and Inteval Length**

In the cell below, write the code that will generate the third line in the table shown on page 212.

(_Hint_: You only need to change the value of one variable).

In [None]:
# Insert your code for Exercise 1B here



If your code is correct, you should see the following output:

~~~text
_______________________________________________________________
    n    95% Confidence Limits for µ  Length of Interval
_______________________________________________________________
   1000            X̅ ±-0.062σ               0.124σ 
~~~

### **Sample Size and Confidence Intervals**

The Python code in Example 1, as well as the code you wrote in **Exercises 1A** and **1B**, illustrates how could generate confidence intervals for different values of significance (`alpha`). 

If you combined all three outputs from the last 3 code cells, the table below would replicate the table as shown on page 212

~~~text
_______________________________________________________________
    n    95% Confidence Limits for µ  Length of Interval
_______________________________________________________________
   10            X̅ ±-0.620σ               1.240σ 
  100            X̅ ±-0.196σ               0.392σ 
 1000            X̅ ±-0.062σ               0.124σ 
_______________________________________________________________
~~~

As your authors note:
>As we select larger and larger random samples, the variability of $\bar{X}$ – our estimator of the population mean $µ$ – becomes smaller. The inherent variability of the underlying population, measured by $σ$, is always present, however.  

(Pagano, Marcello; Gauvreau, Kimberlee; Mattie, Heather. Principles of Biostatistics (p. 212). CRC Press. Kindle Edition.) 

## **Frequency Interpretation of Confidence Interval**

As we previously mentioned, this confidence interval also has a **_frequency interpretation+**. Suppose  that the true mean serum cholesterol level of the population of male hypertensive smokers is equal to 211 mg/100 ml, the mean level for adult males in the United States. If we were to draw 100 random samples of size 12 from this population and use each one to construct a 95% confidence interval, we would expect that, on average, 95 of the intervals would cover the true population mean  µ = 211 and 5 would not. This procedure was simulated and the results illustrated in **Figure 9.1**. 

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image01.png)

**Figure 9.1** Set of 95% confidence intervals constructed from samples of size 12 drawn from a normal population with mean 211 (marked by the vertical line) and standard deviation 46 


The  only quantity that varies from sample to sample is $X$. Although the centers of the intervals differ, they all have the same length. The confidence intervals that do not contain the true value of $µ$ are  marked by a red dot; note that exactly five intervals fall into this category. That should make sense to you. With a 95% confidence interval, you should only expect 95 of the 100 trials will have a confidence interval that captures the true population mean µ, while in 5 trials you happen to sample too many of the "wrong" values that shifted your sample mean an interval either too far to the left or to the right.

### Example 2: Plot Confidence Intevals 

The code in the cell below shows how we can recreate **Figure 9.1** using Python. This figure plotted the 95% confidence intervals (CI) for 100 trials. This is equivalent to $Z \pm 2$.

_Code Description:_

We will to import the Python package `scipy.stats as stats` so we can use the function `stats.t.ppf()` function as shown in the code chunk below:

~~~text
# Function to calculate the 95% confidence interval
def confidence_interval(sample):
    sample_mean = np.mean(sample)
    sample_std = np.std(sample, ddof=1)
    margin_of_error = stats.t.ppf(CI, df=sample_size-1) * (sample_std / np.sqrt(sample_size))
    return sample_mean - margin_of_error, sample_mean + margin_of_error
~~~

This function accepts as it argument a `sample`. The number of elements in the sample is specified by the variable `sample_size`. The function computes the mean and standard deviation of the sample, and uses these value to compute the `margin_of_error` using the `scipy.stats` function `stats.t.ppf()` with the following line of code:

~~~text
margin_of_error = stats.t.ppf(CI, df=sample_size-1) * (sample_std / np.sqrt(sample_size))
~~~

It then computes the lower and upper bounds of the confidence interval by subtracting or adding the margin of error to sample mean and returns these two values. 

The function `confidence_interval()` is used in the following code chunk:

~~~text
samples = np.random.normal(population_mean, population_std, (num_samples, sample_size))
conf_intervals = [confidence_interval(sample) for sample in samples]
~~~
This code chunk first selects the a specified number of samples (`12`) _randomly_ from a normal distribution having a specified mean (`211`) and specified standard deviation (`46`).
The sample is then passed to our special function `confidence_interval()`. The function generates a large list `conf_intervals` containing the `lower` and `upper` values of the confidence interval for each sample mean. Since `num_samples = 100` in this example, the list `conf_intervals` contains 200 values that are plotted as graph with the following code chunk:

~~~text
# Plot 
for i, (lower, upper) in enumerate(conf_intervals):
    ax.plot([lower, upper], [i, i],
            linewidth=3.8, 
            color=chart_color)
~~~

A vertical red line located at the true sample mean is added to the plot with this code chunk:

~~~text
# Add a vertical line at the population mean
ax.axvline(x=population_mean, 
           color=line_color,
           alpha=0.5,
           linewidth=4.0,
           linestyle='-')
~~~
The remaining code plots the red dots to the left of any interval that doesn't "touch" (cross) the vertical red line. The coordinates of these dots were determined manually, by trial-and-error. 

~~~text
# Coordinates of the dots
x = [150, 150, 150, 150, 150]
y = [4.5, 28, 43, 55, 92]

# Plot the dots
plt.plot(x, y, 'ro', markersize=3.0)  # 'ro' stands for red circles
~~~


In [None]:
# Example 2: Plot confidence intervals

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Parameters
CI=0.950
population_mean = 211
population_std = 46
sample_size = 12
num_samples = 100  # Number of samples to draw
chart_color = '#294181'  # color in textbook
line_color = '#b00000'   # color in textbook

# Function to calculate the 95% confidence interval
def confidence_interval(sample):
    sample_mean = np.mean(sample)
    sample_std = np.std(sample, ddof=1)
    margin_of_error = stats.t.ppf(CI, df=sample_size-1) * (sample_std / np.sqrt(sample_size))
    return sample_mean - margin_of_error, sample_mean + margin_of_error

# Set the seed
np.random.seed(44)

# Generate samples and calculate confidence intervals
samples = np.random.normal(population_mean, population_std, (num_samples, sample_size))
conf_intervals = [confidence_interval(sample) for sample in samples]


# Create plotting environment
fig, ax = plt.subplots(1, 1, figsize=(3, 9))

# Plot 
for i, (lower, upper) in enumerate(conf_intervals):
    ax.plot([lower, upper], [i, i],
            linewidth=3.8, 
            color=chart_color)

# Ensure y-axis is not visible
plt.gca().yaxis.set_visible(False)

# Add a vertical line at the population mean
ax.axvline(x=population_mean, 
           color=line_color,
           alpha=0.5,
           linewidth=4.0,
           linestyle='-')

# Coordinates of the dots
x = [150, 150, 150, 150, 150]
y = [4.5, 28, 43, 55, 92]

# Plot the dots
plt.plot(x, y, 'ro', markersize=3.0)  # 'ro' stands for red circles

# Show plot
plt.show()


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image02.png)


This figure is equalivalent to **Figure 9.1** on page 214 in your text. You should note that since the 100 samples are choosen using a _random_ process, changing the value of the random seed can affect the number of trials that fall outside of the true sample means (vertical red line).

### **Exercise 2A: Plot Confidence Intevals**

In the cell below, plot the intervals for a Confidence Interval (CI) equal to `0.9973`. This is equivalent to $Z \pm 3$.

You should have only one trial that is outside the true sample mean. The coordinates for your single red dot are:

~~~text
# Coordinates of the dots
x = [116]
y = [28]
~~~

In [None]:
# Insert your code for Exercise 1A here




If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image04.png)


### **Exercise 2B: Plot Confidence Intevals**

In the cell below, plot the intervals for a Confidence Interval (CI) equal to `0.6826`. This is equivalent to $Z \pm 1$.

With this CI, approximately 50% of your trials will have confidence intervals that fall outside of the true mean (red vertical line). So do **NOT** try to plot any red dots! Instead, simply remove the code for genrating the red dots. 


In [None]:
# Insert your code for Exercise 1B here



If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image03.png)


By inspection, you can see that 48% (i.e., 48 out of 100 samples) had confidence interval that did _not_ include the true mean of `211` (vertival red line).

## **Student's _t_-Distribution**

The **_Student’s t-distribution_** is a probability distribution used in statistics when the sample size is small and the population standard deviation is unknown. It was introduced by William Sealy Gosset under the pseudonym “Student.”

#### **Key Characteristics:**

* **Symmetrical and Bell-Shaped:** Similar to the normal distribution but with heavier tails, meaning it has a higher probability for extreme values.
* **Degrees of Freedom (df):** The shape of the _t_-distribution depends on the degrees of freedom, which is related to the sample size. As the degrees of freedom increase, the t-distribution approaches the normal distribution.
  
#### **Importance in Biostatistics:**

1. **Small Sample Sizes:** In biostatistics, researchers often work with small sample sizes. The _t_-distribution is crucial for making inferences about the population mean when the sample size is less than 30.
2. **Unknown Population Standard Deviation:** When the population standard deviation is unknown, the _t_-distribution provides a more accurate estimate than the normal distribution.
3. **Hypothesis Testing:** The _t_-distribution is used in t-tests to determine if there is a significant difference between the means of two groups. This is particularly important in clinical trials and medical research to compare treatment effects.
4. **Confidence Intervals:** It helps in constructing confidence intervals for the mean, which is essential for estimating the range within which the true population parameter lies with a certain level of confidence.

**Example in Biostatistics:**

Imagine a clinical trial testing a new drug. Researchers want to compare the mean blood pressure reduction between the treatment group and the control group. With a small sample size and unknown population standard deviation, they would use the _t_-distribution to perform a _t_-test and determine if the observed difference is statistically significant.

_From your textbook:_

To construct a two-sided confidence interval for a population mean $µ$, we began by noting that

$$ Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}  $$

has an approximate standard normal distribution if _n_ is sufficiently large. When the population  standard deviation is not known, it might seem logical to substitute _s_ for _σ_, where _s_ is the standard deviation of a sample drawn from the population. This is, in fact, what is done. However, the ratio  

$$ t = \frac{\bar{X} - \mu}{s / \sqrt{n}} $$

does not have a standard normal distribution. 

In addition to the sampling variability inherent in $\bar{X}$ – which we are using as an estimator of the population mean _µ_ – there is also variability in _s_. The value of _s_ is likely to change from sample to sample. Therefore, we must account for the fact that _s_ may not be a reliable estimate of _σ_, especially when the sample size is small. 

If _X_ is normally distributed and a sample of size _n_ is randomly chosen from this underlying  population, then the probability distribution of the random variable  

$$ t = \frac{\bar{X} - \mu}{s / \sqrt{n}} $$

is known as _Student’s t distribution_ with _n_ − 1 degrees of freedom.

#### **FIGURE 9.2**

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image06.png)

>The standard normal distribution and Student’s t distribution with 1 degree of freedom  does not have a standard normal distribution. In addition to the sampling variability inherent


### Example 3: Plot Normal and Student's t- Distributions

The code in the cell below shows how to recreate **Figure 9.2** with Python. 

In [None]:
# Example 3: Plot Normal and Student's t- Distributions

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, t

# Assign variables
chart_color = '#294181'
line_color = '#b00000'
degrees_fredom = 1

# Define the range for the x-axis
x = np.linspace(-5, 5, 1000)

# Standard normal distribution (mean=0, std=1)
normal_pdf = norm.pdf(x, 0, 1)

# Student's t-distribution with 1 degree of freedom
t_pdf = t.pdf(x, df=degrees_fredom)

# Create plotting environment
fig = plt.subplots(1, 1, figsize=(8, 6))

# Plotting
plt.plot(x, normal_pdf, color=chart_color)
plt.plot(x, t_pdf, color=line_color, linestyle='-')

# Ensure y-axis is not visible
plt.gca().yaxis.set_visible(False)

# Add labels and title
plt.xlabel('Z')

# Plot text 'Normal'
plt.text(1.40, 0.35, 'Normal', fontsize=14)   # mu at bottom

# Plot line 'Normal'
x_line=[0.60, 1.31]
y_line=[0.356, 0.356]
plt.plot(x_line, y_line, color='k', linestyle='solid', linewidth=1.0)

# Plot text 't1'
plt.text(2.50, 0.075, '$t_1$', fontsize=14)   # mu at bottom

# Plot line 't1'
x_line=[2.50, 2.3]
y_line=[0.070, 0.06]
plt.plot(x_line, y_line, color='k', linestyle='solid', linewidth=1.0)

# Show plot
plt.show()


If the code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image05.png)


The standard normal distribution and Student’s _t_-distribution with 1 degree of freedom.

You should note that the _t_distribution (red) has "fatter tails" than the standard normal distribution (blue). These "fatter tails" (starting where the label `t1` is located) compensate for the increased uncertainity in both the unknown population mean _µ_ and the unknown standard deviation _σ_ due to the small sample.

----------------------------------------

### **Degrees of Freedom**

**_Degrees of freedom (df)_** in statistics refer to the number of independent values or quantities that can vary in an analysis without violating any constraints. They are crucial in various statistical tests and help determine the shape of different probability distributions, such as the t-distribution and chi-squared distribution.

#### **Key Points:**

* **Definition:** Degrees of freedom represent the number of independent pieces of information available to estimate another piece of information.

* **Calculation:** The general formula for calculating degrees of freedom is:

 $$  \text{df} = n - k $$ 
 
 where $n$ is the sample size and $k$ is the number of parameters or constraints.

----------------------------------------

### **Exercise 3A: Plot Normal and Student's t- Distributions**

In the cell below write the Python code to co-plot a standard normal distribution and Student's _t_-Distribution with 4 degrees of freedom (_df_= 4). 

In [None]:
# Example 3: Plot Normal and Student's t- Distributions



If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image08.png)


The standard normal distribution and Student’s _t_-distribution with 4 degrees of freedom.

### **Exercise 3B: Plot Normal and Student's t- Distributions**

In the cell below write the Python code to co-plot a standard normal distribution and Student's _t_-Distribution with 10 degrees of freedom (_df_= 10). 

In [None]:
# Insert your code for Exercise 3B here




If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO5853/images/module_03/lesson_03_1_image09.png)


For each possible value of the degrees of freedom, there is a different _t_ distribution. The  distributions with smaller degrees of freedom are more spread out; as df increases, the t distribution approaches the standard normal. This occurs because, as the sample size increases, _s_ becomes a more reliable estimate of σ. If _n_ is very large, knowing the value of _s_ is nearly equivalent to knowing σ. 

(Pagano, Marcello; Gauvreau, Kimberlee; Mattie, Heather. Principles of Biostatistics (p. 216). CRC Press. Kindle Edition.) 

### Example 4: Read Numpy Array from File Server

So far we have been using Pandas to read data files from the course file server and creating DataFrames to hold the data. However, we can also read a datafile from a file server and directly convert it into a Numpy array as shown in the code in this example. 

_Code Description:_

The code in the cell below requires the Python package `requests` as well as `numpy` to be imported. The `requests` package is a simple yet powerful library for making HTTP requests. It allows you to send HTTP/1.1 requests including file uploads and downloads. Here we are using it for downloading a file. The name of the file, `plasmaAlum.npy`, is included in the definition of the URL.

The code uses "low level" file commands in the following code chunk for the actual reading the file from the server and writing it to local machine running this notebook:

~~~text
# Save the file locally
with open('file.npy', 'wb') as f:
    f.write(response.content)
~~~

The Numpy function `np.load()` is then used to create the numpy array on the local machine.

In [None]:
# Example 4: Read numpy array from server

import numpy as np
import requests

# URL of the .npy file on the file server
url = 'https://biologicslab.co/BIO5853/data/plasmaAlum.npy'

# Download the file
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Save the file locally
with open('file.npy', 'wb') as f:
    f.write(response.content)

# Load the .npy file
plasmaAlu = np.load('file.npy')

# Display the loaded array
print(plasmaAlu)

If the code is correct, you should see the following output:

~~~text
[41.41118753 45.06951245 45.06878192 ... 38.34447466 32.12692585
 31.39125005]
~~~

These are a small number of plasma aluminum levels recorded in infants receiving antacids containing aluminum.

### **Exercise 4: Read Numpy Array from File Server**

In the cell below, read the file `serumChol.npy` from the course file server and create a Numpy array called `serumChol`. Print out the contents of `serumChol`. 

In [None]:
# Insert your code for Exercise 4 here



If the code is correct, you should see the following output:

~~~text
[217.17712868 228.15031639 167.75769752 ... 217.94590095 225.88330817
 141.99081274]
~~~

These are a small number of serum cholesterol levels in US males that smoke.

### **Compute Confidence Intevals**

Consider a random sample of 10 children selected from the population of infants receiving antacids that contain aluminum. These antacids are often used to treat peptic or digestive disorders. The distribution of plasma aluminum levels is known to be approximately normal; however, its mean µ and standard deviation σ are not known. 

We wish to estimate the mean plasma aluminum level for this population. For the random sample of 10 children, the mean aluminum level is $\bar{x}$ = 37.2 µg/l  and the sample standard deviation is s = 7.13 µg/l [181].  Since the population standard deviation σ is not known, we must use the t distribution to find 95% confidence limits for µ. 

For a t distribution with 10 − 1 = 9 degrees of freedom, the 97.5th percentile is 2.262, and 95% of the observations lie between −2.262 and 2.262. Replacing σ with s, a 95% confidence interval for the population mean µ is

$$ \left( \bar{X} - 2.262 \frac{s}{\sqrt{10}}, \bar{X} + 2.262 \frac{s}{\sqrt{10}} \right) $$

Substituting in the values of $\bar{x} and s, the interval becomes   

$$ \left( 37.2 - 2.262 \frac{7.13}{\sqrt{10}}, \ 37.2 + 2.262 \frac{7.13}{\sqrt{10}} \right) $$
or  
$$ (32.1, 42.3)$$ 

Pagano, Marcello; Gauvreau, Kimberlee; Mattie, Heather. Principles of Biostatistics (p. 217). CRC Press. Kindle Edition. 

### Example 5: Compute Confidence Intervals

The Python code in the cell below shows how to generate confidence intervals using the _t_ distribution for a sample of 10 infants from a the population of infants (_n_=2400) receiving antacids containing aluminum, as described in the preceding cell.

_Code Description:_

The code uses the Numpy array `plasmaAlu` that was downloaded from the course file server in Example 4. 

The t critical value for the sample is computed in this code chunk using the `scipy.stats` function `stats.t.ppf()`:

~~~text
# Calculate the t-critical value for 95% confidence
plasmaAlu_sample_t_critical = stats.t.ppf(1 - 0.025, degrees_of_freedom)
~~~

Once the _t_ critical value has been calculated, the margin of error is calculated in the next code chunk:

~~~text
# Calculate the margin of error
plasmaAlu_sample_margin_of_error = plasmaAlu_sample_t_critical * plasmaAlu_sample_std_error
~~~

In [None]:
# Example 5: Compute Confidence Intevals

import numpy as np
from scipy import stats

# Define variables
sample_size = 10
degrees_of_freedom = sample_size - 1

# Compute 'true' population mean
plasmaAlu_true_mean = np.mean(plasmaAlu)
print(f"The true population mean for plasma aluminum is {plasmaAlu_true_mean:.2f} µg/l") 

# Set the seed
np.random.seed(28) # Different values have large effects on sample stats

# Select random sample
plasmaAlu_sample = np.random.choice(plasmaAlu, size=sample_size, replace=True)

# Compute the mean of the sample
plasmaAlu_sample_mean = np.mean(plasmaAlu_sample)
print(f"The mean for {sample_size} samples for plasma aluminum is {plasmaAlu_sample_mean:.2f} µg/l")

# Compute standard deviation of the sample
plasmaAlu_sample_std = np.std(plasmaAlu_sample)

# Compute standard error of the sample
plasmaAlu_sample_std_error = plasmaAlu_sample_std / np.sqrt(sample_size)

# Calculate the t-critical value for 95% confidence
plasmaAlu_sample_t_critical = stats.t.ppf(1 - 0.025, degrees_of_freedom)

# Calculate the margin of error
plasmaAlu_sample_margin_of_error = plasmaAlu_sample_t_critical * plasmaAlu_sample_std_error

# Calculate the confidence interval
confidence_interval = (plasmaAlu_sample_mean - plasmaAlu_sample_margin_of_error, 
                       plasmaAlu_sample_mean + plasmaAlu_sample_margin_of_error)

print(f"95% Confidence Interval with {degrees_of_freedom} degrees of freedom: {confidence_interval}")

If the code is correct, you should see the following output:

~~~text
The true population mean for plasma aluminum is 37.08 µg/l
The mean for 10 samples for plasma aluminum is 37.25 µg/l
95% Confidence Interval with 9 degrees of freedom: (32.89306379759906, 41.613607119840886)
~~~

These confidence intervals are close to, but not exactly the same as, the values on page 217 of your textbook: 

$$ (32.1, 42.3) $$

The small difference might be expected for a process involving random selection of the sample values.

### **Exercise 5: Compute Confidence Intervals**

In the cell below, use the _t_ distribution for a sample of 5 males from the serum cholesterol data in the Numpy array `serumChol` that you downloaded from the file server in **Exercise 4**. Using Example 5 as a template. Leave the value of the random seed = `28`.

In [None]:
# Example 5: Compute Confidence Intevals



If the code is correct, you should see the following output:

~~~text
The true population mean for serum cholesterol is 217.93 µg/l
The mean for 5 samples for serum cholesterol is 224.33 µg/l
95% Confidence Interval with 4 degrees of freedom: (186.5807665295276, 262.07909980863326)
~~~

## **Lesson Turn-in**

When you have run all of the code cells in sequential order (the last code cell above should be 19) you need to create a PDF of your notebook. 

Upload the **_PDF_** of your Lesson_03_1 assignment to Canvas for grading.
