<a href="https://colab.research.google.com/github/DavidSenseman/BIO5853/blob/master/Lesson_01_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO5853: Biostatistics**

## **Lesson_01_5: Numerical Summary Measures**

##### **Module I: Variability**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 1 Material
* Part 1.1: Getting Started with Google COLAB
* Part 1.2: Python Basics 1 -- Syntax, Operators, Expressions
* Part 1.3: Python Basics 2 -- Functions, Variables, Strings
* Part 1.4: Python Basics 3 -- Charts
* **Part 1.5: Python Basics 4 -- Numerical Summary Measures**

#### In this assignment you will learn about:

* Mean
* Median
* Mode
* Range
* IQR
* Variance

### Google CoLab Instructions

The following code will map your GDrive to ```/content/drive``` and print out your Google GMAIL address.

In [None]:
# You must run this cell second
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

## **Numerical Summary Measures**

This lesson follows Chapter 2.4 in your textbook, starting on page 34.

### **Mean**

The mean (also known as the average) summarizes an entire dataset with a single number representing its center point or typical value. To find the mean, simply add all the values and divide by the number of observations. The formula for the mean is:

$$ \overline{X} = \frac{1}{n} \sum_{n=1}^{n} X_i $$

Here:

$x_i$ represents each value in the dataset and $n$ is the total number of values.

## Example 1: Calculate mean 

Example 1 is divided into steps to make learning the Python code easier. 

In Example 1A, we demonstrate how to use the **_Pandas_** package to read a datafile from an internet website and creating a datatype known as a **_DataFrame_**. We will use this code example throughout this course in nearly every lesson. 

---------------------------

### File handling with _Pandas_

**_Pandas_** is frequently used in Python programs to read unformated text file. File handling using Pandas typically involves reading in data from a file into a Pandas **_DataFrame_** using the `pd.read_csv(filename)` function. This function can be used to read in data from a variety of sources including CSV files, Excel files, HTML tables, and other formats. 

The function's name refers to a particularily common file type called a CSV (Comma Separated Values) file. In this file type, a comma **`,`** is used as the **_delimiter_** value, to **separate** one data value from another. 

--------------------------
 

### Example 1A: Read Datafile

The code in the cell below uses the Pandas function `pd.read_csv(filename)` to read the data file `pima.csv` stored on the course HTTPS server https://biologicslab.co. As the file is read, it is stored in a Pandas DataFrame called `pimaDF`. 

The code begins by "importing" the Pandas function and assigning it the alias `pd`. 

After the file is read and the DataFrame is created, we use the Python function `display()` to print out a portion of our new DataFrame to make sure it was read correctly. The code in the cell below shows how to use this method with the data stores in `pimaDF`. The `display()` function allows you to control the maximum rows and columns to print. This is useful since many datasets sets have large numbers of rows and columns which can be difficult to print to your computer screen. The code below sets the maximumn number of rows and columns to `6`. 

In [None]:
# Example 1A: Read datafile

import pandas as pd

# Read the datafile 
pimaDF = pd.read_csv(
    "https://biologicslab.co/BIO5853/data/pima.csv",
   # index_col=0,
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 6) 

# Display DataFrame
display(pimaDF)

If you code is correct, you should see the following table.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image01.png)

As you can see from the last line of the above table, the `pimaDF` DataFrame contains 768 rows and 9 columns. Each  column contains the value for one of 9 items of clinical information as follows:

* **Pregnancies:** Number of times pregnant
* **Glucose:** Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)2)
* **DiabetesPedigree:** Diabetes pedigree function
* **Age: Age (years)**
* **Outcome:** Diagnosed with Type II Diabetes (`0`= no, `1` = yes)

Each row corresponds the data obtained for one subject in the dataset. In this example, the data were obtained from 768 women belonging to the Native American Indian tribe called the `Pima`. 

The Pima Indians are known for their high prevalence of Type 2 diabetes, making them an important population for studying and understanding the genetic and environmental factors that contribute to the development of diabetes. They have one of the highest rates of diabetes in the world, with some estimates suggesting that up to 50% of Pima adults have the disease.

### Example 1B: Compute Mean

Now that we have our Pima Indian dataset in a Pandas DataFrame called `pimaDF`, we can use the Python package `statistics` to compute various summary statistics.

For example, the code in the cell below, uses the `statistics` package to find the **_mean_** blood pressure of the 768 women in the Pima Indian dataset.  

_Code Description:_

The first step is to load any Python packages that will be needed for the computations. In Example 1B we will need to load (`import`) the package `statistics` as follows:

~~~text
import statistics
~~~

Since we only want to find the mean for the blood pressure values, we need a way to selectively **_extract_** these values from the `pimaDF` DataFrame. As you will see, there are several ways that this can be accomplished. One way is shown in the following code chunk: 

~~~text
pimaDF.BloodPressure
~~~

To access the data in a specific DataFrame column, we can place a dot (**.**) after the DataFrame name, followed by the column name.

**WARNING:** You must correctly enter the column name _exactly_ as it appears in the DataFrame, including capitalization or you will receive an error. 

The following code fragment uses this dot notation to creates a new variable called `BldPressure_dat` that contains only the numbers (values) in the blood pressure column:

~~~text
# Create data variable
BldPressue_dat = pimaDF.BloodPressure
~~~

Once we have the blood pressures isolated, we can use the function `statistics.mean()` to compute the mean of the variable `BldPressure_dat` and store the result in a new variable called `BldPressure_mean`. 

~~~text
# Compute mean
BldPressure_mean=statistics.mean(BldPressure_dat)
~~~

The last step is to print out the results using Python's `f` print statement.

~~~text
# Print result
print(f"Blood pressure mean={BldPressure_mean:.2f} mmHg")
~~~

In [None]:
# Example 1B: Find mean

import statistics

# Create data variable
BldPressure_dat = pimaDF.BloodPressure

# Compute mean
BldPressure_mean=statistics.mean(BldPressure_dat)

# Print result
print(f"Blood pressure mean={BldPressure_mean:.2f} mmHg")


If the code is correct you should see the following output:

~~~text
Blood pressure mean=69.11 mmHg
~~~

### **Exercise 1A: Read Datafile**

In the cell below, write the Python code to read the datafile `obesity_prediction.csv` stored on the course HTTPS server `https://biologicslab.co/BIO5853/data/`. As shown in Example 1A, use the Pandas function `pd.read_csv()` to read the datafile and store it in a new DataFrame called `obDF`. 

After the file is read and the DataFrame is created, use the Python function `display()` to print out a portion of your new DataFrame to make sure it was read correctly. Set the maximumn number of rows and columns to `6`. 

In [None]:
# Insert your code of Exercise 1A here



If you code is correct, you should see the following table.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image02.png)

As you can see from the last line of the above table, the `obDF` DataFrame contains 1000 rows and 7 columns. Each  column contains the value for one of 7 items of clinical information for 1000 subjects as follows:

1. **Age:** The age of the individual, expressed in years.
2. **Gender:** The gender of the individual, categorized as male or female.
3. **Height:** The height of the individual, typically measured in centimeters (cm).
4. **Weight:** The weight of the individual, typically measured in kilograms (kg).
5. **BMI:** A calculated metric derived from the individual's weight and height
6. **PhysicalActivityLevel:** This variable quantifies the individual's level of physical activity
7. **ObesityCategory:** Categorization of individuals based on their BMI into different obesity categories

### **Exercise 1B: Compute Mean**

In the cell below, use the Python packgage `statistics` to find the mean height of the men and women in the Obesity Prediction dataset. Store this value in a new variable called `Height_mean`.  

Print out the value of your `Height` using Python's `f` print statement. Make sure to add the correct units (`cm`).

In [None]:
# Insert your code for Exercise 1B here




If your code is correct you should see the following output:

~~~text
Height mean=170.05 cm
~~~

## **Median**

The **_median_** is a measure of central tendency that represents the middle value in a dataset when it’s ordered from the highest to lowest value. It separates the lowest 50% from the highest 50% of values. Here’s how to find it manually:

**Odd-numbered dataset:**

* Order the values from low to high.
* Calculate the middle position using the formula:

$$ \text{Middle position} = \frac{n + 1}{2}$$
where $n$ is the number of values.
* The median is the value at the middle position.


**Even-numbered dataset:**
* Order the values.
* Calculate the two middle positions using the formulas:

$$\text{Middle position 1} = \frac{n}{2}$$  

> and

$$ \text{Middle position 2} = \frac{n}{2} + 1 $$

* Find the two middle values.
* Calculate the mean of these middle values to get the median.


The median is especially useful when dealing with skewed data or outliers, as it’s less affected by extreme values compared to the mean. It provides a robust estimate of the central value in a dataset.


### Example 2: Compute Median

The Python code in the cell below again uses the `statistics` package, this time to compute the median blood pressure value for the Pima Indian dataset. The median value is stored in a new variable, `BldPressure_median`. 

The `f` print function is used to print out both the median value as well as the mean value calculate above in Example 1B.


In [None]:
# Example 2: Compute median

import statistics

# Create data variable
BldPressue_dat = pimaDF.BloodPressure

# Calculate median
BldPressure_median = statistics.median(BldPressure_dat)

# Print result
print(f"Blood pressure median={BldPressure_median:.2f} mmHg")
print(f"Blood pressure mean={BldPressure_mean:.2f} mmHg")

If the code is correct, you should see the following output:

~~~text
Blood pressure median=72.00 mmHg
Blood pressure mean=69.11 mmHg
~~~

As you can see the values for the median and the mean blood pressure are similar, but _not_ identical. This indicates that the blood pressures in the Pima Indian dataset is somewhat "left skewed"--a concept that we will explore later in this lesson. 

### **Exercise 2: Compute Median**

In the cell below use the `statistics` package to compute the median height for the dubjects in the Obesity Prediction dataset and store this value in a new variable called `Height_median`. 

Use the `f` print function to print out both the median height value, as well as the mean height value that you computed earlier in **Exercise 2B**.


In [None]:
# Insert your code for Exercise 2 here:



If the code is correct, you should see the following output:

~~~text
Height median=169.80 cm
Height mean=170.05 cm
~~~

As you can see, the values for the median and the mean height are very close to the same value. This indicates that the height values in the Obesity Prediction dataset is almost "normally distributed"--a concept that we will explore later in this course. 

## **Mode**

The **_mode_** represents the value that appears most frequently in a dataset. It’s a measure of central tendency that tells you the most common choice or characteristic within your sample. Here are some key points about the mode:

* **Unimodal:** A dataset can have one mode, where a single value occurs most often.
* **Bimodal:** If two different values repeat most frequently, the dataset is bimodal.
* **Trimodal:** With three modes, it’s trimodal.
* **Multimodal:** If there are four or more modes, it’s multimodal.

### Example 3: Compute Mode

The code in the cell below uses the `statistics` package to compute the mode for blood pressure in the Pima Indian dataset. The values for the mode, median and mean are printed out.

In [None]:
# Example 3: Compute mode

import statistics

# Create the data variable
BldPressure_dat = pimaDF.BloodPressure

# Compute the mode
BldPressure_mode = statistics.mode(BldPressure_dat)

# Print the result
print(f"Blood pressure mode={BldPressure_mode:.2f} mmHg")
print(f"Blood pressure median={BldPressure_median:.2f} mmHg")
print(f"Blood pressure mean={BldPressure_mean:.2f} mmHg")

If the code is correct, you should see the following output:

~~~text
Blood pressure mode=70.00 mmHg
Blood pressure median=72.00 mmHg
Blood pressure mean=69.11 mmHg
~~~

Once again, you can see the values for the mode, median and mean blood pressure are similar, but _not_ identical. 

### **Exercise 3: Compute Mode**

In the cell below, use the `statistics` package to compute the mode for height in the Obesity Prediction dataset. Store the mode value in a new variable called `Height_mode`. Print out the values for the mode, median and mean using the correct units of measure.

In [None]:
# Insert your code for Exercise 3 here



If the code is correct, you should see the following output:

~~~text
Height mode=173.58 cm
Height median=169.80 cm
Height mean=170.05 cm
~~~

As you can see the values for the mode, median and mean height are again similar, but _not_ identical. In this particular dataset, the mode value is farther from both the median and the mean.  

## **Data Distributions**

Since the mean, median and mode seem to be essentially the same value, you might be wondering why there are three different measures of central tendency? In this section, we will address this question.

Consider the four different data distributions shown in **Figure 2.13** from your textbook (pg. 38):

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image03.png)

As your textbook states on page 37:

> The best measure of central tendency for a given set of data often depends on the way in which the values are distributed. If continuous or discrete measurements are symmetric and unimodal – meaning that, if we were to draw a histogram or a frequency polygon, there would be only one peak, as in the smoothed distribution pictured in **Figure 2.13(a)** – then the mean, the median, and the mode should all be roughly the same. If the distribution of values is symmetric but bimodal, so that the corresponding frequency histogram would have two peaks as in **Figure 2.13(b)**, then the mean and median should again be the same. Note, however, that this common value could lie between the two peaks, and hence be a measurement that is extremely unlikely to occur. A bimodal distribution often indicates that the population from which the values are taken actually consists of two distinct subgroups that differ in the characteristic being measured; in this situation, it might be better to report two modes rather than the mean or the median, or to treat the two subgroups separately. The data in **Figure 2.13(c)** are skewed to the right, and those in **Figure 2.13(d)** are skewed to the left. When the data are not symmetric, as in these two figures, the median is often the best measure of central tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the outlying data values. As a result, the mean might end up either excessively inflated or excessively deflated. Note that when the data are skewed to the right, the mean lies to the right of the median;  when they are skewed to the left, the mean lies to the left of the median. In both instances, the mean is pulled in the direction of the extreme values.  Regardless of the measure of central tendency used in a particular situation, it can be misleading to assume that this value is representative of all observations in the group.

## **Generating Data Distributions Using Python**

In the next section of this lesson, we demonstrate how to generate the four data distributions shown above in Figure 2.13 using three important Python packages: `Numpy`, `Matplotlib` and `Scipy`. 

### **Numpy Package**

Numpy (pronounced as "num-pee") is a Python package that provides powerful data structures and functions for manipulating numerical data. It is the fundamental package for scientific computing with Python.

Like other Python packages, it has to be imported into a Python program with the following command before it can be used.
~~~text
import numpy as np
~~~
When importing software packages, an alias ('nickname') is often used. For example, the standard alias for Numpy is np. The Python package alias is a feature of the Python language that allows a software package to be imported under a different name. Another example of a package alias is pd which is the alias of the Python package called Pandas. One advantage of using an alias is that it shortens the length of commands that part of the package.

When using a method that is part of a Python package, the alias is used instead of the package name. For example, to use the Numpy `append()` method, the command would be:

~~~text
np.append(array,value,axis)
~~~
Numpy arrays are the main data structure used in this course when we preprocessing (preparing) data to be analyzed. Numpy arrays are N-dimensional objects that can hold any data type, including strings, although purely numeric arrays are most often encountered. 

In addition to providing a data structure, Numpy also provides a suite of mathematical functions and tools for working with these arrays. For example, Numpy provides a powerful linear algebra library for manipulating matrices and vectors. Numpy also provides a range of statistical functions that can be used for data analysis.


### **Matplotlib Package**

**_Matplotlib_** is a comprehensive library for creating static, animated, and interactive visualizations in Python. 
Matplotlib can be used in Python scripts, the Python and IPython shells, Google COLAB, the Jupyter notebook, web application servers, and four graphical user interface toolkits. It supports many types of charts and graphs, including line plots, scatter plots, bar charts, histograms, pie charts, box plots, 3D charts, and more. It also supports mathematical notation and LaTeX-style formatting for text and labels.

#### **Matplotlib with Numpy**

Matplotlib can be used with Numpy to create powerful visualizations. Numpy can be used to provide data for Matplotlib to plot, manipulate, and transform. For example, you can use Numpy functions to generate random numbers, generate evenly spaced numbers over a specified interval, generate normally distributed numbers, and even perform linear algebraic operations. Matplotlib can then take this data and plot it in various ways, including line graphs, bar graphs, histograms, scatter plots, contour plots, and more. Together, Matplotlib and Numpy can be used to create powerful visualizations of data.

### **Scipy Package**

The `scipy.stats` module in Python’s `SciPy` library provides a wide range of statistical functions and distributions. Here are some key features:

1. **Probability Distributions:**
It contains a large number of probability distributions (both continuous and discrete), such as normal, exponential, gamma, chi-squared, and more.
These distributions allow you to model and analyze random variables in various fields like physics, finance, and engineering.

2. **Summary Statistics and Tests:**
You can compute summary statistics (mean, median, variance, etc.) for your data using functions from this module.
It also offers statistical tests (e.g., t-tests, ANOVA) to assess hypotheses and compare groups.

3. **Correlation Functions:**
Functions for calculating correlation coefficients (Pearson, Spearman, Kendall) are available.
Useful for understanding relationships between variables.

### **Demonstration 1: Normal Distribution**

**Figure 2.13a** in your textbook (and above) shows a unimodal distribution. In a unimodal distribution the mean, the median, and the mode should all be roughly the same value. In a special unimodal distribution known as a **_Normal Distribution_**, the mean and median are _exactly_ the same value!  

_Code Description:_

As always, the code begins by importing the necessary Python packages, in this case, Numpy (as `np`), Matplotlib (as `plt`) and SciPy.stats.

The code then uses the Numpy function `np.random.seed(42)` to set the random seed to a specific number, in this case `42` (see **Note** below). Setting the random seed in computer programs is essential for reproducibility. When you generate random numbers or perform any stochastic operations (like shuffling data), the outcome depends on the initial seed value. You don't need to set the random seed and it is only done here as a pedagological device. 

The next code fragment is the essence for generating large datasets containing random numbers. The Numpy function `np.random.normal()` generates a Numpy array called `unimodal_data` containing 10 million random numbers (`size=10000000`), with a normal distribution having an average value = `5.0`.  

The Matplotlib function `plt.hist()` is then used to visualize the data values in this Numpy array by creating a frequency histogram. Additional `plt` functions are used to label the x-axis, the y-axis and the title.

------------------
**Note:** In _The Hitchhiker’s Guide to the Galaxy_ by Douglas Adams, the number 42 is famously known as the “Answer to the Ultimate Question of Life, the Universe, and Everything.” This answer was calculated by an enormous supercomputer named _Deep Thought_ over a period of 7.5 million years. However, the catch is that no one actually knows what the question is!

In [None]:
# Demonstration 1: Generate and Plot Unimodal (Normal) distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Set random seed
np.random.seed(42) 

# Generate data 
unimodal_dat = np.random.normal(loc=5, scale=1, size=10000000)

# Plot histogram
plt.hist(unimodal_dat, bins=500, density=True, alpha=0.6, color='b')

# Add labels and title
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Figure 2.13a: Unimodal Distribution')

# Show plot
plt.show()

If the code is correct, you should see the following histogram.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image08.png)

### Example 4: Compute Mean and Median using Numpy

In the examples above, we computed the mean and median from data stored in a particualr column within a Pandas DataFrame using the `statistics` package. However, when working with a Numpy array, you can easily compute their mean and median using Numpy's built-in statistical functions as illustrated in the code in the cell below. 

In Example 4, we find the mean and median of the Numpy array `unimodal_dat` that was generate in the above code cell.

_Code Description:_

The following code chunk illustrates the use of the Numpy built-in statistical functions for finding the mean and median:

~~~text
# Compute mean and median using Numpy
Unimodal_mean=np.mean(unimodal_dat)
Unimodal_median=np.median(unimodal_dat)
~~~

The code then prints out the results. The number of decimal places to print is expanded to 6 places instead of the previous 2 places:

~~~text
# Print results
print(f"Unimodal data mean={Unimodal_mean:.6f}")
print(f"Unimodal data median={Unimodal_median:.6f}")
~~~

In [None]:
# Example 4: Compute mean and median using Numpy

import numpy as np

# Compute mean and median using Numpy
Unimodal_mean=np.mean(unimodal_dat)
Unimodal_median=np.median(unimodal_dat)

# Print results
print(f"Unimodal data mean={Unimodal_mean:.6f}")
print(f"Unimodal data median={Unimodal_median:.6f}")

If the code is correct you should see the following output:

~~~text
Unimodal data mean=4.999936
Unimodal data median=5.000023
~~~

As expected, the mean and median of our dataset `unimodal_dat`, that contains 10 million random numbers, are both equal to approximately`5` but not exactly `5.0000`. Why? Because we used a _random_ number generator to generate the 10 million values our `unimodal_dat` array.

### **Demonstration 2: Bimodal Distribution**

If the distribution of values in a dataset is symmetric, but **_bimodal_**, so that the  corresponding frequency histogram has two peaks (**Figure 2.13(b))**, then the mean and median will also be roughtly the same value. However, that this common value will lie _between_ the two peaks, and hence be a measurement that is extremely unlikely to occur. In other words, using the mean or median to describe the **_central tendency_** of a bimodal dataset would be very misleading since very few--or none--of the data values will have this "average" value. 

A bimodal distribution often indicates that the population from which the values are taken actually consists of two distinct subgroups that differ in the characteristic being measured. For instance, female black bears have an average weight of 175 pounds, while males average around 400 pounds, resulting in a bimodal distribution of weights. In this situation, it might be better to report two modes rather than the mean or the median, or to treat the two subgroups separately.  

_Code Description:_

The code is essentially the same as shown in Demonstration 1 except that two datasets are generated each containing 10 million values:

~~~text
# Generate two normal distributions (bimodal)
bimodal_dat1 = np.random.normal(loc=5, scale=1, size=10000000)
bimodal_dat2 = np.random.normal(loc=10, scale=1, size=10000000)
~~~

Note that the first dataset, `bimodal_dat1`, is centered at the value 5 (`loc=5`) while the second dataset `bimodal_dat2` is centered at the value 10 (`loc=10`).

The next code chunk "adds" (`concatenate`) the two Numpy arrays together to make one giant array `bimodal_dat` containing 2 million random numbers:

~~~text
# Combine the two datasets
bimodal_dat = np.concatenate([bimodal_dat1, bimodal_dat2])
~~~

In [None]:
# Demonstration 2: Biimodal distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Set random seed
np.random.seed(42)

# Generate two normal distributions (bimodal)
bimodal_dat1 = np.random.normal(loc=5, scale=1, size=10000000)
bimodal_dat2 = np.random.normal(loc=10, scale=1, size=10000000)

# Combine the two datasets
bimodal_dat = np.concatenate([bimodal_dat1, bimodal_dat2])

# Plot histogram
plt.hist(bimodal_dat, bins=500, density=True, alpha=0.6, color='b')

# Add labels and title
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Figure 2.13b: Bimodal Distribution')

# Show plot
plt.show()

If the code is correct, you should see the following histogram.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image09.png)

### **Exercise 4: Compute Mean and Median using Numpy**

Compute the mean and median of the Numpy array `bimodal_dat` that was generated in the code cell above using Numpy's built-in statistical functions. Call the value containing the mean `Bimodal_mean` and the value containing the median, `Bimodal_median`. 

Print out these two values with 6 decimal places as was illustrated in Example 4 above.

In [1]:
# Insert your code for Exercise 4 here 



If your code is correct you should see the following output:

~~~text
Bimodal data mean=7.500071
Bimodal data median=7.498994
~~~

### Example 5: Right-skewed Distribution

The data in **Figure 2.13(c)** are skewed to the right. When the data are not symmetric, as in this figure, the **_median+** is often a better measure of central tendency than the mean. This is because the mean is sensitive to extreme observations, being _pulled_ in the direction of the outlying data values. As a result, the mean ends up excessively inflated (i.e. too high). When data is right-skewed, the mean lies to the right of the median. 

_Code Description:_

To generate a skewed data distribution, we will use the `skewnorm` method that is part of the `scipy.stats` package. To skew a data distribution to the right, you enter a **_positive_** skewness value. Conversely, if you want your data distribution skewed to the left, set your skewness value to a negative number.

Since we want to a right-skewed distribution, we set the skewness value to +5 as shown in the following code fragement:

~~~text
# Generate data using skewnorm
skewness = +5  # Positive values create right-skewed distribution
rightSkewed_arr = skewnorm.rvs(a=skewness, loc=max_value, size=num_values)
~~~

This creates a right-skewed Numpy array called `rightSkewed_arr` with 10 million values (`num_value = 10000000`).

The next step is to shift all of the values in `rightSkewed_arr` so that the smallest (minimum) value is equal to 0. This is accomplished by the following code chunk:

~~~text
# Shift the set so the minimum value is equal to zero
rightSkewed_arr = rightSkewed_arr - min(rightSkewed_arr)
~~~

The next step is standardize all of the values in `rightSkewed_arr` by dividing each value in `rightSkewed_arr` by the largest value in the array as shown in the following code chunk:

~~~text
# Standardize all the values between 0 and 1
rightSkewed_arr = rightSkewed_arr / max(rightSkewed_arr)
~~~

Finally, each value in `rightSkewed_arr` is multiplied by 100 before plotting the frequency histogram.

~~~text
# Multiply the standardized values by the maximum value
rightSkewed_arr = rightSkewed_arr * max_value
~~~



In [None]:
# Example 5: Right-skewed distribution

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skewnorm

# Generate Right-skewed data
num_values = 10000000  # 10 million values
max_value = 100

# Generate data using skewnorm
skewness = +5  # Positive values create right-skewed distribution
rightSkewed_arr = skewnorm.rvs(a=skewness, loc=max_value, size=num_values)

# Shift the set so the minimum value is equal to zero
rightSkewed_arr = rightSkewed_arr - min(rightSkewed_arr)

# Standardize all the values between 0 and 1
rightSkewed_arr = rightSkewed_arr / max(rightSkewed_arr)

# Multiply the standardized values by the maximum value
rightSkewed_arr = rightSkewed_arr * max_value

# Plot histogram
plt.hist(rightSkewed_arr, bins=500, density=True, color='b', alpha=0.8)

# Label plot
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Figure 2.13c: Right-Skewed Distribution')

# Show plot
plt.show()

If the code is correct, you should see the following histogram.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image10.png)

### **Exercise 5: Left-skewed Distribution**

For **Exercise 5**, you are to create a left-skewed data distribution as shown in **Figure 2.13(d)**. In a left-skewed distribution, the mean is _pulled_ to the left it ends up excessively **_deflated_**. Call your dataset `leftSkewed_arr`. 

In [None]:
# Insert your code for Exercise 5 here



If your code is correct, you should see the following histogram.

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image11.png)

### **Demonstration 3: Two Normal Distributions with Different Dispersions**

**Figure 2.14**, on page 39 of your textbook, shows two very differnt data distributions, even though both distribution have the same mean and median values. In other words, in order to characterize a dataset, we need to know not only its central tendency (i.e., mean and/or median), but also its **_measure of dispersion_** -- how "spread out" the data is.

The code in the cell below uses the Numpy function to generate two normal distributions, one with a narrow dispersion (`narrow_dat`) and one with a wide dispersion (`wide_dat`). Both distributions have the same mean (`loc=5`) but have different dispersions (_narrow_:`scale=1`, _wide_:`scale=5`). 

In [None]:
# Demonstration 3: Two normal distributions with different dipersions

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Set random seed
np.random.seed(42)

# Generate data for narrow and wide dispersions
narrow_dat = np.random.normal(loc=5, scale=1, size=1000000)
wide_dat = np.random.normal(loc=5, scale=5, size=1000000)

# Plot the histogram
plt.hist(narrow_dat, bins=500, density=True, alpha=0.6, color='b')
plt.hist(wide_dat, bins=500, density=True, alpha=0.4, color='r')

# Add labels and title
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Figure 2.14: Narrow and Wide Data Dispersions')

plt.xlim(-10, 20)
plt.ylim(0, 0.45)

# Show the plot
plt.show()

If the code is correct, you show see the following image:

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image04.png)

In the image above, the data with the narrow dispersion (`scale=1`) is shown in blue while the data with the wide dispersion (`scale=5`) is shown in red.

## **Range**

The **_range_** represents the spread of data (i.e. _dispersion_) from the lowest to the highest value in a distribution. It’s a measure of variability that helps summarize the extent of differences within your dataset. 

**Importance of Range:**

* The range provides a quick overview of variability.
* When combined with measures of central tendency (like mean or median), it helps describe the span of the distribution.
* However, be cautious with outliers: A single extreme value can significantly affect the range.
* For a clearer picture of variability, consider using other measures like interquartile range or standard deviation alongside the range.

### Example 6: Compute Range

The range is simple to compute. It is simply the largest value in a dataset minus the smallest value in the dataset.

In [None]:
# Example 6: Compute range

BldPressure_max = max(pimaDF.BloodPressure)
BldPressure_min = min(pimaDF.BloodPressure)
BldPressure_range = BldPressure_max - BldPressure_min
print(f"Blood pressure range={BldPressure_range:.2f} mmHg")

If the code is correct, you should see the following output:

~~~text
Blood pressure range=122.00 mmHg
~~~

### **Exercise 6: Compute Range**

In the cell below, write the code to compute the range for Height in the Obesity Prediction dataset. Call the variable holding the range `Height_range`. Print out your results. 

In [None]:
# Insert your code for Exercise 6 here



If the code is correct, you should see the following output:

~~~text
Height range=65.30 cm
~~~

## **Interquartile Range**

The **_interquartile range_** (IQR) is a second measure of statistical dispersion that tells you how spread out the data is within the middle half of your distribution. Here are the key points:

* **Definition:** The IQR represents the difference between the third quartile (Q3) and the first quartile (Q1).
* **Quartiles:** Quartiles divide an ordered dataset from low to high into four equal parts.
* **Q1:** The value below which 25% of the distribution lies.
* **Q3:** The value below which 75% of the distribution lies.

**Calculation:**

$$ \text IQR = Q3 - Q1 $$

* **Visualize:** Boxplots often display the IQR as the range between the box’s edges.

Remember, while the range gives you the spread of the entire dataset, the IQR focuses on the middle half

### **Create Function to Compute IQR**

In the following code cell, we create our own function called `find_iqr()` to compute the Interquartile Range (IQR) for a Numpy array. 

In [None]:
# Define a function to calculate IQR
def find_iqr(x):
    return np.subtract(*np.percentile(x, [75, 25]))

If the code is correct, you should **not** see any output when you run the above cell. We will use our custom function in the next example.

### Example 7: Compute IRQ

In Example 7 we use our new function to find the IRQ for the blood pressure data in the Pima Indian dataset. 

_Code Description:_

The following code chunk uses our function `find_irq()`:

~~~text
BldPressue_iqr = pimaDF[['BloodPressure']].apply(find_iqr)
~~~

The code uses the Python `apply()` to create a new variable called `BldPressure_iqr`. The `apply()` method will be discussed later in this course. 

In this example, the

In [None]:
# Example 7: Compute IRQ

# Calculate IQR
BldPressure_iqr = pimaDF[['BloodPressure']].apply(find_iqr)

# Print
print(BldPressure_iqr)

If the code is correct, you should see the following output:

~~~text
BloodPressure    18.0
dtype: float64
~~~

The variable `BldPressure_iqr` is not a string so it can't be printed using the `f` print function.

### **Exercise 7: Compute IQR**

In the cell below, write the Python code to compute the IQR for the height data in the Obesity Prediction dataset. Call the variable holding the IQR `Height_iqr` and print out its value.

In [None]:
# Insert your code for Exercise 7 here



If your code is correct, you should see the following output:

~~~text
Height    13.839391
dtype: float64
~~~

### **IQR and the Box Plot**

The interquartile range (IQR) and the box plot are closely related.

**Box Plot:**
A box plot (also known as a box-and-whisker plot) visually represents the distribution of a dataset.
It displays the following key statistics:
* The **median** (Q2), which is the middle value.
* The **first quartile** (Q1), which marks the 25th percentile.
* The **third quartile** (Q3), which marks the 75th percentile.
* The minimum and maximum values (sometimes called “whiskers”).
  
The “box” in the plot spans from Q1 to Q3, enclosing the middle 50% of the data.

**Interquartile Range (IQR):**
* The IQR is the difference between Q3 and Q1: (IQR = Q3 - Q1).
* It tells us how spread out the middle 50% of values are in a dataset.
* Larger IQR indicates greater variability within that central range.

**Relationship:**
* The width of the “box” in the box plot corresponds to the IQR.
* The whiskers extend to the minimum and maximum values within a certain range (often 1.5 times the IQR).

In summary, the IQR provides a measure of data spread, and the box plot visually represents this spread along with other quartiles

### Example 8: Boxplot

The Boxplot and the Python for generating it was covered in a previous lessson. In the cell below, generate a Boxplot for the blood pressure values in the Pima Indian dataset. We know from Example 7 that IQR for blood pressure is 18.0 mmHg. We would therefore expect this value to be the width of our Boxplot. 

In [None]:
# Example 8: Boxplot

import matplotlib.pyplot as plt

# Create graohics environment
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(4, 4))

# Assign x-values
x=pimaDF.BloodPressure

# Use color from textbook
colors =['#919eb6']

# Assign values to markers and line styles
flierprops = dict(marker='o', markerfacecolor='black', markersize=6,
                  markeredgecolor='none')
medianprops = dict(linestyle='-', linewidth=1.0, color='black')

# Plot boxplot
bplot=ax.boxplot(x, notch=False,
           flierprops=flierprops,
           medianprops=medianprops,
           patch_artist=True,
           widths=0.6)

# Add markers
ax.flierprops = dict(marker='o', markerfacecolor='black', markersize=8,
                  linestyle='none')

# Add title
ax.set_title('Pima Indian Blood Pressures')

# Fill with colors
for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color)

# Set axis
ax.set_xticks([])
ax.set_xlabel('')
ax.set_ylabel('Blood Pressure (mmHg)')

# Show plot
plt.show()

If the code is correct, you show see the following image:

![____](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image05.png)

The IRQ calculated for the blood pressures in Example 6 was `18.0`. In the boxplot above, the "width" (i.e. height) of the grey box graphically represents the IRQ or 18 mmHg. 

### **Exercise 8: Boxplot**

In the cell below, write the Python to create a Boxplot of the height values in the Obesity Prediction dataset. 


In [None]:
# Insert your code for Exercise 8 here



If the code is correct, you show see the following image:

![___](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image06.png)

The IRQ calculated for height in Exercise 6 was `13.8`. In the boxplot above, the "width" (i.e. height) of the "box" (colored rectangle) graphically represents the IRQ and should be 13.8 cm. 

## **Variance**

**_Variance_** measures the spread or dispersion of a set of data points around their mean (average). It quantifies how much individual data points deviate from the central tendency. Here’s how it’s calculated:

$$  s^2 =  \frac{1}{2n(n-1)} \sum_{i=1}^{n} \sum_{j=1, j\not = i}^{n} (x_i - x_j)^2   $$

The variance plays a crucial role in statistics, and here’s why it’s significant:

1. **Measuring Spread:**
* Variance quantifies how much individual data points deviate from the **mean** (average).
* It provides a measure of **spread** or **dispersion** within a dataset.
* Larger variance indicates greater variability, while smaller variance means data points are closer to the mean.

2. **Key Properties:**
* Variance is always **non-negative** (it can’t be negative).
* If all data points are identical, the variance is zero.
* It’s sensitive to **outliers**—extreme values can significantly impact variance.


### Example 9: Compute Variance

The code in the cell below uses the `statistics` package to compute the variance of the blood pressure values in the Pima Indian dataset.

_Code Description:_

The following code chunk is used to round off the value in `BldPressure_variance` to two decimal places. 

~~~text
# Create rounded-off value as a string
BldPressure_var_str = str(round(BldPressure_variance,2))
~~~

It should be noted that this produces a `string` variable called `BldPressure_val` while leaving the orignal value `BldPressure_variance` as a number (`float32`). This was done in order to use "fancy" formating in the print statement.

~~~text
# Print result with fancy formating
print("Blood pressure variance={} mmHg\u00b2".format(BldPressure_var_str))
~~~

The fancy formatting is used to create the exponent value `2`.

In [None]:
# Example 9: Compute variance

import statistics

# Assign x
x = pimaDF.BloodPressure

# Calculate the variance
BldPressure_variance = statistics.variance(x)

# Create rounded-off value as a string
BldPressure_var_str = str(round(BldPressure_variance,2))

# Print result with fancy formating
print("Blood pressure variance={} mmHg\u00b2".format(BldPressure_var_str))

If the code is correct, you should see the following output:

~~~text
Blood pressure variance=374.65 mmHg²
~~~

The string `mmHg²` was generated by adding the "escape" value: `\u00b2` to the string.

It should be noted that units for the blood pressure variance are "squared millimeters of mercury" (`mmHg²`) even though these units do **not** make any physical sense!

### **Exercise 9: Compute Variance**

In the cell below, compute the variance of the height values in the Obesity Prediction dataset using Example 8 as a template. Print out your value for the variance. The units for your variance should be `cm²`.

In [None]:
# Insert your code for Exercise 9 here



If your code is correct, you should see the following output:

~~~text
Height variance=106.3 cm²
~~~

Again the units for variance, `cm²`, doesn't make any physical sense when talking about a person's height. 

## **Standard Deviation**

**_Standard Deviation_** measures the amount of variation or spread in a dataset around its mean (average). Here’s is how the standard deviation ($s$) is computed:

$$ s = \sqrt{s^2} = \sqrt{ \frac{1}{(n-1)} \sum_{i=1}^{n} (x_i - \overline{x})^2 }   $$

As you can see from the first part of this equation, the standard deviation is simply the square root of the variance:

$$ s = \sqrt{s^2} $$

In biostatistics, the standard deviation is used more frequently than the variance. This is primarily because the standard deviation has the _same_ units of measurement as the mean, rather than squared units.

Here is what the standard deviation tells you about a data set.

**Spread of Data:**
* A _low_ standard deviation indicates that values tend to be close to the mean, suggesting less variability.
* A _high_ standard deviation means that values are spread out over a wider range, indicating greater variability.

**Normal Distributions:**
Standard deviation is particularly useful for normal distributions, where data is symmetrically distributed around the mean.

Most scientific variables (like height, test scores, or job satisfaction ratings) follow normal distributions.

### Example 10: Compute Standard Deviation

The code in the cell below uses the `statistics` package to compute the standard deviation from the `pimaDF` DataFrame column `BloodPressure`. In addition, it uses the `math` package to compute the square root of the `BldPressure_variance` computed in Example 8. Both values are printed out using the normal `f` print function.

In [None]:
# Example 10: Compute standard deviation

import statistics
import math

# Assign x
x = pimaDF.BloodPressure

# Calculate standard deviation
BldPressure_SD = statistics.stdev(x)

# Print result
print(f"Blood pressure standard deviation={BldPressure_SD:.2f} mmHg")

# Calculate square root
sqrtBldPressure_var=math.sqrt(BldPressure_variance)

# Print result
print(f"Square root of Blood pressure variance={sqrtBldPressure_var:.2f} mmHg")

If the code is correct, you should see the following output:

~~~text
Blood pressure standard deviation=19.36 mmHg
Square root of Blood pressure variance=19.36 mmHg
~~~

As you can see, the standard deviation is simply the square root of the variance. You should also note that the units are no longer squared, but have the same units as the mean blood pressure (i.e. `mmHg`).

### **Exercise 10: Compute Standard Deviation**

In the cell below, compute the standard deviation for the height measurements in the DataFrame `obDF` and the square root of the height variance that you computed in **Exercise 8**. Print out both values using the `f` print function. 


In [None]:
# Insert your code for Exercise 10 here



If the code is correct, you should see the following output:

~~~text
Height standard deviation=10.31 cm
Square root of height variance=10.31 mmHg
~~~

As was shown in Example 9, the standard deviation is simply the square root of the variance. Once again, note that the units are no longer squared, but have the same units as the mean (i.e. `cm`).

### **Empirical Rule**

The **_Empirical Rule_**, also called the **68-95-99.7 Rule** applies only to a normally distributed data distribution. 

In a normal distribution:
* Around 68% of scores fall within **1 standard deviation** of the mean.
* Around 95% fall within **2 standard deviations**.
* Approximately 99.7% fall within **3 standard deviations**.

For example, if you have a memory recall test with a mean score of 50 and a standard deviation of 10:
* About 68% of scores are between 40 and 60.
* About 95% are between 30 and 70.
* Almost all (99.7%) are between 20 and 80 1.

Standard deviation helps us understand data variability and make informed decisions! 

![___](https://biologicslab.co/BIO5853/images/Lesson_01_5/lesson_01_5_image07.png)

## **Lesson Turn-in**

When you have run all of the code cells print out a PDF copy of your COLAB notebook. You should name your PDF  `Lesson_01_5_lastname.pdf` where _lastname_ is your last name. Upload the PDF to Lesson_01_5 on Canvas for grading.