In [4]:
%%capture
%%sh
chmod u+x ./helpful-script.sh
./helpful-script.sh setup

In [5]:
import otter
grader = otter.Notebook('GAI-E08.ipynb')

# Week 8 GenAI Learning Log

## Your Mission: Understand Populations, Samples, and Distributions

Welcome to your next GenAI learning module! 🎉 This week, we transition from probability to the core of statistical inference: **sampling and describing data**. Last week, you focused on **Probability**—the mathematics of chance. This assignment will guide you through the process of correctly gathering data (**sampling**), visualizing it (**distributions**), and summarizing it numerically (**summary statistics**). As always, your AI assistant is here to help you explore these concepts and prepare for hands-on practice in class. Let's dive in!

### What You'll Need

* Access to TerrierGPT or ChatGPT (or your preferred AI assistant)
* This notebook for recording your responses
* About 2-3 hours of focused exploration time (but not necessarily all at once!)

**Important:** This is a GREEN ZONE assignment. AI collaboration is not just allowed but encouraged!

## Part 1: Sampling Concepts (75 min)

**Your Mission:** Understand the foundational concepts of gathering data from large groups.

---

### Question 1.1: Population vs. Sample

Ask your AI assistant:

"I need to understand how we get data from the real world: **What's the difference between a population and a sample?**"

In the variable `pop_vs_sample`, define a **population** and a **sample**, and give one simple example that illustrates the difference (e.g., all students in a university vs. 100 students surveyed).

In [6]:
pop_vs_sample = "A population is the complete set of individuals or items that share a common characteristic and that a researcher wants to study. A sample is a smaller group selected from the population that is used to represent the whole population for research purposes."

In [7]:
grader.check("q1.1")

In [8]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 804 bytes | 804.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/Calvinl1017/GAI-E08.git
   2804b7f..be9b119  main -> main


### Question 1.2: Why We Sample

Ask your AI assistant:

"I need to understand how we get data from the real world: **Why can't we always study the entire population?**"

In the variable `why_sample`, state two primary reasons (e.g., cost, feasibility, time) why studying the entire population is often impossible or impractical.

In [9]:
why_sample = "Studying the entire population is often impossible or impractical primarily because of cost and feasibility. Collecting data from every individual in a large population can require an enormous amount of money and resources. Additionally, it may be infeasible due to time constraints or because some members of the population are difficult to access."

In [10]:
grader.check("q1.2")

In [11]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 400 bytes | 400.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   be9b119..570ece9  main -> main


### Question 1.3: Good vs. Bad Samples

Ask your AI assistant:

"I need to understand how we get data from the real world: **What makes a good sample vs a bad sample?**"

In the variable `good_vs_bad_sample`, describe the key characteristic that determines a **good sample** (e.g., representativeness) and one example of a **bad sample** (e.g., convenience sample).

In [12]:
good_vs_bad_sample = "A good sample is one that is representative of the population, meaning it accurately reflects the characteristics and diversity of the entire group. This ensures that conclusions drawn from the sample apply to the larger population. A bad sample would be a convenience sample, where participants are chosen based on ease of access rather than randomness, which can lead to biased or unrepresentative results."

In [13]:
grader.check("q1.3")

In [14]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 410 bytes | 410.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   570ece9..947a034  main -> main


### Question 1.4: Sampling Methods

Ask your AI assistant:

"I need to understand how we get data from the real world: **What are different sampling methods and when do we use each?**"

In the variable `sampling_methods`, describe two different types of random sampling methods (e.g., simple random sampling, stratified sampling) and briefly explain when a data scientist would choose one over the other.

In [15]:
sampling_methods = "Simple random sampling is a method where each member of the population has an equal chance of being selected. It is useful when the population is relatively uniform and no specific subgroups need to be considered. Stratified sampling involves dividing the population into distinct subgroups, or strata, based on shared characteristics and then randomly sampling from each stratum. This method is preferred when the population has important subgroups, and the data scientist wants to ensure representation from each group to improve accuracy and reduce sampling bias."

In [16]:
grader.check("q1.4")

In [17]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 408 bytes | 408.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   947a034..56ac213  main -> main


### Question 1.5: Sampling Method Exercise

Imagine two researchers want to sample BU students for a survey:

1.  **Ben** took every single BU student email address and randomly chose 1,000 to send his survey link.
2.  **Jerry** stood outside the CDS building from 10 am to 12 pm and asked students passing by to fill out his survey.

In the variable `sampling_comparison`, identify what Ben's and Jerry's sampling methods are called, and explain which method is better for achieving a **representative sample** and why.

In [20]:
sampling_comparison = "Ben's sampling method is called simple random sampling because he randomly selected 1,000 students from the entire population of BU students. Jerry's method is a convenience sampling since he surveyed students who happened to pass by the CDS building during a specific timeframe."

In [21]:
grader.check("q1.5")

In [22]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 767 bytes | 767.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   56ac213..35bd379  main -> main



--- 

## Part 2: Sampling Bias and Problems (60 min)

**Your Mission:** Identify potential pitfalls and errors that can undermine data quality.

---

### Question 2.1: What is Sampling Bias?

Ask your AI assistant:

"What can go wrong with sampling? **What is sampling bias and how does it happen?**"

In the variable `sampling_bias_def`, define **sampling bias** and explain one common way it can occur (e.g., self-selection bias, undercoverage).

In [23]:
sampling_bias_def = "Sampling bias occurs when the sample collected is not representative of the population, leading to systematic errors in the results. One common way sampling bias happens is through self-selection bias, where participants choose to be part of the sample, often leading to overrepresentation of certain views or characteristics and underrepresentation of others."

In [24]:
grader.check("q2.1")

In [25]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 388 bytes | 388.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   35bd379..48a1e6a  main -> main


### Question 2.2: Examples of Biased Samples

Ask your AI assistant:

"What can go wrong with sampling? **Give me examples of biased samples from real surveys or studies**"

In the variable `biased_example`, describe one concrete historical or theoretical example of a biased sample and explain why it failed to represent the population accurately.

In [26]:
biased_example = "One famous example of a biased sample is the 1936 Literary Digest presidential poll, which predicted the wrong winner in the U.S. presidential election. The poll surveyed readers of the magazine by sending questionnaires to a large list of people, including telephone directories and car owners. During the Great Depression, this led to undercoverage because poorer citizens—the majority of voters—were less likely to own phones or cars and were therefore excluded, resulting in a sample that was not representative of the entire voting population."

In [27]:
grader.check("q2.2")

In [28]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 2.31 KiB | 2.31 MiB/s, done.
Total 4 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   48a1e6a..3ebd2c6  main -> main


### Question 2.3: The Effect of Response Rates

Ask your AI assistant:

"What can go wrong with sampling? **How do response rates affect sample quality?**"

In the variable `response_rate_effect`, explain the potential problem that arises when a survey has a low response rate (i.e., non-response bias).

In [29]:
response_rate_effect = "When a survey has a low response rate, non-response bias can occur, meaning the individuals who choose not to respond may have different opinions or characteristics from those who do respond. This can skew the results because the sample no longer accurately reflects the entire population, reducing the reliability and generalizability of the findings."

In [30]:
grader.check("q2.3")

In [31]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 417 bytes | 417.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   3ebd2c6..be0fde6  main -> main


### Question 2.4: Representative Samples

Ask your AI assistant:

"What can go wrong with sampling? **What is a representative sample and why is it important?**"

In the variable `representative_sample_importance`, define a **representative sample** and explain why achieving one is the main goal of statistical sampling.

In [32]:
representative_sample_importance = "A representative sample is a subset of a population that accurately reflects the characteristics and diversity of the entire population. Achieving a representative sample is the main goal of statistical sampling because it ensures that conclusions drawn from the sample can be generalized to the whole population, providing valid and reliable results."

In [33]:
grader.check("q2.4")

In [34]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 429 bytes | 429.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   be0fde6..731d225  main -> main



--- 

## Part 3: Data Distributions (90 min)

**Your Mission:** Understand how to visualize and interpret the shape of your data.

---

### Question 3.1: What is a Distribution?

Ask your AI assistant:

"Now I want to understand how data is distributed: **What is a distribution and why do we care about the shape of data?**"

In the variable `distribution_meaning`, define a **data distribution** and explain why its shape is a critical piece of information for data analysis.

In [35]:
distribution_meaning = "A data distribution describes how values are spread or arranged across different possible outcomes in a dataset. The shape of the distribution—whether it is symmetric, skewed, uniform, or has multiple peaks—is critical because it informs analysts about the underlying patterns, helps identify outliers, guides the choice of statistical methods, and influences the interpretation of the data."

In [36]:
grader.check("q3.1")

### Question 3.2: Histograms

Ask your AI assistant:

"Now I want to understand how data is distributed: **What do histograms tell us about our data?**"

In the variable `histogram_info`, explain what a **histogram** visualizes and list two specific things a data scientist can learn by looking at one (e.g., skewness, modality).

In [37]:
histogram_info = "A histogram visualizes the frequency distribution of a dataset by showing how many data points fall within specified intervals or bins. By looking at a histogram, a data scientist can learn about the skewness of the data (whether it is symmetric or skewed left or right) and the modality (the number of peaks or modes), which helps in understanding the data's overall distribution and identifying any unusual patterns."

In [38]:
grader.check("q3.2")

In [39]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 465 bytes | 465.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   731d225..c44e9d1  main -> main


### Question 3.3: Normal Distributions

Ask your AI assistant:

"Now I want to understand how data is distributed: **What are normal distributions and why are they important?**"

In the variable `normal_dist_importance`, describe the key visual characteristic of a **normal distribution** (the 'bell curve') and explain why it is so frequently used in statistics and data science.

In [40]:
normal_dist_importance = "A normal distribution is visually characterized by a symmetric, bell-shaped curve where most data points cluster around the mean, and frequencies taper off evenly on both sides. It is important because many natural phenomena follow this pattern, and numerous statistical methods and tests assume data is normally distributed, making it foundational for inference, probability calculations, and predictive modeling in data science."

In [41]:
grader.check("q3.3")

In [42]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 420 bytes | 420.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   c44e9d1..b9dedd7  main -> main


### Question 3.4: Outliers

Ask your AI assistant:

"Now I want to understand how data is distributed: **What are outliers and how do they affect our analysis?**"

In the variable `outlier_effect`, define an **outlier** and briefly explain one way it can distort the conclusions drawn from a dataset.

In [43]:
outlier_effect = "An outlier is a data point that significantly differs from the other observations in a dataset. Outliers can distort analysis by skewing summary statistics like the mean, leading to misleading conclusions about the typical value or trends in the data."

In [44]:
grader.check("q3.4")

In [45]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 1.81 KiB | 1.81 MiB/s, done.
Total 4 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   b9dedd7..e60f9d9  main -> main


### Exercise: Interpreting a Sample Distribution

The code block below generates a sample dataset. Run the code to visualize the data distribution before answering the next two questions.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(42)

# Generating sample data
skewed_data = np.random.exponential(scale=20000, size=500) + 30000
salaries_data = np.append(skewed_data, [500000])

# Create and display the histogram
plt.figure(figsize=(10, 6))
plt.hist(salaries_data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Hypothetical Salaries (Right Skewed with Outlier)')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')
plt.axvline(np.median(salaries_data), color='red', linestyle='dashed', linewidth=1, label='Median')
plt.axvline(np.mean(salaries_data), color='green', linestyle='dashed', linewidth=1, label='Mean')
plt.legend()
plt.show()


### Question 3.5: Identifying Skewness

Based on the histogram generated above, what is the skewness of the `salaries_data` distribution (left, right or normally distributed), and how do the mean (green line) and median (red line) relate to each other in this skewed shape?

In [46]:
skewness_and_center = "The distribution of the salaries_data is right-skewed, as shown by the long tail extending toward higher salary values, including the outlier at $500,000. In a right-skewed distribution, the mean (green dashed line) is pulled to the right and is greater than the median (red dashed line), which remains closer to the bulk of the lower salary data. This discrepancy occurs because the mean is sensitive to extreme high values, whereas the median represents the middle value and is less affected by outliers."

In [47]:
grader.check("q3.5")

In [48]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 384 bytes | 384.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   e60f9d9..b80e83b  main -> main


### Question 3.6: Identifying the Outlier

Referring to the histogram, is there an obvious outlier present in the `salaries_data`? If yes, briefly describe where the outlier is located relative to the rest of the data. Also, explain the effect of the outlier on the mean and median.

In [49]:
outlier_location = "Yes, there is an obvious outlier present in the salaries_data. It is located far to the right of the rest of the data at $500,000, well beyond the main cluster of salary values. This extreme high value pulls the mean upward, making it larger than the median. The median, being the middle value, remains closer to the center of the bulk of the data and is less affected by the outlier, providing a better measure of the typical salary in this skewed distribution."

In [50]:
grader.check("q3.6")

In [51]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 402 bytes | 402.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   b80e83b..ea4fc32  main -> main



--- 

## Part 4: Summary Statistics (55 min)

**Your Mission:** Learn how to summarize and quantify the key features of a data distribution.

---

### Question 4.1: Measures of Center

Ask your AI assistant:

"How do we summarize distributions with numbers? **What do mean, median, and mode tell us differently?**"

In the variable `center_measures_difference`, summarize the unique information provided by the **mean**, **median**, and **mode**.

In [52]:
center_measures_difference = "The mean tells us the arithmetic average of the data, representing the central value by summing all observations and dividing by the number of observations. The median indicates the middle value when the data is ordered, showing the point that divides the data into two equal halves. The mode reveals the most frequently occurring value in the dataset, highlighting the common or popular outcome. Each measure provides different insights: mean is sensitive to outliers, median is robust to skewed data, and mode indicates the most common observation."

In [53]:
grader.check("q4.1")

In [54]:
!./helpful-script.sh save 1>/dev/null

Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 423 bytes | 423.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/Calvinl1017/GAI-E08.git
   ea4fc32..ef501f7  main -> main


### Question 4.2: Measures of Spread

Ask your AI assistant:

"How do we summarize distributions with numbers? **How do we measure spread (range, standard deviation)?**"

In the variable `spread_measures`, define the **range** and the **standard deviation**, and explain what each one tells you about the variability of the data.

In [None]:
spread_measures = ...

In [None]:
grader.check("q4.2")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.3: Outliers' Effect on Summary Statistics (Theory)

Ask your AI assistant:

"How do we summarize distributions with numbers? **How do outliers affect different summary statistics?**"

In the variable `outlier_on_stats`, explain how an **outlier** affects the **mean** and how it affects the **median**.

In [None]:
outlier_on_stats = ...

In [None]:
grader.check("q4.3")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.4: Best Measure of Center (General)

Ask your AI assistant:

"How do we summarize distributions with numbers? **When is each measure of center most useful?**"

In the variable `best_measure_of_center`, explain which measure of center (**mean, median, or mode**) is generally considered the most robust (least affected by outliers) and why.

In [None]:
best_measure_of_center = ...

In [None]:
grader.check("q4.4")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Exercise: Applying Measures of Center to Skewed Data

The code block below is the same histogram from Part 3, using the **right-skewed salary data with an outlier**. Use this visualization to answer the next two questions regarding the central tendency.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a seed for reproducibility
np.random.seed(42)

# Generate a right-skewed dataset (using exponential distribution, then scaling)
skewed_data = np.random.exponential(scale=20000, size=500) + 30000

# Introduce a single, extreme outlier
salaries_data = np.append(skewed_data, [500000])

# Create and display the histogram
plt.figure(figsize=(10, 6))
plt.hist(salaries_data, bins=30, edgecolor='black', alpha=0.7)
plt.title('Distribution of Hypothetical Salaries (Right Skewed with Outlier)')
plt.xlabel('Salary ($)')
plt.ylabel('Frequency')
plt.axvline(np.median(salaries_data), color='red', linestyle='dashed', linewidth=1, label='Median')
plt.axvline(np.mean(salaries_data), color='green', linestyle='dashed', linewidth=1, label='Mean')
plt.legend()
plt.show()


### Question 4.5: Best Center Metric for Skewed Data (Application)

Look closely at the histogram above, specifically at the positions of the **Mean** (green line) and the **Median** (red line). Which of these two metrics is the better measure for describing the *typical* salary in this specific dataset?

In [None]:
best_center_metric = ...

In [None]:
grader.check("q4.5")

In [None]:
!./helpful-script.sh save 1>/dev/null

### Question 4.6: Justifying the Metric Choice

Explain **why** the metric you chose in Question 4.5 is superior for describing the center of the `salaries_data` distribution, specifically considering the presence of the skew and the extreme outlier.

In [None]:
justification_for_metric = ...

In [None]:
grader.check("q4.6")

In [None]:
!./helpful-script.sh save 1>/dev/null