# Table of Contents
<li><a href="#Sampling_and_point_estimates">Sampling_and_point_estimates</a></li>
<li><a href="#Convenience_sampling">Convenience_sampling</a></li>
<li><a href="#Pseudo_random_number_generation">Pseudo_random_number_generation</a></li>
<li><a href="#Simple_random_and_systematic_sampling">Simple_random_and_systematic_sampling</a></li>
<li><a href="#Stratified_and_weighted_random_sampling">Stratified_and_weighted_random_sampling</a></li>
<li><a href="#Cluster_sampling">Cluster_sampling</a></li>
<li><a href="#Comparing_sampling_methods">Comparing_sampling_methods</a></li>
<li><a href="#Write_Here">Write_Here</a></li>
<li><a href="#Write_Here">Write_Here</a></li>
<li><a href="#Write_Here">Write_Here</a></li>
<li><a href="#Write_Here">Write_Here</a></li>
<li><a href="#Write_Here">Write_Here</a></li>
<li><a href="#Write_Here">Write_Here</a></li>

In [57]:
input().replace(' ', '_')

 Comparing sampling methods


'Comparing_sampling_methods'

<a id='Sampling_and_point_estimates'></a>
# Sampling_and_point_estimates

Welcome to the course! I’m James, and I’ll be your host as we explore **sampling data with Python**. Let’s start by understanding what sampling is and why it’s useful.  

## Estimating the Population of France  

Imagine we want to determine **how many people live in France**. The standard approach is conducting a **census**, where every household reports the number of residents. However, this process is **expensive** and is typically done **only every five or ten years**.  

## Sampling Households  

In **1786**, Pierre-Simon Laplace discovered a smarter way: instead of surveying **everyone**, he sampled **a small number of households** and used statistics to estimate the total population. This approach is known as **sampling**—working with a **subset** of the population instead of the entire dataset.  

## Population vs. Sample  

Two key definitions:  
- **Population**: The complete dataset we’re interested in. It doesn’t always refer to people; it can be any dataset.  
- **Sample**: A **subset** of the population that we analyze. Since we often **don’t have access to the full population**, we rely on samples to make estimates.  

## Coffee Rating Dataset  

Let’s look at a dataset of **professional coffee ratings**, which includes:  
- **1,338 rows**, each representing a coffee.  
- **Total cup points (0-100)** and other attributes like **aroma and body (0-10)**.  
- Although this dataset isn’t **every coffee in the world**, it’s large enough to serve as our **population of interest**.  

## Points vs. Flavor: Population vs. Sample  

We explore the relationship between **cup points and flavor** using the full dataset. But what if we only analyze **10 random coffees** instead?  

Using **pandas’ `.sample(n=10)`**, we can:  
- Select **10 unique random rows** from the dataset.  
- Ensure that each row appears **only once** by default.  

## Sampling in Pandas  

The `.sample()` method also works on **pandas Series**. For example:  
```python
coffee_ratings["total_cup_points"].sample(n=10)
```  
This returns **10 random values** from the `total_cup_points` column.  

## Population Parameters vs. Point Estimates  

- A **population parameter** is a calculation on the **entire dataset** (e.g., the **mean cup score** of all coffees).  
- A **point estimate** (or **sample statistic**) is a calculation **based on the sample** (e.g., the **mean cup score of the 10-sample subset**).  

While the **sample mean is close to the population mean**, it’s **not identical** due to sampling variability.  

## Point Estimates with Pandas  

Instead of NumPy, we can use **pandas’ `.mean()`** method for convenience:  
```python
coffee_ratings["total_cup_points"].mean()
```  

## Let’s Practice!  

Time to start sampling and analyzing data! 🚀  

![image.png](attachment:dcbc61cb-49a9-45ff-ae6c-43ca46b8aab4.png)

![image.png](attachment:cc3d03b3-988d-41cb-a2ae-c84b84398801.png)

In [None]:
# Sample 1000 rows from spotify_population
spotify_sample = spotify_population.sample(n=1000)

# Print the sample
print(spotify_sample)

# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify_population.duration_minutes.mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample.duration_minutes.mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)

![image.png](attachment:7e698414-73f2-4260-906d-5e0d7bf145d4.png)

In [None]:
# Create a pandas Series from the loudness column of spotify_population
loudness_pop = spotify_population['loudness']

# Sample 100 values of loudness_pop
loudness_samp = loudness_pop.sample(n=100)

# Calculate the mean of loudness_pop
mean_loudness_pop = loudness_pop.mean()

# Calculate the mean of loudness_samp
mean_loudness_samp = loudness_samp.mean()

print(mean_loudness_pop)
print(mean_loudness_samp)

<a id='Convenience_sampling'></a>
# Convenience_sampling

Previously, our **point estimates** were close to the **population parameters**, but is that always the case? Let’s explore how sampling methods can introduce **bias** in our results.  

## The Literary Digest Election Prediction  

In **1936**, *The Literary Digest* conducted a massive poll to predict the US presidential election:  
- **10 million voters** were contacted by telephone.  
- **2 million responses** were collected.  
- The poll predicted **Landon (57%) would win against Roosevelt (43%)**.  

However, **Roosevelt won by a landslide (62%)**! What went wrong?  
- **In 1936, only wealthy people had telephones.**  
- The sample **wasn’t representative** of all voters, leading to **sample bias**.  
- This is an example of **convenience sampling**, where data is collected **by the easiest method** rather than ensuring **representativeness**.  

Before sampling, we **must carefully design our data collection process** to avoid bias!  

## Finding the Mean Age of French People  

Imagine you’re at **Disneyland Paris** and want to estimate the **mean age of all French citizens**. You ask **10 random visitors** nearby and calculate their **mean age: 24.6 years**.  
Sounds reasonable, right? 🚨 **Wrong!**  

### How Accurate Was This Survey?  

Official **French census data** shows that the population has been **gradually aging**.  
- In **2015, the mean age was over 40**!  
- Our Disneyland estimate of **24.6 years** is way off.  

Why? **Theme parks attract younger people.** The sample is **not representative** of the whole French population!  

## Convenience Sampling in Coffee Ratings  

Returning to our **coffee ratings dataset**, let’s calculate the **mean cup points**:  
- **Population mean:** ~82  
- **Convenience sample (first 10 rows):** **89**  

The difference suggests that **higher-rated coffees appear earlier in the dataset**. Again, **convenience sampling** introduces **selection bias**!  

## Visualizing Selection Bias  

Histograms help us **see** selection bias.  
- The **total_cup_points histogram** shows values ranging from **59 to 91**.  
- Using **numpy’s `.arange()`**, we create bins of width **2** from **59 to 91** (setting **93** as the exclusive stop value).  

### Comparing Population vs. Convenience Sample  

- The histogram of the **convenience sample** looks **very different** from the population—**all values are skewed to the right**.  
- When we **randomly sample 10 coffees**, the distribution is **much closer** to the population.  

## Let’s Practice!  

Time to **plot some histograms** and visualize selection bias in action! 📊  

![image.png](attachment:a4187524-feea-426c-9183-876e75cc9ed9.png)

In [None]:
# Visualize the distribution of acousticness with a histogram
spotify_population['acousticness'].hist(bins=np.arange(0, 1.01, 0.01))
plt.show()

In [None]:
# Update the histogram to use spotify_mysterious_sample
spotify_mysterious_sample['acousticness'].hist(bins=np.arange(0, 1.01, 0.01))
plt.show()

![image.png](attachment:e09f3a3c-b33b-40ac-a2a9-32b957117e4c.png)

![image.png](attachment:0c65dc56-f19b-4ade-9b24-da5dd74f2733.png)

In [None]:
# Visualize the distribution of duration_minutes as a histogram
spotify_population['duration_minutes'].hist(bins=np.arange(0, 15.5, 0.5))
plt.show()

In [None]:
# Update the histogram to use spotify_mysterious_sample2
spotify_mysterious_sample2['duration_minutes'].hist(bins=np.arange(0, 15.5, 0.5))
plt.show()

![image.png](attachment:f7b2a3c9-0432-4a52-b888-82e6718dc173.png)

<a id='Pseudo_random_number_generation'></a>
# Pseudo_random_number_generation

Previously, we saw how **random sampling** helps approximate population results. But how does a computer actually perform **random sampling**?  

## What Does "Random" Mean?  

In everyday language, "random" can have different meanings.  
From a statistical perspective, **random selection** means that we **cannot systematically predict** which data points will be chosen in advance.  

## True Random Numbers  

To generate **truly random numbers**, we need **physical processes** like:  
- **Flipping coins or rolling dice**  
- **Radioactive decay** (e.g., [Hotbits](https://www.fourmilab.ch/hotbits))  
- **Atmospheric noise** (e.g., [RANDOM.ORG](https://www.random.org))  

However, these methods are **slow and expensive**, making them impractical for most applications.  

## Pseudo-Random Number Generation (PRNG)  

For most use cases, **pseudo-random numbers** are preferred. They are:  
✅ **Fast**  
✅ **Cheap**  
✅ **Deterministic** (but appear random)  

**How does it work?**  
- The first random number is generated from a **seed value**.  
- Each subsequent number is **calculated** from the previous one.  
- If we start with the **same seed**, the sequence will always be the **same**.  

⚠️ This is why "random" is in quotes—it’s not truly random but **statistically indistinguishable from randomness**.  

## Example of PRNG  

Let’s assume we have a function called **calc_next_random**:  
1️⃣ Start with a **seed value** (e.g., `1`).  
2️⃣ The function calculates a "random" number (e.g., `3`).  
3️⃣ We feed `3` back into the function, which outputs `2`.  
4️⃣ The process repeats, generating a sequence that **looks random** but is actually **deterministic**.  

## Random Number Generating Functions in NumPy  

NumPy provides various functions for generating random numbers from different **statistical distributions**:  
- `np.random.uniform()` → **Uniform distribution**  
- `np.random.normal()` → **Normal distribution**  
- `np.random.beta()` → **Beta distribution**  

Each function requires **distribution parameters** and a `size` argument to specify how many numbers to generate.  

## Visualizing Random Numbers  

- **Example**: Generating **5,000 pseudo-random numbers** from the **Beta distribution** (`a=2, b=5`).  
- Since the Beta distribution produces **continuous values**, a **histogram** is the best way to visualize them.  

## Random Number Seeds  

To ensure **reproducibility**, we use **random seeds**:  
```python
np.random.seed(42)  # Set seed
np.random.normal(loc=0, scale=1, size=5)  # Generate numbers
```
- The **seed value (42)** ensures that every time we run the code, we get **the same random numbers**.  
- If we **change the seed**, we get a **different sequence** of random numbers.  

## Let’s Practice!  

Time to **sow some random seeds** and experiment with NumPy’s random functions! 🌱🎲  

![image.png](attachment:09813bd1-0aed-4fdc-8092-ade382249a1f.png)

In [None]:
# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)

# Plot a histogram of uniform values, binwidth 0.25
plt.hist(uniforms, bins=np.arange(-3, 3.25, 0.25))
plt.show()

In [None]:
# Generate random numbers from a Normal(5, 2)
normals = np.random.normal(loc=5, scale=2, size=5000)

# Plot a histogram of normal values, binwidth 0.5
plt.hist(normals, np.arange(-2, 13.5, 0.5))
plt.show()

![image.png](attachment:1a9242dc-b253-4f64-9103-90c17e35fc72.png)

![image.png](attachment:e27b7122-1fd2-4641-a758-8fd9d9294e22.png)

![image.png](attachment:570d9a12-7304-4f05-90f6-5df569887eb1.png)

![image.png](attachment:7e22ed07-e492-48b4-a811-9b41841153e7.png)

<a id='Simple_random_and_systematic_sampling'></a>
# Simple_random_and_systematic_sampling

There are several methods to sample from a population. In this lesson, we'll explore **Simple Random Sampling** and **Systematic Sampling**.  

## 1️⃣ Simple Random Sampling  

### 🎟️ How It Works  
Simple random sampling works like a **raffle** or **lottery**:  
1. We start with a **population** (e.g., coffee varieties).  
2. We randomly pick items **one at a time**.  
3. Every item has the **same chance** of being selected.  

📌 **Key Observation:**  
- Some items next to each other in the dataset **may** get picked.  
- Large portions of the dataset **may** not get picked at all.  

### 🛠️ Using Pandas for Simple Random Sampling  
We can implement **simple random sampling** in Pandas using:  
```python
coffee_ratings.sample(n=5)  # Selects 5 random rows
```
To **ensure reproducibility**, set the `random_state`:  
```python
coffee_ratings.sample(n=5, random_state=42)
```
- If `random_state` is **not set**, a different sample is chosen **each time** the code runs.  

## 2️⃣ Systematic Sampling  

### 🔄 How It Works  
Instead of selecting items **randomly**, we select them at **regular intervals**:  
- If reading **left to right, top to bottom**, we might pick **every 5th coffee**.  

### 🔢 Defining the Sampling Interval  
The interval is calculated as:  
\[
\text{interval} = \frac{\text{Population Size}}{\text{Sample Size}}
\]
Example:  
- If we have **1,338** coffees and want a sample size of **5**,  
  \[
  1338 \div 5 = 267.6 \Rightarrow 267 \text{ (integer division)}
  \]
- We select **every 267th coffee**.  

### 📌 Selecting the Rows in Pandas  
To take a **systematic sample**, we use `.iloc`:  
```python
coffee_ratings.iloc[::267]  # Selects every 267th row
```

## 3️⃣ The Trouble with Systematic Sampling 🚨  

⚠️ **Potential Bias:**  
- If **data is ordered** by some characteristic (e.g., **aftertaste score**), systematic sampling **may not** be representative.  
- Example: **Earlier rows** might have **higher** aftertaste scores than **later rows**.  

### ✅ Making Systematic Sampling Safe  
To **avoid bias**, we **shuffle** the dataset first:  
```python
shuffled = coffee_ratings.sample(frac=1).reset_index(drop=True)
systematic_sample = shuffled.iloc[::267]
```
- `frac=1` → Shuffles the **entire** dataset.  
- `reset_index(drop=True)` → Resets the row index.  
- Now, systematic sampling is **similar** to simple random sampling.  

## 4️⃣ Let’s Practice! 🎯  
It’s time to get **sampling**! Try implementing both methods and observe their effects.  

![image.png](attachment:18fad694-844d-46e5-bec9-c397b8b27236.png)

In [None]:
# Sample 70 rows using simple random sampling and set the seed
attrition_samp = attrition_pop.sample(70, random_state=18900217)

# Print the sample
print(attrition_samp)

![image.png](attachment:8638bbe8-eca1-4fcc-9222-1b06790d4b64.png)

In [None]:
# Set the sample size to 70
sample_size = 70

# Calculate the population size from attrition_pop
pop_size = len(attrition_pop)

# Calculate the interval
interval = pop_size // sample_size

# Systematically sample 70 rows
attrition_sys_samp = attrition_pop.iloc[::interval]

# Print the sample
print(attrition_sys_samp)

![image.png](attachment:66a52950-c9a8-4f20-9804-48ab6b174d50.png)

In [None]:
# Add an index column to attrition_pop
attrition_pop_id = attrition_pop.reset_index()

# Plot YearsAtCompany vs. index for attrition_pop_id
attrition_pop_id.plot(x='index', y='YearsAtCompany', kind='scatter')
plt.show()

In [None]:
# Shuffle the rows of attrition_pop
attrition_shuffled = attrition_pop.sample(frac=1)

# Reset the row indexes and create an index column
attrition_shuffled = attrition_shuffled.reset_index(drop=True).reset_index()

# Plot YearsAtCompany vs. index for attrition_shuffled
attrition_shuffled.plot(x='index', y='YearsAtCompany', kind='scatter')
plt.show()

![image.png](attachment:78801182-6a6c-44b3-95da-fe053f76125f.png)

<a id='Stratified_and_weighted_random_sampling'></a>
# Stratified_and_weighted_random_sampling

## 1️⃣ What is Stratified Sampling?  
Stratified sampling is a technique that helps us sample a **population with subgroups** while ensuring proper representation.  

### ☕ Example: Grouping Coffee Ratings by Country  
- We can **group** our coffee ratings dataset by **country**.  
- Using `.value_counts()`, we identify the **top 6 countries** with the most data.  

```python
coffee_ratings["country"].value_counts()
```

---

## 2️⃣ Simple Random Sampling vs. Stratified Sampling  

### 🎲 Simple Random Sampling  
- We take a **random** sample of 10% using `.sample(frac=0.1)`.  
- `normalize=True` converts counts to **proportions** for comparison.  

```python
coffee_ratings_top.sample(frac=0.1, random_state=42)
```

📌 **Issue:** The proportions of countries in the sample may **not match** the original population.  

---

## 3️⃣ Proportional Stratified Sampling  
To ensure each subgroup (country) is **proportionally represented**, we:  
1. **Group** by country.  
2. Apply `.sample(frac=0.1)` **within each group**.  

```python
coffee_ratings_top.groupby("country", group_keys=False)\
    .apply(lambda x: x.sample(frac=0.1, random_state=42))
```

📌 **Result:** The sample maintains **proportions similar** to the population.  

---

## 4️⃣ Equal Counts Stratified Sampling  
Instead of taking **proportional** samples, we can select **equal numbers** from each group.  
- We use `n=15` instead of `frac=0.1` to extract **15 samples per country**.  

```python
coffee_ratings_top.groupby("country", group_keys=False)\
    .apply(lambda x: x.sample(n=15, random_state=42))
```

📌 **Result:** Each country contributes **the same number** of samples.  

---

## 5️⃣ Weighted Random Sampling  
Weighted sampling allows us to **increase** or **decrease** the probability of selecting certain rows.  

### ✨ Example: Boosting Taiwanese Coffees in the Sample  
1. Create a **weights column** where:  
   - **Taiwanese coffees** have a weight of `2`.  
   - All other coffees have a weight of `1`.  
2. Use the `weights` argument in `.sample()`.  

```python
import numpy as np

coffee_ratings_top["weights"] = np.where(coffee_ratings_top["country"] == "Taiwan", 2, 1)
weighted_sample = coffee_ratings_top.sample(n=100, weights="weights", random_state=42)
```

📌 **Result:** Taiwan now makes up **17%** of the sample instead of **8.5%**.  
🔹 This method is often used in **political polling** to adjust for **underrepresented** groups.  

---

## 6️⃣ Let’s Practice! 🎯  
Try implementing **stratified** and **weighted** sampling techniques in your dataset!  

![image.png](attachment:7fc553d7-f972-4398-81d2-645a8e9408bf.png)

![image.png](attachment:e432a7b1-84e8-4e69-b175-1902665f8c1a.png)

![image.png](attachment:be2e1d07-a838-4486-a834-5de8f57842bb.png)

Classy classification! Stratified sampling is useful if you care about subgroups. Otherwise, simple random sampling is more appropriate.

![image.png](attachment:9d782af7-8b60-40b7-bed1-b5774e87410f.png)

In [None]:
# Proportion of employees by Education level
education_counts_pop = attrition_pop['Education'].value_counts(normalize=True)

# Print education_counts_pop
print(education_counts_pop)

# Proportional stratified sampling for 40% of each Education group
attrition_strat = attrition_pop.groupby('Education')\
	.sample(frac=0.4, random_state=2022)

# Calculate the Education level proportions from attrition_strat
education_counts_strat = attrition_strat['Education'].value_counts(normalize=True)

# Print education_counts_strat
print(education_counts_strat)

![image.png](attachment:21ad4f8b-6755-4af0-a1ef-de622a8a0489.png)

In [None]:
# Get 30 employees from each Education group
attrition_eq = attrition_pop.groupby('Education')\
	.sample(n=30, random_state=2022)      

# Get the proportions from attrition_eq
education_counts_eq = attrition_eq.Education.value_counts(normalize=True)

# Print the results
print(education_counts_eq)

![image.png](attachment:a4eedc15-3183-4f2b-aaf4-7d34c31fd3cf.png)

In [None]:
# Plot YearsAtCompany from attrition_pop as a histogram
attrition_pop['YearsAtCompany'].hist(bins=np.arange(0, 41, 1))
plt.show()

# Sample 400 employees weighted by YearsAtCompany
attrition_weight = attrition_pop.sample(n=400, weights="YearsAtCompany")

# Plot YearsAtCompany from attrition_weight as a histogram
attrition_weight['YearsAtCompany'].hist(bins=np.arange(0, 41, 1))
plt.show()

<a id='Cluster_sampling'></a>
# Cluster_sampling

 

## 1️⃣ Why Use Cluster Sampling?  
🔹 **Problem with Stratified Sampling**: We need to **collect data from every subgroup**, which can be **expensive** and **time-consuming** (e.g., traveling to locations for data collection).  
🔹 **Solution**: Cluster Sampling helps **reduce cost** by randomly selecting a few **entire subgroups** and sampling from them.  

---

## 2️⃣ Stratified vs. Cluster Sampling  

| Sampling Method  | How It Works |
|-----------------|-------------|
| **Stratified Sampling** | **Divide** the population into subgroups, then sample **each** subgroup. |
| **Cluster Sampling** | **Randomly select a few** subgroups, then sample only from them. |

📌 **Key Difference**: In **cluster sampling**, we **do not** sample from all subgroups—only from the randomly selected ones.  

---

## 3️⃣ Example: Coffee Varieties ☕  

### 🔍 Extracting Unique Varieties  
- Our dataset contains **28 coffee varieties**.  
- To **extract unique varieties**, we use `.unique()`.  

```python
unique_varieties = list(coffee_ratings["variety"].unique())
```

📌 **Challenge**: Working with all varieties is expensive. Instead, we use **Cluster Sampling**!  

---

## 4️⃣ Stage 1: Sampling Subgroups (Clusters)  

🔹 We **randomly select 3** coffee varieties using `random.sample()`.  

```python
import random

selected_varieties = random.sample(unique_varieties, k=3)
```

📌 **Now, instead of working with all 28 varieties, we focus only on 3!**  

---

## 5️⃣ Stage 2: Sampling Within Each Cluster  

🔹 We **filter** the dataset to keep only the selected varieties using `.isin()`.  
🔹 We **remove unused categories** to avoid errors when sampling.  
🔹 We **perform simple random sampling** within each selected variety.  

```python
filtered_coffees = coffee_ratings[coffee_ratings["variety"].isin(selected_varieties)]
filtered_coffees["variety"] = filtered_coffees["variety"].cat.remove_unused_categories()

cluster_sample = filtered_coffees.groupby("variety", group_keys=False)\
    .apply(lambda x: x.sample(n=5, random_state=42))
```

📌 **Result**: We get **5 samples per selected variety**, totaling **15 rows** (3 varieties × 5 samples each).  

---

## 6️⃣ Multistage Sampling  
Cluster Sampling is a **special case** of **Multistage Sampling**.  
🔹 More than **two stages** can be used.  
🔹 Example: **National Surveys**  
   - **Stage 1**: Randomly sample **states**.  
   - **Stage 2**: Within selected states, sample **counties**.  
   - **Stage 3**: Within selected counties, sample **cities**.  
   - **Stage 4**: Within selected cities, sample **neighborhoods**.  

---

## 7️⃣ Let’s Practice! 🎯  
Try implementing **Cluster Sampling** in your dataset! 🚀  

![image.png](attachment:0a822a61-667c-4ef1-a685-293ce9259ce6.png)

![image.png](attachment:0570f27d-54e3-41b6-b86b-bb28231a0666.png)

In [None]:
# Create a list of unique JobRole values
job_roles_pop = list(attrition_pop['JobRole'].unique())

# Randomly sample four JobRole values
job_roles_samp = random.sample(job_roles_pop, k=4)

# Filter for rows where JobRole is in job_roles_samp
jobrole_condition = attrition_pop['JobRole'].isin(job_roles_samp)
attrition_filtered = attrition_pop[jobrole_condition]

# Remove categories with no rows
attrition_filtered['JobRole'] = attrition_filtered['JobRole'].cat.remove_unused_categories()

# Randomly sample 10 employees from each sampled job role
attrition_clust = attrition_filtered.groupby('JobRole').sample(n=10, random_state=2022)


# Print the sample
print(attrition_clust)

<a id='Comparing_sampling_methods'></a>
# Comparing_sampling_methods



## 1️⃣ Overview  
We've explored **Simple Random Sampling**, **Stratified Sampling**, and **Cluster Sampling**. Now, let's compare their effectiveness using a **coffee dataset** containing:  
✅ **6 countries**  
✅ **880 rows** (coffee varieties)  
✅ **8 columns**  

---

## 2️⃣ Review of Sampling Techniques  

| Sampling Method  | How It Works | Sample Size (~293 rows) |
|-----------------|-------------|------------------|
| **Simple Random Sampling** | Randomly selects rows from the dataset. | ✅ |
| **Stratified Sampling** | Groups by **country**, then performs random sampling within each group. | ✅ |
| **Cluster Sampling** | Selects **2 out of 6 countries** first, then samples from them. | ✅ (~292 rows) |

📌 **Key Takeaway**: All methods result in a similar sample size (~293 rows).  

---

## 3️⃣ Comparing Mean Cup Points ☕  

### 📌 **Step 1: Calculate the Population Mean**  
We compute the **mean cup points** (quality score) for the **entire dataset**:  

```python
population_mean = coffee_ratings["total_cup_points"].mean()
```
✅ **Population Mean** = `81.9` (Gold Standard 📏)  

### 📌 **Step 2: Sample Means Calculation**  

| Sampling Method  | Sample Mean (Total Cup Points) | Accuracy |
|-----------------|-----------------------------|----------|
| **Simple Random** | 🔹 Close to **81.9** ✅ | ✅ High |
| **Stratified** | 🔹 Close to **81.9** ✅ | ✅ High |
| **Cluster** | 🔹 Slightly off from **81.9** | ⚠️ Less Accurate |

📌 **Why?** Cluster Sampling uses fewer subgroups, so **variation is higher**.  

---

## 4️⃣ Mean Cup Points by Country  

### **📊 Simple Random & Stratified Sampling**  
✅ Sample means for each country are **close to the actual population means**.  

### **⚠️ Cluster Sampling Limitation**  
❌ Only provides means for **2 selected countries**, **missing the other 4**.  
❌ **Bad choice** if analyzing **country-specific** trends.  

---

## 5️⃣ Conclusion 🏆  

| Method  | When to Use? | Pros ✅ | Cons ⚠️ |
|--------|------------|---------|--------|
| **Simple Random** | General sampling | Easy, unbiased | May not represent subgroups well |
| **Stratified** | When subgroup accuracy matters | More precise estimates | More complex |
| **Cluster** | When data collection is expensive | Reduces cost | Less accuracy, may exclude key subgroups |

📌 **Final Thought**: If **country-wise accuracy** is needed, **avoid Cluster Sampling**!  

---

## 6️⃣ Let’s Practice! 🚀  
Try calculating **summary statistics** using different sampling methods!  

![image.png](attachment:d3eea530-53e0-4e04-a91c-20f2c302ed1e.png)

In [None]:
# Perform simple random sampling to get 0.25 of the population
attrition_srs = attrition_pop.sample(frac=0.25, random_state=2022)

In [None]:
# Perform stratified sampling to get 0.25 of each relationship group
attrition_strat = attrition_pop.groupby('RelationshipSatisfaction').sample(frac=0.25, random_state=2022)


In [None]:
# Create a list of unique RelationshipSatisfaction values
satisfaction_unique = list(attrition_pop.RelationshipSatisfaction.unique()) 

# Randomly sample 2 unique satisfaction values
satisfaction_samp = random.sample(satisfaction_unique, k=2)

# Filter for satisfaction_samp and clear unused categories from RelationshipSatisfaction
satis_condition = attrition_pop.RelationshipSatisfaction.isin(satisfaction_samp)
attrition_clust_prep = attrition_pop[satis_condition]
attrition_clust_prep['RelationshipSatisfaction'] = attrition_clust_prep['RelationshipSatisfaction'].cat.remove_unused_categories()

# Perform cluster sampling on the selected group, getting 0.25 of attrition_pop
attrition_clust = attrition_clust_prep.groupby('RelationshipSatisfaction').sample(n=len(attrition_pop)//4, random_state=2022)


![image.png](attachment:5e07951c-6086-460b-830d-6d067da3a7c6.png)

In [None]:
# Mean Attrition by RelationshipSatisfaction group
mean_attrition_pop = attrition_pop.groupby('RelationshipSatisfaction')['Attrition'].mean()

# Print the result
print(mean_attrition_pop)

In [None]:
# Calculate the same thing for the simple random sample 
mean_attrition_srs = attrition_srs.groupby('RelationshipSatisfaction')['Attrition'].mean()

# Print the result
print(mean_attrition_srs)

In [None]:
# Calculate the same thing for the stratified sample 
mean_attrition_strat = attrition_strat.groupby('RelationshipSatisfaction')['Attrition'].mean()

# Print the result
print(mean_attrition_strat)

In [None]:
# Calculate the same thing for the cluster sample 
mean_attrition_clust = attrition_clust.groupby('RelationshipSatisfaction')['Attrition'].mean()

# Print the result
print(mean_attrition_clust)

In [None]:
<a id='Refer_to'></a>
# Refer_to

In [None]:
<a id='Refer_to'></a>
# Refer_to

In [None]:
<a id='Refer_to'></a>
# Refer_to

In [None]:
<a id='Refer_to'></a>
# Refer_to