
## 🧠 Assignment 2.1 - Random Sampling and Sample Bias

---

## Dataset:

```python
import pandas as pd

data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
              'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
              'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa'],
    'Region': ['South', 'West', 'West', 'South', 'West',
               'West', 'Northeast', 'South', 'South', 'South',
               'West', 'West', 'Midwest', 'Midwest', 'Midwest']
}

state = pd.DataFrame(data)
```
---

### 📘 PART A: Code and Interpretation

### **Q1. Simple Random Sample (Without Replacement)**

Take a **simple random sample of 5 states** from the dataset (without replacement).

* List the selected states.
* Is this sample representative of all regions? Why or why not?



### **Q2. Stratified Sampling**

Perform a **stratified sample**: take **1 state from each Region**.

* Which states did you get?
* Why might this be more useful than simple random sampling?



### **Q3. Sampling With Replacement**

Draw **10 states with replacement** from the dataset.

* Are there any duplicates?
* Why would sampling with replacement be useful in practice?

---

### 📘 PART B: Conceptual Questions

### **Q4. Oversampling**

Briefly explain **what oversampling is**, and give an example of when it might be necessary.



### **Q5. Sampling Bias**

You’re running a political survey. Explain how **sampling bias** could occur if you only sample people who respond to online ads. What might be the result?



### **Q6. Census vs. Sampling**

Why do data scientists often prefer **sampling over conducting a census**, even if they have access to the full population?



# Answers:

In [1]:
# Import libraries that we will use for the assignment:
import pandas as pd
import scipy.stats

data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
              'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
              'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa'],
    'Region': ['South', 'West', 'West', 'South', 'West',
               'West', 'Northeast', 'South', 'South', 'South',
               'West', 'West', 'Midwest', 'Midwest', 'Midwest']
}

state = pd.DataFrame(data)
state

Unnamed: 0,State,Region
0,Alabama,South
1,Alaska,West
2,Arizona,West
3,Arkansas,South
4,California,West
5,Colorado,West
6,Connecticut,Northeast
7,Delaware,South
8,Florida,South
9,Georgia,South


### **Q1. Simple Random Sample (Without Replacement)**

Take a **simple random sample of 5 states** from the dataset (without replacement).

* List the selected states.
* Is this sample representative of all regions? Why or why not?


In [4]:
state.sample(n=5, random_state=42)

Unnamed: 0,State,Region
9,Georgia,South
11,Idaho,West
0,Alabama,South
13,Indiana,Midwest
5,Colorado,West


The sample is not representative of all regions because the Northeast region was not included. 

### **Q2. Stratified Sampling**

Perform a **stratified sample**: take **1 state from each Region**.

* Which states did you get?
* Why might this be more useful than simple random sampling?


In [5]:
# Create function that does the stratified sampling:
def stratified_sampling(df, stratify_col, frac=None, n=None):
    """
    Performs stratified sampling on a Pandas DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to sample from.
        stratify_col (str): The column to stratify on.
        frac (float, optional): The fraction of samples to return from each group. Defaults to None.
        n (int, optional): The number of samples to return from each group. Defaults to None.

    Returns:
        pd.DataFrame: A DataFrame containing the stratified sample.
    """
    if frac is None and n is None:
        raise ValueError("Must specify either frac or n")
    if frac is not None and n is not None:
         raise ValueError("Must specify either frac or n, not both")

    return df.groupby(stratify_col, group_keys=False).apply(lambda x: x.sample(frac=frac, n=n))

Carry out the stratified sampling:

In [9]:
stratified_sampling(state, "Region", n=1)

  return df.groupby(stratify_col, group_keys=False).apply(lambda x: x.sample(frac=frac, n=n))


Unnamed: 0,State,Region
13,Indiana,Midwest
6,Connecticut,Northeast
0,Alabama,South
5,Colorado,West


This type of sampling is more useful than random sampling for this situation because all the regions were included this time.

### **Q3. Sampling With Replacement**

Draw **10 states with replacement** from the dataset.

* Are there any duplicates?
* Why would sampling with replacement be useful in practice?


In [10]:
state.sample(n=10, replace=True, random_state=42)

Unnamed: 0,State,Region
6,Connecticut,Northeast
3,Arkansas,South
12,Illinois,Midwest
14,Iowa,Midwest
10,Hawaii,West
7,Delaware,South
12,Illinois,Midwest
4,California,West
6,Connecticut,Northeast
9,Georgia,South


This method produced some duplicates like the state of **Delaware** and **Connecticut**. However, sampling with replacements may be useful in practice because it simplifies probability calculations by ensuring that the probability of selecting an item remains constant throughout the sampling process. 

### **Q4. Oversampling**

Briefly explain **what oversampling is**, and give an example of when it might be necessary.


**Answer:** Oversampling in statistics involves intentionally incorporating more members of a particular group into a sample than would be expected if the sample were drawn at random. This is often done to study smaller or harder-to-reach groups within the population. The goal is to increase the precision of estimates for those groups by ensuring a larger sample size, even if it means overrepresenting them in the overall sample. 

**Example:** Imagine a study on the religious beliefs of residents in a small state like Wyoming. A random sample might only yield a few residents from that state, making it difficult to draw meaningful conclusions about their religious practices. Oversampling would involve intentionally surveying more residents from Wyoming to ensure a sufficient sample size for analysis, even if it means overrepresenting them in the overall survey sample. 

### **Q5. Sampling Bias**

You’re running a political survey. Explain how **sampling bias** could occur if you only sample people who respond to online ads. What might be the result?


**Answer:** Sampling may occur in this situation if you only sample people who respond to online ads because they may not be representative of the entire population. For example, these may be people who can are particularly politically engage, which is usually found in members of the older generations compared to the youth. These may also be people with a good internet connection indicating a certain social economic class.

The result might be skewed towards a particular result that is not representative of what the general result would be if the entire population voted.

### **Q6. Census vs. Sampling**

Why do data scientists often prefer **sampling over conducting a census**, even if they have access to the full population?


**Answer:** This is because a full census is very labour intensive and requires a lot of time to carry out. In comparison, a sample which is a subset of the population, is less time and labour intensive.