## 📊 Dataset (Use this)

```python
import pandas as pd

data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
              'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
              'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa'],
    'Region': ['South', 'West', 'West', 'South', 'West',
               'West', 'Northeast', 'South', 'South', 'South',
               'West', 'West', 'Midwest', 'Midwest', 'Midwest'],
    'Population': [4779736, 710231, 6392017, 2915918, 37253956,
                   5029196, 3574097, 897934, 18801310, 9687653,
                   1360301, 1567582, 12830632, 6483802, 3046355],
    'Murder.Rate': [5.7, 5.6, 4.7, 5.6, 4.4,
                    2.8, 2.4, 5.8, 5.2, 6.0,
                    1.8, 2.3, 5.5, 5.7, 1.9]
}

state = pd.DataFrame(data)
```

---

## 📘 Quiz Part A: Exploring the Data Distribution

### **Q1. Histogram Buckets**

Use NumPy or pandas to assign each state’s **murder rate** to a bin of size 1.0 (e.g., 1–2, 2–3, etc.). Count how many states fall into each bin.



### **Q2. Skewness**

Without using `.skew()` or built-in functions, inspect the **murder rate** data and determine whether it is **left-skewed**, **right-skewed**, or **approximately symmetric**. Justify your answer based on comparison of **mean vs. median** and visual evidence if needed.



### **Q3. Empirical CDF**

Manually compute the **empirical cumulative distribution function (ECDF)** for the **population**. Return a list of 15 `(x, y)` pairs where:

* `x` is the population value
* `y` is the proportion of states less than or equal to `x`



## 📘 Quiz Part B: Exploring Binary and Categorical Data

### **Q4. Frequency Table**

Create a **frequency table** showing how many states are in each **Region**.



### **Q5. Proportion Table**

Convert the frequency table from Q4 into a **proportion table**.



### **Q6. Conditional Proportions**

Find the proportion of states in each region whose **murder rate** is **above 5.0**.



### **Q7. Bar Plot Interpretation (Conceptual)**

Suppose you created a bar plot showing the number of states in each region. Why might it be misleading to directly compare regions without considering population size?



# Answers:

Create the data set

In [1]:
import pandas as pd

data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
              'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
              'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa'],
    'Region': ['South', 'West', 'West', 'South', 'West',
               'West', 'Northeast', 'South', 'South', 'South',
               'West', 'West', 'Midwest', 'Midwest', 'Midwest'],
    'Population': [4779736, 710231, 6392017, 2915918, 37253956,
                   5029196, 3574097, 897934, 18801310, 9687653,
                   1360301, 1567582, 12830632, 6483802, 3046355],
    'Murder.Rate': [5.7, 5.6, 4.7, 5.6, 4.4,
                    2.8, 2.4, 5.8, 5.2, 6.0,
                    1.8, 2.3, 5.5, 5.7, 1.9]
}

state = pd.DataFrame(data)

Import some libraries we might need later:

In [3]:
import numpy as np

View the data:

In [4]:
state

Unnamed: 0,State,Region,Population,Murder.Rate
0,Alabama,South,4779736,5.7
1,Alaska,West,710231,5.6
2,Arizona,West,6392017,4.7
3,Arkansas,South,2915918,5.6
4,California,West,37253956,4.4
5,Colorado,West,5029196,2.8
6,Connecticut,Northeast,3574097,2.4
7,Delaware,South,897934,5.8
8,Florida,South,18801310,5.2
9,Georgia,South,9687653,6.0


### **Q1. Histogram Buckets**

Use NumPy or pandas to assign each state’s **murder rate** to a bin of size 1.0 (e.g., 1–2, 2–3, etc.). Count how many states fall into each bin.


In [9]:
# Define bins and their labels
bins = [1, 2, 3, 4, 5, 6, 7]
labels = ['1-2', '2-3', '3-4', '4-5', '5-6', '6-7']

state['Binned_Values'] = pd.cut(state['Murder.Rate'], bins=bins, labels=labels, right=False)
state

Unnamed: 0,State,Region,Population,Murder.Rate,Binned_Values
0,Alabama,South,4779736,5.7,5-6
1,Alaska,West,710231,5.6,5-6
2,Arizona,West,6392017,4.7,4-5
3,Arkansas,South,2915918,5.6,5-6
4,California,West,37253956,4.4,4-5
5,Colorado,West,5029196,2.8,2-3
6,Connecticut,Northeast,3574097,2.4,2-3
7,Delaware,South,897934,5.8,5-6
8,Florida,South,18801310,5.2,5-6
9,Georgia,South,9687653,6.0,6-7


### **Q2. Skewness**

Without using `.skew()` or built-in functions, inspect the **murder rate** data and determine whether it is **left-skewed**, **right-skewed**, or **approximately symmetric**. Justify your answer based on comparison of **mean vs. median** and visual evidence if needed.


In [11]:
print(f"This is the mean: {state['Murder.Rate'].mean()}")
print(f"This is the median: {state['Murder.Rate'].median()}")

This is the mean: 4.36
This is the median: 5.2


**Answer:** The murder rate is **left skewed** because the mean (4.36) is less than the median (5.2). This means that higher rates of murder take place behind the median, which is the middle value.

### **Q3. Empirical CDF**

Manually compute the **empirical cumulative distribution function (ECDF)** for the **population**. Return a list of 15 `(x, y)` pairs where:

* `x` is the population value
* `y` is the proportion of states less than or equal to `x`



In [26]:
# Sort the population in ascending order
population_sorted = state["Population"].sort_values()

# Create empty variable to store the distribution
distr = []

# Create variable to store the number of data points "n"
n = population_sorted.shape[0]

# Create count
count = 1

for i in range(n):
    pair = [int(population_sorted[i]), count/n]
    distr.append(pair)
    count += 1 #increment the count for the next round in the loop

print(distr)


[[4779736, 0.06666666666666667], [710231, 0.13333333333333333], [6392017, 0.2], [2915918, 0.26666666666666666], [37253956, 0.3333333333333333], [5029196, 0.4], [3574097, 0.4666666666666667], [897934, 0.5333333333333333], [18801310, 0.6], [9687653, 0.6666666666666666], [1360301, 0.7333333333333333], [1567582, 0.8], [12830632, 0.8666666666666667], [6483802, 0.9333333333333333], [3046355, 1.0]]


### **Q4. Frequency Table**

Create a **frequency table** showing how many states are in each **Region**.


In [35]:
freq_region = pd.DataFrame(state.groupby('Region')['State'].count())
freq_region

Unnamed: 0_level_0,State
Region,Unnamed: 1_level_1
Midwest,3
Northeast,1
South,5
West,6


### **Q5. Proportion Table**

Convert the frequency table from Q4 into a **proportion table**.


In [39]:
# Find the total number of states
state_total = freq_region['State'].sum()

# Create the column showing proportion
freq_region["Proportion"] = freq_region['State'] / state_total

freq_region

Unnamed: 0_level_0,State,Proportion
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,3,0.2
Northeast,1,0.066667
South,5,0.333333
West,6,0.4


### **Q6. Conditional Proportions**

Find the proportion of states in each region whose **murder rate** is **above 5.0**.


In [52]:
# Create mask to filter the murder rate
mask = state["Murder.Rate"] > 5.0

# Create data frame with the filter
filt_freq_table = pd.DataFrame(state[mask].groupby('Region')['State'].count())
filt_freq_table

Unnamed: 0_level_0,State
Region,Unnamed: 1_level_1
Midwest,2
South,5
West,1


Now divide the number of states with murder rate 5.0 with their total:

In [66]:
murd_proportion = pd.DataFrame(filt_freq_table['State'] / freq_region['State']).rename(columns={"State": "Proportion_compared_to_Total"})
murd_proportion

Unnamed: 0_level_0,Proportion_compared_to_Total
Region,Unnamed: 1_level_1
Midwest,0.666667
Northeast,
South,1.0
West,0.166667


Replace the null value (NaN) with 0:

In [69]:
murd_proportion.fillna(0)

Unnamed: 0_level_0,Proportion_compared_to_Total
Region,Unnamed: 1_level_1
Midwest,0.666667
Northeast,0.0
South,1.0
West,0.166667


### **Q7. Bar Plot Interpretation (Conceptual)**

Suppose you created a bar plot showing the number of states in each region. Why might it be misleading to directly compare regions without considering population size?

**Answer:** It would be misleading to compare the regions without considering population size because some states have a much larger population. A region with a few states with huge populations can have more people than regions with more states that have smaller populations.