In [1]:
import numpy as np
import pandas as pd

### **2.1/ FREQUENCY DISTRIBUTIONS FOR QUANTITATIVE DATA**

- Definition:
    + Frequency Distribution: a collection of observations produced by sorting observations into classes and showing their frequency (f) of occurrence in each class.
    + Frequency Distribution for Ungrouped Data: a Frequency Distribution produced whenever observations are sorted into classes of single values.

#### **Not Always Appropriate:**
- Since Frequency Distribution (FQ) for ungrouped data consists of only classes of single values, when the number exceeds approximately 20 classes, it might be inconvenient for people to read and convey the meaning of the FQ.

**Progress Check 2.1: Students in a theater arts appreciation class rated the classic film The Wizard of Oz on a 10-point scale, ranging from 1 (poor) to 10 (excellent), as follows:** <br>
![image.png](attachment:de20064e-43d6-4d6f-b24a-26e6713d22df.png) <br>
Since the number of possible values is relatively small—only 10—it’s appropriate to construct a frequency distribution for ungrouped data. Do this.

In [123]:
# Solution:
data_str = "3 7 2 7 8 3 1 4 10 3 2 5 3 5 8 9 7 6 3 7 8 9 7 3 6"
data = np.array(data_str.split(), dtype=np.int8)
freq_dist = pd.DataFrame(data=data, columns=["Rating"]).groupby("Rating").size()
freq_dist = pd.DataFrame(data={"Rating": freq_dist.index,
                               "f": freq_dist.values})
freq_dist.index.name = "Index"
freq_dist

Unnamed: 0_level_0,Rating,f
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,1
1,2,2
2,3,6
3,4,1
4,5,2
5,6,2
6,7,5
7,8,3
8,9,2
9,10,1


#### **Grouped Data:**
- Definition:
    + Frequency Distribution for Grouped Data: produced whenever observations are sorted into classes of more than one value.

In [None]:
# Example:
data_str = "240–249 1 230–239 0 220–229 3 210–219 0 200–209 2 190–199 4 180–189 3 170–179 7 160–169 12 150–159 17 140–149 1 130–139 3"
classes = data_str.split()[::2]
freq = data_str.split()[1::2]
freq_dist = pd.DataFrame(data=freq, 
                         index=classes
                         , columns=["f"])
freq_dist.index.name = "Weight"
freq_dist

### **2.2/ GUIDELINES**

#### **How Many Classes?**
- The goal: to produce a concise description of data.

#### **When There Are Either Many or Few Observations:**
- General rule of thumb for the number of classes within a Frequency Distribution:
    + 10 classes

#### **Gaps between Classes:**
- Definition:
    + Unit of Measurement: the smallest possible difference between scores.
- The size of gap should always equal one Unit of Measurement.
- Guidelines for Frequency Distributions: <br>
![image.png](attachment:2b8cd6b3-29c2-4bd4-92be-4174d0b39d0a.png)

#### **Real Limits of Class Intervals:**
- Definition:
    + Real Limits: located at the midpoint of the gap between adjacent tabled boundaries.
- Due to the nature of continuous data, some reported data might come with decimal points, so that to preserve the concise property of the description, one should round up/down the data to the nearest integer, therefore, the classes that appear in a Frequency Distribution have their real limits expand by $\frac{1}{2}$ of one Unit of Measurement. <br>
For example: the real limits for 140-149 class are 139.5 (lower) and 149.5 (upper).

#### **Constructing Frequency Distributions:**
![image.png](attachment:b95c88df-fa17-4dc4-9f0a-f5520c4de4aa.png)

**Progress Check 2.2: The IQ scores for a group of 35 high school dropouts are as follows:** <br>
![image.png](attachment:b5eabb4d-19a8-4b41-85a7-64b717ee6d37.png) <br>
(a) Construct a frequency distribution for grouped data. <br>
(b) Specify the real limits for the lowest class interval in this frequency distribution. <br>
Lowest class interval is 65-70, so that the real limits are: <br>
- Lower limit = 65 - (0.5 * 1) = 64.5
- Upper limit = 70 + (0.5 * 1) = 70.5

In [None]:
# Solution:
data_str = "91 85 84 79 80 87 96 75 86 104 95 71 105 90 77 123 80 100 93 108 98 69 99 95 90 110 109 94 100 103 112 90 90 98 89"
data = np.sort(np.array(data_str.split(), dtype=np.int16))

class_range = data.max() - data.min()
num_of_class = 10
interval = class_range // num_of_class
begin = data.min() - (data.min() % interval)
# while((num_of_class * interval + begin) < data.max()):
#     num_of_class += 1

# final_data = dict(classes=list(), occurrence=list())
# for idx in range(num_of_class):
#     final_data["classes"].append("{0}-{1}".format(begin, begin + interval - 1))
#     count = data[(data >= begin) & (data <= begin + interval - 1)].size
#     final_data["occurrence"].append(count)
#     begin = begin + interval

# freq_dist = pd.DataFrame(data=final_data["occurrence"],
#                          index=final_data["classes"],
#                          columns=["f"])
# freq_dist.index.name = "IQ Scores"
# freq_dist

**Progress Check 2.3: What are some possible poor features of the following frequency distribution?** <br>
![image.png](attachment:103f4de1-44e4-446b-afe0-56a1e20ef993.png) <br>
- Interval: unequal intervals among classes.
- Gap: the gap between 20-22 and 25-30 classes is greater than one Unit of Measurement, so that it may cause loss of data.
- Boundary: 
    + the highest class does not include an upper boundary, so that readers will not know the maximum number of viewing hours;
    + overlapped boundaries between 25-30 and 30-34 classes causes confusion that which class the 30 hours viewing time should lie in.

### **2.3/ OUTLIERS**

### **2.4/ RELATIVE FREQUENCY DISTRIBUTIONS**

### **2.5/ CUMULATIVE FREQUENCY DISTRIBUTIONS**

### **2.6/ FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA**

### **2.7/ INTERPRETING DISTRIBUTIONS CONSTRUCTED BY OTHERS**

### **2.8/ GRAPHS FOR QUANTITATIVE DATA**

### **2.9/ TYPICAL SHAPES**

### **2.10/ A GRAPH FOR QUALITATIVE (NOMINAL) DATA**

### **2.11/ MISLEADING GRAPHS**

### **2.12/ DOING IT YOURSELF**