## Import Modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
ap_data = pd.read_excel('./ApTest.xlsx')
ap_data.head()

Unnamed: 0,Correct
0,112
1,73
2,126
3,82
4,92


# 1. Measures of Location

## 1-1. Mean

✅ **pandas.DataFrame.mean**

- Return the mean of the values over the requested axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

## $\bar x = {\sum x_i \over n}$

$\sum x_i$ : Sum of the values of the n observations

$n$ : Number of observations in the sample

Most of the measures are provided by dataframe, but we're going to get the measures in two ways(✍, 🤖).

In [3]:
n = len(ap_data["Correct"])
sum_x = sum(ap_data["Correct"])
mean_x = sum_x / n

print(f'✍ Mean = {mean_x}')

✍ Mean = 98.92


In [4]:
mean_x = ap_data["Correct"].mean() # ✅ pandas.DataFrame.mean

print(f'🤖 Mean = {mean_x}')

🤖 Mean = 98.92


## 1-2. Median

✅ **pandas.DataFrame.sort_values**

- Sort by the values along either axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

✅ **pandas.DataFrame.reset_index**

- Reset the index of the DataFrame.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

✅ **pandas.DataFrame.median**

- Return the median of the values over the requested axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html

In [5]:
# To get the median, sort the data in ascending order.
sorted_data = ap_data["Correct"].sort_values().reset_index(drop=True) # ✅ pandas.DataFrame.reset_index ✅ pandas.DataFrame.sort_values
sorted_data.head()

0    68
1    69
2    72
3    73
4    73
Name: Correct, dtype: int64

In [6]:
# n/2 position means middle point of the dataset whose len is n.
# For this reason, we can get the median of dataset by accessing (n/2)th data in sorted data.
median_x = (sorted_data[n/2-1] + sorted_data[n/2]) / 2
print(f'✍ Median = {median_x}')

✍ Median = 97.5


In [7]:
median_x = ap_data["Correct"].median() # ✅ pandas.DataFrame.median
print(f'🤖 Median = {median_x}')

🤖 Median = 97.5


## 1-3. Mode

✅ **pandas.DataFrame.mode**

- Get the mode(s) of each element along the selected axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html

In [8]:
# We can check the frequency of the data by using value_counts.
ap_data["Correct"].value_counts().head()

92     3
106    3
81     2
76     2
115    2
Name: Correct, dtype: int64

In [9]:
ap_data["Correct"].mode() # ✅ pandas.DataFrame.mode

0     92
1    106
Name: Correct, dtype: int64

## 1-4. Percentiles (20th and 80th)

✅ **pandas.DataFrame.quantile**

- Return values at the given quantile over requested axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

We can compute index i, the position of the pth percentile, as follows:

### $ i = (p/100)n $

In [10]:
def get_percentile(data, p):
    n = len(data)
    i = (p/100) * n 
    if int(i) == i:
        # If i is an integer, the pth percentile is the average of the values in positions i and i+1.
        pct = (data[i-1] + data[i]) / 2
    else :
        # If i is not an integer, round up.
        i = int(i+0.5)
        # The pth percentile is the value in the ith position.
        pct = data[i-1]
    return pct

The positionon of the $20^{th} Percentile = (20/100)50 = 10$

In [11]:
pct = get_percentile(sorted_data, 20)
print(f'✍ 20th Percentile = {pct}')

✍ 20th Percentile = 81.0


In [12]:
# If the positionon of the nth percentile is an integer, the average of the values in positions i and i+1 should be calculated.
# Therefore, when i is an integer, we have to set interpolation method to 'midpoint'.
# When the desired quantile lies between two data points i and j, 'midpoint' method use (i + j) / 2.
pct = ap_data["Correct"].quantile(0.2, interpolation='midpoint') # ✅ pandas.DataFrame.quantile
print(f'🤖 20th Percentile = {pct}')

🤖 20th Percentile = 81.0


The positionon of the $80^{th} Percentile = (80/100)50 = 40$

In [13]:
pct = get_percentile(sorted_data, 80)
print(f'✍ 80th Percentile = {pct}')

✍ 80th Percentile = 116.5


In [14]:
pct = ap_data["Correct"].quantile(0.8, interpolation='midpoint')
print(f'🤖 80th Percentile = {pct}')

🤖 80th Percentile = 116.5


## 1-5. Quartiles (1st, 2nd, and 3rd)

### 1st Quartile

The positionon of the $25^{th} Percentile = (25/100)50 = 12.5 = 13$

In [15]:
pct = get_percentile(sorted_data, 25)
print(f'✍ 25th Percentile = {pct}')

✍ 25th Percentile = 83


In [16]:
# If the positionon of the nth percentile is not an integer, the decimal should be rounded up.
# In order to round the decimal up, we have to set interpolation method to 'nearest'.
# When the desired quantile lies between two data points i and j, 'nearest' method use i or j whichever is nearest.
pct = ap_data["Correct"].quantile(0.25, interpolation='nearest')
print(f'🤖 25th Percentile = {pct}')

🤖 25th Percentile = 83


### 2nd Quartile

The positionon of the $50^{th} Percentile = (50/100)50 = 25$

In [17]:
pct = get_percentile(sorted_data, 50)
print(f'✍ 50th Percentile = {pct}')

✍ 50th Percentile = 97.5


In [18]:
pct = ap_data["Correct"].quantile(0.5, interpolation='midpoint')
print(f'🤖 50th Percentile = {pct}')

🤖 50th Percentile = 97.5


In [19]:
med = sorted_data.median()
print(f'Median = {med}')

Median = 97.5


###  3rd Quartile

The positionon of the $75^{th} Percentile = (75/100)50 = 37.5 = 38$

In [20]:
pct = get_percentile(sorted_data, 75)
print(f'✍ 75th Percentile = {pct}')

✍ 75th Percentile = 113


In [21]:
pct = ap_data["Correct"].quantile(0.75, interpolation='nearest')
print(f'🤖 75th Percentile = {pct}')

🤖 75th Percentile = 113


# 2. Measures of Variability

## 2-1. Range

✅ **pandas.DataFrame.max**
- Return the maximum of the values over the requested axis
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html

✅ **pandas.DataFrame.min**
- Return the minimum of the values over the requested axis
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html

Range = largest value - smallest value

In [22]:
range_x = ap_data["Correct"].max() - ap_data["Correct"].min() # ✅ pandas.DataFrame.max ✅ pandas.DataFrame.min
print(f'Range = {range_x}')

Range = 73


## 2-2. Interquartile Range

Interquartile Range = 3rd Quartile (Q3) - 1st Quartile (Q1)

In [23]:
# When the desired quantile lies between two data points i and j, 'nearest' method use i or j whichever is nearest.
q1 = ap_data["Correct"].quantile(0.25, interpolation='nearest')
q3 = ap_data["Correct"].quantile(0.75, interpolation='nearest')

irange_x = q3 - q1
print(f'Interquartile Range = {irange_x}')

Interquartile Range = 30


## 2-3. Variance

✅ **numpy.sum**
- Sum of array elements over a given axis.
- https://numpy.org/doc/stable/reference/generated/numpy.sum.html

✅ **numpy.power**
- First array elements raised to powers from second array, element-wise.
- https://numpy.org/doc/stable/reference/generated/numpy.power.html

✅ **pandas.DataFrame.var**
- Return unbiased variance over requested axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.var.html

It is based on the difference between the value of each observation($x_i$) and the mean($\bar x$).

## $ s^2 = {{\sum (x_i - \bar x)^2} \over (n-1)}$

In [24]:
var_x = np.sum(np.power(ap_data["Correct"] - mean_x, 2)) / (n-1) # ✅ numpy.sum # ✅ numpy.power
print(f'✍ Variance = {var_x}')

✍ Variance = 355.6261224489795


In [25]:
var_x = ap_data["Correct"].var()
print(f'🤖 Variance = {var_x}') # ✅ pandas.DataFrame.var

🤖 Variance = 355.6261224489795


## 2-4. Standard Deviation

✅ **pandas.DataFrame.std**
- Return sample standard deviation over requested axis.
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html

The standard deviation of a data set is the positive square root of the variance.

## $ s = \sqrt {s^2} $

In [26]:
std_x = np.sqrt(var_x)
print(f'✍ Standard Deviation = {std_x}')

✍ Standard Deviation = 18.858051926139655


In [27]:
std_x = ap_data["Correct"].std()
print(f'🤖 Standard Deviation = {std_x}') # ✅ pandas.DataFrame.std

🤖 Standard Deviation = 18.858051926139655


## 2-5. Coefficient of Variation

The coefficient of variation is computed as follows:

## $ ({s \over {\bar x}} \times 100)\% $

In [28]:
cv_x = std_x / mean_x * 100
print(f'Coefficient of Variation = {cv_x}')

Coefficient of Variation = 19.06394250519577
