# Coding exercises
Exercises 1-3 are thought exercises that don't require coding.

1. Explore the Jupyter Lab interface and look at some of the shortcuts available. Don't worry about memorizing them now (eventually they will become second nature and save you a lot of time), just get comfortable using notebooks.

2. Are all data normally distributed?

Answer: No, not all data are normally distributed. While the normal distribution (bell curve) is common in many natural and social phenomena (e.g., height, test scores), many datasets follow other distributions, such as:
Skewed distributions (e.g., income, where most values are low and a few are very high)
Uniform distributions (equal likelihood across a range)
Bimodal or multimodal distributions (two or more peaks)
Exponential or Poisson distributions (often used in time-to-event data or counts)


3. When would it make more sense to use the median instead of the mean for the measure of center?

Answer: Use mean for symmetric, normally distributed data; use median for skewed or outlier-prone data.

## Exercise 4: Generate the data by running this cell
This will give you a list of numbers to work with in the remaining exercises.

In [4]:
import random
random.seed(0)
salaries = [round(random.random()*1000000, -3) for _ in range(100)]
print(list(salaries))

[844000.0, 758000.0, 421000.0, 259000.0, 511000.0, 405000.0, 784000.0, 303000.0, 477000.0, 583000.0, 908000.0, 505000.0, 282000.0, 756000.0, 618000.0, 251000.0, 910000.0, 983000.0, 810000.0, 902000.0, 310000.0, 730000.0, 899000.0, 684000.0, 472000.0, 101000.0, 434000.0, 611000.0, 913000.0, 967000.0, 477000.0, 865000.0, 260000.0, 805000.0, 549000.0, 14000.0, 720000.0, 399000.0, 825000.0, 668000.0, 1000.0, 494000.0, 868000.0, 244000.0, 325000.0, 870000.0, 191000.0, 568000.0, 239000.0, 968000.0, 803000.0, 448000.0, 80000.0, 320000.0, 508000.0, 933000.0, 109000.0, 551000.0, 707000.0, 547000.0, 814000.0, 540000.0, 964000.0, 603000.0, 588000.0, 445000.0, 596000.0, 385000.0, 576000.0, 290000.0, 189000.0, 187000.0, 613000.0, 657000.0, 477000.0, 90000.0, 758000.0, 877000.0, 923000.0, 842000.0, 898000.0, 923000.0, 541000.0, 391000.0, 705000.0, 276000.0, 812000.0, 849000.0, 895000.0, 590000.0, 950000.0, 580000.0, 451000.0, 660000.0, 996000.0, 917000.0, 793000.0, 82000.0, 613000.0, 486000.0]


## Exercise 5: Calculating statistics and verifying

Use the data generated above to calulate in code the following statistics without importing the statistics module. Then use the statistics module to verify your results.  Import the statistics module from https://docs.python.org/3/library/statistics.html

### mean

In [5]:
# Calculate the mean without the statistics module
manual_mean = sum(salaries) / len(salaries)
print(f"Manual mean: {manual_mean}")

# Verify with the statistics module
import statistics
stats_mean = statistics.mean(salaries)
print(f"Statistics module mean: {stats_mean}")

# Check if the results match
print(f"Results match: {manual_mean == stats_mean}")

Manual mean: 585690.0
Statistics module mean: 585690.0
Results match: True


### median

In [6]:
# Calculate the median without the statistics module
salaries.sort()
n = len(salaries)
if n % 2 == 0:
    # If the number of data points is even, the median is the average of the two middle values
    median_manual = (salaries[n//2 - 1] + salaries[n//2]) / 2
else:
    # If the number of data points is odd, the median is the middle value
    median_manual = salaries[n//2]
print(f"Manual median: {median_manual}")

# Verify with the statistics module
import statistics
median_stats = statistics.median(salaries)
print(f"Statistics module median: {median_stats}")

# Check if the results match
print(f"Results match: {median_manual == median_stats}")

Manual median: 589000.0
Statistics module median: 589000.0
Results match: True


### mode

In [7]:
# Calculate the mode without the statistics module
from collections import Counter
count = Counter(salaries)
# Find the most common element(s)
max_freq = max(count.values())
manual_mode = [k for k, v in count.items() if v == max_freq]
print(f"Manual mode: {manual_mode}")

# Verify with the statistics module
import statistics
try:
    stats_mode = statistics.mode(salaries)
    print(f"Statistics module mode: {stats_mode}")
    # Check if the results match (only if there is a single mode)
    if len(manual_mode) == 1:
      print(f"Results match: {manual_mode[0] == stats_mode}")
    else:
      print("Cannot directly compare manual mode (multiple modes) with statistics.mode (single mode)")
except statistics.StatisticsError as e:
    print(f"Statistics module error: {e}")
    print("Cannot calculate mode with statistics.mode when there are multiple modes or no mode.")

Manual mode: [477000.0]
Statistics module mode: 477000.0
Results match: True


### sample variance
Remember to use Bessel's correction.

In [25]:
# Calculate the sample variance without the statistics module (using Bessel's correction)
n = len(salaries)
manual_mean = sum(salaries) / n
manual_variance = sum([(x - manual_mean) ** 2 for x in salaries]) / (n - 1)
print(f"Manual sample variance: {manual_variance}")

# Verify with the statistics module
import statistics
stats_variance = statistics.variance(salaries)
print(f"Statistics module sample variance: {stats_variance}")

# Check if the results match
print(f"Results match: {manual_variance == stats_variance}")

Manual sample variance: 70664054444.44444
Statistics module sample variance: 70664054444.44444
Results match: True


### sample standard deviation
Remember to use Bessel's correction.

In [9]:
# Calculate the sample standard deviation without the statistics module (using Bessel's correction)
n = len(salaries)
manual_mean = sum(salaries) / n
manual_std_dev = (sum([(x - manual_mean) ** 2 for x in salaries]) / (n - 1)) ** 0.5
print(f"Manual sample standard deviation: {manual_std_dev}")

# Verify with the statistics module
import statistics
stats_std_dev = statistics.stdev(salaries)
print(f"Statistics module sample standard deviation: {stats_std_dev}")

# Check if the results match
print(f"Results match: {manual_std_dev == stats_std_dev}")

Manual sample standard deviation: 265827.11382484
Statistics module sample standard deviation: 265827.11382484
Results match: True


## Exercise 6: Calculating more statistics
### range

In [24]:
# Calculate the range
manual_range = max(salaries) - min(salaries)
print(f"Manual range: {manual_range}")

Manual range: 995000.0


### coefficient of variation

In [23]:
# Calculate the coefficient of variation
# CV = (Standard Deviation / Mean) * 100
# We'll use the previously calculated manual standard deviation and manual mean
manual_cv = (manual_std_dev / manual_mean) * 100
print(f"Manual coefficient of variation: {manual_cv:.2f}%")

Manual coefficient of variation: 45.39%


### interquartile range

In [14]:
# Calculate the interquartile range without the statistics module
# Quartile 1 (Q1) is the median of the lower half of the data
# Quartile 3 (Q3) is the median of the upper half of the data
# IQR = Q3 - Q1

salaries_sorted = sorted(salaries)
n = len(salaries_sorted)

# Find Q1
if n % 2 == 0:
    lower_half = salaries_sorted[:n//2]
else:
    lower_half = salaries_sorted[:n//2 + 1] # Include the median in the lower half for odd n (method 3 in some definitions)

n_lower = len(lower_half)
if n_lower % 2 == 0:
    q1_manual = (lower_half[n_lower//2 - 1] + lower_half[n_lower//2]) / 2
else:
    q1_manual = lower_half[n_lower//2]

# Find Q3
if n % 2 == 0:
    upper_half = salaries_sorted[n//2:]
else:
    upper_half = salaries_sorted[n//2:]

n_upper = len(upper_half)
if n_upper % 2 == 0:
    q3_manual = (upper_half[n_upper//2 - 1] + upper_half[n_upper//2]) / 2
else:
    q3_manual = upper_half[n_upper//2]

manual_iqr = q3_manual - q1_manual
print(f"Manual interquartile range: {manual_iqr}")


Manual interquartile range: 417500.0


### quartile coefficent of dispersion

In [22]:
# Calculate the quartile coefficient of dispersion
# QCD = (Q3 - Q1) / (Q3 + Q1)
# We'll use the previously calculated manual Q1 and Q3
manual_qcd = (q3_manual - q1_manual) / (q3_manual + q1_manual)
print(f"Manual quartile coefficient of dispersion: {manual_qcd:.2f}")

Manual quartile coefficient of dispersion: 0.34


## Exercise 7: Scaling data
### min-max scaling

In [15]:
min_salary, max_salary = min(salaries), max(salaries)
salary_range = max_salary - min_salary

min_max_scaled = [(x - min_salary) / salary_range for x in salaries]
min_max_scaled[:5]

[0.0,
 0.01306532663316583,
 0.07939698492462312,
 0.0814070351758794,
 0.08944723618090453]

### standardizing

In [21]:
from statistics import mean, stdev

mean_salary, std_salary = mean(salaries), stdev(salaries)

standardized = [(x - mean_salary) / std_salary for x in salaries]
standardized[:5]

[-2.199512275430514,
 -2.150608309943509,
 -1.9023266390094862,
 -1.8948029520114855,
 -1.8647082040194827]

## Exercise 8: Calculating covariance and correlation
### covariance

In [20]:
import numpy as np
np.cov(min_max_scaled, standardized)

from statistics import mean
running_total = []
for x, y in zip(min_max_scaled, standardized):
    running_total.append((x - mean(min_max_scaled)) * (y - mean(standardized)))

cov = mean(running_total)
cov

0.26449129918250414

### Pearson correlation coefficient ($\rho$)

In [19]:
from statistics import stdev
cov / (stdev(min_max_scaled) * stdev(standardized))

0.9900000000000001