In [1]:
import pandas as pd
import tools.thinkstats as ts2
import tools.conversions as convs
from tools.brfss import ReadBrfss
from tools.distributions import Hist, BiasedHist

**Disclaimer:** The module tools and all submodules contain tools to follow the DRY principle.
Code in the tools.thinkstats module is partially copied (and identical) from ThinkStats2 code and partially modified.

**BRFSS:** This module was copied completely without any customaizations.

**Data:** The datafile, which is evaluated for this exercise is also copied from the ThinkStats2 repository.

**Hist:** Hist is a class, which takes a list or series, calculates a histogram and provides several methods using the histogram (e.g., mean, median, outliers, etc.).

**BiasedHist:** BiasedHist is a class derived from Hist, which is biased in the way the the class size paradox is biased.

**Convs:** This module contains conversion functions to translate between imperial and metric system.

# Chapter 5, Ex. 1

In the BRFSS (see Section on page 56), the distribution of heights is roughly normal with parameters µ = 178 cm and σ = 7.7 cm for men, and µ = 163 cm and σ = 7.3 cm for women.

In order to join Blue Man Group, you have to be male between 5’10” and 6’1” (see http://bluemancasting.com). What percentage of the U.S. male population is in this range? Hint: use scipy.stats.norm.cdf.

In [20]:
import scipy.stats

In [21]:
# Convert the Hight from feet and inches to centimeters
lower_hight_limit = round(convs.us_hight2m(feet=5, inches=10) * 100, 1)
upper_hight_limit = round(convs.us_hight2m(feet=6, inches=1) * 100, 1)
lower_hight_limit, upper_hight_limit

(177.8, 185.4)

In [22]:
mu = 178
sigma = 7.7
dist = scipy.stats.norm(loc=mu, scale=sigma)

In [23]:
low, high, diff = dist.cdf(lower_hight_limit), dist.cdf(upper_hight_limit), dist.cdf(upper_hight_limit) - dist.cdf(lower_hight_limit)
low, high, diff

(0.48963902786483265, 0.8317337108107857, 0.3420946829459531)

In [24]:
diff

0.3420946829459531

In [25]:
print(f"About {round(diff,3)*100}% of the US male population is within the range of {lower_hight_limit} cm and {upper_hight_limit} cm")


About 34.2% of the US male population is within the range of 177.8 cm and 185.4 cm


In [17]:
## Compare results to the BRFSS data

In [18]:
df = ReadBrfss()
males = df[df['sex']==1]
male_hight_hist = Hist(males.htm3)

In [19]:
# Get rid of digits after the comma
# This is necessary because of a current limitation in the way the CDF class looksup values.
lower_hight_limit, upper_hight_limit = round(lower_hight_limit, 0), round(upper_hight_limit, 0)

In [11]:
low_2, high_2, diff_2 = male_hight_hist.cdf[lower_hight_limit], male_hight_hist.cdf[upper_hight_limit], male_hight_hist.cdf[upper_hight_limit] - male_hight_hist.cdf[lower_hight_limit]

low_2, high_2, diff_2

(0.5432137144041397, 0.8786842565427733, 0.3354705421386336)

In [12]:
print(f"About {round(diff_2,3)*100}% of the males within this data set are within the range of {lower_hight_limit} cm and {upper_hight_limit} cm")


About 33.5% of the males within this data set are within the range of 178.0 cm and 185.0 cm


In [13]:
# Calculate the difference between the two approaches
round(abs(low - low_2), 3), round(abs(high - high_2), 3), round(abs(diff - diff_2), 3)

(0.054, 0.047, 0.007)

In [14]:
The difference betwee the two calculations is very small.

SyntaxError: invalid syntax (<ipython-input-14-7abbfaf40de1>, line 1)