# 1. Exploring The Data
Instructions Find the county with the lowest median income in the US (median_income). Assign the name of the county (county) to lowest_income_county.
Find the county that has more than 500000 residents with the lowest median income. Assign the name of the county to lowest_income_high_pop_county.
Hint
The .idxmin() function will find the index of the minimum value in a column. income["median_income"].idxmin() will find the index of the row with the lowest median income.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

income = pd.read_csv("ACS_15_1YR_S2601C_metadata.csv")

print(income.head(3))

lowest_income_county = income["county"][income["median_income"].idxmin()]
high_pop = income["pop_over_25"] > 500000
lowest_income_high_pop_county = high_pop["county"][high_pop["median_income"].idxmin()]

# 2. Random Numbers
Instructions Set a random seed of 20 and generate a list of 10 random numbers between the values 0 and 10.
Assign the list to new_sequence.
Hint
Remember to use random.seed() to set a random seed.

In [4]:
import random

# Returns a random integer between the numbers 0 and 10, inclusive.
num = random.randint(0, 10)

# Generate a sequence of 10 random numbers between the values of 0 and 10.
random_sequence = [random.randint(0, 10) for _ in range(10)]

# Sometimes, when we generate a random sequence, we want it to be the same sequence whenever the program is run.
# An example is when you use random numbers to select a subset of the data, and you want other people
# looking at the same data to get the same subset.
# We can ensure this by setting a random seed.
# A random seed is an integer that is used to "seed" a random number generator.
# After a random seed is set, the numbers generated after will follow the same sequence.
random.seed(10)
print([random.randint(0,10) for _ in range(5)])
random.seed(10)
# Same sequence as above.
print([random.randint(0,10) for _ in range(5)])
random.seed(11)
# Different seed means different sequence.
print([random.randint(0,10) for _ in range(5)])

random.seed(20)
new_sequence = [random.randint(0,10) for _ in range(10)]
print(new_sequence)

[9, 0, 6, 7, 9]
[9, 0, 6, 7, 9]
[7, 8, 7, 7, 8]
[10, 2, 4, 10, 10, 1, 5, 9, 2, 0]


# 3. Selecting Items From A List(demo)

In [5]:
shopping = ["100","300","200","300","500"]

random.seed(1)
shopping_sample = random.sample(shopping, 4)

print(shopping_sample)

['300', '100', '500', '300']


# 4. Population Vs Sample
Instructions Set the random seed to 1, then generate a medium sample of 100 die rolls. Plot the result using a histogram with 6 bins.
Set the random seed to 1, then generate a large sample of 10000 die rolls. Plot the result using a histogram with 6 bins.
Hint
Use [roll() for _ in range(x)] to generate the rolls, with x being the number of rolls. Use plt.hist(sample, 6) to generate the plot. Make sure to set the seed before generating each sequence of rolls.

In [None]:
import matplotlib.pyplot as plt
import random

def roll():
    return random.randint(1, 6)

random.seed(1)
small_sample = [roll() for _ in range(10)]

plt.hist(small_sample, 6)
plt.show()

random.seed(1)
medium_sample = [roll() for _ in range(10)]

plt.hist(medium_sample, 6)
plt.show()

random.seed(1)
large_sample = [roll() for _ in range(10)]

plt.hist(large_sample, 6)
plt.show()

# 5. Finding The Right Sample Size
Instructions Set the random seed to 1, then generate probabilities for 300 trials of 100 die rolls each. Make a histogram with 20 bins.
Set the random seed to 1, then generate probabilities for 300 trials of 1000 die rolls each. Make a histogram with 20 bins.
Hint
Use probability_of_one(x, y) to generate the rolls, with x being the number of trials, and y being the number of rolls per trial. Use plt.hist(sample, 20) to generate the plot. Make sure to set the seed before generating each set of trials.

In [None]:
def probability_of_one(num_trials, num_rolls):
    probabilities = []
    for i in range(num_trials):
        die_rolls = [roll() for _ in range(num_rolls)]
        one_prob = len([d for d in die_rolls if d == 1]) / num_rolls
        probabilities.append(one_prob)
    return probabilities

random.seed(1)
small_sample = probability_of_one(300, 50)
plt.hist(small_sample, 20)
plt.show()

random.seed(1)
medium_sample = probability_of_one(300, 100)
plt.hist(medium_sample, 20)
plt.show()
    
random.seed(1)
large_sample = probability_of_one(300, 1000)
plt.hist(large_sample, 20)
plt.show()

# 6. What Are The Odds?
Instructions Find how many standard deviations away from the mean of large_sample .18 is. Assign the result to deviations_from_mean.
Find how many probabilities in large sample are greater than or equal to .18. Assign the result to over_18_count.
Hint
You can calculate how many standard deviations a value is from the mean by doing abs(value - mean) / standard_deviation.

In [None]:
import numpy 
large_sample_std = numpy.std(large_sample)
avg = numpy.mean(large_sample)
deviations_from_mean = (.18-avg) / large_sample_std

over_18_count = len([p for p in large_sample if p >= .18])

# 7. Sampling Counties
Instructions Use the select_random_sample function to pick 1000 random samples of 100 counties each from the income data. Find the mean of the median_income column for each sample.
Plot a histogram with 20 bins of all the mean median incomes.
Hint
You can use list comprehensions and the select_random_sample function to to create a list of all the means.

In [None]:
# This is the mean median income in any US county.
mean_median_income = income["median_income"].mean()
print(mean_median_income)

def get_sample_mean(start, end):
    return income["median_income"][start:end].mean()

def find_mean_incomes(row_step):
    mean_median_sample_incomes = []
    # Iterate over the indices of the income rows
    # Starting at 0, and counting in blocks of row_step (0, row_step, row_step * 2, etc).
    for i in range(0, income.shape[0], row_step):
        # Find the mean median for the row_step counties from i to i+row_step.
        mean_median_sample_incomes.append(get_sample_mean(i, i+row_step))
    return mean_median_sample_incomes

nonrandom_sample = find_mean_incomes(100)
plt.hist(nonrandom_sample, 20)
plt.show()

# What you're seeing above is the result of biased sampling.
# Instead of selecting randomly, we selected counties that were next to each other in the data.
# This picked counties in the same state more often that not, and created means that didn't represent the whole country.
# This is the danger of not using random sampling -- you end up with samples that don't reflect the entire population.
# This gives you a distribution that isn't normal.

import random
def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)

# 8. An Experiment
Instructions Select 1000 random samples of 100 counties each from the income data using the select_random_sample method.
For each sample:
Divide the median_income_hs column by median_income_college to get ratios.
Then, find the mean of all the ratios in the sample.
Add it to the list, mean_ratios.
Plot a histogram containing 20 bins of the mean_ratios list.
Hint
To select a sample of 100 counties, pass in 100 to select_random_sample. Wrap this code within a for loop that repeats the action 1000 time. For each sample, divide the median_income_hs column of the sample by the median_income_college column of the sample. Then take the mean of these ratios and store it.
plt.hist(mean_ratios,20) plots a histogram with 20 bins.

In [None]:
import random

def select_random_sample(count):
    random_indices = random.sample(range(0, income.shape[0]), count)
    return income.iloc[random_indices]

random.seed(1)
mean_ratios = []
for i in range(1000):
    sample = select_random_sample(100)
    ratios = sample["median_income_hs"] / sample["median_income_college"]
    mean_ratios.append(ratios.mean())

plt.hist(mean_ratios, 20)
plt.show()

# 9. Statistical Significance
Instructions
Determine how many values in mean_ratios are greater than or equal to .675.
Divide by the total number of items in mean_ratios to get the significance level.
Assign the result to significance_value.
Hint
Use len([m for m in mean_ratios if m >= .675]) to find the number of ratios in mean_ratios greater than or equal to .675. Then divide by len(mean_ratios).

In [None]:
significance_value = None

mean_sum = len([m for m in mean_ratios if m >= .675])
significance_value = mean_sum / len(mean_ratios)

# 10. Final Result(Demo)

In [None]:
# This is "steeper" than the graph from before, because it has
# 500 items in each sample.

random.seeds(1)
mean_ratios = []
for i in range(1000):
    sample = select_random_sample(500)
    ratios = sample["median_income_hs"] / sample["median_income_college"]
    mean_ratios.append(ratios_mean())

plt.hist(mean_ratios, 20)
plt.show()