### Problem 0:

The Fictional Bureau of Public Health tests the water in various locations around MagicLand for a toxic substance called Mercuronium. 

Mercuronium starts to cause symptoms, like lightheadedness, at around 495 parts per million. At a higher concentration (like 500 parts per million), the symptoms progress to headaches. At 505 parts per million, the symptoms become more severe and possibly permanent. So the government recommends that all water sources be kept under 490 parts per million of Mercuronium and sends toxicologists out to various locations to test the water. It is known that the tests are not perfect, so to avoid uncertainty, the toxicologists are sent on a pretty regular basis to take multiple measurements. 

Import the data from `toxicology_data.csv`, which contains the toxicologists' findings. Please include your pandas code in the cell below to show your work:

In [None]:
import pandas as pd

# YOUR CODE HERE



### Problem 1:

Find the number of readings taken for each location in the data, as well as the mean and standard deviation of the readings for each location.

In [None]:
# YOUR CODE HERE



### Problem 2:

Remember—the fewer measurements we have for a place, the more uncertainty we have that the aggregate metrics for our measurements accurately represent the _real_ numbers. Calculate the top and bottom of the confidence interval for the average reading in each location, and add it to your location data in columns called `reading_bottom_conf_inv` and `reading_top_conf_inv`. You can use the confidence interval function defined below.

In [None]:
import math
from scipy.stats import t
import numpy as np

def confidence_interval_for_collection(sample_size=[], standard_deviation=[], mean=[], confidence=0.95):
    degrees_freedom = [count - 1 for count in sample_size]
    outlier_tails = (1.0 - confidence) / 2.0
    confidence_collection = [outlier_tails for _ in sample_size]
    t_distribution_number = [-1 * t.ppf(tails, df) for tails, df in zip(confidence_collection, degrees_freedom)]

    step_1 = [std/math.sqrt(count) for std, count in zip(standard_deviation, sample_size)]
    step_2 = [step * t for step, t in zip(step_1, t_distribution_number)]

    low_end = [mean_num - step_num for mean_num, step_num in zip(mean, step_2)]
    high_end = [mean_num + step_num for mean_num, step_num in zip(mean, step_2)]

    return low_end, high_end

In [None]:
# YOUR CODE HERE



### Problem 3:

A table is not so intuitive for the FBPH to understand, so you should make a visualization that will help them absorb the information. Here is an example visualization that fulfills the requirements:

![](../images/example_chart.png)

**You can take some design liberties with this plot, but here is what needs to be there:**

- It needs to be a scatterplot, with the dot representing the mean water reading. It should not be a line plot, since a line plot communicates some kind of relationship between the points that isn't there and would mislead the FBPH.
- Each dot should have a set of error bars around it showing the upper and lower end of the confidence interval for its mean water reading.
- The chart should have the PPM numbers labeled so the FBPH can get a general idea of the reading numbers from the chart.
- The locations need to be labeled so you can tell which location corresponds to each water reading.
- The location labels need to be legible (not run into each other). You can do this by adjusting the proportions of the figure to spread out the names, by rotating the names (as I have done in the example), or by orienting your plot so that the locations are on the Y axis. You _should not_ do this by making the font on the labels smaller: that makes the chart inaccessible because people can't always read tiny fonts.
- Either the dots should not be black, or the background should not be white. Feel free to change one or both. In the example, I have changed both. There should still be enough contrast to see the dots on the background.
- The chart should have a title. 
- The chart should have a legend. _The legend should not cover up any of the dots or error bars_.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Your Code Here

### Problem 4:

If we want to make absolutely certain that the people of MagicLand are safe, which of these numbers should we use to estimate the Mercuronium reading? How would you describe what that number means for the probability that the water is safe?

### YOUR ANSWER HERE

>

### Problem 5:

Do you think Mercuronium poisoning is a continuous or a categorical variable? Why?

### YOUR ANSWER HERE

>

### Problem 6:

The FBPH suspects that Mercuronium readings might be higher in the evenings than in the mornings. Test their hypothesis. You can use the T-test code provided below.

In [None]:
def t_test_for(num_samples_1, standard_deviation_1, mean_1, num_samples_2, standard_deviation_2, mean_2, confidence=0.95):
    alpha = 1 - confidence
    total_degrees_freedom = num_samples_1 + num_samples_2 - 2

    t_distribution_number =-1 * t.ppf(alpha, total_degrees_freedom)

    degrees_freedom_1 = num_samples_1 - 1
    degrees_freedom_2 = num_samples_2 - 1
    sum_of_squares_1 = (standard_deviation_1 ** 2) * degrees_freedom_1
    sum_of_squares_2 = (standard_deviation_2 ** 2) * degrees_freedom_2

    combined_variance = (sum_of_squares_1 + sum_of_squares_2) / (degrees_freedom_1 + degrees_freedom_2)
    first_dividend_addend = combined_variance/float(num_samples_1)
    second_dividend_addend = combined_variance/float(num_samples_2)

    denominator = math.sqrt(first_dividend_addend + second_dividend_addend)
    numerator = mean_1 - mean_2
    t_value = float(numerator)/float(denominator)

    accept_null_hypothesis = abs(t_value) < abs(t_distribution_number) #results are not significant

    return accept_null_hypothesis, t_value

In [None]:
# YOUR CODE HERE



### Problem 7:

Would you say that readings are appreciably higher in the evenings than in the mornings? Why or why not?

### YOUR ANSWER HERE

>

### Problem 8:

Disappointed with this finding, the FBPH notes that we also have data on which assessor did each reading. They now want you to go back and see if you can find a connection between the assessors and the ratings—maybe someone is assessing too harshly? 

How would you caution them about this decision?

### YOUR ANSWER HERE

>