## Making Observations

In this notebook we will begin with some basic observations or data that we can collect to answer our questions about the M&Ms. As it turns out, our observations are two dimensional. What does that mean?

## Discussion
In the lecture portion of this lesson, we need to address the questions:

**What is a qualitative observation?**<br>
**What is a quantitative observation?**

Depending on your answers, would you say that:
** Science should be more qualitative?**
** Science should be more quantitative?**

In some sense, these questions are preparing us for a journey that we started back in September. If you go back to your notes, you should find one of our first reading assignments [What is This Thing Called Science](https://github.com/JasonJWilliamsNY/science_institute_2015/blob/master/pdfs/what_is_science.pdf). It's strongly suggested that you go back to read this material, having made it this far in the course. After having taken a lot of time to understand some important scientific techniques, we are now going to start looking into some of the more quantitative aspects of the scientific endeavor. 

## What is truth?

Without going to deep into the philosphy - [Epistemology](https://en.wikipedia.org/wiki/Epistemology) is the branch of philosopy concerned with knowledge (and strictly speaking [Logic](https://en.wikipedia.org/wiki/Logic) is maybe more directly concerned with truth - let's consider another question:

**Is Central a normal New York City High School?**
<br>
This question is ambigous, normal how? But it is a starting point to asking, is there a quantitative (scientific?) way to measure this?

## Statistics

[Statistics](https://en.wikipedia.org/wiki/Statistics) is (among other things) a way of using mathematical tools to help us answer quantitative questions - or at least describe quantitatively what we are observing. For example:

### Principle 1 - Taking Samples

What questions could you ask about the student population of Central to determine if Central is the same or different as other NYC High Schools?

What about student high vs. grade? Let's make a hypothetical example:

|Grade Level|Mean Height of all NYC Students|Mean Height of all Central Students|
|-----------|:------------------------------:|:-----------------------------------:|
|9|59.1|61.3|
|10|64.3|62.3|
|11|65.1|66.0|
|12|66.2|67.5|

In this hypothetical example, if we look at the height of the students, would you say NYC students in general are the same as Central students? 

What we we have done is taken a sample, and an unusual one at that; can you name at least 3 reasons why?

What observations (samples) could we make that would emphasize or de-emphasize the similarities between NYC and Central Students?

### Principle 2 - Describing: the same or different?

As you are probably aware, there are some specific technical terms that we make use of in statistics. Some basic ones include:


- **Mean (Average)**: the sum of the observations divided by the total number of observations
- **Median**: the number that is in the exact middle of a set of observations
- **Mode**: this is the number the occurs the most frequently in a set of observations 






#### Python Example
Let's use Python to calculate these statistical terms. Given the following observations, how would you calculate the mean?

|Day|Rainfall (cm)|
|---|-------------|
|1|3.2|
|2|2.8|
|3|9.6|
|4|0|
|5|1.2|
|6|14.2|
|7|2.5|



We have 7 observations, so how can we pass this information to Python. One way is to do this as a [list](https://docs.python.org/3.1/tutorial/datastructures.html). A list is just what it sounds like, it is a ordered set of other python data types (like integers, strings, or even lists). 

We create a list by giving Python the name of the list, and then putting the list items inside a set of square brackets [  ]. The items in the list must be seperated with a comma. 

In [None]:
# Create a list called 'rain_observations' and enter in the variables
# from the chart above
rain_observations = [3.2,2.8,9.6,0,1.2,14.2,2.5]
print(rain_observations)

In [None]:
# How would you take all of these observations and take the sum divide by
# the total to determine the mean?

# We can get the total number of observations by asking python what is the 
# 'length' of the list

print(len(rain_observations))

In [None]:
# printing the answer is nice, but let's save it. 

number_of_observations = len(rain_observations)

In [None]:
# Now we need to take the sum of all of the observations.
# to get an element from a list, we can call it by its index
# remember, the list is indexed from 0..n where n is the last element
# so...

print('the first observation is',rain_observations[0])
print('the second observation is',rain_observations[1])


In [None]:

# how would you calculate the sum of all the observations?

sum_of_observations = rain_observations[0]+rain_observations[1]

# given the sum of the observations and the number of observations...

mean_observation = '?'


In real life however, most people will just do this:

In [None]:
import numpy
mean_observation = numpy.mean(rain_observations)
print('mean observation is',mean_observation)



These calculations are a begining of asking a question: are two samples the same or different?

Take for example the following situation, where a two dimensional observation (x and y) are plotted on a graph:

|Statistical description|Sample X1|Sample X2|Sample X3|Sample X4|
|-----------------------|---------|---------|---------|---------|
|Mean (x-coordinate)|9|9|9|9|
|Mean (y-coordinate)|7.5|7.5|7.5|7.5|
|Varience of X|11|11|11|11|
|Varience of Y|4.122|4.122|4.122|4.122|
|Corelation of X and Y|0.816|0.816|0.816|0.816|
|Linear regression|y = 3.00 + 0.500x|y = 3.00 + 0.500x|y = 3.00 + 0.500x|y = 3.00 + 0.500x|

Same or different?

What is the result when plotted:
![](anscombe_quartet_3.png)

We'll talk more about this later. 

## Python Challenge - Plotting Data


In our example, we are going to look at samples of M&M canides and ask if different tubes of M&Ms are the same or different from each other. How can we use statsical tools to give us answers, and minimize the chance we will be fooled by the results. We will get to some answers, but first, let's talk about the samples we have in class. To complete this exercise you will need data from all of your classmates. Assume that we examine tubes 1-8. How would you complete the following:

Given Tube 0:

|Sample name|Blue|Brown|Green|Orange|Red|Yellow|
|-----------|----|-----|-----|------|---|------|
|Tube 0|22|13|21|18|8|17|

In [None]:
# Assume that we are going to enter the colors alphabetical order
# So 

Tube_0 = [22,13,21,18,8,17]



In [None]:
# How would you calculate the mean for the Blue M&Ms if we had 3 tubes to
# count? Hint: Tube_0[0] = 22, Tube_0[1] = 13

We can also make a chart of our observations. To do this we will import the matlibplot library:

In [None]:
import numpy as np
import matplotlib.pyplot as plot
% matplotlib inline

In [None]:
# Plot your observations 

observations = Tube_0
n = len(observations)
index = np.arange(n)
colors = ['blue',
          'brown',
          'green',
          'orange',
          'red',
          'yellow']
plot_1 = plot.bar(index,
              observations,
              color=colors,
              tick_label=colors,
              align='center')
plot.show(plot_1)
 

In [None]:
# How would you make a plot for all of the tubes in the class?
# Hint - to give a title to a plot use the plot.title() function