# Monday - Environmental Justice!

In [None]:
# These are tools we will use later. Just run the cell.
import numpy as np
import matplotlib.pyplot as plots
from datascience import *
import statsmodels.formula.api as sm
# import correlation as c
%matplotlib inline 
%run functions.ipynb
plots.style.use("fivethirtyeight")

## 1. Correlation

Correlation is used to test relationships between quantitative variables or categorical variables. In other words, it’s a measure of how things are related. The study of how variables are correlated is called correlation analysis.

Some examples of data that have a high correlation:

    Your caloric intake vs. your weight.
    Your eye color vs. your relatives’ eye colors.
    The amount of time you study vs. your GPA.
    Alcohol consumed vs. your blood alcohol content.

Some examples of data that have a low correlation (or none at all):

    Your sexual preference vs. the type of cereal you eat.
    A dog’s name vs. the type of dog biscuit they prefer.
    The cost of a car wash vs. how long it takes to buy a soda inside the station.

Correlations are useful because if you can find out what relationship variables have, you can make predictions about future behavior. Knowing what the future holds is very important in the social sciences like government and healthcare.

You make decisions based on relationships of two events all the time: if it's 2pm on a Thursday of Deadweek, you predict the number of seats avaiable in Moffitt Floor 5 would be close to none and would think twice about trying your luck there. As simple as this is, this is correlation and prediction at work: time of semester vs. the number of seats available in Moffitt Floor 5. This is exacltly what we are going in this lecture -- **the correlation coefficient simply assigns a number to the *type* and *strength* of a relationship between two events**.

The **correlation coefficient** ( r ) puts a value to the relationship and shows how strong it is. The value is between -1 and 1 where 0 is no relationship, -1 is a perfect negative relationship, and 1 is a perfect positive relationship. Correlation is also necessary for regression (which we will get to later).

![image](images/correlation-examples.svg)

If we wanted to look at the relationship between two of the variables in our dataset, we could calculate the correlation. For example, asking how race is related to a particular health factor, such as asthma.

### The Data

We will be using data from the website of the Office of Environmental Health Hazard Assessment. The file includes environmental and population data across different counties of California. In order to analyze the data, we must first import it to our Jupyter notebook and create a table. We will call this table `ces_data`.

In [None]:
ces_data = Table.read_table("data/ces_data.csv")
ces_data.take(np.arange(40,50))

Notice that a lot of the entries in the Pesticides column above are 0's. When dealing with large datasets, we will often encounter **missing** values. We've talked about this in Project 1. These values are simply empty values that appear when we do not have a value available for a particular record. It is important to clean these meaningless values to carry out analysis of the dataset. Much of data science consists of **cleaning data** which includes **renaming columns**, **reducing the table size to include only the columns of interest**, and **removing missing values**.  There are various methods of dealing with missing values -- for our purposes, it is safe to simply remove these values from our table. 

**We have done this for you**: simply run the cell below to save a clean version of the data as `clean_ces_data`. From this point forward, we'll use this cleaned CES data to run our analysis.

In [None]:
clean_ces_data = Table.read_table("data/cleaned_data_new.csv")
clean_ces_data.show(5)

This scatter plot shows the relationship between the pollution score and asthma. Refer back to the image above the data.

In [None]:
clean_ces_data.scatter("ces_pollution_score", "asthma", alpha = .18, s = 10)

#### Based on this scatter plot, what do you think the r-value is?
In other words, about how closely are pollution and asthma related? Compare this graph with the charts above to help you identify the **type** (Positive? Negative?) and **strength** (value) of the relationship.

*Your Guess Here*

#### Correlation Function!

To see how well your guess matches the actual r-value, we can use the `correlation` function defined below. 

In [None]:
#Run me to find the actual correlation coefficient!
correlation(clean_ces_data, 'ces_pollution_score', 'asthma')

It's certainly not perfect -- if you are given a pollution score, you can't say that the number of reported asthma attacks **will definitely** be \_\_. However, you can see (both from the plot and from the calculated r-value) that there is a positive relationship between a census tract's pollution score and the number of reported asthma attacks.


---

## Your Turn!

In previous example, we explored the relationship between an environmental outcome and a health issue. Now let's look at how this health issue compares with a certain demographic.

In [None]:
# This will find the correlation coefficient between African Americans and Asthma.
print('r: ', correlation(clean_ces_data, 'african_american', 'asthma'))
clean_ces_data.scatter("african_american", "asthma", alpha = .18, s = 10)

`r:  0.4986847676603604`

Since our r-value is low (far from 1) it shows us that we need to conduct more analysis because a single variable is not sufficient to predict asthma. Usually, there are multiple factors that affect an outcome so it makes sense that we need to do more than a simple analysis. Choose factors you want to see the relationship of and enter them in the call below!

In [None]:
#Replace the ... with the columns you want to look at.
print('r: ', correlation(clean_ces_data, '...', '...'))
clean_ces_data.scatter("african_american", "asthma", alpha = .18, s = 10)

---

**CONGRATULATIONS!!!** You've made it through an introduction to correlation! 

---

**Citation:**

- [DS Modules](https://github.com/ds-modules)
- Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/)
- Pierce, Rod. "Correlation" Math Is Fun. Ed. Rod Pierce. 5 Nov 2018. 16 Feb 2019 <http://www.mathsisfun.com/data/correlation.html>

*Notebook developed by: Aarish Irfan, Alleanna Clark & Keiko Kamei*