<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Project 2

### Exploratory Data Analysis (EDA)

---

Your hometown mayor just created a new data analysis team to give policy advice, and the administration recruited _you_ via LinkedIn to join it. Unfortunately, due to budget constraints, for now the "team" is just you...

The mayor wants to start a new initiative to move the needle on one of two separate issues: high school education outcomes, or drug abuse in the community.

Also unfortunately, that is the entirety of what you've been told. And the mayor just went on a lobbyist-funded fact-finding trip in the Bahamas. In the meantime, you got your hands on two national datasets: one on SAT scores by state, and one on drug use by age. Start exploring these to look for useful patterns and possible hypotheses!

---

This project is focused on exploratory data analysis, aka "EDA". EDA is an essential part of the data science analysis pipeline. Failure to perform EDA before modeling is almost guaranteed to lead to bad models and faulty conclusions. What you do in this project are good practices for all projects going forward, especially those after this bootcamp!

This lab includes a variety of plotting problems. Much of the plotting code will be left up to you to find either in the lecture notes, or if not there, online. There are massive amounts of code snippets either in documentation or sites like [Stack Overflow](https://stackoverflow.com/search?q=%5Bpython%5D+seaborn) that have almost certainly done what you are trying to do.

**Get used to googling for code!** You will use it every single day as a data scientist, especially for visualization and plotting.

#### Package imports

In [42]:
import numpy as np
import scipy.stats as stats
import csv
import pandas as pd

# this line tells jupyter notebook to put the plots in the notebook rather than saving them to file.
%matplotlib inline

# this line makes plots prettier on mac retina screens. If you don't have one it shouldn't do anything.
%config InlineBackend.figure_format = 'retina'

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 1. Load the `sat_scores.csv` dataset and describe it

---

You should replace the placeholder path to the `sat_scores.csv` dataset below with your specific path to the file.

### 1.1 Load the file with the `csv` module and put it in a Python dictionary

The dictionary format for data will be the column names as key, and the data under each column as the values.

Toy example:
```python
data = {
    'column1':[0,1,2,3],
    'column2':['a','b','c','d']
    }
```

In [43]:
import csv
from pprint import pprint

sat_scores = './sat_scores.csv'
with open(sat_scores) as scores:
    reader = csv.reader(scores)
    scores_list = []
    for row in reader:
        scores_list.append(row)
    pprint (scores_list)


[['State', 'Rate', 'Verbal', 'Math'],
 ['CT', '82', '509', '510'],
 ['NJ', '81', '499', '513'],
 ['MA', '79', '511', '515'],
 ['NY', '77', '495', '505'],
 ['NH', '72', '520', '516'],
 ['RI', '71', '501', '499'],
 ['PA', '71', '500', '499'],
 ['VT', '69', '511', '506'],
 ['ME', '69', '506', '500'],
 ['VA', '68', '510', '501'],
 ['DE', '67', '501', '499'],
 ['MD', '65', '508', '510'],
 ['NC', '65', '493', '499'],
 ['GA', '63', '491', '489'],
 ['IN', '60', '499', '501'],
 ['SC', '57', '486', '488'],
 ['DC', '56', '482', '474'],
 ['OR', '55', '526', '526'],
 ['FL', '54', '498', '499'],
 ['WA', '53', '527', '527'],
 ['TX', '53', '493', '499'],
 ['HI', '52', '485', '515'],
 ['AK', '51', '514', '510'],
 ['CA', '51', '498', '517'],
 ['AZ', '34', '523', '525'],
 ['NV', '33', '509', '515'],
 ['CO', '31', '539', '542'],
 ['OH', '26', '534', '439'],
 ['MT', '23', '539', '539'],
 ['WV', '18', '527', '512'],
 ['ID', '17', '543', '542'],
 ['TN', '13', '562', '553'],
 ['NM', '13', '551', '542'],
 ['IL

In [44]:
scores_list[:5]

[['State', 'Rate', 'Verbal', 'Math'],
 ['CT', '82', '509', '510'],
 ['NJ', '81', '499', '513'],
 ['MA', '79', '511', '515'],
 ['NY', '77', '495', '505']]

In [45]:
len(scores_list[0])

4

In [46]:
#converting the list of lists to a dictionary:
scores_dict = {
    # scores_list[0][i]: [scores_list[1][0], scores_list[k][i], ... scores_list[52][0]],
    # scores_list[0][i]: [scores_list[1][1], scores_list[2][1], ... scores_list[52][1]],
    # scores_list[0][i]: [scores_list[1][2], scores_list[2][2], ... scores_list[52][2]],
    # scores_list[0][i]: [scores_list[1][3], scores_list[2][3], ... scores_list[52][3]],
}


# create dictionary keys
k = 0
while k < len(scores_list[0]):
    scores_dict[scores_list[0][k]] = []
    k += 1
    
#create the dictionary values, which is the list
k = 1
while k < len(scores_list):
    i = 0
    while i < len(scores_list[0]):
        scores_dict[scores_list[0][i]].append(scores_list[k][i])
        i += 1
    k += 1

pprint(scores_dict)

{'Math': ['510',
          '513',
          '515',
          '505',
          '516',
          '499',
          '499',
          '506',
          '500',
          '501',
          '499',
          '510',
          '499',
          '489',
          '501',
          '488',
          '474',
          '526',
          '499',
          '527',
          '499',
          '515',
          '510',
          '517',
          '525',
          '515',
          '542',
          '439',
          '539',
          '512',
          '542',
          '553',
          '542',
          '589',
          '550',
          '545',
          '572',
          '589',
          '580',
          '554',
          '568',
          '561',
          '577',
          '562',
          '596',
          '550',
          '570',
          '603',
          '582',
          '599',
          '551',
          '514'],
 'Rate': ['82',
          '81',
          '79',
          '77',
          '72',
          '71',
          '71',
   

In [47]:
drug_use = './drug-use-by-age.csv'
with open(drug_use) as drugs:
    reader = csv.reader(drugs)
    drugs_list = []
    for row in reader:
        drugs_list.append(row)
    print (drugs_list[:5])

[['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use', 'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use', 'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use', 'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency', 'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use', 'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency', 'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency', 'sedative-use', 'sedative-frequency'], ['12', '2798', '3.9', '3.0', '1.1', '4.0', '0.1', '5.0', '0.0', '-', '0.1', '35.5', '0.2', '52.0', '1.6', '19.0', '2.0', '36.0', '0.1', '24.5', '0.2', '52.0', '0.2', '2.0', '0.0', '-', '0.2', '13.0'], ['13', '2757', '8.5', '6.0', '3.4', '15.0', '0.1', '1.0', '0.0', '3.0', '0.0', '-', '0.6', '6.0', '2.5', '12.0', '2.4', '14.0', '0.1', '41.0', '0.3', '25.5', '0.3', '4.0', '0.1', '5.0', '0.1', '19.0'], ['14', '2792', '18.1', '5.0', '8.7', '24.0', '0.1', '5.5', '0.0', '-', '0.1', 

In [48]:
#convert drugs list of lists to a dictionary

drugs_dict = {}

#create dictionary keys
k = 0
while k < len(drugs_list[0]):
    drugs_dict[drugs_list[0][k]] = []
    k += 1
    
#create dictionary values
k = 1
while k < len(drugs_list):
    i = 0
    while i < len(drugs_list[0]):
        drugs_dict[drugs_list[0][i]].append(drugs_list[k][i])
        i += 1
    k += 1

pprint(drugs_dict)

{'age': ['12',
         '13',
         '14',
         '15',
         '16',
         '17',
         '18',
         '19',
         '20',
         '21',
         '22-23',
         '24-25',
         '26-29',
         '30-34',
         '35-49',
         '50-64',
         '65+'],
 'alcohol-frequency': ['3.0',
                       '6.0',
                       '5.0',
                       '6.0',
                       '10.0',
                       '13.0',
                       '24.0',
                       '36.0',
                       '48.0',
                       '52.0',
                       '52.0',
                       '52.0',
                       '52.0',
                       '52.0',
                       '52.0',
                       '52.0',
                       '52.0'],
 'alcohol-use': ['3.9',
                 '8.5',
                 '18.1',
                 '29.2',
                 '40.1',
                 '49.3',
                 '58.7',
                 '64.6',
   

### 1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas `.read_csv()` function

Compare the DataFrames using the `.dtypes` attribute in the DataFrame objects. What is the difference between loading from file and inputting this dictionary (if any)?

In [49]:
# convert scores_dict into a pandas DataFrame
scores_dict_df = pd.DataFrame(scores_dict)
scores_dict_df

Unnamed: 0,Math,Rate,State,Verbal
0,510,82,CT,509
1,513,81,NJ,499
2,515,79,MA,511
3,505,77,NY,495
4,516,72,NH,520
5,499,71,RI,501
6,499,71,PA,500
7,506,69,VT,511
8,500,69,ME,506
9,501,68,VA,510


In [59]:
#convert drugs_dict into a pandas DataFrame
drugs_dict_df = pd.DataFrame(drugs_dict)
drugs_dict_df.sample(5)

Unnamed: 0,age,alcohol-frequency,alcohol-use,cocaine-frequency,cocaine-use,crack-frequency,crack-use,hallucinogen-frequency,hallucinogen-use,heroin-frequency,...,oxycontin-frequency,oxycontin-use,pain-releiver-frequency,pain-releiver-use,sedative-frequency,sedative-use,stimulant-frequency,stimulant-use,tranquilizer-frequency,tranquilizer-use
2,14,5.0,18.1,5.5,0.1,-,0.0,3.0,1.6,2.0,...,4.5,0.4,12.0,3.9,16.5,0.2,12.0,0.8,5.0,0.9
0,12,3.0,3.9,5.0,0.1,-,0.0,52.0,0.2,35.5,...,24.5,0.1,36.0,2.0,13.0,0.2,2.0,0.2,52.0,0.2
11,24-25,52.0,83.1,6.0,4.0,6.0,0.5,2.0,4.5,88.0,...,20.0,1.3,15.0,9.0,17.5,0.2,10.0,2.6,10.0,4.3
14,35-49,52.0,75.0,15.0,1.5,48.0,0.5,3.0,0.6,280.0,...,12.0,0.3,12.0,4.2,10.0,0.3,24.0,0.6,6.0,1.9
10,22-23,52.0,84.2,5.0,4.5,5.0,0.5,3.0,5.2,57.5,...,17.5,1.7,15.0,10.0,52.0,0.2,10.0,3.6,12.0,4.4


In [26]:
#using pandas.read_csv to convert raw data to DataFrame
scores_pd_df = pd.read_csv(sat_scores)
drugs_pd_df = pd.read_csv(drug_use)

If you did not convert the string column values to float in your dictionary, the columns in the DataFrame are of type `object` (which are string values, essentially). 

In [60]:
scores_pd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 4 columns):
State     52 non-null object
Rate      52 non-null int64
Verbal    52 non-null int64
Math      52 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.7+ KB


In [61]:
drugs_pd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          17 non-null object
crack-use                  17 non-null float64
crack-frequency            17 non-null object
heroin-use                 17 non-null float64
heroin-frequency           17 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         17 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-f

### 1.3 Look at the first ten rows of the DataFrame: what does our data describe?

From now on, use the DataFrame loaded from the file using the `.read_csv()` function.

Use the `.head(num)` built-in DataFrame function, where `num` is the number of rows to print out.

You are not given a "codebook" with this data, so you will have to make some (very minor) inference.

In [None]:
scores.head()
#rate means the percentile that score placed into for that particular exam
#scores is the entire data set for a particular exam adminstered across all states 

In [None]:
drugs.head()

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 2. Create a "data dictionary" based on the data

---

A data dictionary is an object that describes your data. This should contain the name of each variable (column), the type of the variable, your description of what the variable is, and the shape (rows and columns) of the entire dataset.

In [None]:
scores.info()

In [None]:
scores_dict = {}
# sample output
# columns: [column names]
# dtypes: [type of variable (for each column?, will this be a nested dictionary?)]
# description: [description of each variable]
# shape: [of entire dataset]

scores_dict.update({('column'):[col for col in scores] for col in scores})

In [None]:
my_descr={
    'State':'place where SAT test was administered',
    'Rate': 'I have no idea what this is',
    'Verbal': 'average verbal scores',
    'Math': 'average math scores',
}

In [None]:
scores_dict['descriptions'] = my_descr

In [None]:
scores_dict.update({'dtypes by column':[(col, type(scores[col][0])) for col in scores]})

In [None]:
scores_dict.update({'shape': scores.shape})

In [None]:
pprint (scores_dict)

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 3. Plot the data using seaborn

---

### 3.1 Using seaborn's `distplot`, plot the distributions for each of `Rate`, `Math`, and `Verbal`

Set the keyword argument `kde=False`. This way you can actually see the counts within bins. You can adjust the number of bins to your liking. 

[Please read over the `distplot` documentation to learn about the arguments and fine-tune your chart if you want.](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html#seaborn.distplot)

### 3.2 Using seaborn's `pairplot`, show the joint distributions for each of `Rate`, `Math`, and `Verbal`

Explain what the visualization tells you about your data.

[Please read over the `pairplot` documentation to fine-tune your chart.](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.pairplot.html#seaborn.pairplot)

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 4. Plot the data using built-in pandas functions.

---

Pandas is very powerful and contains a variety of nice, built-in plotting functions for your data. Read the documentation here to understand the capabilities:

http://pandas.pydata.org/pandas-docs/stable/visualization.html

### 4.1 Plot a stacked histogram with `Verbal` and `Math` using pandas

### 4.2 Plot `Verbal` and `Math` on the same chart using boxplots

What are the benefits of using a boxplot as compared to a scatterplot or a histogram?

What's wrong with plotting a box-plot of `Rate` on the same chart as `Math` and `Verbal`?

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 4.3 Plot `Verbal`, `Math`, and `Rate` appropriately on the same boxplot chart

Think about how you might change the variables so that they would make sense on the same chart. Explain your rationale for the choices on the chart. You should strive to make the chart as intuitive as possible. 


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 5. Create and examine subsets of the data

---

For these questions you will practice **masking** in pandas. Masking uses conditional statements to select portions of your DataFrame (through boolean operations under the hood.)

Remember the distinction between DataFrame indexing functions in pandas:

    .iloc[row, col] : row and column are specified by index, which are integers
    .loc[row, col]  : row and column are specified by string "labels" (boolean arrays are allowed; useful for rows)
    .ix[row, col]   : row and column indexers can be a mix of labels and integer indices
    
For detailed reference and tutorial make sure to read over the pandas documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html



### 5.1 Find the list of states that have `Verbal` scores greater than the average of `Verbal` scores across states

How many states are above the mean? What does this tell you about the distribution of `Verbal` scores?




### 5.2 Find the list of states that have `Verbal` scores greater than the median of `Verbal` scores across states

How does this compare to the list of states greater than the mean of `Verbal` scores? Why?

### 5.3 Create a column that is the difference between the `Verbal` and `Math` scores

Specifically, this should be `Verbal - Math`.

### 5.4 Create two new DataFrames showing states with the greatest difference between scores

1. Your first DataFrame should be the 10 states with the greatest gap between `Verbal` and `Math` scores where `Verbal` is greater than `Math`. It should be sorted appropriately to show the ranking of states.
2. Your second DataFrame will be the inverse: states with the greatest gap between `Verbal` and `Math` such that `Math` is greater than `Verbal`. Again, this should be sorted appropriately to show rank.
3. Print the header of both variables, only showing the top 3 states in each.

## 6. Examine summary statistics

---

Checking the summary statistics for data is an essential step in the EDA process!

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 6.1 Create the correlation matrix of your variables (excluding `State`).

What does the correlation matrix tell you?


<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 6.2 Use pandas'  `.describe()` built-in function on your DataFrame

Write up what each of the rows returned by the function indicate.

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 6.3 Assign and print the _covariance_ matrix for the dataset

1. Describe how the covariance matrix is different from the correlation matrix.
2. What is the process to convert the covariance into the correlation?
3. Why is the correlation matrix preferred to the covariance matrix for examining relationships in your data?

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 7. Performing EDA on "drug use by age" data.

---

You will now switch datasets to one with many more variables. This section of the project is more open-ended - use the techniques you practiced above!

We'll work with the "drug-use-by-age.csv" data, sourced from and described here: https://github.com/fivethirtyeight/data/tree/master/drug-use-by-age.

### 7.1

Load the data using pandas. Does this data require cleaning? Are variables missing? How will this affect your approach to EDA on the data?

### 7.2 Do a high-level, initial overview of the data

Get a feel for what this dataset is all about.

Use whichever techniques you'd like, including those from the SAT dataset EDA. The final response to this question should be a written description of what you infer about the dataset.

Some things to consider doing:

- Look for relationships between variables and subsets of those variables' values
- Derive new features from the ones available to help your analysis
- Visualize everything!

### 7.3 Create a testable hypothesis about this data

Requirements for the question:

1. Write a specific question you would like to answer with the data (that can be accomplished with EDA).
2. Write a description of the "deliverables": what will you report after testing/examining your hypothesis?
3. Use EDA techniques of your choice, numeric and/or visual, to look into your question.
4. Write up your report on what you have found regarding the hypothesis about the data you came up with.


Your hypothesis could be on:

- Difference of group means
- Correlations between variables
- Anything else you think is interesting, testable, and meaningful!

**Important notes:**

You should be only doing EDA _relevant to your question_ here. It is easy to go down rabbit holes trying to look at every facet of your data, and so we want you to get in the practice of specifying a hypothesis you are interested in first and scoping your work to specifically answer that question.

Some of you may want to jump ahead to "modeling" data to answer your question. This is a topic addressed in the next project and **you should not do this for this project.** We specifically want you to not do modeling to emphasize the importance of performing EDA _before_ you jump to statistical analysis.

** Question and deliverables**


...

In [None]:
# Code

**Report**



...

<img src="http://imgur.com/xDpSobf.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## 8. Introduction to dealing with outliers

---

Outliers are an interesting problem in statistics, in that there is not an agreed upon best way to define them. Subjectivity in selecting and analyzing data is a problem that will recur throughout the course.

1. Pull out the rate variable from the sat dataset.
2. Are there outliers in the dataset? Define, in words, how you _numerically define outliers._
3. Print out the outliers in the dataset.
4. Remove the outliers from the dataset.
5. Compare the mean, median, and standard deviation of the "cleaned" data without outliers to the original. What is different about them and why?

<img src="http://imgur.com/GCAf1UX.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

### 9. Percentile scoring and spearman rank correlation

---

### 9.1 Calculate the spearman correlation of sat `Verbal` and `Math`

1. How does the spearman correlation compare to the pearson correlation? 
2. Describe clearly in words the process of calculating the spearman rank correlation.
  - Hint: the word "rank" is in the name of the process for a reason!


### 9.2 Percentile scoring

Look up percentile scoring of data. In other words, the conversion of numeric data to their equivalent percentile scores.

http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.percentileofscore.html

1. Convert `Rate` to percentiles in the sat scores as a new column.
2. Show the percentile of California in `Rate`.
3. How is percentile related to the spearman rank correlation?

### 9.3 Percentiles and outliers

1. Why might percentile scoring be useful for dealing with outliers?
2. Plot the distribution of a variable of your choice from the drug use dataset.
3. Plot the same variable but percentile scored.
4. Describe the effect, visually, of coverting raw scores to percentile.