# CS 533 Assignment 5 - Ben Whitehead

## Context

This assignment is designed to develop your ability to reason about classifier accuracy metrics in the context of their social impact.

You will do this by partially replicating the ProPublica analysis of fairness in the COMPAS pre-trial risk assessment tool, reflecting on the process, and replicating simulation studies of tradeoffs in the fairness of machine learning metrics.

**Note:** my data is being loaded from a data/ directory. If you do not have that setup in your notebook, you will need to either change the path the data set is loaded from or put your data in a data directory.


## Setup

I am going to start by getting the infectious disease data into a good working state. The data comes in multiple files, so we'll need to concatenate it, and turn it into both state and county level data. 

## Part 1. Load Data

Download the Raw text files for the main ‘BBC’ data set from http://mlg.ucd.ie/datasets/bbc.html. This will be a Zip file that contains text files with data.

**note:** my data is located in the `./data` directory instead of the root of my project

In [30]:
import pandas as pd
import pandas_ml as pdml
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


all_data = pd.DataFrame()

for f in os.listdir('data/2018_data_summary_spreadsheets'):
    temp = pd.read_excel('data/2018_data_summary_spreadsheets/'+f, sheet_name=0)
    temp['Year'] = f.split('.')[0].split('_')[2]    
    all_data = pd.concat([temp, all_data])


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [31]:
all_data.reset_index()

Unnamed: 0,index,Address,Adipic Acid Production,Aluminum Production,Ammonia Manufacturing,Biogenic CO2 emissions (metric tons),CO2 emissions (non-biogenic),Cement Production,City,County,...,Soda Ash Manufacturing,State,Stationary Combustion,Titanium Dioxide Production,Total reported direct emissions,Underground Coal Mines,Very Short-lived Compounds emissions,Year,Zinc Production,Zip Code
0,0,3820 SAM RAYBURN HIGHWAY,,,,,,,MELISSA,COLLIN COUNTY,...,,TX,,,2.504975e+05,,,2017,,75454
1,1,4200 S. Hwy 15,,,,,,,Hazard,PERRY COUNTY,...,,KY,,,2.186992e+05,218699.25,,2017,,40701
2,2,1845 S. KY HWY 15,,,,,,,Hazard,PERRY,...,,KY,,,7.026500e+04,70265.00,,2017,,41701
3,3,22845 Highway 33,,,,,9120.1,,McKittrick,,...,,CA,9115.416,,9.298916e+03,,,2017,,93251
4,4,730 3rd Avenue,,,,,52645.5,,BROOKLYN,Kings,...,,NY,65.000,,5.269866e+04,,,2017,,11232
5,5,11700 W 31ST ST,,,,,146.8,,WESTCHESTER,COOK COUNTY,...,,IL,146.800,,2.902680e+04,,,2017,,60154
6,6,4501 HIGHWAY 377 SOUTH,,,,,36165.0,,BROWNWOOD,BROWN,...,,TX,36006.364,,3.620226e+04,,,2017,,76801
7,7,,,,,,76516.2,,MAPLEWOOD,RAMSEY COUNTY,...,,MN,76595.410,,7.659541e+04,,,2017,,55144
8,8,3669 South Hwy 50,,,,,39664.4,,Gillette,,...,,WY,38602.204,,5.710495e+04,,,2017,,82716
9,9,6675 US HIGHWAY 43,,,,,25108.2,,GUIN,MARION,...,,AL,24787.756,,2.513396e+04,,,2017,,35563


In [32]:
all_data.groupby(['Year','State', 'City'])['Total reported direct emissions'].agg('sum')

Year  State  City             
2010  AK     ANCHORAGE            1.989695e+06
             Akutan               3.693258e+04
             Anchorage            8.823022e+05
             BARROW               4.268854e+04
             BIG LAKE             5.252580e+03
             CLEAR AIR STATION    8.999312e+04
             DUTCH HARBOR         2.805247e+04
             EAGLE RIVER          3.231570e+05
             EIELSON AFB          3.306647e+05
             FAIRBANKS            4.540996e+05
             FORT WAINWRIGHT      3.734709e+05
             Fairbanks            1.374978e+05
             HEALY                2.786180e+05
             KENAI                4.477359e+05
             KOTZEBUE             1.468131e+05
             Kenai                3.151448e+05
             Milne Point Unit     1.822157e+05
             NIKISKI              1.813932e+05
             NORTH POLE           5.647467e+05
             North Slope          1.242964e+06
             Offshore        