# CS 533 Assignment 5 - Ben Whitehead

## Context

This assignment is designed to develop your ability to reason about classifier accuracy metrics in the context of their social impact.

You will do this by partially replicating the ProPublica analysis of fairness in the COMPAS pre-trial risk assessment tool, reflecting on the process, and replicating simulation studies of tradeoffs in the fairness of machine learning metrics.

**Note:** my data is being loaded from a data/ directory. If you do not have that setup in your notebook, you will need to either change the path the data set is loaded from or put your data in a data directory.


## Setup

I am going to start by getting the infectious disease data into a good working state. The data comes in multiple files, so we'll need to concatenate it, and turn it into both state and county level data. 

## Part 1. Load Data

Download the Raw text files for the main ‘BBC’ data set from http://mlg.ucd.ie/datasets/bbc.html. This will be a Zip file that contains text files with data.

**note:** my data is located in the `./data` directory instead of the root of my project

In [24]:
import pandas as pd
import pandas_ml as pdml
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import addfips



ghg = pd.DataFrame()

for f in os.listdir('data/2018_data_summary_spreadsheets'):
    temp = pd.read_excel('data/2018_data_summary_spreadsheets/'+f, sheet_name=0)
    temp['Year'] = f.split('.')[0].split('_')[2]    
    ghg = pd.concat([temp, ghg])


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [40]:
fips_map = pd.read_excel('data/fips-codes.xls', sheet_name=0)

fips_map = fips_map[fips_map['Entity Description'] == 'city']

In [45]:
fips_map.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10888 entries, 0 to 41339
Data columns (total 7 columns):
State Abbreviation    10888 non-null object
State FIPS Code       10888 non-null int64
County FIPS Code      10888 non-null int64
FIPS Entity Code      10888 non-null object
ANSI Code             10888 non-null int64
GU Name               10888 non-null object
Entity Description    10888 non-null object
dtypes: int64(3), object(4)
memory usage: 680.5+ KB


In [47]:
def str_func(x):
    return str(x).zfill(5)

fips_map['FIPS Entity Code'] = fips_map['FIPS Entity Code'].apply(str_func)

In [48]:
ghg_mapped = ghg.join(fips_mma)

Unnamed: 0,State Abbreviation,State FIPS Code,County FIPS Code,FIPS Entity Code,ANSI Code,GU Name,Entity Description
0,AL,1,67,00124,2403054,Abbeville,city
1,AL,1,73,00460,2403063,Adamsville,city
2,AL,1,117,00820,2403069,Alabaster,city
3,AL,1,95,00988,2403074,Albertville,city
4,AL,1,123,01132,2403077,Alexander City,city
5,AL,1,107,01228,2403080,Aliceville,city
6,AL,1,39,01708,2403097,Andalusia,city
7,AL,1,15,01852,2403101,Anniston,city
8,AL,1,43,02116,2403104,Arab,city
9,AL,1,95,02116,2403104,Arab,city


In [32]:
all_data.groupby(['Year','State', 'City'])['Total reported direct emissions'].agg('sum')

Year  State  City             
2010  AK     ANCHORAGE            1.989695e+06
             Akutan               3.693258e+04
             Anchorage            8.823022e+05
             BARROW               4.268854e+04
             BIG LAKE             5.252580e+03
             CLEAR AIR STATION    8.999312e+04
             DUTCH HARBOR         2.805247e+04
             EAGLE RIVER          3.231570e+05
             EIELSON AFB          3.306647e+05
             FAIRBANKS            4.540996e+05
             FORT WAINWRIGHT      3.734709e+05
             Fairbanks            1.374978e+05
             HEALY                2.786180e+05
             KENAI                4.477359e+05
             KOTZEBUE             1.468131e+05
             Kenai                3.151448e+05
             Milne Point Unit     1.822157e+05
             NIKISKI              1.813932e+05
             NORTH POLE           5.647467e+05
             North Slope          1.242964e+06
             Offshore        