# Netids: ja743, mce58

# Problem: In the one year time period from July 01, 2014 to June 30, 2015, the temperatures in Houston, Texas and Jacksonville, Florida peaked some of its highest values. Additionally, after doing some online research, we've observed some consistent data that reveals the police fatality rates for these two states to be considerably higher than than the rest of the United States. With this information, we've decided to seek a correlation between the annual deaths by police in these specific states and the average high temperatures of popular cities belonging to these states. 


## Hypothesis 1: We hypothesize that the city with the larger average highest temperature for the month will also have the higher police fatalities in 2014.

## Methodology: 
### Clean:
### 1) We must drop all columns excpet for 'date', 'actual_mean_temp', 'actual_min_temp', 'actual_max_temp'
### 2) Drop all skewed data and NaN values that may hinder accurate data

### Manipulate:
### 1) Calculate the average highest daily temperatures for each month within the given one year time period 
### 2) Record said data 


## Predictions: We plan to train both a linear regression and a KNN model with the recorded fatalities by police and each of the 12 months to determine the the average highest temperatures. We will then determine the accuracy scores of said models

## Lastly, we plan to train both a linear regression and a KNN model with the average highest temperatures (monthly) and the recorded fatalities by police to determine the the month in which these numbers can be found

## Resources and Links:
https://github.com/fivethirtyeight/data/blob/master/us-weather-history/KHOU.csv
https://github.com/fivethirtyeight/data/blob/master/us-weather-history/KJAX.csv
https://data.world/awram/us-police-involved-fatalities/workspace/file?filename=Police+Fatalities.csv
https://mappingpoliceviolence.org/states

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import datasets

weatherH = pd.read_csv('KHOU.csv')
weatherJ = pd.read_csv('KJAX.csv')
pf = pd.read_csv('Police Fatalities.csv', encoding = "ISO-8859-1")

In [8]:
weatherH.head()

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
0,2014-7-1,84,74,94,75,93,66,102,1985,1980,0.0,0.16,1.4
1,2014-7-2,84,74,94,75,93,66,104,1924,1894,0.0,0.16,5.43
2,2014-7-3,84,73,94,75,93,66,100,1977,1980,0.0,0.16,1.29
3,2014-7-4,82,72,91,75,93,66,101,1910,2009,0.48,0.15,3.49
4,2014-7-5,82,73,90,75,93,66,102,1924,2009,0.31,0.16,3.04


In [9]:
weatherJ.head()

Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp,average_min_temp,average_max_temp,record_min_temp,record_max_temp,record_min_temp_year,record_max_temp_year,actual_precipitation,average_precipitation,record_precipitation
0,2014-7-1,82,72,91,72,91,65,100,1981,1990,0.0,0.21,7.26
1,2014-7-2,81,73,89,72,91,65,100,2000,1902,0.0,0.21,2.3
2,2014-7-3,84,71,96,72,92,63,99,2000,1970,0.41,0.22,4.87
3,2014-7-4,81,70,91,72,92,65,99,2000,1954,0.0,0.21,1.4
4,2014-7-5,82,73,90,72,92,66,100,2000,1877,0.01,0.21,1.85


In [36]:
pf.head()

Unnamed: 0,UID,Name,Age,Gender,Race,Date,City,State,Manner_of_death,Armed,Mental_illness,Flee
0,133,Karen O. Chin,44.0,Female,Asian,5/4/2000,Alameda,CA,Shot,,False,False
1,169,Chyraphone Komvongsa,26.0,Male,Asian,6/2/2000,Fresno,CA,Shot,,False,False
2,257,Ming Chinh Ly,36.0,Male,Asian,8/13/2000,Rosemead,CA,Shot,Gun,False,False
3,483,Kinh Quoc Dao,29.0,Male,Asian,2/9/2001,Valley Glen,CA,Shot,Gun,False,False
4,655,Vanpaseuth Phaisouphanh,25.0,Male,Asian,6/10/2001,Riverside,CA,Shot,Knife,False,False


In [15]:
weatherH.drop(columns=['average_min_temp','average_max_temp','record_min_temp','record_max_temp','record_min_temp_year','record_max_temp_year','actual_precipitation','average_precipitation','record_precipitation']).dropna().head()


Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp
0,2014-7-1,84,74,94
1,2014-7-2,84,74,94
2,2014-7-3,84,73,94
3,2014-7-4,82,72,91
4,2014-7-5,82,73,90


In [17]:
weatherJ.drop(columns=['average_min_temp','average_max_temp','record_min_temp','record_max_temp','record_min_temp_year','record_max_temp_year','actual_precipitation','average_precipitation','record_precipitation']).dropna().head()


Unnamed: 0,date,actual_mean_temp,actual_min_temp,actual_max_temp
0,2014-7-1,82,72,91
1,2014-7-2,81,73,89
2,2014-7-3,84,71,96
3,2014-7-4,81,70,91
4,2014-7-5,82,73,90


In [None]:
months = {0: "2014-7", 1: "2014-8", 2: "2014-9", 3: "2014-10", 4: "2014-11", 5: "2014-12", 6: "2014-1", 7: "2014-2", 8: "2014-3", 9: "2014-4", 10: "2014-5", 11: "2014-6"}
avg_high = []
for date in range(len(weatherH['date'])):
    i = 0 
    count = 0
    total = 0
    # calc avg for all days in each month 
    if months.get(i) in date: 
        total++
        count += weatherH.loc(['date'],['actual_max_temp'])
        for temp in weatherH['actual_max_temp']:
    i++    
    
    cur_month

In [37]:
fatalities_TX = 0

yr14 = 0


for date in pf['Date']:
    if date.find("2014")!= -1 or date.find("2015")!= -1:
        for person in pf['State']:
            if person pf.loc[(pf['State'] == "TX") & (pf['Date'] == date)]:
                fatalities_TX = fatalities_TX + 1                
print(fatalities_TX)   

"""
for person in pf.loc(pf['State'] == "TX", ['Date']):
    fatalities_TX = fatalities_TX + 1
        
print(fatalities_TX)
"""

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [89]:
pf_data = pf.copy().dropna(subset=['Race', 'Mental_illness', 'State'])
pf_data.head(-10)

Unnamed: 0,UID,Name,Age,Gender,Race,Date,City,State,Manner_of_death,Armed,Mental_illness,Flee
0,133,Karen O. Chin,44.0,Female,Asian,5/4/2000,Alameda,CA,Shot,,False,False
1,169,Chyraphone Komvongsa,26.0,Male,Asian,6/2/2000,Fresno,CA,Shot,,False,False
2,257,Ming Chinh Ly,36.0,Male,Asian,8/13/2000,Rosemead,CA,Shot,Gun,False,False
3,483,Kinh Quoc Dao,29.0,Male,Asian,2/9/2001,Valley Glen,CA,Shot,Gun,False,False
4,655,Vanpaseuth Phaisouphanh,25.0,Male,Asian,6/10/2001,Riverside,CA,Shot,Knife,False,False
5,668,Bernardo Ancheta Caberto,55.0,Male,Asian,6/23/2001,Henderson,NV,Shot,Knife,False,False
6,677,Cuong Tran,33.0,Male,Asian,6/30/2001,Syracuse,NY,Shot,,True,False
7,678,Sengsadaphet Phongsavanh,29.0,Male,Asian,7/1/2001,Beaverton,OR,Shot,Knife,True,False
8,686,Nam Quoc Nguyen,21.0,Male,Asian,7/6/2001,Garden Grove,CA,Shot,Gun,False,False
9,736,Rosa Hammer,27.0,Female,Asian,8/9/2001,Gorst,WA,Shot,Gun,False,False


In [None]:
"""
drop na for races at least
categorize races by numbers and make a key 0:Asian 1:Black 2: White
same 
"""

"""
states = []
for i in range(len(pf_data['State'])):
    if pf_data['State'][i] == "AL":
        states.append(1)
    if pf_data['State'][i] == "AK":
        states.append(2)
    if pf_data['State'][i] == "AZ":
        states.append(3)
    if pf_data['State'][i] == "AR":
        states.append(4)
    if pf_data['State'][i] == "CA":
        states.append(5)
    if pf_data['State'][i] == "CO":
        states.append(6)
    if pf_data['State'][i] == "CT":
        states.append(7)
    if pf_data['State'][i] == "DE":
        states.append(8)
    if pf_data['State'][i] == "FL":
        states.append(9)
    if pf_data['State'][i] == "GA":
        states.append(10)
        
    if pf_data['State'][i] == "HI":
        states.append(11)
    if pf_data['State'][i] == "ID":
        states.append(12)
    if pf_data['State'][i] == "IL":
        states.append(13)
    if pf_data['State'][i] == "IN":
        states.append(14)
    if pf_data['State'][i] == "IA":
        states.append(15)
    if pf_data['State'][i] == "KS":
        states.append(16)
    if pf_data['State'][i] == "KY":
        states.append(17)
    if pf_data['State'][i] == "LA":
        states.append(18)
    if pf_data['State'][i] == "ME":
        states.append(19)
    if pf_data['State'][i] == "MD":
        states.append(20)
        
    if pf_data['State'][i] == "MA":
        states.append(21)
    if pf_data['State'][i] == "MI":
        states.append(22)
    if pf_data['State'][i] == "MN":
        states.append(23)
    if pf_data['State'][i] == "MA":
        states.append(24)
    if pf_data['State'][i] == "MO":
        states.append(25)
    if pf_data['State'][i] == "MT":
        states.append(26)
    if pf_data['State'][i] == "NE":
        states.append(27)
    if pf_data['State'][i] == "NV":
        states.append(28)
    if pf_data['State'][i] == "NH":
        states.append(29)
    if pf_data['State'][i] == "NJ":
        states.append(30)
        
    if pf_data['State'][i] == "NM":
        states.append(31)
    if pf_data['State'][i] == "NY":
        states.append(32)
    if pf_data['State'][i] == "NC":
        states.append(33)
    if pf_data['State'][i] == "ND":
        states.append(34)
    if pf_data['State'][i] == "OH":
        states.append(35)
    if pf_data['State'][i] == "OK":
        states.append(36)
    if pf_data['State'][i] == "OR":
        states.append(37)
    if pf_data['State'][i] == "PA":
        states.append(38)
    if pf_data['State'][i] == "RI":
        states.append(39)
    if pf_data['State'][i] == "SC":
        states.append(40)
        
    if pf_data['State'][i] == "SD":
        states.append(41)
    if pf_data['State'][i] == "TN":
        states.append(42)
    if pf_data['State'][i] == "TX":
        states.append(43)
    if pf_data['State'][i] == "UT":
        states.append(44)
    if pf_data['State'][i] == "VT":
        states.append(45)
    if pf_data['State'][i] == "VA":
        states.append(46)
    if pf_data['State'][i] == "WA":
        states.append(47)
    if pf_data['State'][i] == "WV":
        states.append(48)
    if pf_data['State'][i] == "WI":
        states.append(49)
    if pf_data['State'][i] == "WY":
        states.append(50)
#print(states)  
"""

In [111]:
import math
races = []
for i in range(len(pf_data['Race'])):
    if pf_data['Race'][i] == "Asian":
        races.append(1)
    if pf_data['Race'][i] == "Black":
        races.append(2)
    if pf_data['Race'][i] == "Hispanic":
        races.append(3)
    if pf_data['Race'][i] == "Native":
        races.append(4)
    if pf_data['Race'][i] == "White":
        races.append(5)
    if pf_data['Race'][i] == "Other":
        races.append(6)
#print(races)

ill = []
for i in pf_data['Mental_illness']:
    if i == False:
        ill.append(0)
    else:
        ill.append(1)
#print(ill)

armed = []
for i in pf_data['Armed']:
    if type(i) is float:
        armed.append(False)
    else:
        armed.append(True)
#print(armed)

pf_data['races'] = races
pf_data['ill'] = ill
pf_data['armed'] = armed
print(len(races))
print(len(ill))
print(len(armed))
pf_data.head()

8526
8526
8526


Unnamed: 0,UID,Name,Age,Gender,Race,Date,City,State,Manner_of_death,Armed,Mental_illness,Flee,races,ill,armed
0,133,Karen O. Chin,44.0,Female,Asian,5/4/2000,Alameda,CA,Shot,,False,False,1,0,False
1,169,Chyraphone Komvongsa,26.0,Male,Asian,6/2/2000,Fresno,CA,Shot,,False,False,1,0,False
2,257,Ming Chinh Ly,36.0,Male,Asian,8/13/2000,Rosemead,CA,Shot,Gun,False,False,1,0,True
3,483,Kinh Quoc Dao,29.0,Male,Asian,2/9/2001,Valley Glen,CA,Shot,Gun,False,False,1,0,True
4,655,Vanpaseuth Phaisouphanh,25.0,Male,Asian,6/10/2001,Riverside,CA,Shot,Knife,False,False,1,0,True


In [133]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
%matplotlib inline

inputs = pf_data[['races', 'Mental_illness']]
outputs = pf_data['armed']
model = KNeighborsClassifier()

known_input, future_input, known_output, future_output = train_test_split(inputs, outputs, test_size=0.25, random_state=42)

model.fit(known_input, known_output) # tell it known stuff

predictions = model.predict(future_input) # given future input, predict future output!

# compare actual median_earnings and the predicted median_earnings
pd.DataFrame({"Is Armed":future_output, "PREDICTED armed":predictions}).reset_index(drop=True).head(10)

Unnamed: 0,Is Armed,PREDICTED armed
0,False,True
1,True,True
2,False,True
3,True,True
4,False,True
5,True,True
6,True,True
7,True,True
8,True,True
9,True,True


In [134]:
print("sklearn's accuracy score for is Armed:", metrics.accuracy_score(future_output, predictions))

sklearn's accuracy score for is Armed: 0.5637898686679175


In [145]:
inputs1 = pf_data[['races', 'armed']]
outputs1 = pf_data['Mental_illness']
model1 = KNeighborsClassifier()

known_input1, future_input1, known_output1, future_output1 = train_test_split(inputs1, outputs1, test_size=0.3, random_state=42)

model1.fit(known_input1, known_output1) # tell it known stuff

predictions1 = model1.predict(future_input1) # given future input, predict future output!

# compare actual median_earnings and the predicted median_earnings
pd.DataFrame({"Was Person Actually ill":future_output1, "PREDICTED illness":predictions1}).reset_index(drop=True).head(10)

Unnamed: 0,Was Person Actually ill,PREDICTED illness
0,True,False
1,False,False
2,False,False
3,False,False
4,False,False
5,False,False
6,False,False
7,False,False
8,False,False
9,True,False


In [146]:
print("sklearn's accuracy score for mental illness:", metrics.accuracy_score(future_output1, predictions1))

sklearn's accuracy score for mental illness: 0.7912431587177482
