## Create a crime count by ward dataframe and check for statisically significant relationships between changes in crime rate and change in labour vote share

This notebook imports json crime data from Jan 2016 to May 2018. The data exists in the github repository and has been pre-labeled with date and ward code.

The notebook aggregates the crime counts, reads a csv file (in repository) for the labour vote change in each ward between the 2014 and 2018 council elections and combines the results in a dataframe.

Ultimately no statistically significant relationship was found between changes in crime count and change in Labour vote share.

In [1]:
import json
import pandas as pd
from copy import copy
import os
from pathlib import Path
import statsmodels.api as sm

  from pandas.core import datetools


In [2]:
# Collect all the paths of the files we want
# pathlib lets you use * as many times as you want in the glob
root_data_path = Path('../2018-05 boundaries/')
# we can use the glob function to restrict json files to specific date ranges
fnames = list(map(lambda x : str(x), list(root_data_path.glob('london_month_*/*.json'))))

In [3]:
# fnames = a list of all the json filenames
len(fnames)

14425

In [4]:
# Each json file holds records for all crimes that occurred in each ward each month
# The length of the json reflects how many crimes occurred in that ward in that month
# The json file name contains the information about the ward and the month
# We run through the jsons keeping a count of how crimes occurred in each ward in each month
count_dictionary = {}

for fname in fnames:
    with open(fname) as json_file:
        try:
            data = json.load(json_file)
            ward = fname.split('_')[-2]
            month = '-'.join(fname.split('_')[-5:-3])
            crimes = len(data) # len(data) = number of crimes recorded in the file
            key = ward + '_' + month
            count_dictionary[key] = count_dictionary.get(key, 0) + len(data)
        except:
            pass
#             print(data, '\n\n')

# At this point we have created a dictionary where there is a key made of ward code and year-month
# The value associated with each key is the number of crimes that occurred in the corresponding ward in each month

final_csv = '' # Take all of the keys and values and write them to a csv file
for k, v in count_dictionary.items():
    final_csv += ','.join(k.split('_'))
    final_csv += ',' + str(v) + '\n'


text_file = open("../crimes_by_ward_by_month.csv", "w")
text_file.write(final_csv)
text_file.close()

df_monthly = pd.read_csv("../crimes_by_ward_by_month.csv", names =['ward_code', 'month', 'crime_count'])
df_monthly.head()

Unnamed: 0,ward_code,month,crime_count
0,E05000526,2016-02,82
1,E05000277,2016-02,102
2,E05000426,2016-02,209
3,E05000377,2016-02,148
4,E05000214,2016-02,136


In [5]:
# Splitting the columns to create yearly dataframes that only look at q1
df_2016 = df_monthly[(df_monthly["month"]<"2016-04") & (df_monthly["month"]>"2015-12")].groupby("ward_code").sum()
df_2017 = df_monthly[(df_monthly["month"]<"2017-04") & (df_monthly["month"]>"2016-12")].groupby("ward_code").sum()
df_2018 = df_monthly[(df_monthly["month"]<"2018-04") & (df_monthly["month"]>"2017-12")].groupby("ward_code").sum()

df_2016.rename(columns={'crime_count':'crimes2016q1'}, inplace=True)
df_2017.rename(columns={'crime_count':'crimes2017q1'}, inplace=True)
df_2018.rename(columns={'crime_count':'crimes2018q1'}, inplace=True)

# combine the yearly dataframes to create one large dataframe containing all years
join_year_df = df_2016.join(df_2017)
join_year_df = join_year_df.join(df_2018)

# Create a new column that records the change in crime rate between q1 2017 and q1 2018
join_year_df['crime_change_q1_2017-18percent'] = ((join_year_df['crimes2018q1'] - \
                                           join_year_df['crimes2017q1'])/join_year_df['crimes2017q1'])*100
join_year_df.head()

Unnamed: 0_level_0,crimes2016q1,crimes2017q1,crimes2018q1,crime_change_q1_2017-18percent
ward_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
E05000026,719,757,804,6.208719
E05000027,250,256,269,5.078125
E05000028,343,378,269,-28.835979
E05000029,232,253,259,2.371542
E05000030,251,232,276,18.965517


In [6]:
df_ward_vote_change = pd.read_csv('../LabourChangeByWard.csv')

In [7]:
# Creating columns that look at all quarters
# Splitting the columns to create yearly dataframes
# Because we only have data up to May 2018 we cannot compare it to previous years
df_2016 = df_monthly[(df_monthly["month"]<"2017-01") & (df_monthly["month"]>"2015-12")].groupby("ward_code").sum()
df_2017 = df_monthly[(df_monthly["month"]<"2018-01") & (df_monthly["month"]>"2016-12")].groupby("ward_code").sum()
df_2018 = df_monthly[(df_monthly["month"]<"2019-01") & (df_monthly["month"]>"2017-12")].groupby("ward_code").sum()

df_2016.rename(columns={'crime_count':'totalcrimes2016'}, inplace=True)
df_2017.rename(columns={'crime_count':'totalcrimes2017'}, inplace=True)
df_2018.rename(columns={'crime_count':'totalcrimes2018'}, inplace=True)

join_year_df = join_year_df.join(df_2016)
join_year_df = join_year_df.join(df_2017)
join_year_df = join_year_df.join(df_2018)
join_year_df['crime_change_2016-17percent'] = ((join_year_df['totalcrimes2017'] - \
                                         join_year_df['totalcrimes2016'])/join_year_df['totalcrimes2016'])*100
join_year_df.head()

Unnamed: 0_level_0,crimes2016q1,crimes2017q1,crimes2018q1,crime_change_q1_2017-18percent,totalcrimes2016,totalcrimes2017,totalcrimes2018,crime_change_2016-17percent
ward_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
E05000026,719,757,804,6.208719,2952,3313,1636,12.228997
E05000027,250,256,269,5.078125,968,1102,555,13.842975
E05000028,343,378,269,-28.835979,1318,1566,625,18.816388
E05000029,232,253,259,2.371542,919,1041,511,13.275299
E05000030,251,232,276,18.965517,823,955,492,16.038882


In [8]:
df_ward_vote_change.rename(columns={'WardID': 'ward_code', 'Percent change': 'Labour vote change 2014-18percent'}, inplace=True)

In [9]:
df_ward_vote_change.head()

Unnamed: 0,ward_code,Ward,Local Authority,Labour vote change 2014-18percent
0,E05009401,Queen's Gate,Kensington and Chelsea,1.1
1,E05000595,Forest,Waltham Forest,15.4
2,E05000044,Burnt Oak,Barnet,3.0
3,E05011113,Rye Lane,Southwark,66.9
4,E05000093,Kenton,Brent,4.4


In [10]:
# df.set_index('key').join(other.set_index('key'))
final_df = df_ward_vote_change.set_index('ward_code').join(join_year_df)
final_df.head()

Unnamed: 0_level_0,Ward,Local Authority,Labour vote change 2014-18percent,crimes2016q1,crimes2017q1,crimes2018q1,crime_change_q1_2017-18percent,totalcrimes2016,totalcrimes2017,totalcrimes2018,crime_change_2016-17percent
ward_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
E05009401,Queen's Gate,Kensington and Chelsea,1.1,,,,,,,,
E05000595,Forest,Waltham Forest,15.4,221.0,459.0,407.0,-11.328976,1451.0,1726.0,841.0,18.952447
E05000044,Burnt Oak,Barnet,3.0,491.0,473.0,510.0,7.82241,2248.0,2116.0,1117.0,-5.871886
E05011113,Rye Lane,Southwark,66.9,,,,,,,,
E05000093,Kenton,Brent,4.4,176.0,185.0,189.0,2.162162,724.0,740.0,378.0,2.209945


In [11]:
sum(final_df['totalcrimes2017'].isnull()) # Check how many wards are missing crime data

144

In [12]:
final_df.shape

(627, 11)

In [13]:
final_df.to_csv('../CrimeByWard-LabourVoteChange.csv')

In [14]:
final_df.corr()

Unnamed: 0,Labour vote change 2014-18percent,crimes2016q1,crimes2017q1,crimes2018q1,crime_change_q1_2017-18percent,totalcrimes2016,totalcrimes2017,totalcrimes2018,crime_change_2016-17percent
Labour vote change 2014-18percent,1.0,0.095586,0.062305,0.050668,-0.056335,0.061556,0.057695,0.054303,-0.030658
crimes2016q1,0.095586,1.0,0.965599,0.962557,0.041308,0.97711,0.96431,0.964241,-0.031383
crimes2017q1,0.062305,0.965599,1.0,0.988213,-0.009671,0.989741,0.995519,0.989664,0.086467
crimes2018q1,0.050668,0.962557,0.988213,1.0,0.118229,0.985576,0.992562,0.997876,0.090375
crime_change_q1_2017-18percent,-0.056335,0.041308,-0.009671,0.118229,1.0,0.03943,0.042712,0.099448,0.001979
totalcrimes2016,0.061556,0.97711,0.989741,0.985576,0.03943,1.0,0.990435,0.988487,-0.002767
totalcrimes2017,0.057695,0.96431,0.995519,0.992562,0.042712,0.990435,1.0,0.994239,0.11076
totalcrimes2018,0.054303,0.964241,0.989664,0.997876,0.099448,0.988487,0.994239,1.0,0.08192
crime_change_2016-17percent,-0.030658,-0.031383,0.086467,0.090375,0.001979,-0.002767,0.11076,0.08192,1.0


### Correlation matrix shows some weak correlation in relationships: 
Wards with higher crime rates were more likely to swing towards Labour
Wards where the crime rate increased (looking at either the change between 2016/17 or comparing q1 2017 with q1 2018) swung away from Labour. We still need to assess whether these correlations are statistically significant (see below).

In [15]:
final_df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True) # drop wards we have no crime data for
print(final_df.shape)
target = final_df['Labour vote change 2014-18percent']
variables = final_df[['crime_change_q1_2017-18percent']].copy()

variables = sm.add_constant(variables) # Adds a column called 'const' that is all 1s (in order to provide a coeficient for the constant)
crime_model = sm.OLS(target, variables)
results = crime_model.fit()
results.summary()

(482, 11)


0,1,2,3
Dep. Variable:,Labour vote change 2014-18percent,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,1.528
Date:,"Sun, 24 Feb 2019",Prob (F-statistic):,0.217
Time:,16:13:52,Log-Likelihood:,-1684.7
No. Observations:,482,AIC:,3373.0
Df Residuals:,480,BIC:,3382.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.9990,0.365,19.197,0.000,6.283,7.715
crime_change_q1_2017-18percent,-0.0317,0.026,-1.236,0.217,-0.082,0.019

0,1,2,3
Omnibus:,26.628,Durbin-Watson:,1.764
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.039
Skew:,0.29,Prob(JB):,2.49e-13
Kurtosis:,4.598,Cond. No.,14.2


# Interpretation of linear regression model
Our best estimate for how crime effects vote swing is that a 1% increase in crime causes a decrease in vote share of 0.03% but there is a probability of 21.% chance that we observed this relationship by chance (and in fact there is no relationship). It does not pass the test for statisical significance. 

## VALIDATING THE DATA CREATED BY JSON WITH DATA CREATED USING LUKA'S PANDAS SCRIPT


In [16]:
# This code compares the values from the pandas dataframe (containing all individual crimes)
# findings were that 3 out of 482 had different crime counts for 2017 and 2016
# Not sure what caused this discrepancy but it's sufficently small to ignore.

# This is a large pandas dataframe that contains a row for every individual recorded crime
# It was created using a different method to the counting method above
# We are using it here to validate the final_df which was created in the code above
# lukadf = pd.read_csv('../2018-05.csv')

# from tqdm import tqdm
# wards = final_df.index.values
# errors = 0
# for i in tqdm(range(len(final_df))):
#     ward = wards[i]
#     luka_crime_count = sum((lukadf.loc[:,'ward'] == ward) & (lukadf['month']>='2017-01') & (lukadf['month']<'2018-01'))
#     if final_df.loc[ward, 'totalcrimes2017'] != luka_crime_count:
#         errors += 1
#         print(final_df.loc[ward, 'totalcrimes2017'] - luka_crime_count)
        
    
# print('errors', errors)