#Crime Data Collection Methodology

Crime Data is available from several official sources. UCR (uniform crime report), NIBRS (National Incident-Based Reporting System) and SRS (Summary Reporting System) are reported on the FBI website.

The NIBRS is a newer standard that will be THE new standard by January 2021. (https://www.fbi.gov/services/cjis/ucr/nibrs)

The UCR is the older standard but is still quite useful (https://www.fbi.gov/services/cjis/ucr/) as it has data from 1979 - 2018. Some data is estimated, and some data is missing.

In [6]:
import pandas as pd
import numpy as np

In [7]:
#data in this csv contains estimates in instances of no reporting
df = pd.read_csv("http://s3-us-gov-west-1.amazonaws.com/cg-d4b776d0-d898-4153-90c8-8336f86bdfec/estimated_crimes_1979_2019.csv")
print(df.shape)
df.head()

(2116, 15)


Unnamed: 0,year,state_abbr,state_name,population,violent_crime,homicide,rape_legacy,rape_revised,robbery,aggravated_assault,property_crime,burglary,larceny,motor_vehicle_theft,caveats
0,1979,,,220099000,1208030,21460,76390.0,,480700,629480,11041500,3327700,6601000,1112800,
1,1979,AK,Alaska,406000,1994,54,292.0,,445,1203,23193,5616,15076,2501,
2,1979,AL,Alabama,3769000,15578,496,1037.0,,4127,9918,144372,48517,83791,12064,
3,1979,AR,Arkansas,2180000,7984,198,595.0,,1626,5565,70949,21457,45267,4225,
4,1979,AZ,Arizona,2450000,14528,219,1120.0,,4305,8884,177977,48916,116976,12085,


In [8]:
#the null values in 'state_abbr' represent National sums
df.isnull().sum()

year                      0
state_abbr               41
state_name               41
population                0
violent_crime             0
homicide                  0
rape_legacy             156
rape_revised           1752
robbery                   0
aggravated_assault        0
property_crime            0
burglary                  0
larceny                   0
motor_vehicle_theft       0
caveats                2045
dtype: int64

In [9]:
df['state_abbr'] = df['state_abbr'].replace(np.nan,'US')

In [10]:
#adds violent crime rate (vcr) and property crime rate (pcr) to dataframe
df['vcr'] = df['violent_crime'] / df['population']
df['pcr'] = df['property_crime'] / df['population']
df.head()

Unnamed: 0,year,state_abbr,state_name,population,violent_crime,homicide,rape_legacy,rape_revised,robbery,aggravated_assault,property_crime,burglary,larceny,motor_vehicle_theft,caveats,vcr,pcr
0,1979,US,,220099000,1208030,21460,76390.0,,480700,629480,11041500,3327700,6601000,1112800,,0.005489,0.050166
1,1979,AK,Alaska,406000,1994,54,292.0,,445,1203,23193,5616,15076,2501,,0.004911,0.057126
2,1979,AL,Alabama,3769000,15578,496,1037.0,,4127,9918,144372,48517,83791,12064,,0.004133,0.038305
3,1979,AR,Arkansas,2180000,7984,198,595.0,,1626,5565,70949,21457,45267,4225,,0.003662,0.032545
4,1979,AZ,Arizona,2450000,14528,219,1120.0,,4305,8884,177977,48916,116976,12085,,0.00593,0.072644


In [11]:
#initialize a new dataframe for exporting

sand = pd.DataFrame(index=None)
sand['state'] = df['state_abbr']
sand['year'] = df['year']
sand['vcr'] = df['vcr']
sand['pcr'] = df['pcr']
sand

Unnamed: 0,state,year,vcr,pcr
0,US,1979,0.005489,0.050166
1,AK,1979,0.004911,0.057126
2,AL,1979,0.004133,0.038305
3,AR,1979,0.003662,0.032545
4,AZ,1979,0.005930,0.072644
...,...,...,...,...
2111,WA,2019,0.002939,0.026819
2112,WI,2019,0.002932,0.014714
2113,WV,2019,0.003166,0.015834
2114,WY,2019,0.002174,0.015711


In [12]:
#export to csv
sand.to_csv('state_crime.csv', index=False)

further research shows the FBI:UCR data lists 9251 cities crime with the same features as df above but is 20000+ cities short compared to our db. (also, city information is done by year and is given in xls format)

SUGGESTED SOLUTION:
add 10 years of crime data 2009-2018 by city, for cities not listed in crime data, give state data and notify users that data is not by city but by state

###SOURCE DATA (by city):

2018 - https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/tables/table-8/table-8.xls

2017 - https://ucr.fbi.gov/crime-in-the-u.s/2017/crime-in-the-u.s.-2017/tables/table-8/table-8.xls

2016 - https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-6/table-6.xls

2015 - https://ucr.fbi.gov/crime-in-the-u.s/2015/crime-in-the-u.s.-2015/tables/table-8/table_8_offenses_known_to_law_enforcement_by_state_by_city_2015.xls

2014 - https://ucr.fbi.gov/crime-in-the-u.s/2014/crime-in-the-u.s.-2014/tables/table-8/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2014.xls

2013 - https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table_8_offenses_known_to_law_enforcement_by_state_by_city_2013.xls

2012 - https://ucr.fbi.gov/crime-in-the-u.s/2012/crime-in-the-u.s.-2012/tables/8tabledatadecpdf/table_8_offenses_known_to_law_enforcement_by_state_by_city_2012.xls

2011 - https://ucr.fbi.gov/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/tables/table_8_offenses_known_to_law_enforcement_by_state_by_city_2011.xls

2010 - https://ucr.fbi.gov/crime-in-the-u.s/2010/crime-in-the-u.s.-2010/tables/10tbl08.xls

2009 - https://www2.fbi.gov/ucr/cius2009/data/documents/09tbl08.xls

(may leave out 2008 to have an even 10 years of data, but jic)

2008 - https://www2.fbi.gov/ucr/cius2008/data/documents/08tbl08.xls

In [14]:
#read in xls files, skipping the headers and footers 
xl2018 = pd.read_excel('data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2018.xls', skiprows=3, skipfooter=10)
xl2017 = pd.read_excel('data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2017.xls', skiprows=3, skipfooter=10)
xl2016 = pd.read_excel('data/Table_6_Offenses_Known_to_Law_Enforcement_by_State_by_City_2016.xls', skiprows=3, skipfooter=11)
xl2015 = pd.read_excel('data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2015.xls', skiprows=3, skipfooter=10)
xl2014 = pd.read_excel('data/table-8.xls', skiprows=3, skipfooter=17)
xl2013 = pd.read_excel('data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2013.xls', skiprows=3, skipfooter=10)
xl2012 = pd.read_excel('data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2012.xls', skiprows=3, skipfooter=7)
xl2011 = pd.read_excel('data/table_8_offenses_known_to_law_enforcement_by_state_by_city_2011.xls', skiprows=3, skipfooter=7)
xl2010 = pd.read_excel('data/10tbl08.xls', skiprows=3, skipfooter=7)
xl2009 = pd.read_excel('data/09tbl08.xls', skiprows=3, skipfooter=7)

In [15]:
#build a function to automatically clean the results and add to a new DF for 
#import to database
def cleaner(x, year):
  """ 
  Takes a dataframe, changes state abbreviations, changes state NaNs,
  calculates violent crime and property crime rate and returns it as
  a new DataFrame (city_st, vcr, pcr) for the year passed in
   """
  #create new dataframe
  df = pd.DataFrame(columns=['city', 'vcr_'+year,'pcr_'+year])

  #clean numbers from state column and put into new df
  df['city']=x['State'].str.replace('\d+', '')
  #clean numbers from city column
  x['City'] = x['City'].str.replace('\d+', '')
  #clean column names 
  if 'Violent\ncrime' in x.columns:
    x = x.rename(columns={'Violent\ncrime':'Violent crime', 
                    'Property\ncrime':'Property crime'})

  #remove null values from column
  if x['City'].isnull().sum() >= 1:
    print('Replacing null with None...')
    x['City'] = x['City'].replace(np.nan, 'None')

  #replace states with abbreviations
  df['city']= df['city'].replace({"ALABAMA":"AL", "ALASKA":"AK", "ARIZONA":"AZ",
                                 "ARKANSAS":"AK","CALIFORNIA":"CA",
                                 "COLORADO":"CO","CONNECTICUT":"CT",
                                 "DELAWARE":"DE","DISTRICT OF COLUMBIA":"DC",
                                 "FLORIDA":"FL", "GEORGIA":"GA", "HAWAII":"HI",
                                 "IDAHO":"ID", "ILLINOIS":"IL", "INDIANA":"IN",
                                 "IOWA":"IA","KANSAS":"KS", "KENTUCKY":"KY",
                                 "LOUISIANA":"LA", "MAINE":"ME","MARYLAND":"MD",
                                 "MASSACHUSETTS":"MA","MICHIGAN":"MI",
                                 "MINNESOTA":"MN","MISSISSIPPI":"MS",
                                 "MISSOURI":"MI","MONTANA":"MT","NEBRASKA":"NE",
                                 "NEVADA":"NV","NEW HAMPSHIRE":"NH",
                                 "NEW JERSEY":"NJ","NEW MEXICO":"NM",
                                 "NEW YORK":"NY","NORTH CAROLINA":"NC",
                                 "NORTH DAKOTA":"ND","OHIO":"OH",
                                 "OKLAHOMA":"OK", "OREGON":"OR",
                                 "PENNSYLVANIA":"PA","RHODE ISLAND":"RI",
                                 "SOUTH CAROLINA":"SC","SOUTH DAKOTA":"SD",
                                 "TENNESSEE":"TN", "TEXAS":"TX","UTAH":"UT",
                                 "VERMONT":"VT","VIRGINIA":"VA", 
                                 "WASHINGTON":"WA", "WEST VIRGINIA":"WV",
                                 "WISCONSIN":"WI", "WYOMING":"WY"})
  #iterate through dataframe, replacing nan values with proper state abbr.
  state = ""
  for i in range(len(df)):
    if pd.notnull(df.at[i , 'city']):
      if df.at[i, 'city'] != state:
        state = df.at[i, 'city']
    elif pd.isnull(df.at[i, 'city']):
        df.at[i, 'city'] = state

  #populate city column 'city, ST'
  for i in range(len(df['city'])):
    df['city'][i] = x['City'][i] + ", " + df['city'][i]
  
    #populate violent crime rate column
    df['vcr_'+year][i] = x['Violent crime'][i] / x['Population'][i]
    
    #populate property crime rate column
    df['pcr_'+year][i] = x['Property crime'][i] / x['Population'][i]

  #set the index for later concatenation
  df.set_index('city')  
  return df

In [16]:
#run the 10 xls files through the cleaner function
cl18 = cleaner(xl2018, '2018')
cl17 = cleaner(xl2017, '2017')
cl16 = cleaner(xl2016, '2016')
cl15 = cleaner(xl2015, '2015')
cl14 = cleaner(xl2014, '2014')
cl13 = cleaner(xl2013, '2013')
cl12 = cleaner(xl2012, '2012')
cl11 = cleaner(xl2011, '2011')
cl10 = cleaner(xl2010, '2010')
cl09 = cleaner(xl2009, '2009')
cl09

Replacing null with None...


  df['vcr_'+year][i] = x['Violent crime'][i] / x['Population'][i]
  df['pcr_'+year][i] = x['Property crime'][i] / x['Population'][i]
  df['pcr_'+year][i] = x['Property crime'][i] / x['Population'][i]
  df['vcr_'+year][i] = x['Violent crime'][i] / x['Population'][i]


Unnamed: 0,city,vcr_2009,pcr_2009
0,"Abbeville, AL",0.00306958,0.0180764
1,"Adamsville, AL",0.00531463,0.0727041
2,"Addison, AL",0.00704225,0.0408451
3,"Alabaster, AL",0.00136658,0.0211653
4,"Albertville, AL",0.00408407,0.0462197
...,...,...,...
9141,"Sundance, WY",0,0.0110759
9142,"Thermopolis, WY",0.00338868,0.012877
9143,"Torrington, WY",0.0020051,0.0251549
9144,"Wheatland, WY",0.00122212,0.0348304


In [18]:
# the following are the steps I made to merge the dataframes
# (side comments are the shape during each step)

cl18.shape # 9252, 3
cl17.shape# 9579,3
masta = pd.merge(cl18,cl17, how='outer', on='city')
print(masta.shape) # 10188, 5
masta['pcr_2018'].isnull().sum() #939
masta['pcr_2017'].isnull().sum() # 591
masta2 = pd.merge(cl16, cl15, how='outer', on='city')
print(masta2.shape) # 10199, 5
masta2['pcr_2016'].isnull().sum() # 607
masta2['pcr_2015'].isnull().sum() # 777
masta3 = pd.merge(cl14, cl13, how='outer', on='city')
print(masta3.shape) # 10113, 5
masta3['pcr_2014'].isnull().sum() # 750
masta3['pcr_2013'].isnull().sum() # 791
masta4 = pd.merge(cl12, cl11, how='outer', on='city')
print(masta4.shape) # 10275, 5
masta4['vcr_2012'].isnull().sum() # 969
masta4['vcr_2011'].isnull().sum() # 1136
masta5 = pd.merge(cl10, cl09, how='outer', on='city')
print(masta5.shape) # 10110, 5
masta5['pcr_2010'].isnull().sum() # 787
masta5['pcr_2009'].isnull().sum() # 952
master = pd.merge(masta, masta2, how='outer', on='city')
print(master.shape) # 10975, 9
master = pd.merge(master, masta3, how='outer', on='city')
print(master.shape) #12075
master = pd.merge(master, masta4, how='outer', on='city')
print(master.shape) # 14924
master = pd.merge(master, masta5, how='outer', on='city')
print(master.shape) # 24693
master


(10188, 5)
(10199, 5)
(10113, 5)
(10275, 5)
(10110, 5)
(10975, 9)
(12075, 13)
(14924, 17)
(24693, 21)


Unnamed: 0,city,vcr_2018,pcr_2018,vcr_2017,pcr_2017,vcr_2016,pcr_2016,vcr_2015,pcr_2015,vcr_2014,...,vcr_2013,pcr_2013,vcr_2012,pcr_2012,vcr_2011,pcr_2011,vcr_2010,pcr_2010,vcr_2009,pcr_2009
0,"Abbeville, AL",0.00705606,0.0192082,0.0046332,0.0254826,0.00421779,0.0195552,0.00344828,0.0291188,0.00302686,...,0.00415879,0.0238185,0.00810313,0.0209945,0.00481303,0.0240652,0.00711864,0.0227119,0.00306958,0.0180764
1,"Adamsville, AL",0.0043951,0.0668517,0.00576834,0.0602215,0.00434087,0.0571167,0.0056638,0.0747621,0.00677201,...,0.00424012,0.0716358,0.00616333,0.0642747,0.00814261,0.0660211,0.00208507,0.0613011,0.00531463,0.0727041
2,"Alabaster, AL",0.00274619,0.0172831,0.00222536,0.0158482,0.00293584,0.01477,0.0041482,0.0167814,0.0018373,...,0.00141161,0.0205326,0.00158458,0.0220871,0.00193455,0.0215096,0.00107209,0.0167311,0.00136658,0.0211653
3,"Albertville, AL",0.00112003,0.0374277,0.00101908,0.0300167,0.00134727,0.0334959,0.00139315,0.0335748,0.00106211,...,,,,,0.00202239,0.0431286,0.00393433,0.054012,0.00408407,0.0462197
4,"Alexander City, AL",0.0215837,0.0419302,0.0172952,0.0373711,0.0188499,0.0447091,0.0148781,0.0438939,0.00544099,...,0.00809965,0.0449905,0.00636047,0.0460632,0.00923262,0.0513815,0.00387419,0.0424191,0.00876669,0.0559872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24688,"West Logan, WV",,,,,,,,,,...,,,,,,,,,0,0
24689,"Coon Valley, WI",,,,,,,,,,...,,,,,,,,,0,0.00535475
24690,"DeForest, WI",,,,,,,,,,...,,,,,,,,,0.00077726,0.0177659
24691,"Pewaukee, WI",,,,,,,,,,...,,,,,,,,,0.000475022,0.00902541


In [20]:
#export data
master.to_csv('data/crime.csv', index=False)

#Summary and thoughts:

Crime data reporting by city is spotty and after collecting it all togetether, about 10% is missing. Better reporting in the future is hopeful, and there may be no way to collect this data for many years.

While the data is spotty, it is still useful and would be quite interesting to a person investigating a new town or city. 

Futher improvements on the data would be to more closely check the city names. I have not been able to find duplicates, but it is technicially possible. Another improvement would be to flatten out the national and state crime dataframe for uniform use with the city data. This can still be effectively utilized though and is not a high priority at the moment.