# Nashville Police Service Calls Analysis

## Dependencies

In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import requests

### Import

* N.B. - The dataset is large (more than 6.5M records), so it is not available in this Github repo.

&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; If you'd like the dataset, you may find it [here](https://data.nashville.gov/Police/Metro-Nashville-Police-Department-Calls-for-Servic/kwnd-qrrm), on the nashville.gov website.

In [20]:
main_df = pd.read_csv('data/Metro_Nashville_Police_Department_Calls_for_Service.csv', parse_dates=['Call Received'])

  main_df = pd.read_csv('data/Metro_Nashville_Police_Department_Calls_for_Service.csv', parse_dates=['Call Received'])


### Preprocessing

* Let's get a sample of the data to see what we're working with.

In [21]:
samp_df = main_df.sample(frac=.01, random_state=22)

In [8]:
cols = ['Event Number', 'Call Received', 'Complaint Number', 'Tencode',
       'Tencode Description', 'Tencode Suffix', 'Tencode Suffix Description',
       'Disposition Code', 'Disposition Description', 'Block', 'Street Name',
       'Unit Dispatched', 'Shift', 'Sector', 'Zone', 'RPA', 'Latitude',
       'Longitude', 'Mapped Location']

In [22]:
samp_df.dtypes

Event Number                          object
Call Received                 datetime64[ns]
Complaint Number                     float64
Tencode                                int64
Tencode Description                   object
Tencode Suffix                        object
Tencode Suffix Description            object
Disposition Code                      object
Disposition Description               object
Block                                float64
Street Name                           object
Unit Dispatched                       object
Shift                                 object
Sector                                object
Zone                                  object
RPA                                  float64
Latitude                             float64
Longitude                            float64
Mapped Location                       object
dtype: object

#### 'Event Number'

* It looks like all the event numbers begin with 'PD'; if this is the case, then I can strip those two characters and cast as an int, saving space

In [None]:
pd_check = [event.startswith('PD') for event in samp_df['Event Number'].values]

In [None]:
print(sum(pd_check))

In [6]:
def event_number_clean(num):
    return int(num[2:])

In [7]:
samp_df['Event Number'] = samp_df['Event Number'].apply(event_number_clean)

In [None]:
samp_df.iloc[[0]]

#### 'Call Received'

* This is a datetime column, so I'll parse as I read in the csv

In [23]:
samp_df['Call Received'].head(20)

5797724   2016-12-01 16:51:05
6541170   2021-06-10 21:08:23
827216    2016-04-10 20:40:50
503265    2019-06-03 10:42:15
4525988   2020-06-05 00:59:52
5769934   2018-10-06 21:03:00
2849480   2020-09-04 20:03:21
1010424   2020-01-23 07:48:02
4775143   2018-06-03 17:17:59
5435890   2015-11-26 16:23:20
3102685   2019-04-26 10:28:01
6226479   2020-12-28 15:26:32
6396955   2021-01-28 18:17:39
4008377   2017-04-25 14:15:19
1958280   2019-12-30 09:32:33
5330872   2016-01-14 13:41:43
5190845   2015-12-18 05:18:07
1015439   2015-01-07 17:15:41
5494682   2018-04-20 17:05:18
5482143   2016-02-11 19:56:38
Name: Call Received, dtype: datetime64[ns]

#### 'Complaint Number'

* I am not interested in the specific number, just whether or not an incident was generated, so I'll update this to a simple Boolean flag

In [39]:
samp_df['Complaint Number'].isna().value_counts()

True     60129
False     5572
Name: Complaint Number, dtype: int64

In [42]:
type(samp_df['Complaint Number'][5797724])

numpy.float64

In [58]:
def complaint_number_clean(num):
    if np.isnan(num):
        return 0
    else:
        return 1

In [59]:
samp_df['generated_incident_yn'] = samp_df['Complaint Number'].apply(complaint_number_clean)

In [60]:
# check the function
samp_df['generated_incident_yn'].value_counts()

0    60129
1     5572
Name: generated_incident_yn, dtype: int64

In [62]:
# Markdown shortcut!
for col in cols:
    print(f"#### '{col}'")

#### 'Event Number'
#### 'Call Received'
#### 'Complaint Number'
#### 'Tencode'
#### 'Tencode Description'
#### 'Tencode Suffix'
#### 'Tencode Suffix Description'
#### 'Disposition Code'
#### 'Disposition Description'
#### 'Block'
#### 'Street Name'
#### 'Unit Dispatched'
#### 'Shift'
#### 'Sector'
#### 'Zone'
#### 'RPA'
#### 'Latitude'
#### 'Longitude'
#### 'Mapped Location'


#### 'Tencode'

* This column seems to be clean. It's high-cardinality, and the codes are numeric, so I'll have to be careful with them if I do any modeling.

In [84]:
samp_df['Tencode'].value_counts()

43      12710
96      11993
93      11775
15       3422
44       2926
45       2918
50       2729
87       2448
40       2223
70       1943
71       1888
3        1624
88       1375
46        795
49        692
57        679
83        572
95        394
53        337
92        333
54        287
75        261
65        256
16        230
63        215
42        164
64         88
61         56
59         55
35         55
52         51
62         45
58         35
14         33
94         22
73         21
51         12
1000       10
68         10
3000        4
77          4
66          4
8000        4
89          2
79          1
Name: Tencode, dtype: int64

#### 'Tencode Description'

* It looks like there are some blanks here.
* However, the desciptions match the appendix in the metadata document, so rather than clogging the dataframe with strings, I'll remove this column.

In [66]:
samp_df['Tencode Description'].isna().value_counts()

False    64077
True      1624
Name: Tencode Description, dtype: int64

In [67]:
samp_df[samp_df['Tencode Description'].notna()].head(20)

Unnamed: 0,Event Number,Call Received,Complaint Number,Tencode,Tencode Description,Tencode Suffix,Tencode Suffix Description,Disposition Code,Disposition Description,Block,...,Unit Dispatched,Shift,Sector,Zone,RPA,Latitude,Longitude,Mapped Location,generated_incident,generated_incident_yn
5797724,PD201601214292,2016-12-01 16:51:05,,88,INVESTIGATE 911 HANG-UP CALL,PW,,5,GONE ON ARRIVAL,,...,322B,B,S,327.0,8861.0,,,,1,0
6541170,PD202100376271,2021-06-10 21:08:23,,43,WANT OFFICER FOR INVESTIGATION / ASSISTA,PV,,5,GONE ON ARRIVAL,,...,5P66,B,H,535.0,8937.0,,,,1,0
827216,PD201600371223,2016-04-10 20:40:50,,93,TRAFFIC VIOLATION,,,9T,,,...,332B,B,335,,8721.0,36.045,-86.662,POINT (-86.662 36.045),1,0
503265,PD201900487756,2019-06-03 10:42:15,20190420000.0,71,BURGLARY - NON-RESIDENCE,R,REPORT,1,M.P.D. REPORT COMPLED,,...,811A,A,,,,,,,1,1
4525988,PD202000445523,2020-06-05 00:59:52,,83,SHOTS FIRED,P,PROGRESS,11,DISREGARD / SIGNAL 9,,...,,C,S,321,8445.0,,,,1,0
5769934,PD201800981510,2018-10-06 21:03:00,,87,SAFETY HAZARD,W,WARRANT ASSISTANCE ESCORT,6,ASSISTED OTHER UNIT,,...,317B,B,S,331.0,8719.0,,,,1,0
2849480,PD202000664776,2020-09-04 20:03:21,,43,WANT OFFICER FOR INVESTIGATION / ASSISTA,P,PROGRESS,11,DISREGARD / SIGNAL 9,,...,2940,B,C,413,1101.0,,,,1,0
1010424,PD202000061960,2020-01-23 07:48:02,20200050000.0,44,DISORDERLY PERSON,R,REPORT,1,M.P.D. REPORT COMPLED,,...,311A,A,,,,,,,1,1
4775143,PD201800555264,2018-06-03 17:17:59,,93,TRAFFIC VIOLATION,,,9T,,,...,125B,B,W,125.0,4811.0,,,,1,0
5435890,PD201501232246,2015-11-26 16:23:20,,93,TRAFFIC VIOLATION,,,9T,,,...,317B,B,S,311,8159.0,,,,1,0


In [68]:
samp_df = samp_df.drop('Tencode Description', axis=1)

#### 'Tencode Suffix'

* Like the tencode column, this is high-cardinality, though these are strings instead of numeric values

In [82]:
type(samp_df['Tencode Suffix'][5797724])

str

In [103]:
samp_df['Tencode Suffix'].value_counts()

P     26379
A      3823
R      3130
PV     1144
TS      869
PM      856
PW      743
PJ      649
RT      614
TV      586
RV      346
RJ      177
RC      169
W       134
L        98
T        44
PD       30
H        28
F        20
B        12
RD        9
PR        8
FI        4
S         3
RG        1
Name: Tencode Suffix, dtype: int64

#### 'Tencode Suffix Description'

* This one is like the tencode description, a string which is unneeded. I'll drop the column.

In [69]:
samp_df = samp_df.drop('Tencode Suffix Description', axis=1)

#### 'Disposition Code'

* Interestingly, it would appear that tencode suffixes are often appended to the disposition code instead of the tencode. That'll be fun to clean up!

In [107]:
pd.set_option("display.max_rows", 100)

In [108]:
samp_df['Disposition Code'].value_counts()

12     9374
4      8991
11     7710
9T     7561
15     7038
6      4862
1      4631
5      3459
9      2113
10     1322
3T     1211
1T      776
3       644
3M      641
13      611
9S      590
5W      396
4C      372
6T      300
1C      246
2       228
2W      216
6W      210
14      206
5S      174
12A     171
8       144
2M      142
4T      111
1S       85
13A      84
2T       80
6A       79
3R       49
2J       47
6C       47
5C       47
7        47
4A       42
2F       35
4V       31
5T       27
9R       27
6S       27
1J       26
16       24
7A       23
3J       16
6D       14
4S       12
6M       11
10A      10
5J       10
1D       10
6J        9
4J        9
9J        8
2P        4
2D        4
5V        4
10C       3
9C        3
11A       3
6V        2
1P        2
9V        2
1M        2
6F        2
7C        2
1F        2
8J        1
7V        1
1R        1
1A        1
6P        1
3A        1
Name: Disposition Code, dtype: int64

In [105]:
pd.reset_option("display.max_rows")

In [123]:
samp_df['Disposition Code'].str.contains('A|C', case=False, regex=True).value_counts()

False    64293
True      1134
Name: Disposition Code, dtype: int64

In [None]:
def disposition_code_clean

#### 'Disposition Description'

#### 'Block'

#### 'Street Name'

#### 'Unit Dispatched'

#### 'Shift'

#### 'Sector'

#### 'Zone'

#### 'RPA'

#### 'Latitude'

#### 'Longitude'

#### 'Mapped Location'

## EDA

In [None]:
main_df.columns

In [None]:
main_df.dtypes

In [None]:
for col in cols:
    print(f'Column name: {col}')
    print(main_df[col].head(10))
    print('\n*******\n')