# Introduction
This notebook is used to clean the MPD (Milwaukee Police Department) dataset. Explanations will be included for some decisions in cleaning.

# Dataset Description
The first data set that will be evaluated in this notebook is stored in `mkecallswheader.csv`. This dataset comes from requesting the bulk data option from a [website](https://mpd.digitalpublicworks.com/?start=2019-01-05T00:00:00-06:00&end=2019-01-05T23:59:59.999999-06:00) that scrapes the milwaukee police department call logs found [here](https://itmdapps.milwaukee.gov/MPDCallData/) and stores them. This data is stored in a postgres server. The official .gov site shows that the data should have headers of call number, date/time, location, police district, nature of call, and status. The bulk data stored in the .csv file has a couple extra headers of id, inserted_at, updated_at, and point. These features will need to be dropped later on since they do not pertain to the data itself and are an artifact of how the data was stored.

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [1]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to clean the MPD dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

In [2]:
mpd_data = pd.read_csv("mkecallswheader.csv")

In [3]:
mpd_data.head(10)

Unnamed: 0,id,time,location,district,nature,status,inserted_at,updated_at,point,call_id
0,2093116,2019-05-21 15:19:03,"7420 W GOOD HOPE RD,MKE",4,ACC PI,Service in Progress,2019-05-21 20:51:09,2019-05-21 20:51:09,0101000020E6100000FC7C94111793454061D971683600...,191411633
1,2093127,2019-05-21 15:24:30,"1421 N 27TH ST,MKE",3,TRAFFIC STOP,City Citation(s) Issued,2019-05-21 20:57:11,2019-05-21 20:57:11,0101000020E6100000D2AB014A4386454067C416CCA9FC...,191411672
2,2093141,2019-05-21 15:25:46,"4054 N 71ST ST,MKE",7,SUBJ WANTED,Assignment Completed,2019-05-21 21:00:12,2019-05-21 21:00:12,0101000020E610000053FFC5D8AE8B45402CAE3B270700...,191411674
3,2093805,2019-05-21 20:46:28,"245 W LINCOLN AV,MKE",2,SPECIAL ASSIGN,Service in Progress,2019-05-22 02:22:32,2019-05-22 02:22:32,0101000020E610000078ABF8D04F804540633ABE0779FA...,191412545
4,2093816,2019-05-21 20:50:03,"1721 W CANAL ST,MKE",3,TRBL W/SUBJ,Unable to Locate Complainant,2019-05-22 02:25:33,2019-05-22 02:25:33,0101000020E6100000E8323509DE834540C3D7D7BAD4FB...,191412465
5,2093829,2019-05-21 21:02:37,"E WRIGHT ST / N WEIL ST,MKE",5,PARK AND WALK,Service in Progress,2019-05-22 02:37:36,2019-05-22 02:37:36,0101000020E6100000DEF1DC312B88454059D878558CF9...,191412584
6,2093872,2019-05-21 20:50:47,"9010 N 95TH ST,MKE",4,WELFARE CITIZEN,Advised,2019-05-22 02:52:43,2019-05-22 02:52:43,0101000020E6100000357D76C07597454080B4FF01D601...,191412544
7,2093887,2019-05-21 21:25:33,"983 W ARTHUR AV,MKE",2,BATTERY DV,Service in Progress,2019-05-22 03:01:48,2019-05-22 03:01:48,0101000020E6100000BC033C69E17F454041ABDDC02EFB...,191412632
8,2093918,2019-05-21 21:36:05,"4115 N 56TH ST,MKE",7,RETURN STATION,Assignment Completed,2019-05-22 03:16:51,2019-05-22 03:16:51,0101000020E610000014483FD0C08B45404EB747CAF1FE...,191412656
9,2093929,2019-05-21 21:45:53,"7806 W HAMPTON AV,MKE",7,TRAFFIC STOP,Advised,2019-05-22 03:22:52,2019-05-22 03:22:52,0101000020E6100000F6F0C05B7B8D45404F34B4A69E00...,191412676


In [4]:
mpd_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4027695 entries, 0 to 4027694
Data columns (total 10 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   id           4027695 non-null  int64 
 1   time         4027695 non-null  object
 2   location     4027695 non-null  object
 3   district     3937463 non-null  object
 4   nature       4027695 non-null  object
 5   status       4027695 non-null  object
 6   inserted_at  4027695 non-null  object
 7   updated_at   4027695 non-null  object
 8   point        3958766 non-null  object
 9   call_id      4027695 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 307.3+ MB


In [5]:
mpd_data.describe()

Unnamed: 0,id,call_id
count,4027695.0,4027695.0
mean,5589211.0,189180800.0
std,23924880.0,14593090.0
min,1.0,163081500.0
25%,1006924.0,173212900.0
50%,2013848.0,190670700.0
75%,3020772.0,201611500.0
max,163541700.0,220101000.0


From the above calls to the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods it is possible to see that there are 10 total features. Of these features there are two features that are formatted as integers and eight features formatted as the default object that pandas imports non-numerical features as. From the object classified features there are a few that can be converted to new types. The time column should be formatted as a date time object. District should be converted to a numerical categorical value. Nature and status should be converted to categorical features. Location should be kept as an object feature. More features should be extracted from the location in order to draw further observations. Both street name and street suffix would be good features to extract. Stack overflow [helped](https://stackoverflow.com/a/43427677) with showing null counts in the info command as well.

## Drop Postgres Features

All of the postgres features that are not part of the data can be dropped in the next step. These are the headers of id, inserted_at, updated_at, and point.

In [6]:
mpd_data = mpd_data.drop('id', axis=1)
mpd_data = mpd_data.drop('inserted_at', axis=1)
mpd_data = mpd_data.drop('updated_at', axis=1)
mpd_data = mpd_data.drop('point', axis=1)

## Convert the Time feature

The time should be converted to a pandas datetime object. The feature name should also be changed to reflect that the feature contains both date and time information.

In [7]:
mpd_data['datetime'] = pd.to_datetime(mpd_data['time'], infer_datetime_format=True)
mpd_data = mpd_data.drop('time', axis=1)

## Examine Unique Values for Nature, Status, and District

The nature, status, and district features will be examined for unique values and converted into categorical features.

### Examine the District Feature

In [8]:
mpd_data['district'].unique()

array(['4', '3', '7', '2', '5', '6', '1', nan, 'OCOE', 'OUT', 'NTF',
       'SPD', 'CITY', 'ICS3', 'CIB', 'DPR', 'TRU', 'SCD', 'ICS', 'SF',
       'NLA', 'ICS1', 'IFC', 'FI', 'SID', 'D0', 'ICS7', 'ICS5', 'JUNE',
       'DDAC', 'MIRT', 'ICS6', 'ID', 'ADMN', 'TEU', 'MID', 'NID'],
      dtype=object)

In [9]:
mpd_data['district'].value_counts()

3       673002
7       650109
5       567958
4       552701
2       545253
6       473525
1       448786
CITY      7080
NTF       6274
SPD       5477
OCOE      3316
OUT       2249
DDAC       399
SCD        300
DPR        277
ICS1       229
CIB        218
FI          83
SF          55
NLA         34
ICS3        32
SID         31
ICS         23
TEU         16
ICS5         9
JUNE         5
ICS6         4
IFC          4
TRU          3
D0           2
MIRT         2
ADMN         2
NID          2
ICS7         1
ID           1
MID          1
Name: district, dtype: int64

Running a value counts on the different districts shows that there are more than the expected number of police districts in the city of Milwaukee present in the data. There should only be districts one through seven. Instead the data contains more districts than expected. These districts will be converted into categorical anyways as their entries will be useful for some observations. The erraneous districts will likely be ignored when drawing district based conclusions as it is not known what the other districts mean. District will be converted to a categorical with 36 different categories.

### Examine the Nature Feature


In [10]:
mpd_data['nature'].unique()

array(['ACC PI', 'TRAFFIC STOP', 'SUBJ WANTED', 'SPECIAL ASSIGN',
       'TRBL W/SUBJ', 'PARK AND WALK', 'WELFARE CITIZEN', 'BATTERY DV',
       'RETURN STATION', 'BUSINESS CHECK', 'SUSPICIOUS-OTH', 'REPORTS',
       'THEFT', 'VIOL REST ORDER', 'SUBJ WITH GUN', 'TAVERN CHECK',
       'PATROL', 'FOLLOW UP', 'ACC PDO', 'TRAFFIC HAZARD',
       'FAMILY TROUBLE', 'NON PURSUIT', 'BATTERY', 'INVESTIGATION',
       'PRISONER TRANS', 'BUS INV', 'CALL FOR POLICE', 'THEFT VEHICLE',
       'SUSP PERS/AUTO', 'ENTRY TO AUTO', 'ASSIGNMENT', 'PROPERTY DAMAGE',
       'IND EXPOSURE', 'NOISE NUISANCE', 'PROPERTY PICKUP', 'SOLICITING',
       'STOLEN VEHICLE', 'OUT OF SERVICE', 'COURT DUTY', 'ENTRY',
       'PARKING TROUBLE', 'GRAFFITI', 'TRBL W/JUV', 'SHOTS FIRED',
       'SHOTSPOTTER', 'BUS INVESTIGATIO', 'THREAT', 'SUBJ WITH WEAPON',
       'TRAFFIC LASER', 'HOME VISIT DV', 'COMMUNITY MTNG',
       'CITIZEN CONTACT', 'RECK USE OF WEAP', 'ABAND/STOLEN PRO',
       'ASSIGN-ADMN MPD', 'CONVEY PROPERTY',

In [11]:
mpd_data['nature'].value_counts()

TRAFFIC STOP        401644
BUSINESS CHECK      325863
TRBL W/SUBJ         237324
RETURN STATION      209934
FOLLOW UP           150661
                     ...  
ACC PD1                  1
ABAND/LOST               1
2831 N 21ST              1
THREAT TO SCHOOL         1
STOLEN PROP              1
Name: nature, Length: 317, dtype: int64

In [12]:
a = mpd_data['nature'].value_counts().sort_index()
a[0:20]

.                      20
0                       1
1 BLOCK NORTH OF        2
1301                    1
1359                    1
1603                    1
1733                    1
230 N 37TH ST           1
2532                    1
2831 N 21ST             1
3                       1
3410                    2
911 ABUSE             373
911 ABUSE CONFIR      267
911 TEST CALL          25
ABAND PROPERTY        961
ABAND/LOST              1
ABAND/LOST PROP         1
ABAND/PROP WEAPO     1416
ABAND/STOLEN PRO    12873
Name: nature, dtype: int64

There are 317 unique natures present in the dataset. It will be necessary to define a few specific natures to use as targets due to the large number. Some natures are also repeated such as SUBJ WITH GUN and SUBJ W/GUN. In this case the larger and more prevalent of the two natures will be used. The following list of natures will be focused on:
- TRAFFIC STOP        401644
- SHOTSPOTTER          65381
- SHOTS FIRED          47331
- SUBJ WITH GUN        44509
- SUBJ WITH WEAPON     30101
- RECK USE OF WEAP     17524
- SHOOTING             7054

The number next to the nature denotes how many occurances of that nature were found out of the 4027695 total entries in the dataset. The Entries to focus on are divided into two categories. The fist is the traffic related crimes and the second is the gun and weapon related crimes. These will be focused by adding two boolean features to the data which will denote their presence or absence.

The nature feature also contains some anomalous values such as 
- .                      20
- 0                       1
- 1 BLOCK NORTH OF        2
- 1301                    1
- 1359                    1
- 1603                    1
- 1733                    1
- 230 N 37TH ST           1
- 2532                    1
- 2831 N 21ST             1
- 3                       1
- 3410                    2

These values will be retained and included as part of the categorical conversion. This is because they will not be as relevant due to specific features being targeted.

### Examine the Status Feature

In [13]:
mpd_data['status'].unique()

array(['Service in Progress', 'City Citation(s) Issued',
       'Assignment Completed', 'Unable to Locate Complainant', 'Advised',
       'To be Filed', 'Advised/Referral', 'No Prosecution',
       'Open Investigation', 'Cleared by Arrest', 'False Alarm',
       'Filed Driver Exchange Report', 'Patrol Request', 'Referral',
       'Ordered to Appear', 'State Citation(s) Issued',
       'False Alarm (Weather Related)'], dtype=object)

In [14]:
mpd_data['status'].value_counts()

Service in Progress              1424568
Assignment Completed             1186663
Advised                           615111
Unable to Locate Complainant      336406
To be Filed                       164199
City Citation(s) Issued           153937
Advised/Referral                   87051
Open Investigation                 40803
No Prosecution                      6475
Cleared by Arrest                   4696
False Alarm                         2889
Filed Driver Exchange Report        2631
Referral                            1291
Patrol Request                       697
State Citation(s) Issued             160
False Alarm (Weather Related)         93
Ordered to Appear                     25
Name: status, dtype: int64

The status feature looks like it will work very well as a category as is. Most of the counts also look good. The main focus would be on those that occur more than 40000 times overall in the data. This is due to the drop in occurances of almost an order of magnitude after that point.

### Cleaning the Nature Feature

The first step for cleaning nature is to define what values of the nature feature will make up the traffic and weapon crimes. Once that is completed these values can be used to create new features that denote this. After that The nature feature can be turned into a categorical feature.

In [15]:
print("Data Shape Before: %s" % ((mpd_data.shape), ))
target_traffic_crimes_labels = ['TRAFFIC STOP']
target_weapon_crimes_labels = ['SHOTSPOTTER', 'SHOTS FIRED', 'SHOTS FIRED-DV', 
'SUBJ WITH GUN', 'SUBJ W/GUN', 'SUBJ WITH GUN-DV', 'SUBJ WITH WEAPON', 
'SUBJ W/WEAP', 'SUBJ W/WEAPON-DV', 'RECK USE OF WEAP', 'SHOOTING']
mpd_data['traffic_crime'] = mpd_data['nature'].isin(target_traffic_crimes_labels)
mpd_data['weapon_crime'] = mpd_data['nature'].isin(target_weapon_crimes_labels)
print("Data Shape After: %s" % ((mpd_data.shape), ))

Data Shape Before: (4027695, 6)
Data Shape After: (4027695, 8)


In [16]:
mpd_data['nature'] = mpd_data['nature'].astype("category")
mpd_data['nature'].dtype

CategoricalDtype(categories=['.', '0', '1 BLOCK NORTH OF', '1301', '1359', '1603', '1733',
                  '230 N 37TH ST', '2532', '2831 N 21ST',
                  ...
                  'VIOL REST ORD', 'VIOL REST ORD-DV', 'VIOL REST ORDER',
                  'WATER MAIN BREAK', 'WATER MAIN BRK', 'WEAPON',
                  'WELFARE CHK', 'WELFARE CITIZEN', 'WIRES DOWN', '`'],
, ordered=False)

### Cleaning the District and Status Features

The district and status features appeared mostly fine above. Because of this they will just be turned directly into categorical features.


In [17]:
mpd_data['district'] = mpd_data['district'].astype("category")
mpd_data['district'].dtype

CategoricalDtype(categories=['1', '2', '3', '4', '5', '6', '7', 'ADMN', 'CIB', 'CITY',
                  'D0', 'DDAC', 'DPR', 'FI', 'ICS', 'ICS1', 'ICS3', 'ICS5',
                  'ICS6', 'ICS7', 'ID', 'IFC', 'JUNE', 'MID', 'MIRT', 'NID',
                  'NLA', 'NTF', 'OCOE', 'OUT', 'SCD', 'SF', 'SID', 'SPD',
                  'TEU', 'TRU'],
, ordered=False)

In [18]:

mpd_data['status'] = mpd_data['status'].astype('category')
mpd_data['status'].dtype

CategoricalDtype(categories=['Advised', 'Advised/Referral', 'Assignment Completed',
                  'City Citation(s) Issued', 'Cleared by Arrest',
                  'False Alarm', 'False Alarm (Weather Related)',
                  'Filed Driver Exchange Report', 'No Prosecution',
                  'Open Investigation', 'Ordered to Appear', 'Patrol Request',
                  'Referral', 'Service in Progress',
                  'State Citation(s) Issued', 'To be Filed',
                  'Unable to Locate Complainant'],
, ordered=False)

## Clean and Examine Location

The location feature can be used to create many new features that will be easier to use. Currently the location values are in one of two formats. The first format is $HouseNumber$ $StreetName$ $StreetType$,MKE. The second format is $StreetName_1$ $StreetType_1$ / $StreetName_2$ $StreetType_2$,MKE. The second format occurs when the location is on the corner of two streets. A categorical feature will be created to denote if an entry is a corner or not. The attributes of each street in the location will be recorded. Null or NaN values will be recorded where there are no values. There will be no house number for addresses that are corners and no secondary names or types for addresses that are not corners. Overall the following features will be added:
- isCorner
- houseNumber
- primaryStreetName
- primaryStreetSuffix
- secondaryStreetName
- secondaryStreetSuffix

### Method to Extract Addresses

The first step to clean location is to create a method to extract addresses from the raw location strings. This method will then be tested on some example cases.

In [19]:
def get_street_info(address: str) -> list:
    """
    This method will take in a string representing an address and will return the information present in that address.
    Some example addresses are as follows:
        0             7420 W GOOD HOPE RD,MKE
        1                  1421 N 27TH ST,MKE
        2                  4054 N 71ST ST,MKE
        3                245 W LINCOLN AV,MKE
        4                 1721 W CANAL ST,MKE
        5         E WRIGHT ST / N WEIL ST,MKE
        6                  9010 N 95TH ST,MKE
    :param address: the string passed in representing the address
    :return: a list containing two tuples. Each tuple will be an addres of the form (houseNumber, streetName, streetType)
    :auth: Grant Fass
    :since: 8 February 2022
    """
    street_type_lookup = ["ALY", "ANX", "ARC", "AVE", "BYU", "BCH", "BND", "BLF", "BLFS", "BTM", "BLVD", "BR", "BRG", "BRK", "BRKS", "BG", "BGS", "BYP", "CP", "CYN", "CPE",
                          "CSWY", "CTR", "CTRS", "CIR", "CIRS", "CLF", "CLFS", "CLB", "CMN", "CMNS", "COR", "CORS", "CRSE", "CT", "CTS", "CV", "CVS", "CRK", "CRES", "CRST", 
                          "XING", "XRD", "XRDS", "CURV", "DL", "DM", "DV", "DR", "EST", "ESTS", "EXPY", "EXT", "EXTS", "FALL", "FLS", "FRY", "FLD", "FLDS", "FLT", "FLTS", 
                          "FRD", "FRDS", "FRST", "FRG", "FRGS", "FRK", "FRKS", "FT", "FWY", "GDN", "GDNS", "GTWY", "GLN", "GLNS", "GRN", "GRNS", "GRV", "GRVS", "HBR", "HBRS", 
                          "HVN", "HTS", "HWY", "HL", "HLS", "HOLW", "INLT", "IS", "ISS", "ISLE", "JCT", "JCTS", "KY", "KYS", "KNL", "KNLS", "LK", "LKS", "LAND", "LNDG", "LN",
                          "LGT", "LGTS", "LF", "LCK", "LCKS", "LDG", "LOOP", "MALL", "MNR", "MNRS", "MDW", "MDWS", "MEWS", "ML", "MLS", "MSN", "MTWY", "MT", "MTN", "MTNS", 
                          "NCK", "ORCH", "OVAL", "OPAS", "PARK", "PKWY", "PASS", "PSGE", "PATH", "PIKE", "PNE", "PNES", "PL", "PLN", "PLNS", "PLZ", "PT", "PTS", "PRT", "PRTS", 
                          "PR", "RADL", "RAMP", "RNCH", "RPD", "RPDS", "RST", "RDG", "RDGS", "RIV", "RD", "RDS", "RTE", "ROW", "RUE", "RUN", "SHL", "SHLS", "SHR", "SHRS", 
                          "SKWY", "SPG", "SPGS", "SPUR", "SQ", "SQS", "STA", "STRA", "STRM", "ST", "STS", "SMT", "TER", "TRWY", "TRCE", "TRAK", "TRFY", "TRL", "TRLR", "TUNL",
                          "TPKE", "UPAS", "UN", "UNS", "VLY", "VLYS", "VIA", "VW", "VWS", "VLG", "VLGS", "VL", "VIS", "WALK", "WALL", "WAY", "WAYS", "WL", "WLS", "AV"]
    # this is used primarily for error checking
    unmatched_suffix = ""
    # remove the ,MKE suffix from the location if present
    address = address.removesuffix(",MKE")
    # Array containing the seperate addresses in the passed entry
    addresses = []
    # Check if the entry contains a / or not.
    # The presence of a / denotes the entry as a corner with two streets present
    # For example: N HUMBOLDT AV / E NORTH AV
    if ('/' in address):
        addresses = address.split(' / ')
    else:
        addresses = [address]
    
    # now perform opperations for each address
    # print(addresses)
    out = []
    for a in addresses:
        # Set up the values to be returned
        house_number = None
        street = None
        street_suffix = None
        # split appart the address on spaces
        s = a.split(' ')
        # Check if the first cell is a nueric. This would be the house number if it is a numeric
        if s[0].isnumeric():
            # house number present
            house_number = int(s[0])
            street = ' '.join(s[1:-1]) # use -1 since last index is exclusive
        else:
            street = ' '.join(s[0:-1])
        # update the street suffix based on the last entry in the array
        if s[-1] in street_type_lookup:
            street_suffix = s[-1]
        # add the entries into the return field
        out.append((house_number, street, street_suffix))
    return out

In [20]:
mpd_data['location'].head(10)

0        7420 W GOOD HOPE RD,MKE
1             1421 N 27TH ST,MKE
2             4054 N 71ST ST,MKE
3           245 W LINCOLN AV,MKE
4            1721 W CANAL ST,MKE
5    E WRIGHT ST / N WEIL ST,MKE
6             9010 N 95TH ST,MKE
7            983 W ARTHUR AV,MKE
8             4115 N 56TH ST,MKE
9          7806 W HAMPTON AV,MKE
Name: location, dtype: object

In [21]:
#  E WRIGHT ST / N WEIL ST,MKE
get_street_info(mpd_data['location'][5])

[(None, 'E WRIGHT', 'ST'), (None, 'N WEIL', 'ST')]

In [22]:
#  1421 N 27TH ST,MKE
get_street_info(mpd_data['location'][1])

[(1421, 'N 27TH', 'ST')]

### Method to Further Extract Location Data

The next step is to define a method that will take the list of addresses returned by the previous method and combine them into a single list. This method will also be tested.

In [23]:
def get_street_data_as_array(location: list) -> list:
    """
    Method to take in a location that contains up to two streets and combine it into one list for output.
    The location is an array of tuples up to two in length.
    Each tuple will have 3 entries of the form (houseNumber, streetName, streetSuffix).
    If two entries are present then the location is a corner.
    The output list will be of the form [isCorner, houseNumber, primaryStreetName, primaryStreetSuffix, secondaryStreetName, secondaryStreetSuffix]
    :param location: an array of tuples up to two in length.
    :return: list will be of the form [isCorner, houseNumber, primaryStreetName, primaryStreetSuffix, secondaryStreetName, secondaryStreetSuffix]
    :auth: Grant Fass
    :since: 8 February 2022
    """
    
    if len(location) == 1:
        # is not corner
        location_vals = list(location[0])
        return [False, location_vals[0], location_vals[1], location_vals[2], None, None]
    else:
        # is corner
        primary_location_vals = list(location[0])
        secondary_location_vals = list(location[1])
        return [True, None, primary_location_vals[1], primary_location_vals[2], 
        secondary_location_vals[1], secondary_location_vals[2]]

In [24]:
get_street_data_as_array(get_street_info(mpd_data['location'][5]))

[True, None, 'E WRIGHT', 'ST', 'N WEIL', 'ST']

In [25]:
get_street_data_as_array(get_street_info(mpd_data['location'][1]))

[False, 1421, 'N 27TH', 'ST', None, None]

### Apply the Methods to Extract Location Data

The above methods will be applied in sequence to extract the location data into a new dataframe. The features will then have their types set correctly and inspected to verify that the process worked.

In [26]:
# tolist() is needed since the output is a ndarray of lists.
street_data = mpd_data["location"].map(get_street_info).map(get_street_data_as_array).tolist()

In [27]:
header = ["isCorner", "houseNumber", "primaryStreetName", "primaryStreetSuffix", "secondaryStreetName", "secondaryStreetSuffix"]
mpd_location_data = pd.DataFrame(street_data, columns=header)
mpd_location_data.head(5)

Unnamed: 0,isCorner,houseNumber,primaryStreetName,primaryStreetSuffix,secondaryStreetName,secondaryStreetSuffix
0,False,7420.0,W GOOD HOPE,RD,,
1,False,1421.0,N 27TH,ST,,
2,False,4054.0,N 71ST,ST,,
3,False,245.0,W LINCOLN,AV,,
4,False,1721.0,W CANAL,ST,,


In [28]:
mpd_location_data['primaryStreetSuffix'] = mpd_location_data['primaryStreetSuffix'].astype('category')
mpd_location_data['primaryStreetName'] = mpd_location_data['primaryStreetName'].astype('category')
mpd_location_data['secondaryStreetSuffix'] = mpd_location_data['secondaryStreetSuffix'].astype('category')
mpd_location_data['secondaryStreetName'] = mpd_location_data["secondaryStreetName"].astype('category')

In [29]:
mpd_location_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4027695 entries, 0 to 4027694
Data columns (total 6 columns):
 #   Column                 Non-Null Count    Dtype   
---  ------                 --------------    -----   
 0   isCorner               4027695 non-null  bool    
 1   houseNumber            3426083 non-null  float64 
 2   primaryStreetName      4027695 non-null  category
 3   primaryStreetSuffix    3844881 non-null  category
 4   secondaryStreetName    549725 non-null   category
 5   secondaryStreetSuffix  531853 non-null   category
dtypes: bool(1), category(4), float64(1)
memory usage: 58.3 MB


In [30]:
mpd_location_data['isCorner'].value_counts()

False    3477970
True      549725
Name: isCorner, dtype: int64

### Recombine the Location Data

The locations have now been properly extracted into their own seperate features. These features must now be added back to the overall MPD dataframe. Out of the 4027695 entries there were 549725 locations that are corners and 3477970 that are not corners. The datasets will be combined using an inner join and [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). The original location feature will be dropped once the outer merge is completed as it will become irrelevant

In [31]:
print("MPD Data Shape Before: %s" % ((mpd_data.shape), ))
print("MPD Location Data Shape Before: %s" % ((mpd_location_data.shape), ))
mpd_data = pd.concat([mpd_data, mpd_location_data], join='outer', axis=1)
mpd_data = mpd_data.drop('location', axis=1)
print("MPD Data Shape After: %s" % ((mpd_data.shape), ))

MPD Data Shape Before: (4027695, 8)
MPD Location Data Shape Before: (4027695, 6)
MPD Data Shape After: (4027695, 13)


# Conclusion

At this point the MPD dataset is done being cleaned. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the cleaned data to a new csv file.

In [32]:
mpd_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4027695 entries, 0 to 4027694
Data columns (total 13 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   district               3937463 non-null  category      
 1   nature                 4027695 non-null  category      
 2   status                 4027695 non-null  category      
 3   call_id                4027695 non-null  int64         
 4   datetime               4027695 non-null  datetime64[ns]
 5   traffic_crime          4027695 non-null  bool          
 6   weapon_crime           4027695 non-null  bool          
 7   isCorner               4027695 non-null  bool          
 8   houseNumber            3426083 non-null  float64       
 9   primaryStreetName      4027695 non-null  category      
 10  primaryStreetSuffix    3844881 non-null  category      
 11  secondaryStreetName    549725 non-null   category      
 12  secondaryStreetSuffix  53185

In [33]:
mpd_data.head()

Unnamed: 0,district,nature,status,call_id,datetime,traffic_crime,weapon_crime,isCorner,houseNumber,primaryStreetName,primaryStreetSuffix,secondaryStreetName,secondaryStreetSuffix
0,4,ACC PI,Service in Progress,191411633,2019-05-21 15:19:03,False,False,False,7420.0,W GOOD HOPE,RD,,
1,3,TRAFFIC STOP,City Citation(s) Issued,191411672,2019-05-21 15:24:30,True,False,False,1421.0,N 27TH,ST,,
2,7,SUBJ WANTED,Assignment Completed,191411674,2019-05-21 15:25:46,False,False,False,4054.0,N 71ST,ST,,
3,2,SPECIAL ASSIGN,Service in Progress,191412545,2019-05-21 20:46:28,False,False,False,245.0,W LINCOLN,AV,,
4,3,TRBL W/SUBJ,Unable to Locate Complainant,191412465,2019-05-21 20:50:03,False,False,False,1721.0,W CANAL,ST,,


In [34]:
mpd_data.describe()

Unnamed: 0,call_id,houseNumber
count,4027695.0,3426083.0
mean,189180800.0,3367.593
std,14593090.0,2415.387
min,163081500.0,1.0
25%,173212900.0,1614.0
50%,190670700.0,2920.0
75%,201611500.0,4600.0
max,220101000.0,646050.0


In [35]:
mpd_data.to_csv("mpd_data_cleaned.csv", index=False)