# Introduction
This notebook is used to clean the MPD (Milwaukee Police Department) dataset. Explanations will be included for some decisions in cleaning.

# Dataset Description
The first data set that will be evaluated in this notebook is stored in `mkecallswheader.csv`. This dataset comes from requesting the bulk data option from a [website](https://mpd.digitalpublicworks.com/?start=2019-01-05T00:00:00-06:00&end=2019-01-05T23:59:59.999999-06:00) that scrapes the milwaukee police department call logs found [here](https://itmdapps.milwaukee.gov/MPDCallData/) and stores them. This data is stored in a postgres server. The official .gov site shows that the data should have headers of call number, date/time, location, police district, nature of call, and status. The bulk data stored in the .csv file has a couple extra headers of id, inserted_at, updated_at, and point. These features will need to be dropped later on since they do not pertain to the data itself and are an artifact of how the data was stored. See [this link](https://city.milwaukee.gov/ImageLibrary/Groups/mpdAuthors/SOP/COMMUNICATIONS-2501.pdf) for more information about what certain codes mean.

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [None]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to clean the MPD dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

In [None]:
mpd_data = pd.read_csv("mkecallswheader.csv")

In [None]:
mpd_data.head(10)

In [None]:
mpd_data.info(verbose=True, show_counts=True)

In [None]:
mpd_data.describe()

From the above calls to the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods it is possible to see that there are 10 total features. Of these features there are two features that are formatted as integers and eight features formatted as the default object that pandas imports non-numerical features as. From the object classified features there are a few that can be converted to new types. The time column should be formatted as a date time object. District should be converted to a numerical categorical value. Nature and status should be converted to categorical features. Location should be kept as an object feature. More features should be extracted from the location in order to draw further observations. Both street name and street suffix would be good features to extract. Stack overflow [helped](https://stackoverflow.com/a/43427677) with showing null counts in the info command as well.

## Drop Postgres Features

All of the postgres features that are not part of the data can be dropped in the next step. These are the headers of id, inserted_at, updated_at, and point.

In [None]:
mpd_data = mpd_data.drop('id', axis=1)
mpd_data = mpd_data.drop('inserted_at', axis=1)
mpd_data = mpd_data.drop('updated_at', axis=1)
mpd_data = mpd_data.drop('point', axis=1)

## Convert the Time feature

The time should be converted to a pandas datetime object. The feature name should also be changed to reflect that the feature contains both date and time information.

In [None]:
mpd_data['datetime'] = pd.to_datetime(mpd_data['time'], infer_datetime_format=True)
mpd_data = mpd_data.drop('time', axis=1)

## Examine Unique Values for Nature, Status, and District

The nature, status, and district features will be examined for unique values and converted into categorical features.

### Examine the District Feature

In [None]:
mpd_data['district'].unique()

In [None]:
mpd_data['district'].value_counts()

Running a value counts on the different districts shows that there are more than the expected number of police districts in the city of Milwaukee present in the data. There should only be districts one through seven. Instead the data contains more districts than expected. These districts will be converted into categorical anyways as their entries will be useful for some observations. The erraneous districts will likely be ignored when drawing district based conclusions as it is not known what the other districts mean. District will be converted to a categorical with 36 different categories.

### Examine the Nature Feature


In [None]:
mpd_data['nature'].unique()

In [None]:
mpd_data['nature'].value_counts()

In [None]:
a = mpd_data['nature'].value_counts().sort_index()
a[0:20]

There are 317 unique natures present in the dataset. It will be necessary to define a few specific natures to use as targets due to the large number. Some natures are also repeated such as SUBJ WITH GUN and SUBJ W/GUN. In this case the larger and more prevalent of the two natures will be used. The following list of natures will be focused on:
- TRAFFIC STOP        401644
- SHOTSPOTTER          65381
- SHOTS FIRED          47331
- SUBJ WITH GUN        44509
- SUBJ WITH WEAPON     30101
- RECK USE OF WEAP     17524
- SHOOTING             7054

The number next to the nature denotes how many occurances of that nature were found out of the 4027695 total entries in the dataset. The Entries to focus on are divided into two categories. The fist is the traffic related crimes and the second is the gun and weapon related crimes. These will be focused by adding two boolean features to the data which will denote their presence or absence.

The nature feature also contains some anomalous values such as 
- .                      20
- 0                       1
- 1 BLOCK NORTH OF        2
- 1301                    1
- 1359                    1
- 1603                    1
- 1733                    1
- 230 N 37TH ST           1
- 2532                    1
- 2831 N 21ST             1
- 3                       1
- 3410                    2

These values will be retained and included as part of the categorical conversion. This is because they will not be as relevant due to specific features being targeted.

### Examine the Status Feature

In [None]:
mpd_data['status'].unique()

In [None]:
mpd_data['status'].value_counts()

The status feature looks like it will work very well as a category as is. Most of the counts also look good. The main focus would be on those that occur more than 40000 times overall in the data. This is due to the drop in occurances of almost an order of magnitude after that point.

### Cleaning the Nature Feature

The first step for cleaning nature is to define what values of the nature feature will make up the traffic and weapon crimes. Once that is completed these values can be used to create new features that denote this. After that The nature feature can be turned into a categorical feature.

In [None]:
print("Data Shape Before: %s" % ((mpd_data.shape), ))
target_traffic_crimes_labels = ['TRAFFIC STOP']
target_weapon_crimes_labels = ['SHOTSPOTTER', 'SHOTS FIRED', 'SHOTS FIRED-DV', 
'SUBJ WITH GUN', 'SUBJ W/GUN', 'SUBJ WITH GUN-DV', 'SUBJ WITH WEAPON', 
'SUBJ W/WEAP', 'SUBJ W/WEAPON-DV', 'RECK USE OF WEAP', 'SHOOTING']
mpd_data['traffic_crime'] = mpd_data['nature'].isin(target_traffic_crimes_labels)
mpd_data['weapon_crime'] = mpd_data['nature'].isin(target_weapon_crimes_labels)
print("Data Shape After: %s" % ((mpd_data.shape), ))

In [None]:
mpd_data['nature'] = mpd_data['nature'].astype("category")
mpd_data['nature'].dtype

### Cleaning the District and Status Features

The district and status features appeared mostly fine above. Because of this they will just be turned directly into categorical features.


In [None]:
mpd_data['district'] = mpd_data['district'].astype("category")
mpd_data['district'].dtype

In [None]:

mpd_data['status'] = mpd_data['status'].astype('category')
mpd_data['status'].dtype

## Clean and Examine Location

The location feature can be used to create many new features that will be easier to use. Currently the location values are in one of two formats. The first format is $HouseNumber$ $StreetName$ $StreetType$,MKE. The second format is $StreetName_1$ $StreetType_1$ / $StreetName_2$ $StreetType_2$,MKE. The second format occurs when the location is on the corner of two streets. A categorical feature will be created to denote if an entry is a corner or not. The attributes of each street in the location will be recorded. Null or NaN values will be recorded where there are no values. There will be no house number for addresses that are corners and no secondary names or types for addresses that are not corners. Overall the following features will be added:
- isCorner
- houseNumber
- primaryStreetName
- primaryStreetSuffix
- secondaryStreetName
- secondaryStreetSuffix

### Method to Extract Addresses

The first step to clean location is to create a method to extract addresses from the raw location strings. This method will then be tested on some example cases.

In [None]:
def get_street_info(address: str) -> list:
    """
    This method will take in a string representing an address and will return the information present in that address.
    Some example addresses are as follows:
        0             7420 W GOOD HOPE RD,MKE
        1                  1421 N 27TH ST,MKE
        2                  4054 N 71ST ST,MKE
        3                245 W LINCOLN AV,MKE
        4                 1721 W CANAL ST,MKE
        5         E WRIGHT ST / N WEIL ST,MKE
        6                  9010 N 95TH ST,MKE
    :param address: the string passed in representing the address
    :return: a list containing two tuples. Each tuple will be an addres of the form (houseNumber, streetName, streetType)
    :auth: Grant Fass
    :since: 8 February 2022
    """
    street_type_lookup = ["ALY", "ANX", "ARC", "AVE", "BYU", "BCH", "BND", "BLF", "BLFS", "BTM", "BLVD", "BR", "BRG", "BRK", "BRKS", "BG", "BGS", "BYP", "CP", "CYN", "CPE",
                          "CSWY", "CTR", "CTRS", "CIR", "CIRS", "CLF", "CLFS", "CLB", "CMN", "CMNS", "COR", "CORS", "CRSE", "CT", "CTS", "CV", "CVS", "CRK", "CRES", "CRST", 
                          "XING", "XRD", "XRDS", "CURV", "DL", "DM", "DV", "DR", "EST", "ESTS", "EXPY", "EXT", "EXTS", "FALL", "FLS", "FRY", "FLD", "FLDS", "FLT", "FLTS", 
                          "FRD", "FRDS", "FRST", "FRG", "FRGS", "FRK", "FRKS", "FT", "FWY", "GDN", "GDNS", "GTWY", "GLN", "GLNS", "GRN", "GRNS", "GRV", "GRVS", "HBR", "HBRS", 
                          "HVN", "HTS", "HWY", "HL", "HLS", "HOLW", "INLT", "IS", "ISS", "ISLE", "JCT", "JCTS", "KY", "KYS", "KNL", "KNLS", "LK", "LKS", "LAND", "LNDG", "LN",
                          "LGT", "LGTS", "LF", "LCK", "LCKS", "LDG", "LOOP", "MALL", "MNR", "MNRS", "MDW", "MDWS", "MEWS", "ML", "MLS", "MSN", "MTWY", "MT", "MTN", "MTNS", 
                          "NCK", "ORCH", "OVAL", "OPAS", "PARK", "PKWY", "PASS", "PSGE", "PATH", "PIKE", "PNE", "PNES", "PL", "PLN", "PLNS", "PLZ", "PT", "PTS", "PRT", "PRTS", 
                          "PR", "RADL", "RAMP", "RNCH", "RPD", "RPDS", "RST", "RDG", "RDGS", "RIV", "RD", "RDS", "RTE", "ROW", "RUE", "RUN", "SHL", "SHLS", "SHR", "SHRS", 
                          "SKWY", "SPG", "SPGS", "SPUR", "SQ", "SQS", "STA", "STRA", "STRM", "ST", "STS", "SMT", "TER", "TRWY", "TRCE", "TRAK", "TRFY", "TRL", "TRLR", "TUNL",
                          "TPKE", "UPAS", "UN", "UNS", "VLY", "VLYS", "VIA", "VW", "VWS", "VLG", "VLGS", "VL", "VIS", "WALK", "WALL", "WAY", "WAYS", "WL", "WLS", "AV"]
    # this is used primarily for error checking
    unmatched_suffix = ""
    # remove the ,MKE suffix from the location if present
    address = address.removesuffix(",MKE")
    # Array containing the seperate addresses in the passed entry
    addresses = []
    # Check if the entry contains a / or not.
    # The presence of a / denotes the entry as a corner with two streets present
    # For example: N HUMBOLDT AV / E NORTH AV
    if ('/' in address):
        addresses = address.split(' / ')
    else:
        addresses = [address]
    
    # now perform opperations for each address
    # print(addresses)
    out = []
    for a in addresses:
        # Set up the values to be returned
        house_number = None
        street = None
        street_suffix = None
        # split appart the address on spaces
        s = a.split(' ')
        # Check if the first cell is a nueric. This would be the house number if it is a numeric
        if s[0].isnumeric():
            # house number present
            house_number = int(s[0])
            street = ' '.join(s[1:-1]) # use -1 since last index is exclusive
        else:
            street = ' '.join(s[0:-1])
        # update the street suffix based on the last entry in the array
        if s[-1] in street_type_lookup:
            street_suffix = s[-1]
        # add the entries into the return field
        out.append((house_number, street, street_suffix))
    return out

In [None]:
mpd_data['location'].head(10)

In [None]:
#  E WRIGHT ST / N WEIL ST,MKE
get_street_info(mpd_data['location'][5])

In [None]:
#  1421 N 27TH ST,MKE
get_street_info(mpd_data['location'][1])

### Method to Further Extract Location Data

The next step is to define a method that will take the list of addresses returned by the previous method and combine them into a single list. This method will also be tested.

In [None]:
def get_street_data_as_array(location: list) -> list:
    """
    Method to take in a location that contains up to two streets and combine it into one list for output.
    The location is an array of tuples up to two in length.
    Each tuple will have 3 entries of the form (houseNumber, streetName, streetSuffix).
    If two entries are present then the location is a corner.
    The output list will be of the form [isCorner, houseNumber, primaryStreetName, primaryStreetSuffix, secondaryStreetName, secondaryStreetSuffix]
    :param location: an array of tuples up to two in length.
    :return: list will be of the form [isCorner, houseNumber, primaryStreetName, primaryStreetSuffix, secondaryStreetName, secondaryStreetSuffix]
    :auth: Grant Fass
    :since: 8 February 2022
    """
    
    if len(location) == 1:
        # is not corner
        location_vals = list(location[0])
        return [False, location_vals[0], location_vals[1], location_vals[2], None, None]
    else:
        # is corner
        primary_location_vals = list(location[0])
        secondary_location_vals = list(location[1])
        return [True, None, primary_location_vals[1], primary_location_vals[2], 
        secondary_location_vals[1], secondary_location_vals[2]]

In [None]:
get_street_data_as_array(get_street_info(mpd_data['location'][5]))

In [None]:
get_street_data_as_array(get_street_info(mpd_data['location'][1]))

### Apply the Methods to Extract Location Data

The above methods will be applied in sequence to extract the location data into a new dataframe. The features will then have their types set correctly and inspected to verify that the process worked.

In [None]:
# tolist() is needed since the output is a ndarray of lists.
street_data = mpd_data["location"].map(get_street_info).map(get_street_data_as_array).tolist()

In [None]:
header = ["isCorner", "houseNumber", "primaryStreetName", "primaryStreetSuffix", "secondaryStreetName", "secondaryStreetSuffix"]
mpd_location_data = pd.DataFrame(street_data, columns=header)
mpd_location_data.head(5)

In [None]:
mpd_location_data['primaryStreetSuffix'] = mpd_location_data['primaryStreetSuffix'].astype('category')
mpd_location_data['primaryStreetName'] = mpd_location_data['primaryStreetName'].astype('category')
mpd_location_data['secondaryStreetSuffix'] = mpd_location_data['secondaryStreetSuffix'].astype('category')
mpd_location_data['secondaryStreetName'] = mpd_location_data["secondaryStreetName"].astype('category')

In [None]:
mpd_location_data.info(verbose=True, show_counts=True)

In [None]:
mpd_location_data['isCorner'].value_counts()

### Recombine the Location Data

The locations have now been properly extracted into their own seperate features. These features must now be added back to the overall MPD dataframe. Out of the 4027695 entries there were 549725 locations that are corners and 3477970 that are not corners. The datasets will be combined using an inner join and [`pd.concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). The original location feature will be dropped once the outer merge is completed as it will become irrelevant

In [None]:
print("MPD Data Shape Before: %s" % ((mpd_data.shape), ))
print("MPD Location Data Shape Before: %s" % ((mpd_location_data.shape), ))
mpd_data = pd.concat([mpd_data, mpd_location_data], join='outer', axis=1)
mpd_data = mpd_data.drop('location', axis=1)
print("MPD Data Shape After: %s" % ((mpd_data.shape), ))

# Conclusion

At this point the MPD dataset is done being cleaned. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the cleaned data to a new csv file.

In [None]:
mpd_data.info(verbose=True, show_counts=True)

In [None]:
mpd_data.head()

In [None]:
mpd_data.describe()

In [None]:
mpd_data.to_csv("mpd_data_cleaned.csv", index=False)