# Introduction
This notebook is used to combine the cleaned versions of all datasets used in this project into one joined dataframe. This data will then be exported as a csv to be used in the main project.

# Datasets
The following datasets will be merged
- Weather Dataset
- MPD (Milwaukee Police Department) Dataset

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [1]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to combine the weather dataset into the MPD dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

### Load the MPD Data

In [2]:
mpd_data = pd.read_csv("mpd_data_cleaned.csv")

In [3]:
mpd_data.head()

Unnamed: 0,district,nature,status,call_id,datetime,isCorner,houseNumber,primaryStreetName,primaryStreetSuffix,secondaryStreetName,secondaryStreetSuffix,shots_nature,top_natures,top_districts
0,4,ACC PI,Service in Progress,191411633,2019-05-21 15:19:03,False,7420.0,W GOOD HOPE,RD,,,OTHER,OTHER,4
1,3,TRAFFIC STOP,City Citation(s) Issued,191411672,2019-05-21 15:24:30,False,1421.0,N 27TH,ST,,,OTHER,TRAFFIC STOP,3
2,7,SUBJ WANTED,Assignment Completed,191411674,2019-05-21 15:25:46,False,4054.0,N 71ST,ST,,,OTHER,OTHER,7
3,2,SPECIAL ASSIGN,Service in Progress,191412545,2019-05-21 20:46:28,False,245.0,W LINCOLN,AV,,,OTHER,SPECIAL ASSIGN,2
4,3,TRBL W/SUBJ,Unable to Locate Complainant,191412465,2019-05-21 20:50:03,False,1721.0,W CANAL,ST,,,OTHER,TRBL W/SUBJ,3


### Load the Weather Data

In [4]:
weather_data = pd.read_csv("weather_data_cleaned.csv")

In [5]:
weather_data.head()

Unnamed: 0,isdaytime,tempC,windspeedKmph,winddirdegree,weatherDesc,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex,datetime
0,False,14,11,241,Clear,0.0,77,10,1016,3,14,11,14,23,14,1,2008-07-01 00:00:00
1,False,14,12,246,Clear,0.0,77,10,1016,10,14,10,14,25,14,1,2008-07-01 01:00:00
2,False,14,12,251,Clear,0.0,76,10,1015,17,14,10,14,26,14,1,2008-07-01 02:00:00
3,False,15,13,256,Clear,0.0,76,10,1015,24,15,10,14,27,14,1,2008-07-01 03:00:00
4,False,15,12,255,Clear,0.0,74,10,1016,22,15,10,14,26,14,1,2008-07-01 04:00:00


## Fix Feature Formats
The features are likely in the wrong formats based on what they have just been loaded as. They must be changed to their correct types in order for the merge to work.

### Fixing Feature Formats in the MPD Dataset

In [6]:
mpd_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4027695 entries, 0 to 4027694
Data columns (total 14 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   district               object 
 1   nature                 object 
 2   status                 object 
 3   call_id                int64  
 4   datetime               object 
 5   isCorner               bool   
 6   houseNumber            float64
 7   primaryStreetName      object 
 8   primaryStreetSuffix    object 
 9   secondaryStreetName    object 
 10  secondaryStreetSuffix  object 
 11  shots_nature           object 
 12  top_natures            object 
 13  top_districts          object 
dtypes: bool(1), float64(1), int64(1), object(11)
memory usage: 403.3+ MB


Many of the features in the MPD dataset are improperly formatted. The district, nature, status, primaryStreetName, primaryStreetSuffix, secondaryStreetName, and secondaryStreetSuffix all need to become categorical features. The datetime feature needs to be changed to datetime.

In [7]:
mpd_data['district'] = mpd_data['district'].astype('category')
mpd_data['nature'] = mpd_data['nature'].astype('category')
mpd_data['status'] = mpd_data['status'].astype('category')
mpd_data['primaryStreetName'] = mpd_data['primaryStreetName'].astype('category')
mpd_data['primaryStreetSuffix'] = mpd_data['primaryStreetSuffix'].astype('category')
mpd_data['secondaryStreetName'] = mpd_data['secondaryStreetName'].astype('category')
mpd_data['secondaryStreetSuffix'] = mpd_data['secondaryStreetSuffix'].astype('category')
mpd_data['datetime'] = pd.to_datetime(mpd_data['datetime'], infer_datetime_format=True)
mpd_data['shots_nature'] = mpd_data['shots_nature'].astype('category')
mpd_data['top_natures'] = mpd_data['top_natures'].astype('category')
mpd_data['top_districts'] = mpd_data['top_districts'].astype('category')

### Fixing Feature Formats in the MPD Dataset

In [8]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119112 entries, 0 to 119111
Data columns (total 17 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   isdaytime      119112 non-null  bool   
 1   tempC          119112 non-null  int64  
 2   windspeedKmph  119112 non-null  int64  
 3   winddirdegree  119112 non-null  int64  
 4   weatherDesc    119112 non-null  object 
 5   precipMM       119112 non-null  float64
 6   humidity       119112 non-null  int64  
 7   visibilityKm   119112 non-null  int64  
 8   pressureMB     119112 non-null  int64  
 9   cloudcover     119112 non-null  int64  
 10  HeatIndexC     119112 non-null  int64  
 11  DewPointC      119112 non-null  int64  
 12  WindChillC     119112 non-null  int64  
 13  WindGustKmph   119112 non-null  int64  
 14  FeelsLikeC     119112 non-null  int64  
 15  uvIndex        119112 non-null  int64  
 16  datetime       119112 non-null  object 
dtypes: bool(1), float64(1), int64

Everything looks good with the weather data except for datetime and weatherDesc. WeatherDesc needs to become a categorical feature and datetime needs to become a datetime feature.

In [9]:
weather_data['weatherDesc'] = weather_data['weatherDesc'].astype('category')
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], infer_datetime_format=True)

## Sort the Datasets by Time
The datasets must be [sorted](https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-date/) in an increasing manner based on the target feature that is used to join the two datasets. The datetime feature will be used to sort because the goal is to include the weather at the time a crime occured.

### Sort the MPD Data

In [10]:
mpd_data = mpd_data.sort_values(by='datetime')

### Sort the Weather Data

In [11]:
weather_data = weather_data.sort_values(by='datetime')

## Join the Dataframes
The pandas [merge_asof](https://pandas.pydata.org/docs/dev/reference/api/pandas.merge_asof.html) method can be used to quickly [merge the two sorted datasets](https://www.geeksforgeeks.org/merge-two-pandas-dataframes-based-on-closest-datetime/) into one on the closest time. The merge will be conducted so that the nearest hour that matches between the two datasets will be used. This should lead to df maintaining the same number of rows but adding on the number of columns in weather_data.

In [12]:
print("MPD Data Shape Before: %s" % ((mpd_data.shape), ))
print("Weather Data Shape Before: %s" % ((weather_data.shape), ))
merged_data = pd.merge_asof(mpd_data, weather_data, on='datetime', direction='nearest')  
print("Merged Data Shape After: %s" % ((merged_data.shape), ))

MPD Data Shape Before: (4027695, 14)
Weather Data Shape Before: (119112, 17)
Merged Data Shape After: (4027695, 30)


## Remap and Expand Time
The time feature can be expanded into its components. This is done to assist with observations.

In [13]:
merged_data['date'] = merged_data['datetime'].map(lambda t: t.date()) # Represents only the date with no time of day attached.
merged_data['date'] = pd.to_datetime(merged_data['date'], infer_datetime_format=True) # Change the type to a datetime (all the time values will be 0) this is so it can be graphed easier
merged_data['year'] = merged_data['datetime'].map(lambda t: t.year) # Represents year
merged_data['month'] = merged_data['datetime'].map(lambda t: t.month) # Represents month of the year
merged_data['week'] = merged_data['datetime'].map(lambda t: t.week) # Represents week of the year
merged_data['day'] = merged_data['datetime'].map(lambda t: t.day) # Represents day of the month
merged_data['hour'] = merged_data['datetime'].map(lambda t: t.hour) # Represents hour of the day
merged_data['weekday'] = merged_data['datetime'].map(lambda t: t.weekday()) # Monday is 0 and Sunday is 6

### Remove the Un-needed Features
Remove the district, nature, and houseNumber features. The district and nature features are both represented in top_districts and top_natures respectively. The houseNumber feature has too many null values and likely provides too much information.

In [14]:
merged_data = merged_data.drop(columns=['district', 'nature', 'houseNumber'], axis=1)

# Conclusion
At this point the datasets are done being merged. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the merged data to a new csv file.

In [15]:
merged_data.head()

Unnamed: 0,status,call_id,datetime,isCorner,primaryStreetName,primaryStreetSuffix,secondaryStreetName,secondaryStreetSuffix,shots_nature,top_natures,...,WindGustKmph,FeelsLikeC,uvIndex,date,year,month,week,day,hour,weekday
0,Assignment Completed,163120850,2016-11-07 09:33:52,False,N 4TH,ST,,,OTHER,FOLLOW UP,...,18,14,4,2016-11-07,2016,11,45,7,9,0
1,Advised/Referral,163120774,2016-11-07 09:33:55,False,W BLUE MOUND,RD,,,OTHER,PROPERTY DAMAGE,...,18,14,4,2016-11-07,2016,11,45,7,9,0
2,Assignment Completed,163120851,2016-11-07 09:34:58,False,N 93RD,ST,,,OTHER,OTHER,...,18,14,4,2016-11-07,2016,11,45,7,9,0
3,Assignment Completed,163120853,2016-11-07 09:36:20,False,W LINCOLN,AV,,,OTHER,FOLLOW UP,...,18,14,4,2016-11-07,2016,11,45,7,9,0
4,Advised,163120857,2016-11-07 09:38:10,False,N 25TH,ST,,,OTHER,TRAFFIC STOP,...,18,14,4,2016-11-07,2016,11,45,7,9,0


In [16]:
merged_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4027695 entries, 0 to 4027694
Data columns (total 34 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   status                 4027695 non-null  category      
 1   call_id                4027695 non-null  int64         
 2   datetime               4027695 non-null  datetime64[ns]
 3   isCorner               4027695 non-null  bool          
 4   primaryStreetName      4016778 non-null  category      
 5   primaryStreetSuffix    4027695 non-null  category      
 6   secondaryStreetName    4027391 non-null  category      
 7   secondaryStreetSuffix  4027695 non-null  category      
 8   shots_nature           4027695 non-null  category      
 9   top_natures            4027695 non-null  category      
 10  top_districts          4027695 non-null  category      
 11  isdaytime              4027695 non-null  bool          
 12  tempC                  40276

In [17]:
merged_data.describe()

Unnamed: 0,call_id,tempC,windspeedKmph,winddirdegree,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,...,WindChillC,WindGustKmph,FeelsLikeC,uvIndex,year,month,week,day,hour,weekday
count,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,...,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0
mean,189180800.0,9.33296,13.96831,197.8097,0.08068526,72.52384,9.280653,1016.711,46.2748,9.741805,...,7.040703,20.30895,7.32946,2.161411,2018.734,6.556297,26.78238,15.70493,12.7607,2.910867
std,14593090.0,10.99667,6.506821,93.0129,0.3459743,15.60838,1.861185,7.728984,36.20323,11.50243,...,13.14242,9.506874,13.56433,1.869588,1.467857,3.476425,15.18563,8.888607,6.72386,1.97186
min,163081500.0,-31.0,0.0,0.0,0.0,16.0,0.0,986.0,0.0,-30.0,...,-44.0,0.0,-44.0,1.0,2016.0,1.0,1.0,1.0,0.0,0.0
25%,173212900.0,1.0,9.0,126.0,0.0,61.0,10.0,1012.0,11.0,1.0,...,-3.0,13.0,-3.0,1.0,2017.0,4.0,14.0,8.0,8.0,1.0
50%,190670700.0,9.0,13.0,214.0,0.0,74.0,10.0,1016.0,39.0,9.0,...,6.0,19.0,6.0,1.0,2019.0,6.0,26.0,16.0,14.0,3.0
75%,201611500.0,19.0,18.0,276.0,0.0,85.0,10.0,1021.0,83.0,19.0,...,19.0,26.0,19.0,3.0,2020.0,10.0,40.0,23.0,18.0,5.0
max,220101000.0,35.0,47.0,360.0,16.3,100.0,10.0,1048.0,100.0,41.0,...,35.0,71.0,41.0,9.0,2022.0,12.0,53.0,31.0,23.0,6.0


In [18]:
merged_data.to_csv("merged_data.csv", index=False)