# Introduction
This notebook is used to combine the cleaned versions of all datasets used in this project into one joined dataframe. This data will then be exported as a csv to be used in the main project.

# Datasets
The following datasets will be merged
- Weather Dataset
- MPD (Milwaukee Police Department) Dataset

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [17]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to combine the weather dataset into the MPD dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

### Load the MPD Data

In [18]:
mpd_data = pd.read_csv("mpd_data_cleaned.csv")

In [19]:
mpd_data.head()

Unnamed: 0,district,nature,status,call_id,datetime,traffic_crime,weapon_crime,isCorner,houseNumber,primaryStreetName,primaryStreetSuffix,secondaryStreetName,secondaryStreetSuffix
0,4,ACC PI,Service in Progress,191411633,2019-05-21 15:19:03,False,False,False,7420.0,W GOOD HOPE,RD,,
1,3,TRAFFIC STOP,City Citation(s) Issued,191411672,2019-05-21 15:24:30,True,False,False,1421.0,N 27TH,ST,,
2,7,SUBJ WANTED,Assignment Completed,191411674,2019-05-21 15:25:46,False,False,False,4054.0,N 71ST,ST,,
3,2,SPECIAL ASSIGN,Service in Progress,191412545,2019-05-21 20:46:28,False,False,False,245.0,W LINCOLN,AV,,
4,3,TRBL W/SUBJ,Unable to Locate Complainant,191412465,2019-05-21 20:50:03,False,False,False,1721.0,W CANAL,ST,,


### Load the Weather Data

In [20]:
weather_data = pd.read_csv("weather_data_cleaned.csv")

In [21]:
weather_data.head()

Unnamed: 0,loc_id,isdaytime,tempC,windspeedKmph,winddirdegree,weatherDesc,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex,datetime
0,1,False,14,11,241,Clear,0.0,77,10,1016,3,14,11,14,23,14,1,2008-07-01 00:00:00
1,1,False,14,12,246,Clear,0.0,77,10,1016,10,14,10,14,25,14,1,2008-07-01 01:00:00
2,1,False,14,12,251,Clear,0.0,76,10,1015,17,14,10,14,26,14,1,2008-07-01 02:00:00
3,1,False,15,13,256,Clear,0.0,76,10,1015,24,15,10,14,27,14,1,2008-07-01 03:00:00
4,1,False,15,12,255,Clear,0.0,74,10,1016,22,15,10,14,26,14,1,2008-07-01 04:00:00


## Fix Feature Formats
The features are likely in the wrong formats based on what they have just been loaded as. They must be changed to their correct types in order for the merge to work.

### Fixing Feature Formats in the MPD Dataset

In [22]:
mpd_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4027695 entries, 0 to 4027694
Data columns (total 13 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   district               object 
 1   nature                 object 
 2   status                 object 
 3   call_id                int64  
 4   datetime               object 
 5   traffic_crime          bool   
 6   weapon_crime           bool   
 7   isCorner               bool   
 8   houseNumber            float64
 9   primaryStreetName      object 
 10  primaryStreetSuffix    object 
 11  secondaryStreetName    object 
 12  secondaryStreetSuffix  object 
dtypes: bool(3), float64(1), int64(1), object(8)
memory usage: 318.8+ MB


Many of the features in the MPD dataset are improperly formatted. The district, nature, status, primaryStreetName, primaryStreetSuffix, secondaryStreetName, and secondaryStreetSuffix all need to become categorical features. The datetime feature needs to be changed to datetime.

In [23]:
mpd_data['district'] = mpd_data['district'].astype('category')
mpd_data['nature'] = mpd_data['nature'].astype('category')
mpd_data['status'] = mpd_data['status'].astype('category')
mpd_data['primaryStreetName'] = mpd_data['primaryStreetName'].astype('category')
mpd_data['primaryStreetSuffix'] = mpd_data['primaryStreetSuffix'].astype('category')
mpd_data['secondaryStreetName'] = mpd_data['secondaryStreetName'].astype('category')
mpd_data['secondaryStreetSuffix'] = mpd_data['secondaryStreetSuffix'].astype('category')
mpd_data['datetime'] = pd.to_datetime(mpd_data['datetime'], infer_datetime_format=True)

### Fixing Feature Formats in the MPD Dataset

In [24]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119112 entries, 0 to 119111
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   loc_id         119112 non-null  int64  
 1   isdaytime      119112 non-null  bool   
 2   tempC          119112 non-null  int64  
 3   windspeedKmph  119112 non-null  int64  
 4   winddirdegree  119112 non-null  int64  
 5   weatherDesc    119112 non-null  object 
 6   precipMM       119112 non-null  float64
 7   humidity       119112 non-null  int64  
 8   visibilityKm   119112 non-null  int64  
 9   pressureMB     119112 non-null  int64  
 10  cloudcover     119112 non-null  int64  
 11  HeatIndexC     119112 non-null  int64  
 12  DewPointC      119112 non-null  int64  
 13  WindChillC     119112 non-null  int64  
 14  WindGustKmph   119112 non-null  int64  
 15  FeelsLikeC     119112 non-null  int64  
 16  uvIndex        119112 non-null  int64  
 17  datetime       119112 non-nul

Everything looks good with the weather data except for datetime and weatherDesc. WeatherDesc needs to become a categorical feature and datetime needs to become a datetime feature.

In [25]:
weather_data['weatherDesc'] = weather_data['weatherDesc'].astype('category')
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], infer_datetime_format=True)

## Sort the Datasets by Time
The datasets must be [sorted](https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-date/) in an increasing manner based on the target feature that is used to join the two datasets. The datetime feature will be used to sort because the goal is to include the weather at the time a crime occured.

### Sort the MPD Data

In [26]:
mpd_data = mpd_data.sort_values(by='datetime')

### Sort the Weather Data

In [27]:
weather_data = weather_data.sort_values(by='datetime')

## Join the Dataframes
The pandas [merge_asof](https://pandas.pydata.org/docs/dev/reference/api/pandas.merge_asof.html) method can be used to quickly [merge the two sorted datasets](https://www.geeksforgeeks.org/merge-two-pandas-dataframes-based-on-closest-datetime/) into one on the closest time. The merge will be conducted so that the nearest hour that matches between the two datasets will be used. This should lead to df maintaining the same number of rows but adding on the number of columns in weather_data.

In [28]:
print("MPD Data Shape Before: %s" % ((mpd_data.shape), ))
print("Weather Data Shape Before: %s" % ((weather_data.shape), ))
merged_data = pd.merge_asof(mpd_data, weather_data, on='datetime', direction='nearest')  
print("Merged Data Shape After: %s" % ((merged_data.shape), ))

MPD Data Shape Before: (4027695, 13)
Weather Data Shape Before: (119112, 18)
Merged Data Shape After: (4027695, 30)


# Conclusion
At this point the datasets are done being merged. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the merged data to a new csv file.

In [29]:
merged_data.head()

Unnamed: 0,district,nature,status,call_id,datetime,traffic_crime,weapon_crime,isCorner,houseNumber,primaryStreetName,...,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex
0,,FOLLOW UP,Assignment Completed,163120850,2016-11-07 09:33:52,False,False,False,2920.0,N 4TH,...,64,10,1026,1,14,7,14,18,14,4
1,,PROPERTY DAMAGE,Advised/Referral,163120774,2016-11-07 09:33:55,False,False,False,5929.0,W BLUE MOUND,...,64,10,1026,1,14,7,14,18,14,4
2,,VACANT HOUSE CHK,Assignment Completed,163120851,2016-11-07 09:34:58,False,False,False,3425.0,N 93RD,...,64,10,1026,1,14,7,14,18,14,4
3,,FOLLOW UP,Assignment Completed,163120853,2016-11-07 09:36:20,False,False,False,245.0,W LINCOLN,...,64,10,1026,1,14,7,14,18,14,4
4,,TRAFFIC STOP,Advised,163120857,2016-11-07 09:38:10,True,False,False,765.0,N 25TH,...,64,10,1026,1,14,7,14,18,14,4


In [30]:
merged_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4027695 entries, 0 to 4027694
Data columns (total 30 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   district               3937463 non-null  category      
 1   nature                 4027695 non-null  category      
 2   status                 4027695 non-null  category      
 3   call_id                4027695 non-null  int64         
 4   datetime               4027695 non-null  datetime64[ns]
 5   traffic_crime          4027695 non-null  bool          
 6   weapon_crime           4027695 non-null  bool          
 7   isCorner               4027695 non-null  bool          
 8   houseNumber            3426083 non-null  float64       
 9   primaryStreetName      4016778 non-null  category      
 10  primaryStreetSuffix    3844881 non-null  category      
 11  secondaryStreetName    549421 non-null   category      
 12  secondaryStreetSuffix  53185

In [31]:
merged_data.describe()

Unnamed: 0,call_id,houseNumber,loc_id,tempC,windspeedKmph,winddirdegree,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex
count,4027695.0,3426083.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0,4027695.0
mean,189180800.0,3367.593,1.0,9.33296,13.96831,197.8097,0.08068526,72.52384,9.280653,1016.711,46.2748,9.741805,4.295812,7.040703,20.30895,7.32946,2.161411
std,14593090.0,2415.387,0.0,10.99667,6.506821,93.0129,0.3459743,15.60838,1.861185,7.728984,36.20323,11.50243,10.50146,13.14242,9.506874,13.56433,1.869588
min,163081500.0,1.0,1.0,-31.0,0.0,0.0,0.0,16.0,0.0,986.0,0.0,-30.0,-34.0,-44.0,0.0,-44.0,1.0
25%,173212900.0,1614.0,1.0,1.0,9.0,126.0,0.0,61.0,10.0,1012.0,11.0,1.0,-4.0,-3.0,13.0,-3.0,1.0
50%,190670700.0,2920.0,1.0,9.0,13.0,214.0,0.0,74.0,10.0,1016.0,39.0,9.0,4.0,6.0,19.0,6.0,1.0
75%,201611500.0,4600.0,1.0,19.0,18.0,276.0,0.0,85.0,10.0,1021.0,83.0,19.0,14.0,19.0,26.0,19.0,3.0
max,220101000.0,646050.0,1.0,35.0,47.0,360.0,16.3,100.0,10.0,1048.0,100.0,41.0,27.0,35.0,71.0,41.0,9.0


In [32]:
merged_data.to_csv("merged_data.csv", index=False)