# Introduction
This notebook is used to combine the cleaned versions of all datasets used in this project into one joined dataframe. This data will then be exported as a csv to be used in the main project.

# Datasets
The following datasets will be merged
- Weather Dataset
- MPD (Milwaukee Police Department) Dataset

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [None]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to combine the weather dataset into the MPD dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

### Load the MPD Data

In [None]:
mpd_data = pd.read_csv("mpd_data_cleaned.csv")

In [None]:
mpd_data.head()

### Load the Weather Data

In [None]:
weather_data = pd.read_csv("weather_data_cleaned.csv")

In [None]:
weather_data.head()

## Fix Feature Formats
The features are likely in the wrong formats based on what they have just been loaded as. They must be changed to their correct types in order for the merge to work.

### Fixing Feature Formats in the MPD Dataset

In [None]:
mpd_data.info()

Many of the features in the MPD dataset are improperly formatted. The district, nature, status, primaryStreetName, primaryStreetSuffix, secondaryStreetName, and secondaryStreetSuffix all need to become categorical features. The datetime feature needs to be changed to datetime.

In [None]:
mpd_data['district'] = mpd_data['district'].astype('category')
mpd_data['nature'] = mpd_data['nature'].astype('category')
mpd_data['status'] = mpd_data['status'].astype('category')
mpd_data['primaryStreetName'] = mpd_data['primaryStreetName'].astype('category')
mpd_data['primaryStreetSuffix'] = mpd_data['primaryStreetSuffix'].astype('category')
mpd_data['secondaryStreetName'] = mpd_data['secondaryStreetName'].astype('category')
mpd_data['secondaryStreetSuffix'] = mpd_data['secondaryStreetSuffix'].astype('category')
mpd_data['datetime'] = pd.to_datetime(mpd_data['datetime'], infer_datetime_format=True)

### Fixing Feature Formats in the MPD Dataset

In [None]:
weather_data.info()

Everything looks good with the weather data except for datetime and weatherDesc. WeatherDesc needs to become a categorical feature and datetime needs to become a datetime feature.

In [None]:
weather_data['weatherDesc'] = weather_data['weatherDesc'].astype('category')
weather_data['datetime'] = pd.to_datetime(weather_data['datetime'], infer_datetime_format=True)

## Sort the Datasets by Time
The datasets must be [sorted](https://www.geeksforgeeks.org/how-to-sort-a-pandas-dataframe-by-date/) in an increasing manner based on the target feature that is used to join the two datasets. The datetime feature will be used to sort because the goal is to include the weather at the time a crime occured.

### Sort the MPD Data

In [None]:
mpd_data = mpd_data.sort_values(by='datetime')

### Sort the Weather Data

In [None]:
weather_data = weather_data.sort_values(by='datetime')

## Join the Dataframes
The pandas [merge_asof](https://pandas.pydata.org/docs/dev/reference/api/pandas.merge_asof.html) method can be used to quickly [merge the two sorted datasets](https://www.geeksforgeeks.org/merge-two-pandas-dataframes-based-on-closest-datetime/) into one on the closest time. The merge will be conducted so that the nearest hour that matches between the two datasets will be used. This should lead to df maintaining the same number of rows but adding on the number of columns in weather_data.

In [None]:
print("MPD Data Shape Before: %s" % ((mpd_data.shape), ))
print("Weather Data Shape Before: %s" % ((weather_data.shape), ))
merged_data = pd.merge_asof(mpd_data, weather_data, on='datetime', direction='nearest')  
print("Merged Data Shape After: %s" % ((merged_data.shape), ))

# Conclusion
At this point the datasets are done being merged. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the merged data to a new csv file.

In [None]:
merged_data.head()

In [None]:
merged_data.info(verbose=True, show_counts=True)

In [None]:
merged_data.describe()

In [None]:
merged_data.to_csv("merged_data.csv", index=False)