# Introduction
This notebook is used to clean the weather dataset. Explanations will be included for some decisions in cleaning.

# Dataset Description
Weather data was purchased from [World Weather Online](https://www.worldweatheronline.com/v2/historical-weather.aspx?q=53214). It contains historical weather data in Milwaukee from 2008 until 2022, after the final date of police call data, with a granularity of 1 hour. It consists of date, time, whether or not it's 'daytime', temperature in C and F, windspeed in mph and khm, wind direction in degrees and 16 point, weather code, a url for an icon for the weather, weather description in English, precipitation in mm and inches, humidity, visibility in km and mi, atmospheric pressure shown in millibars (mb) and inches, cloud cover %, heat index in C and F, dew point in C and F, wind chill in C and F, wind gust in kph and mph, feels like temp in C and F, and UV index. More documentaion can be found about the data [here](https://www.worldweatheronline.com/hwd/hfw.aspx)

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [None]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to clean the weather dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

In [None]:
weather_data = pd.read_csv("weather_data_1hr.csv")

In [None]:
weather_data.shape

In [None]:
weather_data.head(10)

In [None]:
weather_data.info()

The weather dataset has 119112 entries. At first this number seems small, but some quick math (divide by 24 hours per day, divide by 365 days per year) shows that the dataset covers approximately 14 years beginning at 2008.

Immediatly recognizable from the dataset is that it contains many correlated features.
- tempC and tempF
- windspeedMiles and windspeedKmph
- winddirdegree and winddir16point
- precipMM and precipInches
- visibilityKm and visibilityMiles
- pressureMB and pressureInches
- HeatIndexC and HeatIndexF
- DewPointC and DewPointF
- WindChillC and WindChillF
- WindGustMiles and WindGustKmph
- FeelsLikeC and FeelsLikeF
- weatherCode and weatherDesc

## Dropping Correlated Features
Any of the features using imperial units will be dropped in favor of their metric counterparts. This is done in order to preserve as much information as possible since the imperial features are correlated with the metric features while also being less precise. The winddirdegree feature will be retained instead of the winddir16point for this same reason. The weatherIconUrl feature will also be dropped as it is not very descriptive of our overall data. The weatherCode will be dropped in favor of the weatherDesc since weatherDesc is more easily readable. 

In [None]:
print("Data Shape Before: %s" % ((weather_data.shape), ))
weather_data = weather_data.drop('tempF', axis=1)
weather_data = weather_data.drop('windspeedMiles', axis=1)
weather_data = weather_data.drop('winddir16point', axis=1)
weather_data = weather_data.drop('precipInches', axis=1)
weather_data = weather_data.drop('visibilityMiles', axis=1)
weather_data = weather_data.drop('pressureInches', axis=1)
weather_data = weather_data.drop('HeatIndexF', axis=1)
weather_data = weather_data.drop('DewPointF', axis=1)
weather_data = weather_data.drop('WindChillF', axis=1)
weather_data = weather_data.drop('WindGustMiles', axis=1)
weather_data = weather_data.drop('FeelsLikeF', axis=1)
weather_data = weather_data.drop('weatherIconUrl', axis=1)
weather_data = weather_data.drop('weatherCode', axis=1)
print("Data Shape After: %s" % ((weather_data.shape), ))

## Examining Feature Types
The smaller dataset should now be examined to determine what features need to be updated.

In [None]:
weather_data.info(verbose=True, show_counts=True)

The features in the revised data are all the correct types except for date, isdaytime, and weatherDesc. Date needs to be converted into a datetime, isdaytime needs to be a boolean, and weatherDesc needs to be categorical.

## Fixing Data Types
The afforementioned features will now be changed into more relevant types in pandas. A method will be used to help convert the isdaytime column.

In [None]:
def isdaytime_to_boolean(val: str) -> bool:
    """
    method to convert a string into a boolean
    :param val: (str) no or yes
    :return: (bool) False if no, True if yes
    :auth: Grant Fass
    :since: 8 February 2022
    """
    if val == 'no':
        return False
    return True

In [None]:
weather_data['date'] = pd.to_datetime(weather_data['date'], infer_datetime_format=True)
weather_data['isdaytime'] = weather_data['isdaytime'].map(isdaytime_to_boolean).astype('bool')
weather_data['weatherDesc'] = weather_data['weatherDesc'].astype('category')

## Fixing Time
Lastly the time feature needs to be modified. Currently this feature begins with 0 representing midnight and increments by 100 per hour. The first entry at midnight each day resets to 0. Thus it will make more sense to convert this feature to represent hours more directly by dividing it by 100. This will cause time to more directly denote hour of the day.

The next step is to merge the date and time columns into one singular datetime feature. [This](https://stackoverflow.com/a/44648068) stackoverflow post helped with using apply to map across multiple features. [This](https://stackoverflow.com/a/17152848) stackoverflow post helped with replacing hours in a [timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.replace.html). [This](https://stackoverflow.com/a/25129655) stackoverflow post also provided some assistance. The original date and time features will be dropped after the features are merged.

In [None]:
weather_data['time'] = weather_data['time'] / 100

In [None]:
weather_data['datetime'] = weather_data.apply(lambda t: t['date'].replace(hour=int(t['time'])), axis=1)
weather_data = weather_data.drop('date', axis=1)
weather_data = weather_data.drop('time', axis=1)

# Conclusion

At this point the weather dataset is done being cleaned. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the cleaned data to a new csv file.

In [None]:
weather_data.head()

In [None]:
weather_data.info()

In [None]:
weather_data.describe()

In [None]:
weather_data.to_csv("weather_data_cleaned.csv", index=False)