# Introduction
This notebook is used to clean the weather dataset. Explanations will be included for some decisions in cleaning.

# Dataset Description
Weather data was purchased from [World Weather Online](https://www.worldweatheronline.com/v2/historical-weather.aspx?q=53214). It contains historical weather data in Milwaukee from 2008 until 2022, after the final date of police call data, with a granularity of 1 hour. It consists of date, time, whether or not it's 'daytime', temperature in C and F, windspeed in mph and khm, wind direction in degrees and 16 point, weather code, a url for an icon for the weather, weather description in English, precipitation in mm and inches, humidity, visibility in km and mi, atmospheric pressure shown in millibars (mb) and inches, cloud cover %, heat index in C and F, dew point in C and F, wind chill in C and F, wind gust in kph and mph, feels like temp in C and F, and UV index. More documentaion can be found about the data [here](https://www.worldweatheronline.com/hwd/hfw.aspx)

# Imports
These are the libraries that will be relvant for cleaning this dataset.

In [1]:
import pandas as pd
import numpy as np

# Cleaning the Dataset
The following sections walk through the steps used to clean the weather dataset.

## Load the Raw Data
This section loads the raw data and examines how it is originally formatted.

In [2]:
weather_data = pd.read_csv("weather_data_1hr.csv")

In [3]:
weather_data.shape

(119112, 32)

In [4]:
weather_data.head(10)

Unnamed: 0,loc_id,date,time,isdaytime,tempC,tempF,windspeedMiles,windspeedKmph,winddirdegree,winddir16point,...,HeatIndexF,DewPointC,DewPointF,WindChillC,WindChillF,WindGustMiles,WindGustKmph,FeelsLikeC,FeelsLikeF,uvIndex
0,1,2008-07-01,0,no,14,58,7,11,241,WSW,...,58,11,51,14,57,15,23,14,57,1
1,1,2008-07-01,100,no,14,58,7,12,246,WSW,...,58,10,51,14,57,15,25,14,57,1
2,1,2008-07-01,200,no,14,58,8,12,251,WSW,...,58,10,51,14,57,16,26,14,57,1
3,1,2008-07-01,300,no,15,58,8,13,256,WSW,...,58,10,50,14,57,17,27,14,57,1
4,1,2008-07-01,400,no,15,59,8,12,255,WSW,...,59,10,50,14,58,16,26,14,58,1
5,1,2008-07-01,500,no,15,59,7,12,253,WSW,...,59,10,50,15,58,15,24,15,58,1
6,1,2008-07-01,600,no,15,59,7,11,252,WSW,...,59,10,50,15,59,14,23,15,59,1
7,1,2008-07-01,700,no,17,63,7,12,253,WSW,...,63,11,51,17,63,14,23,17,63,1
8,1,2008-07-01,800,yes,19,66,8,13,254,WSW,...,66,11,53,19,66,15,23,19,66,5
9,1,2008-07-01,900,yes,21,69,9,14,254,WSW,...,69,12,54,21,69,15,24,21,69,6


In [5]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119112 entries, 0 to 119111
Data columns (total 32 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   loc_id           119112 non-null  int64  
 1   date             119112 non-null  object 
 2   time             119112 non-null  int64  
 3   isdaytime        119112 non-null  object 
 4   tempC            119112 non-null  int64  
 5   tempF            119112 non-null  int64  
 6   windspeedMiles   119112 non-null  int64  
 7   windspeedKmph    119112 non-null  int64  
 8   winddirdegree    119112 non-null  int64  
 9   winddir16point   119112 non-null  object 
 10  weatherCode      119112 non-null  int64  
 11  weatherIconUrl   119112 non-null  object 
 12  weatherDesc      119112 non-null  object 
 13  precipMM         119112 non-null  float64
 14  precipInches     119112 non-null  float64
 15  humidity         119112 non-null  int64  
 16  visibilityKm     119112 non-null  int6

The weather dataset has 119112 entries. At first this number seems small, but some quick math (divide by 24 hours per day, divide by 365 days per year) shows that the dataset covers approximately 14 years beginning at 2008.

Immediatly recognizable from the dataset is that it contains many correlated features.
- tempC and tempF
- windspeedMiles and windspeedKmph
- winddirdegree and winddir16point
- precipMM and precipInches
- visibilityKm and visibilityMiles
- pressureMB and pressureInches
- HeatIndexC and HeatIndexF
- DewPointC and DewPointF
- WindChillC and WindChillF
- WindGustMiles and WindGustKmph
- FeelsLikeC and FeelsLikeF
- weatherCode and weatherDesc

## Dropping Correlated Features
Any of the features using imperial units will be dropped in favor of their metric counterparts. This is done in order to preserve as much information as possible since the imperial features are correlated with the metric features while also being less precise. The winddirdegree feature will be retained instead of the winddir16point for this same reason. The weatherIconUrl feature will also be dropped as it is not very descriptive of our overall data. The weatherCode will be dropped in favor of the weatherDesc since weatherDesc is more easily readable. 

In [6]:
print("Data Shape Before: %s" % ((weather_data.shape), ))
to_drop = ['tempF', 'windspeedMiles', 'winddir16point', 'precipInches', \
    'visibilityMiles', 'pressureInches', 'HeatIndexF', \
        'DewPointF', 'WindChillF', 'WindGustMiles', 'FeelsLikeF',\
            'weatherIconUrl', 'weatherCode']
weather_data = weather_data.drop(columns=to_drop, axis=1)
print("Data Shape After: %s" % ((weather_data.shape), ))

Data Shape Before: (119112, 32)
Data Shape After: (119112, 19)


## Dropping Other Features
the loc_id feature should be dropped

In [7]:
weather_data = weather_data.drop(columns=['loc_id'], axis=1)

## Examining Feature Types
The smaller dataset should now be examined to determine what features need to be updated.

In [8]:
weather_data.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119112 entries, 0 to 119111
Data columns (total 18 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   date           119112 non-null  object 
 1   time           119112 non-null  int64  
 2   isdaytime      119112 non-null  object 
 3   tempC          119112 non-null  int64  
 4   windspeedKmph  119112 non-null  int64  
 5   winddirdegree  119112 non-null  int64  
 6   weatherDesc    119112 non-null  object 
 7   precipMM       119112 non-null  float64
 8   humidity       119112 non-null  int64  
 9   visibilityKm   119112 non-null  int64  
 10  pressureMB     119112 non-null  int64  
 11  cloudcover     119112 non-null  int64  
 12  HeatIndexC     119112 non-null  int64  
 13  DewPointC      119112 non-null  int64  
 14  WindChillC     119112 non-null  int64  
 15  WindGustKmph   119112 non-null  int64  
 16  FeelsLikeC     119112 non-null  int64  
 17  uvIndex        119112 non-nul

The features in the revised data are all the correct types except for date, isdaytime, and weatherDesc. Date needs to be converted into a datetime, isdaytime needs to be a boolean, and weatherDesc needs to be categorical.

## Fixing Data Types
The afforementioned features will now be changed into more relevant types in pandas. A method will be used to help convert the isdaytime column.

In [9]:
def isdaytime_to_boolean(val: str) -> bool:
    """
    method to convert a string into a boolean
    :param val: (str) no or yes
    :return: (bool) False if no, True if yes
    :auth: Grant Fass
    :since: 8 February 2022
    """
    if val == 'no':
        return False
    return True

In [10]:
weather_data['date'] = pd.to_datetime(weather_data['date'], infer_datetime_format=True)
weather_data['isdaytime'] = weather_data['isdaytime'].map(isdaytime_to_boolean).astype('bool')
weather_data['weatherDesc'] = weather_data['weatherDesc'].astype('category')

## Fixing Time
Lastly the time feature needs to be modified. Currently this feature begins with 0 representing midnight and increments by 100 per hour. The first entry at midnight each day resets to 0. Thus it will make more sense to convert this feature to represent hours more directly by dividing it by 100. This will cause time to more directly denote hour of the day.

The next step is to merge the date and time columns into one singular datetime feature. [This](https://stackoverflow.com/a/44648068) stackoverflow post helped with using apply to map across multiple features. [This](https://stackoverflow.com/a/17152848) stackoverflow post helped with replacing hours in a [timestamp](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.replace.html). [This](https://stackoverflow.com/a/25129655) stackoverflow post also provided some assistance. The original date and time features will be dropped after the features are merged.

In [11]:
weather_data['time'] = weather_data['time'] / 100

In [12]:
weather_data['datetime'] = weather_data.apply(lambda t: t['date'].replace(hour=int(t['time'])), axis=1)
weather_data = weather_data.drop('date', axis=1)
weather_data = weather_data.drop('time', axis=1)

# Conclusion

At this point the weather dataset is done being cleaned. The last steps are to show the final outputs of the [`.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html), [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html), and [`.describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) methods and output the cleaned data to a new csv file.

In [13]:
weather_data.head()

Unnamed: 0,isdaytime,tempC,windspeedKmph,winddirdegree,weatherDesc,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex,datetime
0,False,14,11,241,Clear,0.0,77,10,1016,3,14,11,14,23,14,1,2008-07-01 00:00:00
1,False,14,12,246,Clear,0.0,77,10,1016,10,14,10,14,25,14,1,2008-07-01 01:00:00
2,False,14,12,251,Clear,0.0,76,10,1015,17,14,10,14,26,14,1,2008-07-01 02:00:00
3,False,15,13,256,Clear,0.0,76,10,1015,24,15,10,14,27,14,1,2008-07-01 03:00:00
4,False,15,12,255,Clear,0.0,74,10,1016,22,15,10,14,26,14,1,2008-07-01 04:00:00


In [14]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119112 entries, 0 to 119111
Data columns (total 17 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   isdaytime      119112 non-null  bool          
 1   tempC          119112 non-null  int64         
 2   windspeedKmph  119112 non-null  int64         
 3   winddirdegree  119112 non-null  int64         
 4   weatherDesc    119112 non-null  category      
 5   precipMM       119112 non-null  float64       
 6   humidity       119112 non-null  int64         
 7   visibilityKm   119112 non-null  int64         
 8   pressureMB     119112 non-null  int64         
 9   cloudcover     119112 non-null  int64         
 10  HeatIndexC     119112 non-null  int64         
 11  DewPointC      119112 non-null  int64         
 12  WindChillC     119112 non-null  int64         
 13  WindGustKmph   119112 non-null  int64         
 14  FeelsLikeC     119112 non-null  int64         
 15  

In [15]:
weather_data.describe()

Unnamed: 0,tempC,windspeedKmph,winddirdegree,precipMM,humidity,visibilityKm,pressureMB,cloudcover,HeatIndexC,DewPointC,WindChillC,WindGustKmph,FeelsLikeC,uvIndex
count,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0,119112.0
mean,8.571311,15.256867,197.462674,0.072492,78.054335,9.156399,1016.540315,43.912167,8.899481,4.678941,5.873674,23.785949,6.103474,2.036294
std,10.93576,7.380158,94.528592,0.354531,14.156311,2.062973,7.691526,35.520894,11.38568,10.448378,13.489782,11.252122,13.829809,1.764697
min,-31.0,0.0,0.0,0.0,16.0,0.0,977.0,0.0,-30.0,-34.0,-44.0,0.0,-44.0,1.0
25%,0.0,10.0,125.0,0.0,69.0,10.0,1012.0,10.0,0.0,-3.0,-4.0,16.0,-4.0,1.0
50%,9.0,14.0,210.0,0.0,81.0,10.0,1016.0,36.0,9.0,5.0,6.0,23.0,6.0,1.0
75%,18.0,19.0,277.0,0.0,89.0,10.0,1021.0,78.0,18.0,14.0,18.0,30.0,18.0,2.0
max,36.0,59.0,360.0,28.3,100.0,10.0,1048.0,100.0,44.0,28.0,36.0,106.0,44.0,9.0


In [16]:
weather_data.to_csv("weather_data_cleaned.csv", index=False)