# Data Processing & Cleaning - Canada Weather
 

This report outlines the steps taken to clean the following data sets:

- Canada Historical Weather data

We will perform the following steps to process & clean the data into its final form for analysis: 

1. General data review
2. Data compilation/consolidation ('raw' --> 'processed')
3. Data cleaning ('processed' --> 'clean_final')  


### Libraries 

In [2]:
import os
import pandas as pd 
import numpy as np
import re
from datetime import datetime
import src.paths as pt
import src.mappings as maps
import imp 
imp.reload(pt)
imp.reload(maps)

<module 'src.mappings' from 'c:\\Users\\Patrick\\OneDrive\\PET PROJECTS\\TTC Delay Analysis\\src\\mappings.py'>

## 1. General Data Review

The weather data set is extracted from the [Canada Weather Stats website](weatherstats.ca) which obtains data directly from [Environment and Climate Change Canada](https://weather.gc.ca/).

Both hourly and daily data sets are extracted to allow for extra granularity (hourly), but also snow and rain measures (daily, as these features are not present in the hourly data).

The hourly dataset contains the following features: 

In [4]:
hourly = pd.read_csv(pt.hourly_weather_path)
hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 19 columns):
date_time_local      60000 non-null object
unixtime             60000 non-null int64
pressure_station     59971 non-null float64
pressure_sea         59971 non-null float64
wind_dir             57248 non-null object
wind_dir_10s         59947 non-null float64
wind_speed           59973 non-null float64
wind_gust            13580 non-null float64
relative_humidity    59964 non-null float64
dew_point            59966 non-null float64
temperature          59973 non-null float64
windchill            14191 non-null float64
humidex              9722 non-null float64
visibility           59975 non-null float64
health_index         59428 non-null float64
cloud_cover_4        0 non-null float64
cloud_cover_8        56536 non-null float64
cloud_cover_10       0 non-null float64
solar_radiation      0 non-null float64
dtypes: float64(16), int64(1), object(2)
memory usage: 8.7+ MB


The daily dataset contains the same features, with columns for aggregates of the features (e.g., avg, min, max) as well as additional features including:  

- Snow  
- Rain  
- Sunrise  
- Sunset  
- Daylight 

## 2. Data Compilation/Consolidation 

No additional compilation or formatting is required to prepare the data files for cleaning. These files will be kept in the 'data/raw/weather' folder until final cleaning is performed.