<a id="top"></a>

The aim of this notebook is to clean and prepare historical weather data for Dublin for 2018.   
The data used was obtained from Open Weather Maps.

***

# Import Packages

In [13]:
import pandas as pd
import datetime

***

# Load Data

In [14]:
df_weather = pd.read_csv('/home/faye/data/weather/historical_weather_2018.csv')

***

<a id="contents"></a>
# Contents

- [1. Overview of the Dataset](#overview)
- [2. The Features](#features)


***

<a id="overview"></a>
# 1. Overview of the Dataset
[Back to contents](#contents)

In [15]:
# print the number of rows and features
num_rows = df_weather.shape[0]
features = df_weather.shape[1]
print(f"The dataset has {num_rows} rows with {features} features.")

The dataset has 8800 rows with 25 features.


In [16]:
# print the first 5 rows of the dataset
print("The first 5 Rows are:")
df_weather.head(5)

The first 5 Rows are:


Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,...,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,1514764800,2018-01-01 00:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,4.15,-6.49,3.84,5.79,...,240,,,,,40,520,Rain,light intensity shower rain,09n
1,1514768400,2018-01-01 01:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,4.14,-5.79,3.65,5.86,...,240,,,,,75,520,Rain,light intensity shower rain,09n
2,1514772000,2018-01-01 02:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,4.61,-5.77,3.85,5.99,...,240,,,,,40,802,Clouds,scattered clouds,03n
3,1514775600,2018-01-01 03:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,4.64,-5.73,4.0,6.14,...,240,,,,,40,802,Clouds,scattered clouds,03n
4,1514779200,2018-01-01 04:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,5.04,-4.91,4.11,6.22,...,240,,,,,40,802,Clouds,scattered clouds,03n


In [17]:
# print the last 5 rows of the dataset
print("The last 5 Rows are:")
df_weather.tail(5)

The last 5 Rows are:


Unnamed: 0,dt,dt_iso,timezone,city_name,lat,lon,temp,feels_like,temp_min,temp_max,...,wind_deg,rain_1h,rain_3h,snow_1h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
8795,1546282800,2018-12-31 19:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,9.65,5.78,8.03,10.0,...,260,,,,,75,803,Clouds,broken clouds,04n
8796,1546286400,2018-12-31 20:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,9.27,5.52,8.22,10.0,...,250,,,,,75,803,Clouds,broken clouds,04n
8797,1546290000,2018-12-31 21:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,9.31,4.87,8.33,10.0,...,260,,,,,75,803,Clouds,broken clouds,04n
8798,1546293600,2018-12-31 22:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,9.19,4.3,8.33,10.0,...,260,,,,,75,803,Clouds,broken clouds,04n
8799,1546297200,2018-12-31 23:00:00 +0000 UTC,0,Dublin,53.349805,-6.26031,8.91,4.39,8.08,9.44,...,250,,,,,75,803,Clouds,broken clouds,04n


***

<a id="features"></a>
# 2. The Features
[Back to contents](#contents)

In [18]:
# print the data type for each feature
df_weather.dtypes

dt                       int64
dt_iso                  object
timezone                 int64
city_name               object
lat                    float64
lon                    float64
temp                   float64
feels_like             float64
temp_min               float64
temp_max               float64
pressure                 int64
sea_level              float64
grnd_level             float64
humidity                 int64
wind_speed             float64
wind_deg                 int64
rain_1h                float64
rain_3h                float64
snow_1h                float64
snow_3h                float64
clouds_all               int64
weather_id               int64
weather_main            object
weather_description     object
weather_icon            object
dtype: object

- `dt` is a timestamp, I will leave this as a integer and use it to create the features `date` and `time`.
- `dt_iso` is the datetime in iso format as a string, this will be obsolete when I create the `date` and `time` features.
- We do not need the `timezone`, `city_name`, `lat` or `lon` features, I will leave these and drop them later.
- Of the four temperature features we are only concerned with `temp` which will stay as a float.
- I will keep the `pressure` as an int.
- We are not concerned with either the `sea_level` or `grnd_level` features, these can be dropped.
- I will keep the `humidity` feature as an integer.
- I will keep `wind_speed` as float and `wind_deg` as integer.
- I will keep the two rain features as floats.
- I will also keep the two snow features as floats but I am not sure how useful they will be for us.
- I will keep clouds_all as integer.
- Of the remaining four features the only two that may prove useful are `weather_main` and `weather_description`, the other two features can be dropped.

In [19]:
# convert 'dt' from timestamp to datetime
df_weather['dt'] = pd.to_datetime(df_weather['dt'], unit='s')

In [20]:
# drop features
features_to_drop = [
    'dt_iso', 'timezone', 'city_name', 'lat', 'lon', 'feels_like',
    'temp_min', 'temp_max', 'sea_level', 'grnd_level', 'weather_id', 'weather_icon'
]

df_weather = df_weather.drop(columns=features_to_drop)

In [21]:
# reorder the features
reorder_features = [
    'dt', 'temp', 'pressure', 'humidity', 'wind_speed', 'wind_deg', 
    'rain_1h','rain_3h', 'clouds_all', 'weather_main', 'weather_description', 
    'snow_1h', 'snow_3h'
]

df_weather = df_weather.reindex(columns=reorder_features)

In [22]:
df_weather.dtypes

dt                     datetime64[ns]
temp                          float64
pressure                        int64
humidity                        int64
wind_speed                    float64
wind_deg                        int64
rain_1h                       float64
rain_3h                       float64
clouds_all                      int64
weather_main                   object
weather_description            object
snow_1h                       float64
snow_3h                       float64
dtype: object

***

***

[Back to top](#top)