In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import json

In [3]:
df = pd.read_csv("weather_burbank_airport.csv")

In [4]:
df.head()

Unnamed: 0,city,timestamp,temperature,cloud_cover,cloud_cover_description,pressure,windspeed,precipitation,felt_temperature
0,Burbank,2018-01-01 08:53:00,9.0,33.0,Fair,991.75,9.0,0.0,8.0
1,Burbank,2018-01-01 09:53:00,9.0,33.0,Fair,992.08,0.0,0.0,9.0
2,Burbank,2018-01-01 10:53:00,9.0,21.0,Haze,992.08,0.0,0.0,9.0
3,Burbank,2018-01-01 11:53:00,9.0,29.0,Partly Cloudy,992.08,0.0,0.0,9.0
4,Burbank,2018-01-01 12:53:00,8.0,33.0,Fair,992.08,0.0,0.0,8.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   city                     29244 non-null  object 
 1   timestamp                29244 non-null  object 
 2   temperature              29219 non-null  float64
 3   cloud_cover              29224 non-null  float64
 4   cloud_cover_description  29224 non-null  object 
 5   pressure                 29236 non-null  float64
 6   windspeed                29158 non-null  float64
 7   precipitation            29244 non-null  float64
 8   felt_temperature         29218 non-null  float64
dtypes: float64(6), object(3)
memory usage: 2.0+ MB


After some manual inspection, the dataset contains hourly timestamps for each day. We are probably only interested in those between 6am and 10pm, since before and after that time period, most people are at home. So we extract the necessary ones

In [26]:
df["timestamp"] = pd.to_datetime(df["timestamp"])
dayData = df[(df["timestamp"].dt.hour >= 6) & (df["timestamp"].dt.hour <= 21)]
dayData

Unnamed: 0,city,timestamp,temperature,cloud_cover,cloud_cover_description,pressure,windspeed,precipitation,felt_temperature
0,Burbank,2018-01-01 08:53:00,9.0,33.0,Fair,991.75,9.0,0.0,8.0
1,Burbank,2018-01-01 09:53:00,9.0,33.0,Fair,992.08,0.0,0.0,9.0
2,Burbank,2018-01-01 10:53:00,9.0,21.0,Haze,992.08,0.0,0.0,9.0
3,Burbank,2018-01-01 11:53:00,9.0,29.0,Partly Cloudy,992.08,0.0,0.0,9.0
4,Burbank,2018-01-01 12:53:00,8.0,33.0,Fair,992.08,0.0,0.0,8.0
...,...,...,...,...,...,...,...,...,...
29231,Burbank,2020-12-31 19:53:00,17.0,34.0,Fair,985.82,19.0,0.0,17.0
29232,Burbank,2020-12-31 20:53:00,18.0,34.0,Fair,984.50,26.0,0.0,18.0
29233,Burbank,2020-12-31 21:53:00,19.0,34.0,Fair,985.16,19.0,0.0,19.0
29242,Burbank,2021-01-01 06:53:00,11.0,33.0,Fair,987.14,13.0,0.0,11.0


Now we need to check if there are actually enough entries for each day. We can do that by looking at the mean.

In [27]:
tuplesEachDay = dayData.groupby(dayData["timestamp"].dt.date).size()
tuplesEachDay.sort_values()
tuplesEachDay.mean()

18.199817518248175

Expected was a mean of 16, since we removed the entries 00:53, 01:53, 02:53, 03:53, 04:53, 05:53 und 22:53, 23:53. This means there are 8 entries remaining per day, as 24-8=16.

In [28]:
tuplesEachDay

timestamp
2018-01-01    14
2018-01-02    16
2018-01-03    16
2018-01-04    16
2018-01-05    16
              ..
2020-12-28    40
2020-12-29    16
2020-12-30    17
2020-12-31    17
2021-01-01     2
Length: 1096, dtype: int64

After looking at a day like the 2020-12-28 with 40 entries, it seems like there were additional measurements on some days. They seem logical and in the end its just more data, which is usually a good thing. We decide to keep them. It probably leads to more accurate means.

On the other hand, the 2021-01-01 is the last day of the measurements, so they stopped and we only have 2 entries. We decide to keep those.

Now we calculate the mean temperature, mean cloud cover and mean precipitation for each day.

In [35]:
meansPerDay = dayData.groupby(dayData["timestamp"].dt.date).agg({
    "temperature": "mean",
    "cloud_cover": "mean",
    "precipitation": "mean"
}).reset_index()

meansPerDay.rename(columns={"timestamp": "day"}, inplace=True)
meansPerDay.head()

Unnamed: 0,day,temperature,cloud_cover,precipitation
0,2018-01-01,12.857143,32.142857,0.0
1,2018-01-02,15.0625,28.9375,0.0
2,2018-01-03,16.6875,28.5,0.0
3,2018-01-04,15.5,30.25,0.0
4,2018-01-05,16.5,31.75,0.0


In [36]:
meansPerDay.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   day            1096 non-null   object 
 1   temperature    1096 non-null   float64
 2   cloud_cover    1096 non-null   float64
 3   precipitation  1096 non-null   float64
dtypes: float64(3), object(1)
memory usage: 34.4+ KB


In [39]:
meansPerDay["day"] = pd.to_datetime(meansPerDay["day"])
meansPerDay.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   day            1096 non-null   datetime64[ns]
 1   temperature    1096 non-null   float64       
 2   cloud_cover    1096 non-null   float64       
 3   precipitation  1096 non-null   float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 34.4 KB


Thats all the data we take from the weather data, as the rest seems unnessesary and to keep the dimensions to a minimum, we decide to only take temperature, cloud cover and precipitation into account.

Now we have clean and easily usable data to map to the rest of the data. We just need to save the meansPerDay as a csv and continue with the mapping in dataPreparation

In [40]:
meansPerDay.to_csv("cleanWeatherData.csv")