Initial plan (or, to be buzzword-compliant, design-doc/outline/<most appropriate buzzword here>)

1. Extract weather data for one city from the raw data file
2. Calculate day of year from the year-month-day value
3. Keep hourly data for what is, approximately, not night time
4. For each hour of the day keep only the most commonly occurring weather description; discard other weather description data
5. For each day keep only the most commonly occurring weather description; discard other weather description data
6. Calculate rgb color of sky for each weather description
7. Make bar chart of sky color where each bar represents one day of the year

Note: this project is based on data from an [Open Database on Kaggle](https://www.kaggle.com/selfishgene/historical-hourly-weather-data#weather_description.csv) provided by [David Beniaguev](https://davidbeniaguev.com). Thank you David!

In [1]:
import pandas as pd

### Extract weather data for one city from the raw data file at https://www.kaggle.com/selfishgene/historical-hourly-weather-data#weather_description.csv

In [2]:
# download 'weather_description.csv' from the link above to the same repository as your Jupyter Notebook
dataFile='weather_description.csv'
city='New York'
datetime='datetime'

In [16]:
 pd.read_csv(dataFile)

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,haze,,,
1,2012-10-01 13:00:00,mist,scattered clouds,light rain,sky is clear,mist,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,overcast clouds,sky is clear,sky is clear,sky is clear,haze,sky is clear,sky is clear,sky is clear
2,2012-10-01 14:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,sky is clear,sky is clear,broken clouds,overcast clouds,sky is clear,overcast clouds
3,2012-10-01 15:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
4,2012-10-01 16:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
5,2012-10-01 17:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
6,2012-10-01 18:00:00,broken clouds,scattered clouds,sky is clear,few clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
7,2012-10-01 19:00:00,broken clouds,scattered clouds,sky is clear,few clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
8,2012-10-01 20:00:00,broken clouds,scattered clouds,sky is clear,few clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
9,2012-10-01 21:00:00,broken clouds,scattered clouds,sky is clear,few clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds


In [3]:
# import hourly weather descriptions
df_WeatherDescription = pd.read_csv(dataFile, usecols=[datetime, city], parse_dates = True)
df_WeatherDescription.head()

Unnamed: 0,datetime,New York
0,2012-10-01 12:00:00,
1,2012-10-01 13:00:00,few clouds
2,2012-10-01 14:00:00,few clouds
3,2012-10-01 15:00:00,few clouds
4,2012-10-01 16:00:00,few clouds


In [4]:
df_WeatherDescription.rename(columns={city: 'weather'}, inplace=True)
df_WeatherDescription.dropna(inplace = True)
df_WeatherDescription.head()

Unnamed: 0,datetime,weather
1,2012-10-01 13:00:00,few clouds
2,2012-10-01 14:00:00,few clouds
3,2012-10-01 15:00:00,few clouds
4,2012-10-01 16:00:00,few clouds
5,2012-10-01 17:00:00,few clouds


### Extract day of year and the hour of the day from the datetime timestamp

In [5]:
# type(df_WeatherDescription.datetime) is pandas.core.series.Series
# convert it to datetime https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
df_WeatherDescription.datetime = pd.to_datetime(df_WeatherDescription.datetime)

# https://stackoverflow.com/questions/28990256/python-pandas-time-series-year-extraction
df_WeatherDescription["day_number"] = df_WeatherDescription.datetime.dt.dayofyear
df_WeatherDescription["hour_time"] = df_WeatherDescription.datetime.dt.hour
df_WeatherDescription.head()

Unnamed: 0,datetime,weather,day_number,hour_time
1,2012-10-01 13:00:00,few clouds,275,13
2,2012-10-01 14:00:00,few clouds,275,14
3,2012-10-01 15:00:00,few clouds,275,15
4,2012-10-01 16:00:00,few clouds,275,16
5,2012-10-01 17:00:00,few clouds,275,17


In [6]:
# sanity check
df_WeatherDescription.day_number.nunique()

366

### Calculate rgb color of sky for each weather description

To do this I listed all distinct weather description in the Jupyter Notebook. I then wrote a json file (you can use any text editor, in I used vim) that had key-value pairs of the weather description and rgb color values as follows:

 ```
 {
   "<some weather description>":"rgb(val1,val2,val3)",
   .
   .
   .
 }
 ```
 
 The file needs to saved with the extension `.json`. 
 
 I calculated the rgb values using my mac's Digital Color Meter. I searched on Google Images for each distinct weather description listed below (except for thunderstorms, I repeated the rgb values for those) and then used the Digital Color Meter to record the rgb value of the color of the sky from a relevant image search result. The process was extremely tedious.
 
This method of finding the rgb values is subjective and inefficient, however, it is also perfectly adequate.

In [7]:
df_WeatherDescription.weather.unique()

array(['few clouds', 'sky is clear', 'scattered clouds', 'broken clouds',
       'overcast clouds', 'mist', 'drizzle', 'moderate rain',
       'light intensity drizzle', 'light rain', 'fog', 'haze',
       'heavy snow', 'heavy intensity drizzle', 'heavy intensity rain',
       'light rain and snow', 'snow', 'light snow', 'freezing rain',
       'proximity thunderstorm', 'thunderstorm', 'thunderstorm with rain',
       'smoke', 'very heavy rain', 'thunderstorm with heavy rain',
       'thunderstorm with light rain', 'squalls', 'dust',
       'proximity thunderstorm with rain',
       'thunderstorm with light drizzle', 'sand', 'shower rain',
       'proximity thunderstorm with drizzle',
       'light intensity shower rain', 'sand/dust whirls',
       'heavy thunderstorm'], dtype=object)

### Keep only the most frequently occurring weather data for each day of the year

In [8]:
# datetime timestamp is not needed anymore
df_WeatherDescription.drop(["datetime"], axis=1, inplace=True)

# we don't need to consider the weather description after sunset or before sunrise
df_WeatherDescription = df_WeatherDescription[(df_WeatherDescription["hour_time"] >= 5) & (df_WeatherDescription["hour_time"] <= 21)]

df_WeatherDescription.head()

Unnamed: 0,weather,day_number,hour_time
1,few clouds,275,13
2,few clouds,275,14
3,few clouds,275,15
4,few clouds,275,16
5,few clouds,275,17


In [9]:
# on any given day for any given hour consider only the most frequently occurring weather description
df_WeatherDescription = df_WeatherDescription.groupby(['day_number', 'hour_time'])['weather'].apply(lambda x: x.mode()[0]).reset_index()
df_WeatherDescription.head()

Unnamed: 0,day_number,hour_time,weather
0,1,5,sky is clear
1,1,6,sky is clear
2,1,7,sky is clear
3,1,8,overcast clouds
4,1,9,overcast clouds


In [10]:
# on any given day consider only the most frequently occurring weather description
df_WeatherDescription = df_WeatherDescription.groupby(['day_number'])['weather'].apply(lambda x: x.mode()[0]).reset_index()
df_WeatherDescription.set_index('day_number', inplace=True)
df_WeatherDescription.head()

Unnamed: 0_level_0,weather
day_number,Unnamed: 1_level_1
1,sky is clear
2,sky is clear
3,broken clouds
4,broken clouds
5,sky is clear


#### merge df_WeatherDescription and the json file of weatherDescription-rgbColor pairs created earlier such that there's a rgbColor value assocation with each day of the year


In [11]:
# https://stackoverflow.com/questions/38380795/pandas-read-json-if-using-all-scalar-values-you-must-pass-an-index
dfWeatherColor = pd.read_json('weatherColor.json', typ='series').to_frame('color')
dfWeatherColor.head()

Unnamed: 0,color
broken clouds,"rgb(123, 154, 207)"
drizzle,"rgb(182, 182, 184)"
dust,"rgb(167, 154, 140)"
few clouds,"rgb(106, 179, 247)"
fog,"rgb(230, 230, 230)"


In [12]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
df_WeatherDescription = df_WeatherDescription.merge(dfWeatherColor,how='left', left_on='weather', right_on='color', right_index=True)
df_WeatherDescription.head()

Unnamed: 0_level_0,weather,color
day_number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,sky is clear,"rgb(72, 143, 225)"
2,sky is clear,"rgb(72, 143, 225)"
3,broken clouds,"rgb(123, 154, 207)"
4,broken clouds,"rgb(123, 154, 207)"
5,sky is clear,"rgb(72, 143, 225)"


In [13]:
# sanity check to see that all weather descriptions have a corresponding rgb color value
df_WeatherDescription[df_WeatherDescription.isnull().any(axis=1)]

Unnamed: 0_level_0,weather,color
day_number,Unnamed: 1_level_1,Unnamed: 2_level_1


### Make bar chart of sky color where each bar represents one day of the year

In [14]:
# the first time you use plotly you'll need to sign up for a free api
# uncomment the 2 lines before the first time you use plotly
#import plotly.tools as tools
#tools.set_credentials_file(username='<plotly username here>', api_key='<plotly api key here>')

In [15]:

import plotly.plotly as py
import plotly.graph_objs as go



x = df_WeatherDescription.index
y = len(df_WeatherDescription.index) * [10]

trace1 = go.Bar(
    x=x,
    y=y,
    text = df_WeatherDescription.weather,
    hoverinfo = 'text',
    marker=dict(
        color=df_WeatherDescription['color'],
        ),
    opacity=1
)

# dealing with an edge case here
if city == 'New York':
    city += ' City'

data = [trace1]
layout = go.Layout(
    title='Color of the sky in ' + city + '<br> everyday of the year',
    titlefont=dict(
        size=32,
    ),
    xaxis=dict(
        title='Day of year',
        titlefont=dict(
            family='Open Sans, monospace',
            size=24,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        showticklabels=False,
    )
)
fig = go.Figure(data=data, layout=layout)

filename = city + ' sky color'
py.iplot(fig, filename=filename)

You'll find your plotly chart on your plotly profile online