# Preparing Chicago Weather Dataset

We want to examine whether the weather has an effect on the type of crimes carried out in the time surrounding sports games. For example, if the weather is bad are more fans staying indoors to watch the games and in turn is domestic violence rising? Then alternatively to that, does good weather cause an increase in assault and vandalism, as more fans would be outdoor and attending the games?

In this notebook we will load in three datasets: one for temperature, one for weather description and one for wind speed. Each one contains data on 36 cities in the US and their columns is named after each one. We want to grade this weather so we can get an overall sense of how good or bad the weather was at certain times in the day. Each type of weather will contribute equally to determine what type of weather occurred. 

### Loading in the data

In [1020]:
import os.path
import datetime
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

The temperature dataset is loaded in. This contains hourly data of 36 different cities in the US. The datetime column is made the index column. 

In [1021]:
if not os.path.exists( "../../data/raw/temperature.csv" ):
    print("Missing dataset file")

In [1022]:
temp = pd.read_csv( "../../data/raw/temperature.csv", index_col="datetime", parse_dates=True)

In [1023]:
temp.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,309.1,,,
2012-10-01 13:00:00,284.63,282.08,289.48,281.8,291.87,291.53,293.41,296.6,285.12,284.61,...,285.63,288.22,285.83,287.17,307.59,305.47,310.58,304.4,304.4,303.5
2012-10-01 14:00:00,284.629041,282.083252,289.474993,281.797217,291.868186,291.533501,293.403141,296.608509,285.154558,284.607306,...,285.663208,288.247676,285.83465,287.186092,307.59,304.31,310.495769,304.4,304.4,303.5
2012-10-01 15:00:00,284.626998,282.091866,289.460618,281.789833,291.862844,291.543355,293.392177,296.631487,285.233952,284.599918,...,285.756824,288.32694,285.84779,287.231672,307.391513,304.281841,310.411538,304.4,304.4,303.5
2012-10-01 16:00:00,284.624955,282.100481,289.446243,281.782449,291.857503,291.553209,293.381213,296.654466,285.313345,284.59253,...,285.85044,288.406203,285.860929,287.277251,307.1452,304.238015,310.327308,304.4,304.4,303.5


### Chicago

Filtering the column from Chicago...

In [1024]:
chi1 = temp.filter(items=['Chicago'])

In [1025]:
chi1.head()

Unnamed: 0_level_0,Chicago
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,284.01
2012-10-01 14:00:00,284.054691
2012-10-01 15:00:00,284.177412
2012-10-01 16:00:00,284.300133


This column is renamed to 'Temperature'. In each dataset, whether it be the temperature, wind speed or description dataset, the column is named after the city. When we would join them, for this example we would be left with three columns named Chicago. 

In [1026]:
chi1 = chi1.rename(columns={'Chicago': 'Temperature'})

In [1027]:
chi1.head()

Unnamed: 0_level_0,Temperature
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,284.01
2012-10-01 14:00:00,284.054691
2012-10-01 15:00:00,284.177412
2012-10-01 16:00:00,284.300133


The wind speed dataset is now loaded in...

In [1028]:
if not os.path.exists( "../../data/raw/wind_speed.csv" ):
    print("Missing dataset file")

The datetime column is again made the index..

In [1029]:
temp1 = pd.read_csv( "../../data/raw/wind_speed.csv", index_col="datetime", parse_dates=True)

In [1030]:
temp1.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,8.0,,,
2012-10-01 13:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,...,4.0,7.0,4.0,3.0,1.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 14:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,...,4.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 15:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,3.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 16:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,3.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0


Again only interested in the Chicago column

In [1031]:
chi2 = temp1.filter(items=['Chicago'])

In [1032]:
chi2.head()

Unnamed: 0_level_0,Chicago
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,0.0
2012-10-01 14:00:00,0.0
2012-10-01 15:00:00,0.0
2012-10-01 16:00:00,0.0


The Chicago column is renamed to 'Wind Speed'

In [1033]:
chi2 = chi2.rename(columns={'Chicago': 'Wind Speed'})

In [1034]:
chi2.head()

Unnamed: 0_level_0,Wind Speed
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,0.0
2012-10-01 14:00:00,0.0
2012-10-01 15:00:00,0.0
2012-10-01 16:00:00,0.0


The weather description dataset is loaded in..

In [1035]:
if not os.path.exists( "../../data/raw/weather_description.csv" ):
    print("Missing dataset file")

datetime is again made the index..

In [1036]:
temp2 = pd.read_csv( "../../data/raw/weather_description.csv", index_col="datetime", parse_dates=True)

In [1037]:
temp2.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,haze,,,
2012-10-01 13:00:00,mist,scattered clouds,light rain,sky is clear,mist,sky is clear,sky is clear,sky is clear,sky is clear,light rain,...,broken clouds,few clouds,overcast clouds,sky is clear,sky is clear,sky is clear,haze,sky is clear,sky is clear,sky is clear
2012-10-01 14:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,sky is clear,sky is clear,broken clouds,overcast clouds,sky is clear,overcast clouds
2012-10-01 15:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
2012-10-01 16:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds


Again, only looking at the Chicago column

In [1038]:
chi3 = temp2.filter(items=['Chicago'])

In [1039]:
chi3.head()

Unnamed: 0_level_0,Chicago
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,overcast clouds
2012-10-01 14:00:00,overcast clouds
2012-10-01 15:00:00,overcast clouds
2012-10-01 16:00:00,overcast clouds


This column is renamed to 'Description'

In [1040]:
chi3 = chi3.rename(columns={'Chicago': 'Description'})

In [1041]:
chi3.head()

Unnamed: 0_level_0,Description
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,overcast clouds
2012-10-01 14:00:00,overcast clouds
2012-10-01 15:00:00,overcast clouds
2012-10-01 16:00:00,overcast clouds


The Temperature, Wind Speed and Description columns are now concatted into the one dataframe. 

In [1042]:
frames = [chi1, chi2, chi3]
df = pd.concat(frames, axis = 1)

In [1043]:
df.head()

Unnamed: 0_level_0,Temperature,Wind Speed,Description
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-10-01 12:00:00,,,
2012-10-01 13:00:00,284.01,0.0,overcast clouds
2012-10-01 14:00:00,284.054691,0.0,overcast clouds
2012-10-01 15:00:00,284.177412,0.0,overcast clouds
2012-10-01 16:00:00,284.300133,0.0,overcast clouds


### Dealing with Null values

We look at the null values in the new dataframe. 

In [1044]:
df.isnull().sum()

Temperature    3
Wind Speed     1
Description    1
dtype: int64

Looking at the rows where Description is null... The only example is in the first row which is black in all fields so this row is deleted. 

In [1045]:
print(df[df["Description"].isnull()])

                     Temperature  Wind Speed Description
datetime                                                
2012-10-01 12:00:00          NaN         NaN         NaN


In [1046]:
df = df.dropna(subset = ['Description'])

Looking at the temperature fields that are null. 

In [1047]:
print(df[df["Temperature"].isnull()])

                     Temperature  Wind Speed Description
datetime                                                
2013-03-11 07:00:00          NaN         1.0  light rain
2013-03-11 08:00:00          NaN         2.0  light rain


Looking at the day that consists of the null values. 

In [1048]:
testdate = df.loc['2013-03-11':'2013-03-11']

In [1049]:
testdate

Unnamed: 0_level_0,Temperature,Wind Speed,Description
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-03-11 00:00:00,278.92,5.0,overcast clouds
2013-03-11 01:00:00,279.005,4.0,overcast clouds
2013-03-11 02:00:00,279.09,4.0,overcast clouds
2013-03-11 03:00:00,279.38,3.0,overcast clouds
2013-03-11 04:00:00,280.22,3.0,overcast clouds
2013-03-11 05:00:00,280.38,4.0,overcast clouds
2013-03-11 06:00:00,280.69,3.0,broken clouds
2013-03-11 07:00:00,,1.0,light rain
2013-03-11 08:00:00,,2.0,light rain
2013-03-11 09:00:00,278.85,4.0,mist


Find the average between 280.690 and 278.850. This is equal to 279.770. We'll then fill the null values with this average. Instead of deleting the rows we felt this was the best solution as we could need these rows later when comparing them to the NFL and NBA games. We felt getting the average between the row after and before the missing values would give us the best result. 

In [1050]:
df['Temperature'] = df['Temperature'].fillna(279.770)

In [1051]:
print(df[df["Temperature"].isnull()])

Empty DataFrame
Columns: [Temperature, Wind Speed, Description]
Index: []


We want to convert the temperature to degrees Celcius from kelvin. Celcius is more familiar to us and the general public so we felt it would make understanding easier. The formula to do this is T(°C)=T(K)-273.15, Therefore, we create a new column called 'Celcius' and use this formula. 

In [1052]:
df["Celcius"] = df["Temperature"] - 273.15

In [1053]:
df.head()

Unnamed: 0_level_0,Temperature,Wind Speed,Description,Celcius
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012-10-01 13:00:00,284.01,0.0,overcast clouds,10.86
2012-10-01 14:00:00,284.054691,0.0,overcast clouds,10.904691
2012-10-01 15:00:00,284.177412,0.0,overcast clouds,11.027412
2012-10-01 16:00:00,284.300133,0.0,overcast clouds,11.150133
2012-10-01 17:00:00,284.422855,0.0,overcast clouds,11.272855


The Kelvin temperature column is now dropped as it is no longer needed

In [1054]:
df = df.drop(df.columns[[0]], axis=1)

In [1055]:
chiWea = df

In [1056]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-10-01 13:00:00,0.0,overcast clouds,10.86
2012-10-01 14:00:00,0.0,overcast clouds,10.904691
2012-10-01 15:00:00,0.0,overcast clouds,11.027412
2012-10-01 16:00:00,0.0,overcast clouds,11.150133
2012-10-01 17:00:00,0.0,overcast clouds,11.272855


### The Weather Grading System

We now look at all the different types of Weather in terms of description. These will be grouped into different groups depending how good the weather is. 

In [1057]:
pt = chiWea.groupby('Description')[('Celcius')].count()
pt.sort_values(ascending = False)

Description
sky is clear                           10844
broken clouds                           7255
overcast clouds                         5412
scattered clouds                        4522
mist                                    4112
few clouds                              3851
light rain                              3745
moderate rain                           1469
light snow                               886
haze                                     877
fog                                      530
heavy intensity rain                     503
snow                                     241
light intensity drizzle                  228
heavy snow                               196
proximity thunderstorm                   178
drizzle                                  143
thunderstorm                              78
thunderstorm with light rain              60
very heavy rain                           35
thunderstorm with rain                    27
thunderstorm with heavy rain              2

We create the following function. If the description is 'sky is clear' or 'few clouds' this weather is considered good and is given a value of 3. Then if the description is 'overcast clouds', 'broken clouds' or 'scattered clouds' the weather is considered to be moderate and is given a value of 2. The rest (the bad weather) is given a value of 0. We wanted to get a clear difference between good and even moderate weather which is why bad weather is valued at 0 instead of 1. 

In [1058]:
conditions = [
    (chiWea['Description'] == 'sky is clear') | (chiWea['Description'] == 'few clouds'),
    (chiWea['Description'] == 'overcast clouds') | (chiWea['Description'] == 'broken clouds') 
    | (chiWea['Description'] == 'scattered clouds') ]
choices = ['3', '2']
chiWea['DescRate'] = np.select(conditions, choices, default='0')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate
datetime                                                             
2012-10-01 13:00:00         0.0   overcast clouds  10.860000        2
2012-10-01 14:00:00         0.0   overcast clouds  10.904691        2
2012-10-01 15:00:00         0.0   overcast clouds  11.027412        2
2012-10-01 16:00:00         0.0   overcast clouds  11.150133        2
2012-10-01 17:00:00         0.0   overcast clouds  11.272855        2
2012-10-01 18:00:00         0.0   overcast clouds  11.395576        2
2012-10-01 19:00:00         0.0   overcast clouds  11.518297        2
2012-10-01 20:00:00         0.0   overcast clouds  11.641018        2
2012-10-01 21:00:00         0.0   overcast clouds  11.763739        2
2012-10-01 22:00:00         0.0   overcast clouds  11.886461        2
2012-10-01 23:00:00         0.0   overcast clouds  12.009182        2
2012-10-02 00:00:00         0.0   overcast clouds  12.131903        2
2012-10-02 01:00:00 

The next thing was to determine how good the temperatue was, so we could assign it a grade. We first got the mean temperature. Anything below the average temperature would be considered to be 'bad weather'. 

In [1059]:
chiWea['Celcius'].mean()

10.200414543717171

Again the same type of code is used as above. This time if the temperature is below or equal to 10 a value of 0 is assigned as it is considered bad weather. Anything above or equal to 20 degrees celcius is given a value of 3 as it is good weather. The In between temperatures are considered moderate and is given a value of 2. 

In [1060]:
conditions = [
    (chiWea['Celcius'] <= 10),
    (chiWea['Celcius'] >= 20)]
choices = ['0', '3']
chiWea['TempRate'] = np.select(conditions, choices, default='2')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate TempRate
datetime                                                                      
2012-10-01 13:00:00         0.0   overcast clouds  10.860000        2        2
2012-10-01 14:00:00         0.0   overcast clouds  10.904691        2        2
2012-10-01 15:00:00         0.0   overcast clouds  11.027412        2        2
2012-10-01 16:00:00         0.0   overcast clouds  11.150133        2        2
2012-10-01 17:00:00         0.0   overcast clouds  11.272855        2        2
2012-10-01 18:00:00         0.0   overcast clouds  11.395576        2        2
2012-10-01 19:00:00         0.0   overcast clouds  11.518297        2        2
2012-10-01 20:00:00         0.0   overcast clouds  11.641018        2        2
2012-10-01 21:00:00         0.0   overcast clouds  11.763739        2        2
2012-10-01 22:00:00         0.0   overcast clouds  11.886461        2        2
2012-10-01 23:00:00         0.0   overcast clouds  1

In [1061]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-10-01 13:00:00,0.0,overcast clouds,10.86,2,2
2012-10-01 14:00:00,0.0,overcast clouds,10.904691,2,2
2012-10-01 15:00:00,0.0,overcast clouds,11.027412,2,2
2012-10-01 16:00:00,0.0,overcast clouds,11.150133,2,2
2012-10-01 17:00:00,0.0,overcast clouds,11.272855,2,2


The average wind speed is found. This is relatively low. Although there is no indication to what the wind speed is measured in we are inclined to think the measurement is knots. The average wind speed in Chicago is around 6-10km per hour according to https://www.isws.illinois.edu/statecli/wind/wind.htm . This average in knots would fit into this bracket. 

In [1062]:
chiWea['Wind Speed'].mean()

3.7593255546716167

Wind speed less than or equal to 2 is considered to be very calm and is considered good. It is valued at 3. Between 2 and 5 is considered moderate and is valued at 2. Above 5 knots is considered bad and is given a value of 0. 

In [1063]:
conditions = [
    (chiWea['Wind Speed'] <= 2),
    (chiWea['Wind Speed'] > 2) | (chiWea['Wind Speed'] < 5) ]
choices = ['3', '2']
chiWea['WindRate'] = np.select(conditions, choices, default='0')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate  \
datetime                                                                
2012-10-01 13:00:00         0.0   overcast clouds  10.860000        2   
2012-10-01 14:00:00         0.0   overcast clouds  10.904691        2   
2012-10-01 15:00:00         0.0   overcast clouds  11.027412        2   
2012-10-01 16:00:00         0.0   overcast clouds  11.150133        2   
2012-10-01 17:00:00         0.0   overcast clouds  11.272855        2   
2012-10-01 18:00:00         0.0   overcast clouds  11.395576        2   
2012-10-01 19:00:00         0.0   overcast clouds  11.518297        2   
2012-10-01 20:00:00         0.0   overcast clouds  11.641018        2   
2012-10-01 21:00:00         0.0   overcast clouds  11.763739        2   
2012-10-01 22:00:00         0.0   overcast clouds  11.886461        2   
2012-10-01 23:00:00         0.0   overcast clouds  12.009182        2   
2012-10-02 00:00:00         0.0   overcast clouds  

In [1064]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-10-01 13:00:00,0.0,overcast clouds,10.86,2,2,3
2012-10-01 14:00:00,0.0,overcast clouds,10.904691,2,2,3
2012-10-01 15:00:00,0.0,overcast clouds,11.027412,2,2,3
2012-10-01 16:00:00,0.0,overcast clouds,11.150133,2,2,3
2012-10-01 17:00:00,0.0,overcast clouds,11.272855,2,2,3


In [1065]:
chiWea.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45252 entries, 2012-10-01 13:00:00 to 2017-11-30 00:00:00
Freq: H
Data columns (total 6 columns):
Wind Speed     45252 non-null float64
Description    45252 non-null object
Celcius        45252 non-null float64
DescRate       45252 non-null object
TempRate       45252 non-null object
WindRate       45252 non-null object
dtypes: float64(2), object(4)
memory usage: 3.7+ MB


We have to change our new columns to int as we are going to add them together to get our overall weather rating. 

In [1066]:
chiWea['DescRate'] = chiWea['DescRate'].astype(int)

In [1067]:
chiWea['TempRate'] = chiWea['TempRate'].astype(int)

In [1068]:
chiWea['WindRate'] = chiWea['WindRate'].astype(int)

In [1069]:
chiWea.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45252 entries, 2012-10-01 13:00:00 to 2017-11-30 00:00:00
Freq: H
Data columns (total 6 columns):
Wind Speed     45252 non-null float64
Description    45252 non-null object
Celcius        45252 non-null float64
DescRate       45252 non-null int32
TempRate       45252 non-null int32
WindRate       45252 non-null int32
dtypes: float64(2), int32(3), object(1)
memory usage: 3.1+ MB


The Overall column is created it consits of the Description Rating, Temperature Rating and Wind Rating added together. This overall rating gives us a sense of how good the weather was for that hour. 

In [1070]:
chiWea['Overall'] = chiWea["DescRate"] + chiWea["TempRate"] + chiWea["WindRate"]

In [1071]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-10-01 13:00:00,0.0,overcast clouds,10.86,2,2,3,7
2012-10-01 14:00:00,0.0,overcast clouds,10.904691,2,2,3,7
2012-10-01 15:00:00,0.0,overcast clouds,11.027412,2,2,3,7
2012-10-01 16:00:00,0.0,overcast clouds,11.150133,2,2,3,7
2012-10-01 17:00:00,0.0,overcast clouds,11.272855,2,2,3,7


We then want to group the weather into Good, Moderate and Overall Weather. 8 and over is Good. 6 to 8 is considered moderate. And under 6 would be considered to be Bad. For example if we have a case where the temperature is 21 degrees, overcast clouds and rain, this would be considered bad as Raining is always considered bad in terms of weather, and prevents people from going outside. 

In [1072]:
conditions = [
    (chiWea['Overall'] >= 8), 
    (chiWea['Overall'] >= 6) & (chiWea['Overall'] < 8) ]
choices = ['Good', 'Moderate']
chiWea['Weather'] = np.select(conditions, choices, default='Bad')

In [1073]:
chiWea.head(50)

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall,Weather
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-10-01 13:00:00,0.0,overcast clouds,10.86,2,2,3,7,Moderate
2012-10-01 14:00:00,0.0,overcast clouds,10.904691,2,2,3,7,Moderate
2012-10-01 15:00:00,0.0,overcast clouds,11.027412,2,2,3,7,Moderate
2012-10-01 16:00:00,0.0,overcast clouds,11.150133,2,2,3,7,Moderate
2012-10-01 17:00:00,0.0,overcast clouds,11.272855,2,2,3,7,Moderate
2012-10-01 18:00:00,0.0,overcast clouds,11.395576,2,2,3,7,Moderate
2012-10-01 19:00:00,0.0,overcast clouds,11.518297,2,2,3,7,Moderate
2012-10-01 20:00:00,0.0,overcast clouds,11.641018,2,2,3,7,Moderate
2012-10-01 21:00:00,0.0,overcast clouds,11.763739,2,2,3,7,Moderate
2012-10-01 22:00:00,0.0,overcast clouds,11.886461,2,2,3,7,Moderate


In [1074]:
chiWea.sort_values(by=['Overall'])

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall,Weather
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-04-26 12:00:00,6.0,fog,7.350000,0,0,2,2,Bad
2016-02-27 15:00:00,6.0,haze,0.640000,0,0,2,2,Bad
2015-01-03 16:00:00,3.0,moderate rain,0.855667,0,0,2,2,Bad
2013-05-24 14:00:00,9.0,mist,8.690000,0,0,2,2,Bad
2014-03-19 00:00:00,6.0,very heavy rain,7.930000,0,0,2,2,Bad
2014-03-19 01:00:00,4.0,very heavy rain,7.400000,0,0,2,2,Bad
2015-01-04 02:00:00,3.0,light rain,2.110667,0,0,2,2,Bad
2014-03-19 08:00:00,5.0,light rain,6.540000,0,0,2,2,Bad
2015-01-04 04:00:00,3.0,light rain,2.397000,0,0,2,2,Bad
2014-03-19 09:00:00,4.0,light rain,6.170000,0,0,2,2,Bad


We give an overall count of the weather grades. Unsurprsingly, there is more bad weather than either moderate or good. This would be because a lot of the hours would be overnight when the temperature is lower and the sun is in. 

In [1075]:
wea1 = chiWea.groupby('Weather')[('Overall')].count()
wea1.sort_values(ascending=False)

Weather
Bad         25374
Moderate    13160
Good         6718
Name: Overall, dtype: int64

We save this to a new csv, where it will be used for analysis. 

In [1076]:
chiWea.to_csv('../../data/prep/300_ChiWea.csv')