# Preparing LA Weather Dataset

We want to examine whether the weather has an effect on the type of crimes carried out in the time surrounding sports games. For example, if the weather is bad are more fans staying indoors to watch the games and in turn is domestic violence rising? Then alternatively to that, does good weather cause an increase in assault and vandalism, as more fans would be outdoor and attending the games?

In this notebook we will load in three datasets: one for temperature, one for weather description and one for wind speed. Each one contains data on 36 cities in the US and their columns is named after each one. We want to grade this weather so we can get an overall sense of how good or bad the weather was at certain times in the day. Each type of weather will contribute equally to determine what type of weather occurred. 

### Loading in the data

In [230]:
import os.path
import datetime
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

The temperature dataset is loaded in. This contains hourly data of 36 different cities in the US. The datetime column is made the index column. 

In [231]:
if not os.path.exists( "../../data/raw/temperature.csv" ):
    print("Missing dataset file")

In [232]:
temp = pd.read_csv( "../../data/raw/temperature.csv", index_col="datetime", parse_dates=True)

In [233]:
temp.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,309.1,,,
2012-10-01 13:00:00,284.63,282.08,289.48,281.8,291.87,291.53,293.41,296.6,285.12,284.61,...,285.63,288.22,285.83,287.17,307.59,305.47,310.58,304.4,304.4,303.5
2012-10-01 14:00:00,284.629041,282.083252,289.474993,281.797217,291.868186,291.533501,293.403141,296.608509,285.154558,284.607306,...,285.663208,288.247676,285.83465,287.186092,307.59,304.31,310.495769,304.4,304.4,303.5
2012-10-01 15:00:00,284.626998,282.091866,289.460618,281.789833,291.862844,291.543355,293.392177,296.631487,285.233952,284.599918,...,285.756824,288.32694,285.84779,287.231672,307.391513,304.281841,310.411538,304.4,304.4,303.5
2012-10-01 16:00:00,284.624955,282.100481,289.446243,281.782449,291.857503,291.553209,293.381213,296.654466,285.313345,284.59253,...,285.85044,288.406203,285.860929,287.277251,307.1452,304.238015,310.327308,304.4,304.4,303.5


Filtering the column from Los Angeles...

In [234]:
chi1 = temp.filter(items=['Los Angeles'])

In [235]:
chi1.head()

Unnamed: 0_level_0,Los Angeles
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,291.87
2012-10-01 14:00:00,291.868186
2012-10-01 15:00:00,291.862844
2012-10-01 16:00:00,291.857503


This column is renamed to 'Temperature'. In each dataset, whether it be the temperature, wind speed or description dataset, the column is named after the city. When we would join them, for this example we would be left with three columns named Los Angeles. 

In [236]:
chi1 = chi1.rename(columns={'Los Angeles': 'Temperature'})

In [237]:
chi1.head()

Unnamed: 0_level_0,Temperature
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,291.87
2012-10-01 14:00:00,291.868186
2012-10-01 15:00:00,291.862844
2012-10-01 16:00:00,291.857503


The wind speed dataset is now loaded in...

In [238]:
if not os.path.exists( "../../data/raw/wind_speed.csv" ):
    print("Missing dataset file")

The datetime column is again made the index..

In [239]:
temp1 = pd.read_csv( "../../data/raw/wind_speed.csv", index_col="datetime", parse_dates=True)

In [240]:
temp1.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,8.0,,,
2012-10-01 13:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,...,4.0,7.0,4.0,3.0,1.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 14:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,...,4.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 15:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,3.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0
2012-10-01 16:00:00,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,3.0,...,3.0,7.0,4.0,3.0,3.0,0.0,8.0,2.0,2.0,2.0


Again only interested in the Los Angeles column

In [241]:
chi2 = temp1.filter(items=['Los Angeles'])

In [242]:
chi2.head()

Unnamed: 0_level_0,Los Angeles
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,0.0
2012-10-01 14:00:00,0.0
2012-10-01 15:00:00,0.0
2012-10-01 16:00:00,0.0


The Los Angeles column is renamed to 'Wind Speed'

In [243]:
chi2 = chi2.rename(columns={'Los Angeles': 'Wind Speed'})

In [244]:
chi2.head()

Unnamed: 0_level_0,Wind Speed
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,0.0
2012-10-01 14:00:00,0.0
2012-10-01 15:00:00,0.0
2012-10-01 16:00:00,0.0


The weather description dataset is loaded in..

In [245]:
if not os.path.exists( "../../data/raw/weather_description.csv" ):
    print("Missing dataset file")

datetime is again made the index..

In [246]:
temp2 = pd.read_csv( "../../data/raw/weather_description.csv", index_col="datetime", parse_dates=True)

In [247]:
temp2.head()

Unnamed: 0_level_0,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,Denver,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-10-01 12:00:00,,,,,,,,,,,...,,,,,,,haze,,,
2012-10-01 13:00:00,mist,scattered clouds,light rain,sky is clear,mist,sky is clear,sky is clear,sky is clear,sky is clear,light rain,...,broken clouds,few clouds,overcast clouds,sky is clear,sky is clear,sky is clear,haze,sky is clear,sky is clear,sky is clear
2012-10-01 14:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,sky is clear,sky is clear,broken clouds,overcast clouds,sky is clear,overcast clouds
2012-10-01 15:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds
2012-10-01 16:00:00,broken clouds,scattered clouds,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,sky is clear,broken clouds,...,broken clouds,few clouds,sky is clear,few clouds,overcast clouds,sky is clear,broken clouds,overcast clouds,overcast clouds,overcast clouds


Again, only looking at the Los Angeles column

In [248]:
chi3 = temp2.filter(items=['Los Angeles'])

In [249]:
chi3.head()

Unnamed: 0_level_0,Los Angeles
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,mist
2012-10-01 14:00:00,sky is clear
2012-10-01 15:00:00,sky is clear
2012-10-01 16:00:00,sky is clear


This column is renamed to 'Description'

In [250]:
chi3 = chi3.rename(columns={'Los Angeles': 'Description'})

In [251]:
chi3.head()

Unnamed: 0_level_0,Description
datetime,Unnamed: 1_level_1
2012-10-01 12:00:00,
2012-10-01 13:00:00,mist
2012-10-01 14:00:00,sky is clear
2012-10-01 15:00:00,sky is clear
2012-10-01 16:00:00,sky is clear


The Temperature, Wind Speed and Description columns are now concatted into the one dataframe. 

In [252]:
frames = [chi1, chi2, chi3]
df = pd.concat(frames, axis = 1)

In [253]:
df.head()

Unnamed: 0_level_0,Temperature,Wind Speed,Description
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-10-01 12:00:00,,,
2012-10-01 13:00:00,291.87,0.0,mist
2012-10-01 14:00:00,291.868186,0.0,sky is clear
2012-10-01 15:00:00,291.862844,0.0,sky is clear
2012-10-01 16:00:00,291.857503,0.0,sky is clear


### Dealing with Null values

We look at the null values in the new dataframe. 

In [254]:
df.isnull().sum()

Temperature    3
Wind Speed     1
Description    1
dtype: int64

Looking at the rows where Description is null... The only example is in the first row which is black in all fields so this row is deleted. 

In [255]:
print(df[df["Description"].isnull()])

                     Temperature  Wind Speed Description
datetime                                                
2012-10-01 12:00:00          NaN         NaN         NaN


In [256]:
df = df.dropna(subset = ['Description'])

Looking at the temperature fields that are null. 

In [257]:
print(df[df["Temperature"].isnull()])

                     Temperature  Wind Speed   Description
datetime                                                  
2013-03-11 07:00:00          NaN         0.0  sky is clear
2013-03-11 08:00:00          NaN         1.0  sky is clear


Looking at the day that consists of the null values. 

In [258]:
testdate = df.loc['2013-03-11':'2013-03-11']

In [259]:
testdate

Unnamed: 0_level_0,Temperature,Wind Speed,Description
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-03-11 00:00:00,289.68,0.0,sky is clear
2013-03-11 01:00:00,289.7465,0.0,sky is clear
2013-03-11 02:00:00,287.22,0.0,sky is clear
2013-03-11 03:00:00,286.15,0.0,sky is clear
2013-03-11 04:00:00,284.96,0.0,sky is clear
2013-03-11 05:00:00,284.49,0.0,sky is clear
2013-03-11 06:00:00,283.81,0.0,sky is clear
2013-03-11 07:00:00,,0.0,sky is clear
2013-03-11 08:00:00,,1.0,sky is clear
2013-03-11 09:00:00,282.58,0.0,sky is clear


Find the average between 283.8100 and 282.5800. This is equal to 283.195. We'll then fill the null values with this average. Instead of deleting the rows we felt this was the best solution as we could need these rows later when comparing them to the NFL and NBA games. We felt getting the average between the row after and before the missing values would give us the best result. 

In [260]:
df['Temperature'] = df['Temperature'].fillna(283.195)

In [261]:
print(df[df["Temperature"].isnull()])

Empty DataFrame
Columns: [Temperature, Wind Speed, Description]
Index: []


We want to convert the temperature to degrees Celcius from kelvin. Celcius is more familiar to us and the general public so we felt it would make understanding easier. The formula to do this is T(°C)=T(K)-273.15, Therefore, we create a new column called 'Celcius' and use this formula. 

In [262]:
df["Celcius"] = df["Temperature"] - 273.15

In [263]:
df.head()

Unnamed: 0_level_0,Temperature,Wind Speed,Description,Celcius
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012-10-01 13:00:00,291.87,0.0,mist,18.72
2012-10-01 14:00:00,291.868186,0.0,sky is clear,18.718186
2012-10-01 15:00:00,291.862844,0.0,sky is clear,18.712844
2012-10-01 16:00:00,291.857503,0.0,sky is clear,18.707503
2012-10-01 17:00:00,291.852162,0.0,sky is clear,18.702162


The Kelvin temperature column is now dropped as it is no longer needed

In [264]:
df = df.drop(df.columns[[0]], axis=1)

In [265]:
chiWea = df

In [266]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-10-01 13:00:00,0.0,mist,18.72
2012-10-01 14:00:00,0.0,sky is clear,18.718186
2012-10-01 15:00:00,0.0,sky is clear,18.712844
2012-10-01 16:00:00,0.0,sky is clear,18.707503
2012-10-01 17:00:00,0.0,sky is clear,18.702162


### The Weather Grading System

We now look at all the different types of Weather in terms of description. These will be grouped into different groups depending how good the weather is. 

In [267]:
pt = chiWea.groupby('Description')[('Celcius')].count()
pt.sort_values(ascending = False)

Description
sky is clear                    26136
haze                             3532
mist                             2961
broken clouds                    2565
overcast clouds                  2432
scattered clouds                 2271
light rain                       1949
few clouds                       1757
fog                               566
moderate rain                     481
smoke                             203
heavy intensity rain              127
light intensity drizzle           104
dust                               66
proximity thunderstorm             25
very heavy rain                    20
thunderstorm                       17
thunderstorm with light rain       13
shower rain                         8
drizzle                             5
proximity shower rain               4
light intensity shower rain         3
squalls                             3
thunderstorm with heavy rain        2
thunderstorm with rain              2
Name: Celcius, dtype: int64

We create the following function. If the description is 'sky is clear' or 'few clouds' this weather is considered good and is given a value of 3. Then if the description is 'overcast clouds', 'broken clouds' or 'scattered clouds' the weather is considered to be moderate and is given a value of 2. The rest (the bad weather) is given a value of 0. We wanted to get a clear difference between good and even moderate weather which is why bad weather is valued at 0 instead of 1. 

In [268]:
conditions = [
    (chiWea['Description'] == 'sky is clear') | (chiWea['Description'] == 'few clouds'),
    (chiWea['Description'] == 'overcast clouds') | (chiWea['Description'] == 'broken clouds') 
    | (chiWea['Description'] == 'scattered clouds') ]
choices = ['3', '2']
chiWea['DescRate'] = np.select(conditions, choices, default='0')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate
datetime                                                             
2012-10-01 13:00:00         0.0              mist  18.720000        0
2012-10-01 14:00:00         0.0      sky is clear  18.718186        3
2012-10-01 15:00:00         0.0      sky is clear  18.712844        3
2012-10-01 16:00:00         0.0      sky is clear  18.707503        3
2012-10-01 17:00:00         0.0      sky is clear  18.702162        3
2012-10-01 18:00:00         0.0      sky is clear  18.696821        3
2012-10-01 19:00:00         0.0      sky is clear  18.691480        3
2012-10-01 20:00:00         0.0      sky is clear  18.686139        3
2012-10-01 21:00:00         0.0      sky is clear  18.680798        3
2012-10-01 22:00:00         0.0      sky is clear  18.675457        3
2012-10-01 23:00:00         0.0      sky is clear  18.670116        3
2012-10-02 00:00:00         0.0      sky is clear  18.664775        3
2012-10-02 01:00:00 

The next thing was to determine how good the temperatue was, so we could assign it a grade. We first got the mean temperature. Anything below the average temperature would be considered to be 'bad weather'. 

In [269]:
chiWea['Celcius'].mean()

17.695777983241285

In [287]:
chiWea['Celcius'].max()

42.32000000000005

In [288]:
chiWea['Celcius'].min()

-6.6463333329999728

Again the same type of code is used as above. This time if the temperature is below or equal to 10 a value of 0 is assigned as it is considered bad weather. Anything above or equal to 20 degrees celcius is given a value of 3 as it is good weather. The In between temperatures are considered moderate and is given a value of 2. 

In [270]:
conditions = [
    (chiWea['Celcius'] <= 13),
    (chiWea['Celcius'] >= 23)]
choices = ['0', '3']
chiWea['TempRate'] = np.select(conditions, choices, default='2')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate TempRate
datetime                                                                      
2012-10-01 13:00:00         0.0              mist  18.720000        0        2
2012-10-01 14:00:00         0.0      sky is clear  18.718186        3        2
2012-10-01 15:00:00         0.0      sky is clear  18.712844        3        2
2012-10-01 16:00:00         0.0      sky is clear  18.707503        3        2
2012-10-01 17:00:00         0.0      sky is clear  18.702162        3        2
2012-10-01 18:00:00         0.0      sky is clear  18.696821        3        2
2012-10-01 19:00:00         0.0      sky is clear  18.691480        3        2
2012-10-01 20:00:00         0.0      sky is clear  18.686139        3        2
2012-10-01 21:00:00         0.0      sky is clear  18.680798        3        2
2012-10-01 22:00:00         0.0      sky is clear  18.675457        3        2
2012-10-01 23:00:00         0.0      sky is clear  1

In [271]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-10-01 13:00:00,0.0,mist,18.72,0,2
2012-10-01 14:00:00,0.0,sky is clear,18.718186,3,2
2012-10-01 15:00:00,0.0,sky is clear,18.712844,3,2
2012-10-01 16:00:00,0.0,sky is clear,18.707503,3,2
2012-10-01 17:00:00,0.0,sky is clear,18.702162,3,2


The average wind speed is found. This is relatively low. Although there is no indication to what the wind speed is measured in we are inclined to think the measurement is knots. The average wind speed in Los Angeles is around 6-10km per hour according to https://www.isws.illinois.edu/statecli/wind/wind.htm . This average in knots would fit into this bracket. 

In [272]:
chiWea['Wind Speed'].mean()

1.2195483072571378

Wind speed less than or equal to 2 is considered to be very calm and is considered good. It is valued at 3. Between 2 and 5 is considered moderate and is valued at 2. Above 5 knots is considered bad and is given a value of 0. 

In [273]:
conditions = [
    (chiWea['Wind Speed'] == 0),
    (chiWea['Wind Speed'] > 0) | (chiWea['Wind Speed'] < 5) ]
choices = ['3', '2']
chiWea['WindRate'] = np.select(conditions, choices, default='0')
print(chiWea)

                     Wind Speed       Description    Celcius DescRate  \
datetime                                                                
2012-10-01 13:00:00         0.0              mist  18.720000        0   
2012-10-01 14:00:00         0.0      sky is clear  18.718186        3   
2012-10-01 15:00:00         0.0      sky is clear  18.712844        3   
2012-10-01 16:00:00         0.0      sky is clear  18.707503        3   
2012-10-01 17:00:00         0.0      sky is clear  18.702162        3   
2012-10-01 18:00:00         0.0      sky is clear  18.696821        3   
2012-10-01 19:00:00         0.0      sky is clear  18.691480        3   
2012-10-01 20:00:00         0.0      sky is clear  18.686139        3   
2012-10-01 21:00:00         0.0      sky is clear  18.680798        3   
2012-10-01 22:00:00         0.0      sky is clear  18.675457        3   
2012-10-01 23:00:00         0.0      sky is clear  18.670116        3   
2012-10-02 00:00:00         0.0      sky is clear  

In [274]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-10-01 13:00:00,0.0,mist,18.72,0,2,3
2012-10-01 14:00:00,0.0,sky is clear,18.718186,3,2,3
2012-10-01 15:00:00,0.0,sky is clear,18.712844,3,2,3
2012-10-01 16:00:00,0.0,sky is clear,18.707503,3,2,3
2012-10-01 17:00:00,0.0,sky is clear,18.702162,3,2,3


In [275]:
chiWea.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45252 entries, 2012-10-01 13:00:00 to 2017-11-30 00:00:00
Freq: H
Data columns (total 6 columns):
Wind Speed     45252 non-null float64
Description    45252 non-null object
Celcius        45252 non-null float64
DescRate       45252 non-null object
TempRate       45252 non-null object
WindRate       45252 non-null object
dtypes: float64(2), object(4)
memory usage: 3.7+ MB


We have to change our new columns to int as we are going to add them together to get our overall weather rating. 

In [276]:
chiWea['DescRate'] = chiWea['DescRate'].astype(int)

In [277]:
chiWea['TempRate'] = chiWea['TempRate'].astype(int)

In [278]:
chiWea['WindRate'] = chiWea['WindRate'].astype(int)

In [279]:
chiWea.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45252 entries, 2012-10-01 13:00:00 to 2017-11-30 00:00:00
Freq: H
Data columns (total 6 columns):
Wind Speed     45252 non-null float64
Description    45252 non-null object
Celcius        45252 non-null float64
DescRate       45252 non-null int32
TempRate       45252 non-null int32
WindRate       45252 non-null int32
dtypes: float64(2), int32(3), object(1)
memory usage: 3.1+ MB


The Overall column is created it consits of the Description Rating, Temperature Rating and Wind Rating added together. This overall rating gives us a sense of how good the weather was for that hour. 

In [280]:
chiWea['Overall'] = chiWea["DescRate"] + chiWea["TempRate"] + chiWea["WindRate"]

In [281]:
chiWea.head()

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-10-01 13:00:00,0.0,mist,18.72,0,2,3,5
2012-10-01 14:00:00,0.0,sky is clear,18.718186,3,2,3,8
2012-10-01 15:00:00,0.0,sky is clear,18.712844,3,2,3,8
2012-10-01 16:00:00,0.0,sky is clear,18.707503,3,2,3,8
2012-10-01 17:00:00,0.0,sky is clear,18.702162,3,2,3,8


We then want to group the weather into Good, Moderate and Overall Weather. 8 and over is Good. 6 to 8 is considered moderate. And under 6 would be considered to be Bad. For example if we have a case where the temperature is 21 degrees, overcast clouds and rain, this would be considered bad as Raining is always considered bad in terms of weather, and prevents people from going outside. 

In [282]:
conditions = [
    (chiWea['Overall'] >= 8), 
    (chiWea['Overall'] >= 6) & (chiWea['Overall'] < 8) ]
choices = ['Good', 'Moderate']
chiWea['Weather'] = np.select(conditions, choices, default='Bad')

In [283]:
chiWea.head(50)

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall,Weather
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-10-01 13:00:00,0.0,mist,18.72,0,2,3,5,Bad
2012-10-01 14:00:00,0.0,sky is clear,18.718186,3,2,3,8,Good
2012-10-01 15:00:00,0.0,sky is clear,18.712844,3,2,3,8,Good
2012-10-01 16:00:00,0.0,sky is clear,18.707503,3,2,3,8,Good
2012-10-01 17:00:00,0.0,sky is clear,18.702162,3,2,3,8,Good
2012-10-01 18:00:00,0.0,sky is clear,18.696821,3,2,3,8,Good
2012-10-01 19:00:00,0.0,sky is clear,18.69148,3,2,3,8,Good
2012-10-01 20:00:00,0.0,sky is clear,18.686139,3,2,3,8,Good
2012-10-01 21:00:00,0.0,sky is clear,18.680798,3,2,3,8,Good
2012-10-01 22:00:00,0.0,sky is clear,18.675457,3,2,3,8,Good


In [284]:
chiWea.sort_values(by=['Overall'])

Unnamed: 0_level_0,Wind Speed,Description,Celcius,DescRate,TempRate,WindRate,Overall,Weather
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2016-01-24 12:00:00,1.0,mist,9.710000,0,0,2,2,Bad
2017-02-28 05:00:00,1.0,mist,11.420000,0,0,2,2,Bad
2017-02-28 06:00:00,1.0,mist,10.680000,0,0,2,2,Bad
2017-01-20 06:00:00,1.0,mist,12.380000,0,0,2,2,Bad
2016-01-05 15:00:00,3.0,mist,12.210000,0,0,2,2,Bad
2017-01-20 05:00:00,4.0,mist,12.580000,0,0,2,2,Bad
2016-01-05 16:00:00,2.0,moderate rain,12.194844,0,0,2,2,Bad
2016-01-05 17:00:00,2.0,fog,12.140000,0,0,2,2,Bad
2016-01-05 18:00:00,1.0,mist,12.560000,0,0,2,2,Bad
2016-01-05 19:00:00,4.0,fog,12.730000,0,0,2,2,Bad


We give an overall count of the weather grades. Unsurprsingly, there is more bad weather than either moderate or good. This would be because a lot of the hours would be overnight when the temperature is lower and the sun is in. 

In [285]:
wea1 = chiWea.groupby('Weather')[('Overall')].count()
wea1.sort_values(ascending=False)

Weather
Moderate    16769
Bad         15167
Good        13316
Name: Overall, dtype: int64

We save this to a new csv, where it will be used for analysis. 

In [286]:
chiWea.to_csv('../../data/prep/500_LAWea.csv')