# Rain in Australia
## Predict rain tomorrow in Australia

Before we can start with our analysis, we have to check and prepare our data. This is a crucial task and is necessary to prevent data-driven mistakes.

The dataset is hosted by Kaggle: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data.   
Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

Note: You should exclude the variable Risk-MM when training a binary classification model.   
Not excluding it will leak the answers to your model and reduce its predictability.


In [2]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from tensorflow import keras
from keras.utils import np_utils
from keras import optimizers
import numpy as np
import pandas as pd

Using TensorFlow backend.


In [3]:
data_path = 'data\\weatherAUS.csv'
df = pd.read_csv(data_path, delimiter=",")
df.drop(['RISK_MM'], axis=1, inplace=True)

print('Shape of the dataset: {}'.format(df.shape))
print('Preview of dataset:')
print(df.head(5))

Shape of the dataset: (142193, 23)
Preview of dataset:
         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
0  2008-12-01   Albury     13.4     22.9       0.6          NaN       NaN   
1  2008-12-02   Albury      7.4     25.1       0.0          NaN       NaN   
2  2008-12-03   Albury     12.9     25.7       0.0          NaN       NaN   
3  2008-12-04   Albury      9.2     28.0       0.0          NaN       NaN   
4  2008-12-05   Albury     17.5     32.3       1.0          NaN       NaN   

  WindGustDir  WindGustSpeed WindDir9am  ... Humidity9am  Humidity3pm  \
0           W           44.0          W  ...        71.0         22.0   
1         WNW           44.0        NNW  ...        44.0         25.0   
2         WSW           46.0          W  ...        38.0         30.0   
3          NE           24.0         SE  ...        45.0         16.0   
4           W           41.0        ENE  ...        82.0         33.0   

   Pressure9am  Pressure3pm  Cloud9am  Clou

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 23 columns):
Date             142193 non-null object
Location         142193 non-null object
MinTemp          141556 non-null float64
MaxTemp          141871 non-null float64
Rainfall         140787 non-null float64
Evaporation      81350 non-null float64
Sunshine         74377 non-null float64
WindGustDir      132863 non-null object
WindGustSpeed    132923 non-null float64
WindDir9am       132180 non-null object
WindDir3pm       138415 non-null object
WindSpeed9am     140845 non-null float64
WindSpeed3pm     139563 non-null float64
Humidity9am      140419 non-null float64
Humidity3pm      138583 non-null float64
Pressure9am      128179 non-null float64
Pressure3pm      128212 non-null float64
Cloud9am         88536 non-null float64
Cloud3pm         85099 non-null float64
Temp9am          141289 non-null float64
Temp3pm          139467 non-null float64
RainToday        140787 non-null obje

As we can see, there are several categorial and numerical features.   
Furthermore it appears that there are some missing values for almost all features that we have to deal with later on.  
Let's start by ordering the features in two lists.

In [5]:
cat_feature_list = [var for var in df.columns if df[var].dtype=='O']
num_feature_list = [var for var in df.columns if df[var].dtype=='float64']
print('There are {} categorical and {} numerical variables.\n'.format(len(cat_feature_list), len(num_feature_list)))
print('The categorical features are: \n{}\n'.format(cat_feature_list))
print('The numerical features are: \n{}'.format(num_feature_list))

There are 7 categorical and 16 numerical variables.

The categorical features are: 
['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

The numerical features are: 
['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']


First of we should clearify the features, measurements and units.

* __Date__: The date of the observation.
* __Location__: The common name of the location of the weather station.
* __MinTemp__ and __MaxTemp__: Minimum and maximum temperatures on that day in degrees celsius.
* __Rainfall__: The amount of rainfall recorded for the day in liters per square meters mm.
* __Evaporation__: The so-called Class A pan evaporation (mm) in the 24 hours to 9am.
* __Sunshine__: The number of hours of bright sunshine in the day.
* __WindGustDir__: The direction of the strongest wind gust in the 24 hours to midnight.
* __WindGustSpeed__: The speed (km/h) of the strongest wind gust in the 24 hours to midnight.
* __WindDir9am__ and __WindDir3pm__: Direction of the wind at 9am and 3pm.
* __WindSpeed9am__ and __WindSpeed3pm__: Wind speed (km/h) averaged over 10 minutes prior to 9am and 3pm.
* __Humidity9am__ and __Humidity3pm__: Humidity (percent) at 9am and 3pm.
* __Pressure9am__ and __Pressure3pm__: Atmospheric pressure (hpa) reduced to mean sea level at 9am and 3pm.
* __Cloud9am__ and __Cloud3pm__: Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.
* __Temp9am__ and __Temp3pm__: Temperature at 9am and 3pm in degrees celsius.
* __RainToday__: Boolean: 'Yes' if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 'No'.
* __RainTomorrow__: The target variable.


Now we take a closer look at the possible values for the categorial features.

In [14]:
for cat in cat_feature_list:
    print(df[cat].unique())
    print("\n There are a total of {} unique values for the feature \'{}\'.\n".format(len(df[cat].unique()), cat))

['2008-12-01' '2008-12-02' '2008-12-03' ... '2008-01-29' '2008-01-30'
 '2008-01-31']

 There are a total of 3436 unique values for the feature 'Date'.

['Albury' 'BadgerysCreek' 'Cobar' 'CoffsHarbour' 'Moree' 'Newcastle'
 'NorahHead' 'NorfolkIsland' 'Penrith' 'Richmond' 'Sydney' 'SydneyAirport'
 'WaggaWagga' 'Williamtown' 'Wollongong' 'Canberra' 'Tuggeranong'
 'MountGinini' 'Ballarat' 'Bendigo' 'Sale' 'MelbourneAirport' 'Melbourne'
 'Mildura' 'Nhil' 'Portland' 'Watsonia' 'Dartmoor' 'Brisbane' 'Cairns'
 'GoldCoast' 'Townsville' 'Adelaide' 'MountGambier' 'Nuriootpa' 'Woomera'
 'Albany' 'Witchcliffe' 'PearceRAAF' 'PerthAirport' 'Perth' 'SalmonGums'
 'Walpole' 'Hobart' 'Launceston' 'AliceSprings' 'Darwin' 'Katherine'
 'Uluru']

 There are a total of 49 unique values for the feature 'Location'.

['W' 'WNW' 'WSW' 'NE' 'NNW' 'N' 'NNE' 'SW' 'ENE' 'SSE' 'S' 'NW' 'SE' 'ESE'
 nan 'E' 'SSW']

 There are a total of 17 unique values for the feature 'WindGustDir'.

['W' 'NNW' 'SE' 'ENE' 'SW' 'SSE' 'S

Ignoring the *nan* values, we can summarize that:  
* __Dates__ are denoted as *YY-MM-DD*. It seems like the record includes 3436 days.
* In __Location__, we have a list of all the locations of 49 weather stations. The data was recorded in Australia.  
* It appears that the same notation is used for the direction of the wind in __WindGustDir__, __WindDir9am__ and __WindDir3pm__. There are a total of $16$ directions with a seperation of $22.5°$ starting from north (N).
* Finally we have two binary features, __RainToday__ and __RainTomorrow__ with the later feature being our target variable for the classification we are about to do.

Luckily, there doesn't seem to be an error or typo here.

We can now take a look at the numerical features.

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MinTemp,141556.0,12.1864,6.403283,-8.5,7.6,12.0,16.8,33.9
MaxTemp,141871.0,23.226784,7.117618,-4.8,17.9,22.6,28.2,48.1
Rainfall,140787.0,2.349974,8.465173,0.0,0.0,0.0,0.8,371.0
Evaporation,81350.0,5.469824,4.188537,0.0,2.6,4.8,7.4,145.0
Sunshine,74377.0,7.624853,3.781525,0.0,4.9,8.5,10.6,14.5
WindGustSpeed,132923.0,39.984292,13.588801,6.0,31.0,39.0,48.0,135.0
WindSpeed9am,140845.0,14.001988,8.893337,0.0,7.0,13.0,19.0,130.0
WindSpeed3pm,139563.0,18.637576,8.803345,0.0,13.0,19.0,24.0,87.0
Humidity9am,140419.0,68.84381,19.051293,0.0,57.0,70.0,83.0,100.0
Humidity3pm,138583.0,51.482606,20.797772,0.0,37.0,52.0,66.0,100.0


Comparing the range of the four temperature features __MinTemp__, __MaxTemp__, __Temp9am__ and __Temp3pm__ it indeed seems reasonable that the temperature is measured in degrees celius. Furhtermore the highest and lowest temperature recorded in __Temp9am__ and __Temp3pm__ is smaller/bigger than the highest and lowest temperature recorded in __MaxTemp__ in __MinTemp__, making the features consistent.

As for Rainfall, we can immediately see that it hardly ever rains in Australia, since for $75\,\%$ of the recorded days, it rained $0.8\,\frac{\text{l}}{\text{m}^{2}}$ or less. Although it seems to be possible to actually rain quite alot with up to $371\,\frac{\text{l}}{\text{m}^{2}}$ which is basically a flood. This may be considered an outliner.

__Evaporation__ is a troublesome feature, as it is measured with a so-called *__Class-A evaporation pan__*, which can overflow on rainy days or on events of intense rainfall. The dimensions used for the different stations are not given in detail. Hence this feature is of limited use and should be handled with great care.

The range of the windspeed features __WindGustSpeed__, __WindSpeed9am__ and __WindSpeed3pm__ seems reasonable with the extrem values of above $100\,\frac{\text{km}}{\text{h}}$ corresponding to hurricanes.

While the minimum of $0$ hours of sunshine may sound surprising for Australia at first, it makes sense on very cloudy days (probably also during wintertime). The average and maximum value seems reasonable.

The range of the __Humidity__ features are between $0\,\%$ and $100\,\%$ and thus perfectly reasonable.

With $1\,$atm as $1013.25\,$hpa, the measured atmospheric pressure seems to be totally normal. There doesn't seem to be any outliners.

Jumping to __Cloud9am__ and __Cloud3pm__ we actually have an additional value for the range. A value of $9$ indicates that the sky is obstructed from view. Here we should furhter check if we only have integer values!



Based on the data, the weather seems to be pretty *normal* for the most part with an occasion extrem weather phenomenon.  We have to think how we will deal with these records. __Evaporation__ may not be trustworthy to use as a feature.

In [19]:
print(df['Cloud9am'].unique())
print(df['Cloud3pm'].unique())

[ 8. nan  7.  1.  0.  5.  4.  2.  6.  3.  9.]
[nan  2.  8.  7.  1.  5.  4.  6.  3.  0.  9.]


Ignoring the nan values, the cloud features really only recorded integers from 0 to 9.