# Australia Weather Forecast

The Dataset contains daily weather observations from numerous Australian weather stations.

- **Goal:** predict the weather next day 
- **Metric:** Accuracy
- **Libraries:** numpy, pandas, sklearn, matplotlib
- **Hardware:** CPU only

In [2]:
# Importing libraries
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data preprocessing
from sklearn.preprocessing import Imputer

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Hiding unecessary warnings to make it more readable
import warnings 
warnings.filterwarnings('ignore')


In [3]:
df = pd.read_csv("weather_austria.csv")

In [4]:
df.head(5)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No


In [5]:
df.set_index("Date")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 24 columns):
Date             145460 non-null object
Location         145460 non-null object
MinTemp          143975 non-null float64
MaxTemp          144199 non-null float64
Rainfall         142199 non-null float64
Evaporation      82670 non-null float64
Sunshine         75625 non-null float64
WindGustDir      135134 non-null object
WindGustSpeed    135197 non-null float64
WindDir9am       134894 non-null object
WindDir3pm       141232 non-null object
WindSpeed9am     143693 non-null float64
WindSpeed3pm     142398 non-null float64
Humidity9am      142806 non-null float64
Humidity3pm      140953 non-null float64
Pressure9am      130395 non-null float64
Pressure3pm      130432 non-null float64
Cloud9am         89572 non-null float64
Cloud3pm         86102 non-null float64
Temp9am          143693 non-null float64
Temp3pm          141851 non-null float64
RainToday        142199 non-null obje

In [6]:
# Exploring missing values
def missing_data_overview(num_of_rows=50):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum()/df.isnull().count() * 100).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    print(missing_data.head(num_of_rows))

missing_data_overview()

               Total    Percent
Sunshine       69835  48.009762
Evaporation    62790  43.166506
Cloud3pm       59358  40.807095
Cloud9am       55888  38.421559
Pressure9am    15065  10.356799
Pressure3pm    15028  10.331363
WindDir9am     10566   7.263853
WindGustDir    10326   7.098859
WindGustSpeed  10263   7.055548
Humidity3pm     4507   3.098446
WindDir3pm      4228   2.906641
Temp3pm         3609   2.481094
RISK_MM         3267   2.245978
RainTomorrow    3267   2.245978
Rainfall        3261   2.241853
RainToday       3261   2.241853
WindSpeed3pm    3062   2.105046
Humidity9am     2654   1.824557
WindSpeed9am    1767   1.214767
Temp9am         1767   1.214767
MinTemp         1485   1.020899
MaxTemp         1261   0.866905
Location           0   0.000000
Date               0   0.000000


### Feature Selection/Exploration


In [6]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# Copy dataframe for exploration
df_tmp = df.copy()
# Transforming yes/no values into numerical values
df_tmp['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df_tmp['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

# One-hot encode data
df_tmp = pd.get_dummies(df_tmp, columns=['WindGustDir','WindDir9am', 'WindDir3pm'])

df_tmp = df_tmp.dropna(how='any')


y = df_tmp[['RainTomorrow']]
X = df_tmp.drop(['Date', 'Location', 'Sunshine', 'Evaporation', 'RainTomorrow', 'RISK_MM'], axis=1)

# Create an SelectKBest object to select features with two best ANOVA F-Values
fvalue_selector = SelectKBest(f_classif, k=4)

# Apply the SelectKBest object to the features and target
X_kbest = fvalue_selector.fit_transform(X, y)

# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])

# Boolean array of best features - True:= within k best features
mask = fvalue_selector.get_support()

# Storing column names of k best features
new_features = X.columns[mask]
print("K best features are:", new_features)

Original number of features: 63
Reduced number of features: 4
K best features are: Index(['Humidity3pm', 'Cloud9am', 'Cloud3pm', 'RainToday'], dtype='object')


*Humidity3pm* and *RainToday* are the best features here since Cloud9am and Cloud3pm are missing most of its data in our original dataframe.

### Feature Engineering/Transformation

In [7]:
# transforming yes/no values into numerical values
df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)

In [8]:
# Filling missing values of the "Humidity3pm" column
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
mean_imputer.fit(df['Humidity3pm'].values.reshape(-1, 1))
df['Humidity3pm'] = mean_imputer.transform(df['Humidity3pm'].values.reshape(-1, 1))
# Droping rows with missing values in the "Rainfall" column
df = df[pd.notnull(df['RainTomorrow'])]
df = df[pd.notnull(df['RainToday'])]
df = pd.get_dummies(df, columns=['WindGustDir','WindDir9am', 'WindDir3pm'])

In [1]:
print(missing_data_overview(10))

NameError: name 'missing_data_overview' is not defined

In [10]:
# selecting important features that are most valuable for our prediction
df = df[['Humidity3pm','Rainfall','RainToday','RainTomorrow', 'RainToday']]
X = df[['Humidity3pm', 'RainToday']] # let's use only one feature Humidity3pm
y = df[['RainTomorrow']]

X.head()

Unnamed: 0,Humidity3pm,RainToday,RainToday.1
0,22.0,0.0,0.0
1,25.0,0.0,0.0
2,30.0,0.0,0.0
3,16.0,0.0,0.0
4,33.0,0.0,0.0



# Model Selection

#### Comapring different models:
- Decision Trees
- Random Forest
- Support Vector machine
- Adaboost
- Logistic Regression

In [11]:
# spliting the data into a training and test set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)


### Decision Tree Classifier


In [12]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(criterion='gini', random_state=0)

t0=time.time()
dt_model = dt_clf.fit(X_train, y_train)
print("Trainings Time:", time.time()-t0)

t0=time.time()
y_pred = dt_model.predict(X_test)
print("Prediction Time:", time.time()-t0)

dt_acc = accuracy_score(y_test,y_pred)
print("---------------------------------------")
print("Decision Tree accuracy:", dt_acc)

Trainings Time: 0.034882545471191406
Prediction Time: 0.002991914749145508
---------------------------------------
Decision Tree accuracy: 0.8298434525669801


### Random Forest Classifier

In [13]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(criterion='entropy', random_state=0, n_jobs=-1)

t0=time.time()
rf_model = rf_clf.fit(X_train, y_train)
print("Trainings Time:", time.time()-t0)

t0=time.time()
y_pred = rf_model.predict(X_test)
print("Prediction Time:", time.time()-t0)

rf_acc = accuracy_score(y_pred, y_test)
print("---------------------------------------")
print("Random Forest accuracy:", rf_acc)

Trainings Time: 0.11668825149536133
Prediction Time: 0.10491585731506348
---------------------------------------
Random Forest accuracy: 0.829587748955877


### Support Vector Machine

In [14]:
from sklearn.svm import LinearSVC

scaler = StandardScaler()
X_std = scaler.fit_transform(X)
svc_clf = LinearSVC(C=1.0)

t0=time.time()
svc_model = svc_clf.fit(X_std, y)
print("Trainings Time:", time.time()-t0)

t0=time.time()
y_pred = svc_model.predict(X_test)
print("Prediction Time:", time.time()-t0)

svc_acc = accuracy_score(y_pred, y_test)
print("---------------------------------------")
print("Support Vector Machine accuracy:", svc_acc)


Trainings Time: 4.12811279296875
Prediction Time: 0.015109777450561523
---------------------------------------
Support Vector Machine accuracy: 0.22280307980793818


### Adaboos Classifier

In [15]:
from sklearn.ensemble import AdaBoostClassifier

ab_clf = AdaBoostClassifier(n_estimators=50,
                         learning_rate=1,
                         random_state=0)

t0=time.time()
ab_model = ab_clf.fit(X, y)
print("Trainings Time:", time.time()-t0)

t0=time.time()
y_pred = ab_model.predict(X_test)
print("Prediction Time:", time.time()-t0)

ab_acc = accuracy_score(y_pred, y_test)
print("---------------------------------------")
print("Adaboost accuracy:", ab_acc)

Trainings Time: 2.0581631660461426
Prediction Time: 0.11470794677734375
---------------------------------------
Adaboost accuracy: 0.8312924396965651


### Logistic Regression Classifier

In [17]:
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

lr_clf = LogisticRegression(random_state=0)

t0=time.time()
lr_model = lr_clf.fit(X_std, y)
print("Trainings Time:", time.time()-t0)

t0=time.time()
y_pred = lr_model.predict(X_test)
print("Prediction Time:", time.time()-t0)

lr_acc = accuracy_score(y_pred, y_test)
print("---------------------------------------")
print("Logistic Regression accuracy:", lr_acc)


Trainings Time: 0.1550908088684082
Prediction Time: 0.000997304916381836
---------------------------------------
Logistic Regression accuracy: 0.22280307980793818


### Conclusion

As we can see, adaboost is outperforming all other models by a significant amount when it comes to the training the model and predictions.

Im not enitrely sure why the logisitic regression model is underperforming, so it will need further investigations. 
Considering the current technical setup, I would pick the adaboost classifier folowed by decision tree classifier. 