## Rain Prediction in Australia
The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RISK_MM       | Amount of rain tomorrow                               | Millimeters     | float  |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

#### Importing the libraries

In [45]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
import requests

#### Downloading and loading the Dataset into Pandas DataFrame

In [2]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [5]:
url = path
response = requests.get(url)

with open("Weather_Data.csv", "wb") as file:
    file.write(response.content)

In [2]:
data = pd.read_csv('Weather_Data.csv')
data.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


#### Data Preprocessing
First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [3]:
data_processed = pd.get_dummies(data=data, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [4]:
data_processed.replace(['Yes', 'No'], [1,0], inplace=True)

In [5]:
data_processed.drop('Date', axis=1, inplace=True)

In [6]:
data_processed = data_processed.astype(float)

In [8]:
X = data_processed.drop(columns='RainTomorrow', axis=1)
y = data_processed['RainTomorrow']

#### Train/Test Split

In [9]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=10)
print ('Training set:', X_train.shape,  y_train.shape)
print ('Testing set:', X_test.shape,  y_test.shape)

Training set: (2616, 66) (2616,)
Testing set: (655, 66) (655,)


#### Using Linear Regression

In [17]:
LinearRegr = LinearRegression()
X = np.asanyarray(X_train)
y = np.asanyarray(y_train)
LinearRegr.fit(X, y)

In [18]:
# The coefficients and intercept
print('Coefficient: ', LinearRegr.coef_[0:5])
print('Intercept: ', LinearRegr.intercept_)

Coefficient:  [-0.02368958  0.01300427  0.00073143  0.0064882  -0.03515687]
Intercept:  -50245867801.01952


#### Prediction and Evaluation

In [20]:
prediction = LinearRegr.predict(X_test)
X = np.asanyarray(X_test)
y = np.asanyarray(y_test)
print("Residual sum of squares: %.2f"
      % np.mean((prediction - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % LinearRegr.score(X, y))

Residual sum of squares: 0.12
Variance score: 0.43




In [23]:
LinearRegression_MAE = np.mean(np.absolute(prediction - y_test))
LinearRegression_MSE = np.mean((prediction -y_test)**2)
LinearRegression_R2 = r2_score(y_test, prediction)
print("Mean absolute error: %.2f" % LinearRegression_MAE)
print("Residual sum of squares (MSE): %.2f" % LinearRegression_MSE)
print("R2-score: %.2f" % LinearRegression_R2 )

Mean absolute error: 0.26
Residual sum of squares (MSE): 0.12
R2-score: 0.43


#### Using KNN

In [25]:
kNN = 4
neigh = KNeighborsClassifier(n_neighbors = kNN).fit(X_train,y_train)
neigh

#### Prediction and Evaluation

In [27]:
prediction = neigh.predict(X_test)
prediction[0:5]

array([0., 0., 1., 0., 0.])

In [31]:
KNN_Accuracy_Score = metrics.accuracy_score(y_test, prediction)
KNN_JaccardIndex = metrics.jaccard_score(y_test, prediction)
KNN_F1_Score = metrics.f1_score(y_test, prediction)
KNN_Log_Loss = metrics.log_loss(y_test, prediction)

print("KNN Accuracy Score: {0:.3f}".format(KNN_Accuracy_Score))
print("KNN_JaccardIndex: {0:.3f}".format(KNN_JaccardIndex))
print("KNN F1 score: {0:.3f}".format(KNN_F1_Score))
print("KNN Log Loss: {0:.3f}".format(KNN_Log_Loss))

KNN Accuracy Score: 0.818
KNN_JaccardIndex: 0.425
KNN F1 score: 0.597
KNN Log Loss: 6.548


#### Using Decision Tree

In [34]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
tree.fit(X_train, y_train)

#### Predictions and Evaluation

In [37]:
prediction = tree.predict(X_test)
prediction[0:5]

array([0., 0., 1., 0., 0.])

In [38]:
Tree_Accuracy_Score = metrics.accuracy_score(y_test, prediction)
Tree_JaccardIndex = metrics.jaccard_score(y_test, prediction)
Tree_F1_Score = metrics.f1_score(y_test, prediction)
Tree_Log_Loss = metrics.log_loss(y_test, prediction)

print("Tree accur_acy score: {0:.3f}".format(Tree_Accuracy_Score))
print("Tree JaccardIndex : {0:.3f}".format(Tree_JaccardIndex))
print("Tree_F1_Score : {0:.3f}".format(Tree_F1_Score))
print("Tree Log Loss : {0:.3f}".format(Tree_Log_Loss))

Tree accur_acy score: 0.818
Tree JaccardIndex : 0.480
Tree_F1_Score : 0.649
Tree Log Loss : 6.548


#### Using Logistic Regression

In [41]:
LR = LogisticRegression(C=0.01, solver='liblinear')
LR.fit(X_train,y_train)

#### Predictions and Evaluation

In [43]:
prediction = LR.predict(X_test)
LR_Accuracy_Score = metrics.accuracy_score(y_test, prediction)
LR_JaccardIndex = metrics.jaccard_score(y_test, prediction)
LR_F1_Score = metrics.f1_score(y_test, prediction)
LR_Log_Loss = metrics.log_loss(y_test, prediction)

print("LR accuracy score: {0:.3f}".format(LR_Accuracy_Score))
print("LR JaccardIndex: {0:.3f}".format(LR_JaccardIndex))
print("LR F1 Score: {0:.3f}".format(LR_F1_Score))
print("LR Log Loss: {0:.3f}".format(LR_Log_Loss))

LR accuracy score: 0.843
LR JaccardIndex: 0.521
LR F1 Score: 0.685
LR Log Loss: 5.668


#### Using SVM

In [46]:
SVM = svm.SVC(kernel='linear')
SVM.fit(X_train, y_train)

#### Predictions and Evaluation

In [47]:
prediction = SVM.predict(X_test)
SVM_Accuracy_Score = metrics.accuracy_score(y_test, prediction)
SVM_JaccardIndex = metrics.jaccard_score(y_test, prediction)
SVM_F1_Score = metrics.f1_score(y_test, prediction)
SVM_Log_Loss = metrics.log_loss(y_test, prediction)

print("SVM accuracy score : {0:.3f}".format(SVM_Accuracy_Score))
print("SVM jaccardIndex : {0:.3f}".format(SVM_JaccardIndex))
print("SVM F1_score : {0:.3f}".format(SVM_F1_Score))
print("SVM Log Loss : {0:.3f}".format(SVM_Log_Loss))

SVM accuracy score : 0.834
SVM jaccardIndex : 0.509
SVM F1_score : 0.675
SVM Log Loss : 5.998


#### Lets make a table in a DataFrame, comparing the evaluations of the above models used

In [53]:
data = {'KNN':[KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score, KNN_Log_Loss],
     'Tree':[Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score, Tree_Log_Loss],
     'LR':[LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss],
     'SVM':[SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score, SVM_Log_Loss]}
Report = pd.DataFrame(data=data, index=['Accuracy','Jaccard Index','F1-Score', 'LogLoss'])
Report

Unnamed: 0,KNN,Tree,LR,SVM
Accuracy,0.818321,0.842748,0.842748,0.833588
Jaccard Index,0.425121,0.52093,0.52093,0.509009
F1-Score,0.59661,0.685015,0.685015,0.674627
LogLoss,6.548389,5.667933,5.667933,5.998104
