## Overview

This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [5]:
#If you don't have the scikit learn, uncomment the following code and install it
#!pip install scikit-learn

In [6]:
#Importing Scikit-learn libraries/modules
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import sklearn.metrics as metrics
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score


## Importing Dataset

In [11]:
path = 'https://github.com/AbhishekDaniel1411/IBM_Data_Science_Professional_Certificate/blob/main/Machine_Learning_with_Python/AUS_Rainfall_Prediction/Weather_Data.csv?raw=true'

df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

## Data Preprocessing

In [13]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

  df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)


In [20]:
features = df_sydney_processed.drop(columns=['Date' ,'RainTomorrow'], axis=1).astype(float)

features.columns

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am',
       'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm',
       'Temp9am', 'Temp3pm', 'RainToday_No', 'RainToday_Yes', 'WindGustDir_E',
       'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N', 'WindGustDir_NE',
       'WindGustDir_NNE', 'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_S',
       'WindGustDir_SE', 'WindGustDir_SSE', 'WindGustDir_SSW',
       'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW', 'WindGustDir_WSW',
       'WindDir9am_E', 'WindDir9am_ENE', 'WindDir9am_ESE', 'WindDir9am_N',
       'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW', 'WindDir9am_NW',
       'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE', 'WindDir9am_SSW',
       'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW', 'WindDir9am_WSW',
       'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE', 'WindDir3pm_N',
       'WindDir3pm_NE', 

In [21]:
Y = df_sydney_processed['RainTomorrow']

## Models

### 1. Linear Regression

In [22]:
Xtrain, Xtest, ytrain, ytest = train_test_split(features, Y, test_size=.2, random_state=10)

In [23]:
LinearReg = LinearRegression()
LinearReg.fit(Xtrain, ytrain)

In [24]:
yhat_LinearReg = LinearReg.predict(Xtest)

In [25]:
LinearRegression_MAE = metrics.mean_absolute_error(ytest, yhat_LinearReg)
LinearRegression_MSE = metrics.mean_squared_error(ytest, yhat_LinearReg)
LinearRegression_R2 = metrics.r2_score(ytest, yhat_LinearReg)

In [46]:
LinearReg_report = {
    'Mean Absolute error': LinearRegression_MAE,
    'Mean Squared error': LinearRegression_MSE,
    'R^2 Score': LinearRegression_R2
}

for key, value in LinearReg_report.items():
    print(key, value)

Mean Absolute error 0.2563175026697057
Mean Squared error 0.11572058021725573
R^2 Score 0.4271321202839915


### 2. k-nearest neighbors(KNN)

In [49]:

Xtrain_scaled = preprocessing.StandardScaler().fit(Xtrain).transform(Xtrain.astype(float))
Xtest_scaled = preprocessing.StandardScaler().fit(Xtest).transform(Xtest.astype(float))

KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(Xtrain_scaled, ytrain)

In [50]:
yhat_KNN = KNN.predict(Xtest_scaled)

In [51]:
KNN_Accuracy_Score = accuracy_score(ytest, yhat_KNN)
KNN_JaccardIndex = jaccard_score(ytest, yhat_KNN)
KNN_F1_Score = f1_score(ytest, yhat_KNN)

### 3. Decision Tree

In [60]:
Tree = DecisionTreeClassifier(criterion='entropy')

Tree.fit(Xtrain_scaled, ytrain)

In [61]:
yhat_tree = Tree.predict(Xtest_scaled)

In [62]:
Tree_Accuracy_Score = accuracy_score(ytest, yhat_tree)
Tree_JaccardIndex = jaccard_score(ytest, yhat_tree)
Tree_F1_Score = f1_score(ytest, yhat_tree)

### 4. Logistic Regression

In [63]:
Xtrain_new, Xtest_new, ytrain_new, ytest_new = train_test_split(features, Y, test_size=.2, random_state=1)

In [64]:

Xtrain_new = preprocessing.StandardScaler().fit(Xtrain_new).transform(Xtrain_new.astype(float))
Xtest_new = preprocessing.StandardScaler().fit(Xtrain_new).transform(Xtest_new.astype(float))

LR = LogisticRegression(C=0.01, solver='liblinear')
LR.fit(Xtrain_new, ytrain_new)




In [65]:
yhat_LR = LR.predict(Xtest_new)
yhat_proba_LR = LR.predict_proba(Xtest_new)

In [66]:
LR_Accuracy_Score = accuracy_score(ytest_new, yhat_LR)
LR_JaccardIndex = jaccard_score(ytest_new, yhat_LR)
LR_F1_Score = f1_score(ytest_new, yhat_LR)
LR_Log_Loss = log_loss(ytest_new, yhat_proba_LR)

### 5. SVM

In [72]:

from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(Xtrain_new, ytrain_new)

In [73]:
yhat_svm = clf.predict(Xtest_new)

In [74]:
SVM_Accuracy_Score = accuracy_score(ytest_new, yhat_svm)
SVM_JaccardIndex = jaccard_score(ytest_new, yhat_svm)
SVM_F1_Score = f1_score(ytest_new, yhat_svm)

## Report

In [75]:
report = {
    'Accuracy Score': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1-Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'Log Loss': [np.nan, np.nan, LR_Log_Loss, np.nan]
}

report_df = pd.DataFrame(report, index=['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'])

In [76]:
report_df

Unnamed: 0,Accuracy Score,Jaccard Index,F1-Score,Log Loss
KNN,0.764885,0.245098,0.393701,
Decision Tree,0.746565,0.380597,0.551351,
Logistic Regression,0.722137,0.0,0.0,10.015183
SVM,0.722137,0.0,0.0,
