<a href="https://colab.research.google.com/github/FatemehAbbasi166/Rain-Prediction-in-Australia/blob/main/rain_forecast_improvement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification with Python
##Table of Contents

- Instructions
- About the Data
- Importing Data
- Data Preprocessing
- One Hot Encoding
- Train and Test Data Split
- Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores

# Instructions

Below, is where we are going to use the regression and classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics.

We will use some of the algorithms below:
1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1. Accuracy Score
2. Jaccard Index
3. F1-Score
4. LogLoss
5. Mean Absolute Error
6. Mean Squared Error
7. R2-Score



# About The Dataset
The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from http://www.bom.gov.au/climate/dwo/.

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData

This dataset contains observations of weather metrics for each day from 2008 to 2017. The weatherAUS.csv dataset includes the following fields:

In [2]:
from google.colab import files

In [3]:
uploaded= files.upload()

Saving Weather_Data.csv to Weather_Data.csv


# Import the required libraries

In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

# Importing the Dataset

In [5]:
df=pd.read_csv('Weather_Data.csv')

In [6]:
df.head(10)

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes
5,2/6/2008,20.2,27.2,1.6,2.6,8.6,W,41,W,ENE,...,69,62,1002.7,998.6,6,6,23.8,26.0,Yes,Yes
6,2/7/2008,18.6,26.3,6.2,5.2,5.2,W,41,W,S,...,75,80,999.0,1000.3,4,7,21.7,22.3,Yes,Yes
7,2/8/2008,17.2,22.3,27.6,5.8,2.1,W,41,S,SE,...,77,61,1008.3,1007.4,7,8,18.9,21.1,Yes,Yes
8,2/9/2008,16.4,20.8,12.6,4.8,3.0,W,41,SSW,W,...,92,91,1006.4,1007.6,7,7,17.1,16.5,Yes,Yes
9,2/10/2008,14.6,24.2,8.8,4.4,10.1,W,41,W,SSE,...,80,53,1014.0,1013.4,4,2,17.2,23.3,Yes,No


In [7]:
df.shape

(3271, 22)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

# Data Preprocessing
One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [9]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [10]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

# Training Data and Test Data
Now, we set our 'features' or x values and our Y or target variable.

In [11]:
df_sydney_processed.drop('Date',axis=1,inplace=True)
df_sydney_processed = df_sydney_processed.astype(float)
df_sydney_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 67 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   MinTemp          3271 non-null   float64
 1   MaxTemp          3271 non-null   float64
 2   Rainfall         3271 non-null   float64
 3   Evaporation      3271 non-null   float64
 4   Sunshine         3271 non-null   float64
 5   WindGustSpeed    3271 non-null   float64
 6   WindSpeed9am     3271 non-null   float64
 7   WindSpeed3pm     3271 non-null   float64
 8   Humidity9am      3271 non-null   float64
 9   Humidity3pm      3271 non-null   float64
 10  Pressure9am      3271 non-null   float64
 11  Pressure3pm      3271 non-null   float64
 12  Cloud9am         3271 non-null   float64
 13  Cloud3pm         3271 non-null   float64
 14  Temp9am          3271 non-null   float64
 15  Temp3pm          3271 non-null   float64
 16  RainTomorrow     3271 non-null   float64
 17  RainToday_No  

In [12]:
df_sydney_processed.shape

(3271, 67)

In [13]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

# Linear Regression

Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.

In [14]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size = 0.2 , random_state = 10)
print('train set : ', x_train.shape, y_train.shape)
print('test set : ', x_test.shape, y_test.shape)

train set :  (2616, 66) (2616,)
test set :  (655, 66) (655,)


Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).

In [15]:
LinearReg = LinearRegression()
LinearReg.fit(x_train,y_train)

 Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [16]:
predictions = LinearReg.predict(x_test)

Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [17]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)
print('Mean Absolute Error : ', LinearRegression_MAE)
print('Mean Squared Error : ', LinearRegression_MSE)
print('R2 Score : ', LinearRegression_R2)

Mean Absolute Error :  0.2563092413749404
Mean Squared Error :  0.11571947414932758
R2 Score :  0.42713759580777166


Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [18]:
Report = {'MAE': [LinearRegression_MAE], 'MSE': [LinearRegression_MSE], 'R2': [LinearRegression_R2]}
Report=pd.DataFrame(Report)
Report

Unnamed: 0,MAE,MSE,R2
0,0.256309,0.115719,0.427138


# KNN
Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.

In [19]:
KNN = KNeighborsClassifier(n_neighbors = 4)
KNN.fit(x_train, y_train)


Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [20]:
predictions = KNN.predict(x_test)

Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [21]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
KNN_F1_Score = f1_score(y_test, predictions, average='weighted')
print('Accuracy Score : ', KNN_Accuracy_Score)
print('jaccard Index : ', KNN_JaccardIndex)
print('F1 Score : ', KNN_F1_Score)

Accuracy Score :  0.8183206106870229
jaccard Index :  0.7901234567901234
F1 Score :  0.802374933635524


# Decision Tree
Create and train a Decision Tree model called Tree using the training data (x_train, y_train).

In [22]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 8)
Tree.fit(x_train, y_train)

Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [23]:
predictions = Tree.predict(x_test)

Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [24]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)
print('Tree Accuracy Score: ', Tree_Accuracy_Score)
print('Tree Jaccard Index: ', Tree_JaccardIndex)
print('Tree F1 Score: ', Tree_F1_Score)

Tree Accuracy Score:  0.7938931297709924
Tree Jaccard Index:  0.41304347826086957
Tree F1 Score:  0.5846153846153846


# Logistic Regression
Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.

In [25]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size = 0.2, random_state = 1)

Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.

In [26]:
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

Now, use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba.

In [27]:
predictions = LR.predict(x_test)

In [28]:
predict_proba = LR.predict_proba(x_test)

Using the predictions, predict_proba and the y_test dataframe calculate the value for each metric using the appropriate function.

In [29]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba[:, 1])
print("LR Accuracy Score:", LR_Accuracy_Score)
print("LR Jaccard Index:", LR_JaccardIndex)
print("LR F1 Score:", LR_F1_Score)
print("LR Log Loss:", LR_Log_Loss)

LR Accuracy Score: 0.8366412213740458
LR Jaccard Index: 0.5091743119266054
LR F1 Score: 0.6747720364741641
LR Log Loss: 0.3812590636097066


# SVM
Create and train a SVM model called SVM using the training data (x_train, y_train).

In [30]:
SVM = svm.SVC(kernel='linear')
SVM.fit(x_train, y_train)

Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [31]:
predictions = SVM.predict(x_test)

Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [32]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
SVM_F1_Score = f1_score(y_test, predictions)
print('SVM accuracy Score : ', SVM_Accuracy_Score)
print('SVM jaccardIndex Score : ', SVM_JaccardIndex)
print('F1 Score', SVM_F1_Score)

SVM accuracy Score :  0.8458015267175573
SVM jaccardIndex Score :  0.8126159554730983
F1 Score 0.6966966966966968


Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

In [33]:
Report=pd.DataFrame({
    'Model': ['KNN', 'Decision Tree','Logistic Regression','SVM'],
    'Accuracy': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    'F1-Score': [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    'LogLoss': ['N/A', 'N/A' , LR_Log_Loss,'N/A']
})
Report

Unnamed: 0,Model,Accuracy,Jaccard Index,F1-Score,LogLoss
0,KNN,0.818321,0.790123,0.802375,
1,Decision Tree,0.793893,0.413043,0.584615,
2,Logistic Regression,0.836641,0.509174,0.674772,0.381259
3,SVM,0.845802,0.812616,0.696697,
