# Final Project - Classification with Python

Practicing various Classification Algorithms.

For this Project we'll try and predict rainfall in Australia(Whether it will rain or not) from observations and data collected overtime.  

Have a look..

## Info on Dataset

This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

## Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

## Importing Dataset

In [2]:
df = pd.read_csv('Weather_Data.csv')
df

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3266,6/21/2017,8.6,19.6,0.0,2.0,7.8,SSE,37,W,SSE,...,73,52,1025.9,1025.3,2,2,10.5,17.9,No,No
3267,6/22/2017,9.3,19.2,0.0,2.0,9.2,W,30,W,ESE,...,78,53,1028.5,1024.6,2,2,11.0,18.7,No,No
3268,6/23/2017,9.4,17.7,0.0,2.4,2.7,W,24,WNW,N,...,85,56,1020.8,1015.0,6,6,10.2,17.3,No,No
3269,6/24/2017,10.1,19.3,0.0,1.4,9.3,W,43,W,W,...,56,35,1017.3,1015.1,5,2,12.4,19.0,No,No


## Inspecting Data

We need to check our data for any missing values, duplicated values or incorrect data types. We can not feed our algorithm bad data.

In [3]:
print("Sum of Null values\n")
print(df.isna().sum())
print("\n")

print("Info on Dataset\n")
print(df.info())
print("\n")

print("Sum of Duplicates:",df.duplicated().sum())

Sum of Null values

Date             0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64


Info on Dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 no

So we have no missing data or duplicated data and the columns seem to be in the correct data type. Let's carry on..

## Converting Categorical Variables into Numerical Variables

In [4]:
new_df = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

In [5]:
new_df.replace(['No', 'Yes'], [0,1], inplace=True)

## Dropping the 'Date' Column

Since we do not need the date column we'll just drop it

In [6]:
new_df.drop('Date',axis=1,inplace=True)

## Converting Dataframe to type Float

In [7]:
new_df = new_df.astype(float)
new_df

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41.0,17.0,20.0,92.0,84.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,19.5,25.6,6.0,3.4,2.7,41.0,9.0,13.0,83.0,73.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.6,24.5,6.6,2.4,0.1,41.0,17.0,2.0,88.0,86.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20.2,22.8,18.8,2.2,0.0,41.0,22.0,20.0,83.0,90.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.7,25.7,77.4,4.8,0.0,41.0,11.0,6.0,88.0,74.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3266,8.6,19.6,0.0,2.0,7.8,37.0,22.0,20.0,73.0,52.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3267,9.3,19.2,0.0,2.0,9.2,30.0,20.0,7.0,78.0,53.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3268,9.4,17.7,0.0,2.4,2.7,24.0,15.0,13.0,85.0,56.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3269,10.1,19.3,0.0,1.4,9.3,43.0,17.0,19.0,56.0,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## Feature Engineering

Selecting input and output data for training.

In [8]:
features = new_df.drop(columns='RainTomorrow', axis=1)
Y = new_df['RainTomorrow']

## Splitting Dataset

Splitting data for training and testing. We'll use some of the data for training our models and some for testing. This is an important step to help mitigate out of sample accuracy.

In [9]:
x_train, x_test, y_train, y_test = train_test_split(features, Y,test_size=0.2, random_state=10)

## Model Training and Evaluation

### Linear Regression

A very uncommon algorithm for classification but let's try it out.

In [10]:
LinearReg = LinearRegression()
LinearReg.fit(x_train,y_train)

predictions = LinearReg.predict(x_test)

LinearRegression_MAE = mean_absolute_error(y_test, predictions)
LinearRegression_MSE = mean_squared_error(y_test, predictions)
LinearRegression_R2 =  r2_score(y_test, predictions)

Report = pd.DataFrame({
    "MAE":LinearRegression_MAE,
    "MSE":LinearRegression_MSE,
    "R2":LinearRegression_R2
},index=["Linear Regression"])
Report

Unnamed: 0,MAE,MSE,R2
Linear Regression,0.256319,0.115722,0.427127


### K-Nearest Neighbours(KNN)

In [11]:
KNN = KNeighborsClassifier(n_neighbors = 4)
KNN.fit(x_train,y_train)

predictions2 = KNN.predict(x_test)

KNN_Accuracy_Score = accuracy_score(y_test, predictions2)
KNN_JaccardIndex = jaccard_score(y_test, predictions2)
KNN_F1_Score = f1_score(y_test, predictions2)

print("KNN Accuracy Score:",KNN_Accuracy_Score)
print("KNN Jaccard Index:",KNN_JaccardIndex)
print("KNN f1 Score:",KNN_F1_Score)

KNN Accuracy Score: 0.8183206106870229
KNN Jaccard Index: 0.4251207729468599
KNN f1 Score: 0.5966101694915255


### Decision Tree

In [12]:
Tree = DecisionTreeClassifier()
Tree.fit(x_train, y_train)

predictions3 = Tree.predict(x_test)

Tree_Accuracy_Score = accuracy_score(y_test, predictions3)
Tree_JaccardIndex = jaccard_score(y_test, predictions3)
Tree_F1_Score = f1_score(y_test, predictions3)

print("Tree Accuracy Score:",Tree_Accuracy_Score)
print("Tree Jaccard Index:",Tree_JaccardIndex)
print("Tree f1 Score:",Tree_F1_Score)

Tree Accuracy Score: 0.7618320610687023
Tree Jaccard Index: 0.40458015267175573
Tree f1 Score: 0.5760869565217391


### Logistic Regression

In [13]:
#Changing the Random State to 1
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

In [14]:
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

predictions4 = LR.predict(x_test)
predict_proba = LR.predict_proba(x_test)

LR_Accuracy_Score = accuracy_score(y_test, predictions4)
LR_JaccardIndex = jaccard_score(y_test, predictions4)
LR_F1_Score = f1_score(y_test, predictions4)
LR_Log_Loss = log_loss(y_test, predict_proba)

print("Accuracy Score:",LR_Accuracy_Score)
print("Jaccard Index:",LR_JaccardIndex)
print("f1 Score:",LR_F1_Score)
print("Log Loss",LR_Log_Loss)

Accuracy Score: 0.8351145038167939
Jaccard Index: 0.5045871559633027
f1 Score: 0.6707317073170731
Log Loss 0.38155847529580783


### Support Vector Machines(SVM)

In [15]:
SVM = svm.SVC()
SVM.fit(x_train, y_train)

predictions5 = SVM.predict(x_test)

SVM_Accuracy_Score = accuracy_score(y_test, predictions5)
SVM_JaccardIndex = jaccard_score(y_test, predictions5)
SVM_F1_Score = f1_score(y_test, predictions5)

print("SVM Accuracy Score:",SVM_Accuracy_Score)
print("SVM Jaccard Index:",SVM_JaccardIndex)
print("SVM f1 Score:",SVM_F1_Score)

SVM Accuracy Score: 0.7221374045801526
SVM Jaccard Index: 0.0
SVM f1 Score: 0.0


In [16]:
Report2 = pd.DataFrame({
    "Accuracy Score":[KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    "Jaccard Index":[KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    "F1 Score":[KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    "Log Loss":["N/A", "N/A", LR_Log_Loss, "N/A"]
},index=["KNN", "Decision Tree", "Logistic Regression", "SVM"])
Report2

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score,Log Loss
KNN,0.818321,0.425121,0.59661,
Decision Tree,0.761832,0.40458,0.576087,
Logistic Regression,0.835115,0.504587,0.670732,0.381558
SVM,0.722137,0.0,0.0,
