# **Classification with Python**

In this notebook we practise using different classification algorithms to create a model based on our training data and evaluate our testing data using some evaluation metrics

The algorithms covered are:
1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM


The evaluation metrics are:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score


In [1]:
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

## Importing the required libraries

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_error
from sklearn.metrics import jaccard_score, f1_score, log_loss
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
%matplotlib inline
import requests
import os

## Downloading the data

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

In [3]:
data_folder = 'data'
output_folder = 'output'

In [8]:
def download(url):
    filename = os.path.join(data_folder, os.path.basename(url))
    if not os.path.exists(filename):
        with requests.get(url, stream=True, allow_redirects=True) as r:
            with open(filename, 'wb') as f:
                for chunk in r.iter_content(chunk_size = 8192):
                    f.write(chunk)

        print('Downloaded: ', filename)

In [6]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [9]:
download(path)

Downloaded:  data\Weather_Data.csv


In [11]:
file = os.path.basename(path)
file_path = os.path.join(data_folder, file)
df = pd.read_csv(file_path)

Additionally we can read the data directly from the source without downloading it using pd.read_csv()

In [None]:
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"
df = pd.read_csv(filepath)

In [12]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## Exploratory Data Analysis

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

## Data Preprocessing

### One Hot Encoding

In [14]:
df_proc = pd.get_dummies(data=df, columns = ['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [15]:
df_proc.replace(['No', 'Yes'], [0, 1], inplace=True)

## Training Data and Test Data

In [16]:
df_proc.drop('Date', axis=1, inplace=True)

In [17]:
df_proc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 67 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   MinTemp          3271 non-null   float64
 1   MaxTemp          3271 non-null   float64
 2   Rainfall         3271 non-null   float64
 3   Evaporation      3271 non-null   float64
 4   Sunshine         3271 non-null   float64
 5   WindGustSpeed    3271 non-null   int64  
 6   WindSpeed9am     3271 non-null   int64  
 7   WindSpeed3pm     3271 non-null   int64  
 8   Humidity9am      3271 non-null   int64  
 9   Humidity3pm      3271 non-null   int64  
 10  Pressure9am      3271 non-null   float64
 11  Pressure3pm      3271 non-null   float64
 12  Cloud9am         3271 non-null   int64  
 13  Cloud3pm         3271 non-null   int64  
 14  Temp9am          3271 non-null   float64
 15  Temp3pm          3271 non-null   float64
 16  RainTomorrow     3271 non-null   int64  
 17  RainToday_No  

In [18]:
df_proc.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41,17,20,92,84,...,False,False,False,False,False,True,False,False,False,False
1,19.5,25.6,6.0,3.4,2.7,41,9,13,83,73,...,False,False,False,False,False,False,False,False,False,False
2,21.6,24.5,6.6,2.4,0.1,41,17,2,88,86,...,False,False,False,False,False,False,False,False,False,False
3,20.2,22.8,18.8,2.2,0.0,41,22,20,83,90,...,False,False,False,False,False,False,False,False,False,False
4,19.7,25.7,77.4,4.8,0.0,41,11,6,88,74,...,False,False,False,False,False,False,False,True,False,False


In [19]:
df_proc = df_proc.astype(float)

In [20]:
df_proc.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41.0,17.0,20.0,92.0,84.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,19.5,25.6,6.0,3.4,2.7,41.0,9.0,13.0,83.0,73.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.6,24.5,6.6,2.4,0.1,41.0,17.0,2.0,88.0,86.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20.2,22.8,18.8,2.2,0.0,41.0,22.0,20.0,83.0,90.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.7,25.7,77.4,4.8,0.0,41.0,11.0,6.0,88.0,74.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [21]:
#Split data to features and target variables
x = df_proc.drop(columns = ['RainTomorrow'])
y = df_proc['RainTomorrow']

## Building our Models

We will split our data into training and testing data using a 20% test size and 80% training size then build our models 

## Linear Regression

In [33]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

In [34]:
print (x_train.shape, y_train.shape)

(2616, 66) (2616,)


In [23]:
# Instantiate the linera regression model
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

In [36]:
# Predict x_test
predictions = LinearReg.predict(x_test)
score = LinearReg.score(x_test, y_test)

In [37]:
from sklearn.metrics import mean_squared_error
# Get the value for each metric
LinearRegression_MAE = mean_absolute_error(predictions, y_test)
LinearRegression_MSE = mean_squared_error(predictions, y_test)
LinearRegression_R2 = score

In [39]:
print(f' MAE: {LinearRegression_MAE}')
print(f'MSE: {LinearRegression_MSE}')

 MAE: 0.2563176099420382
MSE: 0.11572058282746576


In [45]:
report = pd.DataFrame({'MAE': LinearRegression_MAE,
                      'MSE': LinearRegression_MSE,
                      'R2_Score': LinearRegression_R2}, index = ['Value'], columns = ['MAE', 'MSE', 'R2_Score'])

In [46]:
report


Unnamed: 0,MAE,MSE,R2_Score
Value,0.256318,0.115721,0.427132


## K-Nearest Neighbors

In [47]:
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train, y_train)

In [1]:
#predictions=KNN.predict(x_test)

In [49]:
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

## Logistics Reggression

In [51]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [52]:
LR = LogisticRegression(solver = 'liblinear')
LR.fit(x_train, y_train)

In [53]:
predictions = LR.predict(x_test)

In [54]:
predict_proba = LR.predict_proba(x_test)

In [55]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

## SVM

In [58]:
SVM = svm.SVC(kernel='rbf')
SVM.fit(x_train, y_train)

In [59]:
predictions = SVM.predict(x_test)

In [68]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions, average='weighted')

In [None]:
Report = pd.DataFrame({'Accuracy Score': LinearRegression_MAE,
                      'Jaccard Score': LinearRegression_MSE,
                      'F1_Score': LinearRegression_R2},
                      {'MAE2': LinearRegression_MAE,
                      'MSE2': LinearRegression_MSE,
                      'R2_Score2': LinearRegression_R2},
                      {'MAE': LinearRegression_MAE,
                      'MSE': LinearRegression_MSE,
                      'R2_Score': LinearRegression_R2}index = ['Value'], columns = ['MAE', 'MSE', 'R2_Score'])

## Decision Trees

In [61]:
Tree = DecisionTreeClassifier()
Tree.fit(x_train, y_train)

In [62]:
predictions = Tree.predict(x_test)

In [63]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex =  jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

## Generating our Report

We shall then generate a report showing how well our models performed on the given dataset. This is ritical in helping decide which classification model works best with the dataset

In [69]:
list_1 = [[KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score],
         [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss],
         [SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score ],
         [Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score]]
column=['Accuracy_score', 'Jaccard Index', 'F1 Score', 'Log loss']
index = ['KNN', 'Linear_Regression', 'SVM', 'Decision_Tree']

Report = pd.DataFrame(list_1, columns = column, index = index)

In [71]:
Report

Unnamed: 0,Accuracy_score,Jaccard Index,F1 Score,Log loss
KNN,0.818321,0.425121,0.59661,
Linear_Regression,0.835115,0.504587,0.670732,0.381427
SVM,0.770992,0.380165,0.764227,
Decision_Tree,0.770992,0.380165,0.550898,


We can conclude that the Linear Regression algorithm would be the best model to use for our dataset