"Machine Learning With Python" Final Project: Classification with Python

Instructions

In this notebook, you will practice all the classification algorithms that we have learned in this course.

Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

Linear Regression
KNN
Decision Trees
Logistic Regression
SVM
We will evaluate our models using:

Accuracy Score
Jaccard Index
F1-Score
LogLoss
Mean Absolute Error
Mean Squared Error
R2-Score
Finally, you will use your models to generate the report at the end.

About The Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from http://www.bom.gov.au/climate/dwo/.

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData

This dataset contains observations of weather metrics for each day from 2008 to 2017. The weatherAUS.csv dataset includes the following fields:

            Field	       Description	                                   Unit	               Type
                    
             Date	       Date of the Observation in YYYY-MM-DD	       Date	               object

             Location	   Location of the Observation	                   Location	           object
                    
             MinTemp	    Minimum temperature	                           Celsius	           float
                    
             MaxTemp	    Maximum temperature	                           Celsius	           float
                    
            Rainfall	    Amount of rainfall	                           Millimeters	       float

            Evaporation	    Amount of evaporation	                       Millimeters	       float
                    
            Sunshine	    Amount of bright sunshine	                   hours	           float

            WindGustDir	    Direction of the strongest gust	               Compass Points	   object

            WindGustSpeed	Speed of the strongest gust	                   Kilometers/Hour	   object

            WindDir9am	    Wind direction averaged of 10 minutes prior to 9am	Compass Points	object

            WindDir3pm	    Wind direction averaged of 10 minutes prior to 3pm	Compass Points	object

            WindSpeed9am	Wind speed averaged of 10 minutes prior to 9am	 Kilometers/Hour	float

            WindSpeed3pm	Wind speed averaged of 10 minutes prior to 3pm	 Kilometers/Hour	float

            Humidity9am	    Humidity at 9am	                                 Percent	        float

            Humidity3pm	    Humidity at 3pm	                                 Percent	       float

            Pressure9am	    Atmospheric pressure reduced to mean sea level at 9am	Hectopascal	float

            Pressure3pm	    Atmospheric pressure reduced to mean sea level at 3pm	Hectopascal	float

            Cloud9am	    Fraction of the sky obscured by cloud at 9am	  Eights	        float

            Cloud3pm	    Fraction of the sky obscured by cloud at 3pm	  Eights	        float

            Temp9am	        Temperature at 9am	                              Celsius	        float

            Temp3pm	        Temperature at 3pm	                              Celsius	        float

            RainToday	    If there was rain today	                          Yes/No	        object

            RainTomorrow	If there is rain tomorrow	                      Yes/No	        float

Column definitions were gathered from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml

Import the required libraries

In [65]:
import pandas as pd
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

Importing the Dataset

In [2]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

Load the data and use method head to view few rows

In [4]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


Data Preprocessing

One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [5]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [6]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

Training Data and Test Data

Now, we set our 'features' or x values and our Y or target variable.

In [7]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

Use method astype to convert the data to float

In [8]:
df_sydney_processed = df_sydney_processed.astype(float)

In [9]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']
df_sydney_processed.dtypes

MinTemp           float64
MaxTemp           float64
Rainfall          float64
Evaporation       float64
Sunshine          float64
                   ...   
WindDir3pm_SSW    float64
WindDir3pm_SW     float64
WindDir3pm_W      float64
WindDir3pm_WNW    float64
WindDir3pm_WSW    float64
Length: 67, dtype: object

Use method head to view the dataset. We notice the data is processed and ready for model building

In [10]:
df_sydney_processed.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,19.5,22.4,15.6,6.2,0.0,41.0,17.0,20.0,92.0,84.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,19.5,25.6,6.0,3.4,2.7,41.0,9.0,13.0,83.0,73.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,21.6,24.5,6.6,2.4,0.1,41.0,17.0,2.0,88.0,86.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,20.2,22.8,18.8,2.2,0.0,41.0,22.0,20.0,83.0,90.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19.7,25.7,77.4,4.8,0.0,41.0,11.0,6.0,88.0,74.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Linear Regression

Q1) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.

In [11]:
x_train, x_test, y_train, y_test = train_test_split( features, Y, test_size=0.2, random_state=10)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


Q2) Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).

In [12]:
LinearReg = LinearRegression()
x = np.asanyarray(x_train.replace('Date', ''))
y = np.asanyarray(y_train)
LinearReg.fit (x, y)
# The coefficients
print ('Coefficients: ', LinearReg.coef_)

Coefficients:  [-2.36933212e-02  1.30007994e-02  7.30206238e-04  6.48991926e-03
 -3.51699778e-02  4.23739763e-03  1.83047446e-03  7.90999468e-04
  9.55896155e-04  8.56089162e-03  7.70813418e-03 -9.25470178e-03
 -8.86504271e-03  1.00331733e-02  1.44689084e-02 -3.47639901e-03
  2.14785220e+10  2.14785220e+10 -5.67846844e+09 -5.67846844e+09
 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09
 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09
 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09 -5.67846844e+09
 -5.67846844e+09 -5.67846844e+09  1.70623652e+10  1.70623652e+10
  1.70623652e+10  1.70623652e+10  1.70623652e+10  1.70623652e+10
  1.70623652e+10  1.70623652e+10  1.70623652e+10  1.70623652e+10
  1.70623652e+10  1.70623652e+10  1.70623652e+10  1.70623652e+10
  1.70623652e+10  1.70623652e+10  8.63713855e+09  8.63713855e+09
  8.63713855e+09  8.63713855e+09  8.63713855e+09  8.63713855e+09
  8.63713855e+09  8.63713855e+09  8.63713855e+09  8.63713855e+09
  8.637138

Q3) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [53]:
predictions = LinearReg.predict(x_test)
x = np.asanyarray(x_test)
y = np.asanyarray(y_test)
print("Residual sum of squares: %.2f"
      % np.mean((predictions - y) ** 2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % LinearReg.score(x, y))

Residual sum of squares: 0.12
Variance score: 0.42


Q4) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [15]:
from sklearn.metrics import r2_score
LinearRegression_MAE = np.mean(np.absolute(predictions - y_test))
LinearRegression_MSE = np.mean((predictions -y_test)**2)
LinearRegression_R2 = r2_score(y_test, predictions)
print("Mean absolute error: %.2f" % LinearRegression_MAE)
print("Residual sum of squares (MSE): %.2f" % LinearRegression_MSE)
print("R2-score: %.2f" % LinearRegression_R2 )

Mean absolute error: 0.26
Residual sum of squares (MSE): 0.12
R2-score: 0.43


In [77]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)
print("Mean absolute error: %.2f" % LinearRegression_MAE)
print("Residual sum of squares (MSE): %.2f" % LinearRegression_MSE)
print("R2-score: %.2f" % LinearRegression_R2 )

Mean absolute error: 0.28
Residual sum of squares (MSE): 0.28
R2-score: -0.38


Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [16]:
dict = {'error_type':['LinearRegression_MAE','LinearRegression_MSE','LinearRegression_R2'],
        
        'value':[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]}
Report = pd.DataFrame({'Algorithm' : ['LinearRegression']})


Report['MAE'] = [LinearRegression_MAE]
Report['MSE'] = [LinearRegression_MSE]
Report['R2'] = [LinearRegression_R2]

Report

Unnamed: 0,Algorithm,MAE,MSE,R2
0,LinearRegression,0.256319,0.115723,0.427121


KNN

Q6) Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.¶

In [17]:
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(x_train,y_train)
neigh

In [21]:
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train, y_train)

Q7) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [67]:
x_test = np.ascontiguousarray(x_test)

In [68]:
predictions = neigh.predict(x_test)
predictions[0:5]

array([0., 0., 1., 0., 0.])

In [69]:
predictions = KNN.predict(x_test)
predictions[0:5]

array([0., 0., 1., 0., 0.])

Q8) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [72]:
KNN_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
KNN_JaccardIndex = metrics.jaccard_score(y_test, predictions)
KNN_F1_Score = metrics.f1_score(y_test, predictions)
KNN_Log_Loss = metrics.log_loss(y_test, predictions)
print("KNN Accuracy Score: ",KNN_Accuracy_Score)
print("KNN_JaccardIndex: ",KNN_JaccardIndex)
print("KNN F1 score : ", KNN_F1_Score)

KNN Accuracy Score:  0.9541984732824428
KNN_JaccardIndex:  0.8477157360406091
KNN F1 score :  0.9175824175824175


Decision Tree

Q9) Create and train a Decision Tree model called Tree using the training data (x_train, y_train).

In [31]:
from sklearn.tree import DecisionTreeClassifier
Tree = DecisionTreeClassifier()
Tree.fit(x_train, y_train)

Q10) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [71]:
predictions = Tree.predict(x_test)
predictions[0:5]

array([0., 0., 0., 0., 0.])

Q11) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

In [35]:
Tree_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
Tree_JaccardIndex = metrics.jaccard_score(predictions, y_test)
Tree_F1_Score = metrics.f1_score(predictions, y_test)
print("Tree accur_acy score: ", Tree_Accuracy_Score)
print("Tree JaccardIndex : ", Tree_JaccardIndex)
print("Tree_F1_Score : ", Tree_F1_Score)

Tree accur_acy score:  0.7465648854961832
Tree JaccardIndex :  0.3805970149253731
Tree_F1_Score :  0.5513513513513514


Logistic Regression

Q12) Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.¶

In [38]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=.2, random_state=1)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


Q13) Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.

In [39]:
LR = LogisticRegression(solver='liblinear')
LR.fit(x_train, y_train)

Q14) Now, use the predict and predict_proba methods on the testing data (x_test) and save it as 2 arrays predictions and predict_proba.

In [74]:
predictions = LR.predict(x_test)
predictions

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1.,
       0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1.,
       0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1.,
       0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
       1., 0., 1., 1., 0.

In [75]:
predict_proba = LR.predict_proba(x_test)
predict_proba

array([[0.74339483, 0.25660517],
       [0.97495683, 0.02504317],
       [0.50982014, 0.49017986],
       ...,
       [0.98010306, 0.01989694],
       [0.69834832, 0.30165168],
       [0.22120583, 0.77879417]])

Q15) Using the predictions, predict_proba and the y_test dataframe calculate the value for each metric using the appropriate function.¶

In [46]:
LR_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
LR_JaccardIndex = metrics.jaccard_score(y_test, predictions)
LR_F1_Score = metrics.f1_score(y_test, predictions)
LR_Log_Loss = metrics.log_loss(y_test, predictions)
print("LR Accuracy Score: ",LR_Accuracy_Score)
print("LR JaccardIndex: ",LR_JaccardIndex)
print("LR F1 score : ", LR_F1_Score)
print("LR Log Loss : ", LR_Log_Loss)

LR Accuracy Score:  0.8366412213740458
LR JaccardIndex:  0.5091743119266054
LR F1 score :  0.6747720364741641
LR Log Loss :  5.888047194863413


SVM

Q16) Create and train a SVM model called SVM using the training data (x_train, y_train).

In [47]:
SVM = svm.SVC(kernel='linear')

In [48]:
SVM = svm.SVC()
SVM.fit(x_train, y_train)

Q17) Now use the predict method on the testing data (x_test) and save it to the array predictions.

In [76]:
predictions = SVM.predict(x_test)
predictions

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

Q18) Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.¶

In [50]:
SVM_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
SVM_JaccardIndex = metrics.jaccard_score(predictions, y_test)
SVM_F1_Score = metrics.f1_score(predictions, y_test)
print("SVM Accuracy Score: ",SVM_Accuracy_Score)
print("SVM JaccardIndex: ",SVM_JaccardIndex)
print("SVM F1 score : ", SVM_F1_Score)

SVM Accuracy Score:  0.7221374045801526
SVM JaccardIndex:  0.0
SVM F1 score :  0.0


Report

Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

*LogLoss is only for Logistic Regression Model

In [51]:
Report = pd.DataFrame({'Algorithm' : ['LogisticRegression', 'KNN', 'SVM', 'Decision Tree']})


Report['Accuracy'] = [LR_Accuracy_Score, KNN_Accuracy_Score, SVM_Accuracy_Score, Tree_Accuracy_Score]
Report['Jaccard'] = [LR_JaccardIndex, KNN_JaccardIndex, SVM_JaccardIndex, Tree_JaccardIndex]
Report['F1-Score'] = [LR_F1_Score, KNN_F1_Score, SVM_F1_Score, Tree_F1_Score]
Report['LogLoss'] = [ LR_Log_Loss, 'N/A', 'N/A', 'N/A']
Report

Unnamed: 0,Algorithm,Accuracy,Jaccard,F1-Score,LogLoss
0,LogisticRegression,0.836641,0.509174,0.674772,5.888047
1,KNN,0.818321,0.425121,0.59661,
2,SVM,0.722137,0.0,0.0,
3,Decision Tree,0.746565,0.380597,0.551351,
