# Instructions


Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

1.  Linear Regression
2.  KNN
3.  Decision Trees
4.  Logistic Regression
5.  SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, you will use your models to generate the report displaying the accuracy scores.


## **Import the required libraries**


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !mamba install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# Note: If your environment doesn't support "!mamba install", use "!pip install"

In [3]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [5]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv')

df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### Transforming Categorical Variables


First, we need to convert categorical variables to binary variables. We will use pandas `get_dummies()` method for this.


In [6]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [7]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [8]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [9]:
df_sydney_processed = df_sydney_processed.astype(float)

In [10]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [17]:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

In [18]:
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape) 

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [19]:
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)
print ('Coefficients: ', LinearReg.coef_)

Coefficients:  [-0.02369173  0.01300554  0.00072981  0.00649077 -0.03516427  0.00423762
  0.0018292   0.00078986  0.00095609  0.00856061  0.00769793 -0.00924424
 -0.00887454  0.01004774  0.01446555 -0.00348065 -0.05402493  0.05402493
  0.05039419 -0.07898527  0.06640003 -0.0721012  -0.05945626 -0.08239011
 -0.0789619   0.06418738 -0.00838878  0.11105128  0.01414852  0.03851666
  0.03625722 -0.02133122  0.00395909  0.01670037  0.04350405  0.05317842
 -0.00692976 -0.01911823 -0.01461142 -0.00594829 -0.07546046  0.04176858
 -0.00758587 -0.00980346 -0.01874997  0.00302978  0.01914623 -0.0012425
 -0.01756641  0.01638932 -0.09330032 -0.08339081 -0.01838672 -0.05191842
 -0.04092463  0.03423083  0.06883841  0.01862747  0.06892422  0.00033817
 -0.04820507  0.0755034   0.03967488  0.02636872 -0.02236214  0.02598199]


#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [20]:
predictions = LinearReg.predict(x_test)
print("residual sum of squares: %.2f" % np.mean((predictions - y_test) ** 2))
print("Variance score: %.2f" % LinearReg.score(x_test, y_test))

residual sum of squares: 0.12
Variance score: 0.43


#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [21]:
from sklearn.metrics import r2_score
LinearRegression_MAE = np.mean(np.absolute(predictions - y_test))
LinearRegression_MSE = np.mean((predictions - y_test) **2)
LinearRegression_R2 = r2_score(predictions, y_test)
print("Mean Absolute Error: %.2f" % LinearRegression_MAE)
print("Mean Squared Error: %.2f" % LinearRegression_MSE)
print("R2-Score: %.2f" % LinearRegression_R2)

Mean Absolute Error: 0.26
Mean Squared Error: 0.12
R2-Score: -0.38


#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [22]:
from tabulate import tabulate
dict = [["LinearRegression_MAE",LinearRegression_MAE],["LinearRegression_MSE",LinearRegression_MSE],
       ["LinearRegression_R2",LinearRegression_R2]]
Report = pd.DataFrame(dict)
print(tabulate(Report))

-  --------------------  ---------
0  LinearRegression_MAE   0.256318
1  LinearRegression_MSE   0.115721
2  LinearRegression_R2   -0.38476
-  --------------------  ---------


### KNN


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [23]:
K = 4
KNN = KNeighborsClassifier(n_neighbors = K)
KNN.fit(x_train,y_train)

KNeighborsClassifier(n_neighbors=4)

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [24]:
predictions = KNN.predict(x_test)
predictions[0:5]


array([0., 0., 1., 0., 0.])

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [26]:
KNN_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
KNN_JaccardIndex = metrics.jaccard_score(predictions, y_test)
KNN_F1_Score = metrics.f1_score(predictions, y_test)
print("KNN_Accuracy_Score: %.2f" % KNN_Accuracy_Score)
print("KNN_JaccardIndex: %.2f" % KNN_JaccardIndex)
print("KNN_F1_Score: %.2f" % KNN_F1_Score)

KNN_Accuracy_Score: 0.82
KNN_JaccardIndex: 0.43
KNN_F1_Score: 0.60


### Decision Tree


#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [27]:
Tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4)
Tree.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [29]:
predictions = Tree.predict(x_test)

#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [30]:
Tree_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
Tree_JaccardIndex = metrics.jaccard_score(predictions, y_test)
Tree_F1_Score = metrics.f1_score(predictions, y_test)
print("Tree_Accuracy_Score: %.2f" % Tree_Accuracy_Score)
print("Tree_JaccardIndex: %.2f" % Tree_JaccardIndex)
print("Tree_F1_Score: %.2f" % Tree_F1_Score)

Tree_Accuracy_Score: 0.82
Tree_JaccardIndex: 0.48
Tree_F1_Score: 0.65


### Logistic Regression


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [31]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size = 0.2, random_state = 1)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

Train set: (2616, 66) (2616,)
Test set: (655, 66) (655,)


#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [32]:
LR = LogisticRegression(solver = "liblinear")
LR.fit(x_train,y_train)

LogisticRegression(solver='liblinear')

#### Q14) Now, use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [33]:
predictions = LR.predict(x_test)

#### Q15) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [34]:
LR_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
LR_JaccardIndex = metrics.jaccard_score(predictions, y_test)
LR_F1_Score = metrics.f1_score(predictions, y_test)
LR_Log_Loss = metrics.log_loss(predictions, y_test)
print("LR_Accuracy_Score: %.2f" % LR_Accuracy_Score)
print("LR_JaccardIndex: %.2f" % LR_JaccardIndex)
print("LR_F1_Score: %.2f" % LR_F1_Score)
print("LR_Log_Loss: %.2f" % LR_Log_Loss)

LR_Accuracy_Score: 0.84
LR_JaccardIndex: 0.51
LR_F1_Score: 0.67
LR_Log_Loss: 5.64


### SVM


#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).


In [35]:
SVM = svm.SVC(kernel = 'linear')
SVM.fit(x_train,y_train)

SVC(kernel='linear')

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [36]:
predictions = SVM.predict(x_test)

#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [37]:
SVM_Accuracy_Score = metrics.accuracy_score(predictions, y_test)
SVM_JaccardIndex = metrics.jaccard_score(predictions, y_test)
SVM_F1_Score = metrics.f1_score(predictions, y_test)
print("SVM_Accuracy_Score: %.2f" % SVM_Accuracy_Score)
print("SVM_JaccardIndex: %.2f" % SVM_JaccardIndex)
print("SVM_F1_Score: %.2f" % SVM_F1_Score)

SVM_Accuracy_Score: 0.84
SVM_JaccardIndex: 0.51
SVM_F1_Score: 0.68


### Report


#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [38]:
from tabulate import tabulate
#dict = [["LinearRegression_MAE",LinearRegression_MAE],["LinearRegression_MSE",LinearRegression_MSE],
#       ["LinearRegression_R2",LinearRegression_R2]]
dict1 = {'LinearRegression' : [LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2],
         'KNN' : [KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score],
         'DecisionTree' : [Tree_Accuracy_Score,Tree_JaccardIndex,Tree_F1_Score],
         'LogisticRegression' : [LR_Accuracy_Score,LR_JaccardIndex,LR_F1_Score],
         'SVM' : [SVM_Accuracy_Score,SVM_JaccardIndex,SVM_F1_Score]
        }
dict2 = [["LinearRegression_MAE",LinearRegression_MAE],["LinearRegression_MSE",LinearRegression_MSE],
         ["LinearRegression_R2",LinearRegression_R2],
         ["KNN_Accuracy_Score",KNN_Accuracy_Score],["KNN_JaccardIndex",KNN_JaccardIndex],
         ["KNN_F1_Score",KNN_F1_Score],
         ["Tree_Accuracy_Score",Tree_Accuracy_Score],["Tree_JaccardIndex",Tree_JaccardIndex],
         ["Tree_F1_Score",Tree_F1_Score],
         ["LR_Accuracy_Score",LR_Accuracy_Score],["LR_JaccardIndex",LR_JaccardIndex],
         ["LR_F1_Score",LR_F1_Score],["LR_log_Loss",LR_Log_Loss],
         ["SVM_Accuracy_Score",SVM_Accuracy_Score],["SVM_JaccardIndex",SVM_JaccardIndex],
         ["SVM_F1_Score",SVM_F1_Score]]
Report = pd.DataFrame(data=dict2)
#print(tabulate(Report, headers = ['Accuracy','Jaccard Index','F1-Score', 'LogLoss']))
print(tabulate(Report))


--  --------------------  ---------
 0  LinearRegression_MAE   0.256318
 1  LinearRegression_MSE   0.115721
 2  LinearRegression_R2   -0.38476
 3  KNN_Accuracy_Score     0.818321
 4  KNN_JaccardIndex       0.425121
 5  KNN_F1_Score           0.59661
 6  Tree_Accuracy_Score    0.818321
 7  Tree_JaccardIndex      0.480349
 8  Tree_F1_Score          0.648968
 9  LR_Accuracy_Score      0.836641
10  LR_JaccardIndex        0.509174
11  LR_F1_Score            0.674772
12  LR_log_Loss            5.6423
13  SVM_Accuracy_Score     0.841221
14  SVM_JaccardIndex       0.514019
15  SVM_F1_Score           0.679012
--  --------------------  ---------
