
<h1 align="center"><font size="5">Project: Classification with Python</font></h1>


<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li>Instructions</a></li>
    <li>About the Data</a></li>
    <li>Importing Data </a></li>
    <li>Data Preprocessing</a> </li>
    <li>One Hot Encoding </a></li>
    <li>Train and Test Data Split </a></li>
    <li>Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>
</div>
</div>

<hr>


# Instructions


In this notebook, we will  practice all the classification algorithms that we have learned in this course.


Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, we will use your models to generate the report at the end.


# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [2]:
df = pd.read_csv("Weather_Data.csv") # it will create a dataframe

 > Note: you can use any dataset i am using " weather data " you can download any data from kaggle.com and implement all models given below , we are just going to compare for the given dataset which will be the best model based on calculations and accuracy score .


In [3]:
df.head() # it will return first 5 values of dataset

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [4]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [5]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [6]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [7]:
df_sydney_processed = df_sydney_processed.astype(float)

In [8]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


In [14]:
x_train, x_test, y_train, y_test = train_test_split(features,Y,test_size=0.2,random_state=42)

In [15]:
LinearReg = LinearRegression()
LinearReg.fit(x_train,y_train)

In [16]:
predictions = LinearReg.predict(x_test)

In [17]:
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

In [18]:
LinearRegression_MAE = mean_absolute_error(y_test,predictions)
LinearRegression_MSE = mean_squared_error(y_test,predictions)
LinearRegression_R2 = r2_score(y_test,predictions)
print("mean absolute error is : %.2f"% LinearRegression_MAE)
print("mean squared error is : %.2f"% LinearRegression_MSE)
print("r2 score is : %.2f"% LinearRegression_R2)

mean absolute error is : 0.27
mean squared error is : 0.13
r2 score is : 0.34


Showing the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [19]:
dict = {'error_type':['LinearRegression_MAE','LinearRegression_MSE','LinearRegression_R2'],
        'value':[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]}
from tabulate import tabulate
Report = pd.DataFrame(dict)
print(tabulate(Report,headers='keys',tablefmt='psql'))

+----+----------------------+----------+
|    | error_type           |    value |
|----+----------------------+----------|
|  0 | LinearRegression_MAE | 0.2705   |
|  1 | LinearRegression_MSE | 0.131716 |
|  2 | LinearRegression_R2  | 0.336774 |
+----+----------------------+----------+


### KNN


In [27]:
KNN =KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train,y_train)

In [28]:
predictions = KNN.predict(x_test)
print(predictions[0:5])

[0. 1. 0. 0. 1.]


In [29]:
KNN_Accuracy_Score = accuracy_score(y_test,predictions)
KNN_JaccardIndex =jaccard_score(y_test,predictions)
KNN_F1_Score =f1_score(y_test,predictions)
print("KNN Accuracy Score: ",KNN_Accuracy_Score)
print("KNN_JaccardIndex: ",KNN_JaccardIndex)
print("KNN F1 score : ", KNN_F1_Score)

KNN Accuracy Score:  0.7923664122137405
KNN_JaccardIndex:  0.3492822966507177
KNN F1 score :  0.5177304964539007


### Decision Tree


In [42]:
Tree = DecisionTreeClassifier(criterion = "entropy")
Tree.fit(x_train,y_train)

In [44]:
predictions = Tree.predict(x_test)
predictions[0:5]

array([0., 1., 0., 0., 1.])

In [46]:
Tree_Accuracy_Score = accuracy_score(y_test,predictions)
Tree_JaccardIndex = jaccard_score(y_test,predictions)
Tree_F1_Score = f1_score(y_test,predictions)
print("Tree accur_acy score: ", Tree_Accuracy_Score)
print("Tree JaccardIndex : ", Tree_JaccardIndex)
print("Tree_F1_Score : ", Tree_F1_Score)

Tree accur_acy score:  0.7526717557251908
Tree JaccardIndex :  0.36470588235294116
Tree_F1_Score :  0.5344827586206896


### Logistic Regression


In [47]:
x_train, x_test, y_train, y_test = train_test_split(features,Y,test_size=0.2,random_state=1)

In [48]:
LR =LogisticRegression(solver='liblinear')
LR.fit(x_train,y_train)

In [49]:
predictions = LR.predict(x_test)

In [50]:
predict_proba =LR.predict_proba(x_test)
print("Predicted labels:\n", predictions[0:5])
print("\nProbability estimates for each class:\n", predict_proba[0:5])

Predicted labels:
 [0. 0. 0. 0. 0.]

Probability estimates for each class:
 [[0.74574813 0.25425187]
 [0.97506424 0.02493576]
 [0.50824637 0.49175363]
 [0.84727479 0.15272521]
 [0.9684321  0.0315679 ]]


In [51]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test,predictions)
LR_F1_Score = f1_score(y_test,predictions)
LR_Log_Loss =log_loss(y_test,predict_proba)
print("Accuracy is :",LR_Accuracy_Score)
print("Jaccard index is :",LR_JaccardIndex)
print("F1 Score is : ",LR_F1_Score)
print("Log Loss is",LR_Log_Loss)

Accuracy is : 0.8366412213740458
Jaccard index is : 0.5091743119266054
F1 Score is :  0.6747720364741642
Log Loss is 0.3812590636097066


### SVM


In [52]:
SVM = svm.SVC(kernel = 'linear')
SVM.fit(x_train,y_train)

In [53]:
predictions =SVM.predict(x_test)

In [54]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)
print("SVM accuracy score : ", SVM_Accuracy_Score)
print("SVM jaccardIndex : ", SVM_JaccardIndex)
print("SVM F1_score : ", SVM_F1_Score)

SVM accuracy score :  0.8458015267175573
SVM jaccardIndex :  0.5345622119815668
SVM F1_score :  0.6966966966966966


### Report


#### To display the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

 **\*LogLoss is only for Logistic Regression Model**


In [55]:
 d = {'KNN':[KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score,"Null"],
     'Tree':[Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score, "Null"],
     'LR':[LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score,LR_Log_Loss],
     'SVM':[SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score, "Null"]}
Report = pd.DataFrame(data=d, index = ['Accuracy','Jaccard Index','F1-Score', 'LogLoss'])
print(tabulate(Report, headers = 'keys', tablefmt = 'psql'))

+---------------+--------------------+---------------------+----------+--------------------+
|               | KNN                | Tree                |       LR | SVM                |
|---------------+--------------------+---------------------+----------+--------------------|
| Accuracy      | 0.7923664122137405 | 0.7526717557251908  | 0.836641 | 0.8458015267175573 |
| Jaccard Index | 0.3492822966507177 | 0.36470588235294116 | 0.509174 | 0.5345622119815668 |
| F1-Score      | 0.5177304964539007 | 0.5344827586206896  | 0.674772 | 0.6966966966966966 |
| LogLoss       | Null               | Null                | 0.381259 | Null               |
+---------------+--------------------+---------------------+----------+--------------------+
