# FINAL PROJECT: CLASSIFICATION WITH PYTHON

<h2>Table of Contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="https://#Section_1">Instructions</a></li>
    <li><a href="https://#Section_2">About the Data</a></li>
    <li><a href="https://#Section_3">Importing Data </a></li>
    <li><a href="https://#Section_4">Data Preprocessing</a> </li>
    <li><a href="https://#Section_5">One Hot Encoding </a></li>
    <li><a href="https://#Section_6">Train and Test Data Split </a></li>
    <li><a href="https://#Section_7">Train Logistic Regression, KNN, Decision Tree, SVM, and Linear Regression models and return their appropriate accuracy scores</a></li>
</a></li>
</div>

# Instructions

In this notebook, you will  practice all the classification algorithms that we have learned in this course.


Below, is where we are going to use the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

We will use some of the algorithms taught in the course, specifically:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

Finally, you will use your models to generate the report at the end. 


# About The Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)


This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)


## **Import the required libraries**

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [2]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'

In [3]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing

#### One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [4]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [5]:
df_sydney_processed.replace(['No','Yes'], [0,1], inplace=True)

### Training Data and Test Data

Now, we set our 'features' or x values and our Y or target variable.

In [6]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [7]:
df_sydney_processed = df_sydney_processed.astype(float)

In [9]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression

#### Q1) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.

In [11]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(features,Y,test_size=0.2,random_state=10)
print('Train set:',x_train.shape,y_train.shape)
print('Test_set:',x_test.shape,y_test.shape)

Train set: (2616, 66) (2616,)
Test_set: (655, 66) (655,)


#### Q2) Create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).

In [13]:
from sklearn import linear_model
# Create a Linear Regression model
LinearReg = linear_model.LinearRegression()
# Train a linear Regression model
LinearReg.fit(x_train,y_train)

#### Q3) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [16]:
# Use the trained Linear Regression model to make predictions on the testing data
predictions = LinearReg.predict(x_test)


#### Q4) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [18]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate Mean Absolute Error (MAE)
LinearRegression_MAE = mean_absolute_error(y_test, predictions)

# Calculate Mean Squared Error (MSE)
LinearRegression_MAE = mean_squared_error(y_test, predictions)

# Calculate R-squared (R2) score
LinearRegression_R2 = r2_score(y_test, predictions)

print("LinearRegression_MAE =", LinearRegression_MAE)
print("LinearRegression_MSE =", LinearRegression_MSE)
print("LinearRegression_R2 =", LinearRegression_R2)

LinearRegression_MAE = 0.2563250126729485
LinearRegression_MSE = 0.1157232929703616
LinearRegression_R2 = 0.4271186909603826


#### Q5) Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [21]:
# Create a DataFrame to store the metrics
Report = pd.DataFrame({'Metric': ['MAE','MSE', 'R2' ],'Linear Regression' : [LinearRegression_MAE,LinearRegression_MAE,LinearRegression_R2]})
print(Report)

  Metric  Linear Regression
0    MAE           0.256325
1    MSE           0.256325
2     R2           0.427119


### KNN


#### Q6) Create and train a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [25]:
# Create a KNN model with n_neighbors set to 4 and train the KNN model using the training data
KNN = KNeighborsClassifier(n_neighbors=4).fit(x_train,y_train)
KNN

#### Q7) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [24]:
# Use the trained KNN model to make predictions on the testing data
predictions = KNN.predict(x_test)

#### Q8) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [27]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate Accuracy Score
KNN_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Jaccard Index
KNN_JaccardIndex = jaccard_score(y_test, predictions)

# Calculate F1 Score
KNN_F1_Score = f1_score(y_test, predictions)

print("KNN_Accuracy_Score =",KNN_Accuracy_Score)
print("KNN_JaccardIndex =",KNN_JaccardIndex)
print("KNN_F1_Score =",KNN_F1_Score)

KNN_Accuracy_Score = 0.8183206106870229
KNN_JaccardIndex = 0.4251207729468599
KNN_F1_Score = 0.5966101694915255


### Decision Tree

#### Q9) Create and train a Decision Tree model called Tree using the training data (`x_train`, `y_train`).

In [28]:
# Create a Decision Tree model
Tree = DecisionTreeClassifier()

# Train the model using the training data
Tree.fit(x_train, y_train)

#### Q10) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [37]:
# Use the trained Decision Tree model to make predictions on the testing data
predictions = Tree.predict(x_test)

#### Q11) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [30]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate Accuracy Score
Tree_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Jaccard Index
Tree_JaccardIndex = jaccard_score(y_test, predictions)

# Calculate F1 Score
Tree_F1_Score = f1_score(y_test, predictions)

print("Tree_Accuracy_Score =",Tree_Accuracy_Score)
print("Tree_JaccardIndex =",Tree_JaccardIndex)
print("Tree_F1_Score =",Tree_F1_Score)

Tree_Accuracy_Score = 0.7633587786259542
Tree_JaccardIndex = 0.4083969465648855
Tree_F1_Score = 0.5799457994579945


### Logistic Regression


#### Q12) Use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [32]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)
print('Train set:',x_train.shape,y_train.shape)
print('Test_set:',x_test.shape,y_test.shape)

Train set: (2616, 66) (2616,)
Test_set: (655, 66) (655,)


#### Q13) Create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [33]:
# Create a Logistic Regression model with solver set to 'liblinear'
LR = LogisticRegression(solver='liblinear')

# Train the model using the training data
LR.fit(x_train, y_train)

#### Q14) Now, use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.

In [38]:
# Use the trained Logistic Regression model to make predictions on the testing data
predictions = LR.predict(x_test)

# Use the trained Logistic Regression model to get class probabilities for the testing data
predict_proba = LR.predict_proba(x_test)

#### Q15) Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [40]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score, log_loss

# Calculate Accuracy Score
LR_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Jaccard Index
LR_JaccardIndex = jaccard_score(y_test, predictions)

# Calculate F1 Score
LR_F1_Score = f1_score(y_test, predictions)

# Calculate Log Loss
LR_Log_Loss = log_loss(y_test, predict_proba)

print("LR_Accuracy_Score =",LR_Accuracy_Score)
print("LR_JaccardIndex =",LR_JaccardIndex)
print("LR_F1_Score =",LR_F1_Score)
print("LR_Log_Loss =",LR_Log_Loss)

LR_Accuracy_Score = 0.8320610687022901
LR_JaccardIndex = 0.4977168949771689
LR_F1_Score = 0.6646341463414634
LR_Log_Loss = 0.3806153262627543


### SVM

#### Q16) Create and train a SVM model called SVM using the training data (`x_train`, `y_train`).

In [42]:
from sklearn.svm import SVC

# Create an SVM model
SVM = SVC()

# Train the model using the training data
SVM.fit(x_train, y_train)

#### Q17) Now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.

In [44]:
# Use the trained SVM model to make predictions on the testing data
predictions = SVM.predict(x_test)


#### Q18) Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.

In [46]:
from sklearn.metrics import accuracy_score, jaccard_score, f1_score

# Calculate Accuracy Score
SVM_Accuracy_Score = accuracy_score(y_test, predictions)

# Calculate Jaccard Index
SVM_JaccardIndex = jaccard_score(y_test, predictions)

# Calculate F1 Score
SVM_F1_Score = f1_score(y_test, predictions)

print("SVM_Accuracy_Score =",SVM_Accuracy_Score)
print("SVM_JaccardIndex =",SVM_JaccardIndex)
print("SVM_F1_Score =",SVM_F1_Score)

SVM_Accuracy_Score = 0.7221374045801526
SVM_JaccardIndex = 0.0
SVM_F1_Score = 0.0


#### Q19) Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [48]:
# Create a DataFrame to store the metrics
Report = pd.DataFrame({
    'Model': ['K-Nearest Neighbors', 'Decision Tree', 'Logistic Regression', 'Support Vector Machine'],
    'Accuracy': [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    'Jaccard Index': [KNN_JaccardIndex, 0, LR_JaccardIndex, SVM_JaccardIndex],
    'F1 Score': [KNN_F1_Score, 0, LR_F1_Score, SVM_F1_Score],
    'Log Loss': [0, 0, LR_Log_Loss, 0]
})

# Print the report
print(Report)

                    Model  Accuracy  Jaccard Index  F1 Score  Log Loss
0     K-Nearest Neighbors  0.818321       0.425121  0.596610  0.000000
1           Decision Tree  0.763359       0.000000  0.000000  0.000000
2     Logistic Regression  0.832061       0.497717  0.664634  0.380615
3  Support Vector Machine  0.722137       0.000000  0.000000  0.000000
