<center>
    <img src="https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png" width="300" alt="cognitiveclass.ai logo"  />
</center>


# **Weather Classification Assignment**


Estimated time needed: **60** minutes


In this notebook, we will practice using all the classification algorithms and metrics that we learned in this course. Using weather data we will try to predict if there is going to rain the next day.


## Objectives


After completing this lab you will be able to:


*   Data
    *   Describe and Define the Dataset
    *   Load a CSV Dataset using Pandas
    *   Preprocess the Data using Pandas
    *   Deal with NULL Values in your Dataset
    *   Perform One Hot Encoding on Categorical Variables
    *   Split your Data into a Training and Testing Set
    *   Standardize your Data using StandardScaler or MinMax
*   Classification
    *   Use GridSearchCV to Find the Best Parameters for a Classification Algorithm
    *   Perform Classification using Logistic Regression
    *   Perform Classification using K-Nearest Neighbors
    *   Perform Classification using Support Vector Machine
    *   Perform Classification using Decision Trees
*   Use Evaluation Metrics Accuracy Score, Jaccard Index, F1-Score, and Log Loss on Each Algorithm and Report the Results


***


## Setup


First, we will download the data that we will use in this lab which is stored in a CSV format.


In [1]:
from js import fetch
import io

URL = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/ML0101ENv3/project_EdX/weatherAUS.csv"
resp = await fetch(URL)
text = io.BytesIO((await resp.arrayBuffer()).to_py())

For this lab, we are going to be using Python and several Python libraries.

If you are running this Jupyter Notebook locally, you need to install the following libraries by uncommenting the code bellow. Otherwise, leave the code bellow commented out and run the rest of this notebook.


In [2]:
#pip install pandas
#pip install sklearn
#pip install matplotlib
#pip install numpy

In [3]:
# allows us to interact with the data using a dataframe
import pandas as pd
# allows us to interact with the data and perform calculations using ndarrays
import numpy as np
# various classification algorithms and metrics from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
# matplotlib allows us to create graphs
import matplotlib.pyplot as plt

Since sklearn calculates jaccard index differently than what was taught in the course we will define our own function for jaccard index


In [4]:
# works like sklearn classificaton metrics given list or ndarray of predictions and values returns the jaccar index
def jaccard_index(predictions, true):
    if (len(predictions) == len(true)):
        intersect = 0;
        for x,y in zip(predictions, true):
            if (x == y):
                intersect += 1
        return intersect / (len(predictions) + len(true) - intersect)
    else:
        return -1

## Data


### About the Data


The original source of the data is Austrailian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01).

The dataset we will use has extra columns like RainToday and our target RainTomorrow which was gathered from Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


This dataset is observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged 10 minutes prior to 9am       | Compass Points  | object |
| WindDir3pm    | Wind direction averaged 10 minutes prior to 3pm       | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged 10 minutes prior to 9am           | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged 10 minutes prior to 3pm           | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RISK_MM       | Amount of rain tomorrow                               | Millimeters     | float  |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


### Load the Dataset


Lets use the **head()** function to see our data


In [None]:
df = pd.read_csv(text)

df.head()

### Preprocessing


We want to focus specifically on Sydney so that we can train our algorithm quickly. You can select other locations or multiple locations if you would like to experiment.


In [None]:
df = df[df['Location'] == 'Sydney']

Next, we drop all the columns in the table that we won't need.

We drop Location because it is constant for each row and we drop RIS_MM because this tells us the amount of rain tomorrow so we can not train on it as it reveals the target and we are doing classification, not regression.


In [None]:
df_sydney = df[df['Location'] == 'Sydney']

df_sydney.drop(columns=['Location', 'RISK_MM'], axis=1, inplace=True)

print(df_sydney.shape)

df_sydney.head()

As you can see above we have NaN occur a couple of times in our dataset. We can either drop the data or replace the data.


Below we can see how many NaN values we have for each row. WindGustDir, WindGustSpeed, Cloud9am, and Cloud3pm have large values of missing data. In this case for \~33% of the data, we are missing a value for WindGusDir and WindGustSpeed. This is not enough to remove the entire column but we will perform some preprocessing.


In [None]:
df_sydney.isna().sum()

### Dealing With Nulls


Please uncomment the method that you would like to use

1.  Drop all rows that contain NaN
2.  Replace NaN in object type columns like WindGustDir with most frequent value in the column and replace NaN in float type columns like WindGustSpeed, Cloud9am, and Cloud3pm with the mean. Then we drop the remaining rows with NaN in them.

Please note that if you choose to replace the NaN values the classification algorithms will take a little longer to compute


1.  Drop


In [None]:
df_sydney_filled = df_sydney.dropna()

2.  Replace


In [None]:
df_sydney_filled = df_sydney.copy()

most_frequent_WindGustDir = df_sydney_filled['WindGustDir'].value_counts().idxmax()
df_sydney_filled["WindGustDir"].replace(np.nan, most_frequent_WindGustDir, inplace=True)

mean_WindGustSpeed = df_sydney_filled["WindGustSpeed"].astype("float").mean(axis=0)
df_sydney_filled["WindGustSpeed"].replace(np.nan, mean_WindGustSpeed, inplace=True)

mean_Cloud9am = df_sydney_filled["Cloud9am"].astype("float").mean(axis=0)
df_sydney_filled["Cloud9am"].replace(np.nan, mean_Cloud9am, inplace=True)

mean_Cloud3pm = df_sydney_filled["Cloud3pm"].astype("float").mean(axis=0)
df_sydney_filled["Cloud3pm"].replace(np.nan, mean_Cloud3pm, inplace=True)

df_sydney_filled.dropna(inplace=True)

In [None]:
print(df_sydney_filled.shape)
df_sydney_filled.isna().sum()

As you can see we have completely removed all NaN values using different methods which allow you to either remove rows with NaN in them improving the pureness of our dataset or filling in NaN values allowing us to preserve rows. When deciding on the method to use there are many benefits and drawbacks we must consider like whether or not we will have enough data after dropping NaN rows or if filling in Nan by frequency or mean will introduce some sort of bias to our data.


In [None]:
df_sydney_filled.loc[:,'Date'] = df['Date'].str.replace('-', '')

Finally, we remove the - between the values of the Date column so they can be converted to floats


### One Hot Encoding


Finally we need to perform one hot encoding to convert categorical variables to binary variables


In [None]:
df_sydney_processed = pd.get_dummies(data=df_sydney_filled, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the RainTomorrow column changing it from a categorical column to a binary column. We do not use the **get_dummies** method because we would end up with two columns for RainTomorrow and we do not want that because it is our target.


In [None]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Testing Data


First, we turn all columns into a float type. We don't need to do this because the **StandardScalar()** method will convert object types to float but it will give us a warning message.


In [None]:
df_sydney_processed = df_sydney_processed.astype(float)

Now we split our dataset into a features dataset and target dataset. We drop our target to create our features dataset and only keep RainTomorrow to create our target dataset


In [None]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

Now we will standardize the data. We can do this in multiple ways like using the **StandardScalar()** method which will scale the values to unit variance or the **MinMaxScalar()** which will scale each value to the min and max of each column.


### Data Standardization


Before we standardize our data we must split it into training and testing sets. We do this before standarsizing so that we don't give any hints to out model by standardizing all the data together.


In [None]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=.2, random_state=1)

Please uncomment the method you would like to choose


1.


In [None]:
norm = preprocessing.StandardScaler()

2.


In [None]:
norm = preprocessing.MinMaxScaler()

In [None]:
x_train = norm.fit_transform(x_train)

x_test = norm.transform(x_test)

As we discussed before you can see how we fit and the scaler to the training data and also transformed it. Then we used the fitted scaler to transform the test data.


## Classification


### Instructions


Below is where we are going to use the classification algorithms to create a model based on our training data and finally evaluate our testing data using evaluation metrics learned in the course

We will some of the algorithms taught in the course, specifically

1.  Logistic Regression
2.  KNN
3.  SVM
4.  Decision Trees

We will evaluate our models using

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  Log Loss

Note: Jaccard Index is calculated differently in Sci Kit Learn so I have defined a function at the top of the notebook for you to use, its input style is the same as Sci Kit Learn

As we know these algorithms have many parameters and to find the best ones we will use GridSearchCV

I will demonstrate how to do this using a mock classification algorithm

1.  Create a python dictionary with the key being the name of the parameters and the value being a list of possible values
2.  Create an object of the classification algorithm
3.  Create a GridSearchCV object and place your classification object and parameters dictionary as parameters, also define your GridSearchCV cv parameter (Use cv = 4)
4.  Use the fit method of the GridSearchCV algorithm to train our model using x_train and y_train that we create before
5.  Store the best model in a variable provided
6.  Predict the target variable using the x_test data we created above
7.  Calculate and store the values for each metric in the provided variables using the predictions and y_test data

You will need to research the parameters you need to use as there are many options but this is simple. GridSearchCV will determine the best model.

Finally using your models generate the report at the bottom


### Mock


If you need some more help with grid search here are a couple of resources

1.  [https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)
2.  [https://scikit-learn.org/stable/modules/grid_search.html](https://scikit-learn.org/stable/modules/grid_search.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


### Question 1: Logistic Regression


For Logistic Regression please use the parameters C = \[.001, .01, .1, 1, 10, 100] and solver. Use the link provided to select the values for the solver parameter. [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


When creating the LogisticRegression object please make **max_iter = 10000**. This will allow us enough iteration so the model parameters can converge


In [None]:
BestLR = 

In [None]:
print(BestLR)

In [None]:
LR_Accuracy_Score = 
LR_JaccardIndex = 
LR_F1_Score = 
LR_Log_Loss = 

### Question 2: KNN


For KNN please use the parameters n_neighbors = \[1,2,3,4,5,6,7,8,9,10], algorithm, and p. Use the link provided to select the values for algorithm and p. [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


In [None]:
BestKNN =

In [None]:
print(BestKNN)

In [None]:
KNN_Accuracy_Score = 
KNN_JaccardIndex = 
KNN_F1_Score = 

### Question 3: SVM


For SVM please use the parameters C = \[.001, .01, .1, 1, 10, 100] and kernel. Use the link provided to select the values for kernel. [https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


In [None]:
BestSVM =

In [None]:
print(BestSVM)

In [None]:
SVM_Accuracy_Score = 
SVM_JaccardIndex = 
SVM_F1_Score = 

### Question 4: Decision Tree


For Decision Tree please use the parameters criterion. Use the link provided to select the values for criterion. [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


In [None]:
BestTree =

In [None]:
print(BestTree)

In [None]:
Tree_Accuracy_Score = 
Tree_JaccardIndex = 
Tree_F1_Score = 

## Report


In [None]:
Report = pd.DataFrame({'Algorithm' : ['KNN', 'Decision Tree', 'SVM', 'LogisticRegression']})

Report['Accuracy'] = [LR_Accuracy_Score, KNN_Accuracy_Score, SVM_Accuracy_Score, Tree_Accuracy_Score]
Report['Jaccard'] = [LR_JaccardIndex, KNN_JaccardIndex, SVM_JaccardIndex, Tree_JaccardIndex]
Report['F1-Score'] = [LR_F1_Score, KNN_F1_Score, SVM_F1_Score, Tree_F1_Score]
Report['LogLoss'] = ['N/A', 'N/A', 'N/A', LR_Log_Loss]

Report

## Authors


[Azim Hirjani](https://www.linkedin.com/in/azim-hirjani-691a07179/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01)


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description         |
| ----------------- | ------- | ---------- | -------------------------- |
| 2020-09-14        | 0.2     | Azim       | Update Lab to Use Template |
| 2020-04-17        | 0.1     | Azim       | Created Lab                |


Copyright © 2020 IBM Corporation. All rights reserved.
