<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security</h1>

<h2 align='center'> Lab 06 - Machine Learning - II</h2>

*****

In this lab, we will first guide you through some required interfaces of `sklearn`. We will implement and study **Decision Trees** and **Nearest-Neighbours Regression** implimented using `sklearn`.

In [7]:
# Important Imports
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder

## Decision Tree

A **Decision Tree** is a **supervised** learning algorithm used for classification problems. In this algorithm, we split the original data into two or more homogeneous sets. This is done based on most significant attributes to make the resulting groups as distinct as possible. In decision analysis, a decision tree is used to visually and explicitly represent decisions and decision making. It uses a tree-like model of decisions. A decision tree is drawn with its **root** at the top and **branches** at the bottom. The branch end that doesn’t split anymore is the **decision / leaf**.

### Algorithm

1. Pick an attribute to split at a non-terminal node (based on which attribute will provide the greatest **Information Gain**).
2. Split examples into groups based on attribute value.
3. For each group:
    * **If** no examples – return majority from parent
    * **Else If** all examples in same class – return class
    * **Else** loop to Step 1


### Properties

* Internal nodes test attributes.
* Branching is determined by attribute value.
* Leaf nodes are outputs (class assignments).



Let's try to make some predictions with such a decision tree!

## Exercise 1: Sink or Float

We will use the **Titanic Passenger Dataset** which has data related to 891 passengers and 11 features + the target variable (Survived) in the dataset. The dataset is provided under the `data` folder and has the following features:

| **Feature**              |**Description**                                                                   |
|--------------------------|----------------------------------------------------------------------------------|
| PassengerId              | ID of Passenger in dataset                                                       |
| Survived                 | Whether the passenger survived or not {0: No, 1: Yes}                            |
| Pclass                   | Class of Travel (1,2 or 3)                                                       |
| Name                     | Name of Passenger                                                                |
| Sex                      | Gender of Passenger (Male or Female)                                             |
| Age                      | Age of Passenger                                                                 |
| Sibsp                    | Number of Siblings/Spouse aboard                                                 |
| Parch                    | Number of Parent/Child aboard                                                    |
| Ticket                   | Ticket Number                                                                    |
| Fare                     | Cost of the Ticket                                                               |
| Cabin                    | Cabin number of Passenger                                                        |
| Embarked                 | Port which the Passenger embarked {C: Cherbourg, S: Southhampton, Q: Queenstown} |


We are going to split this dataset into **train** and **test** data. We aim to build a **decision tree to predict whether a passenger will survive or not in the titanic crash**.

1. Complete the following steps to prepare the data for use:
    - Load the dataset `titanic.csv` into a dataframe. Do a quick check on the type of data each column has. 
    - To make life easier for the decision tree, alter the **Sex** column such that `male` is replaced with `0` & `female` is replaced with `1`
    - Clean up the **Age** column to remove any `NaN` values (**HINT**: Replace them with the mean of the entire column).
    - For this exercise we would need only need the feature columns `Pclass`, `Sex`, `Age`, `SibSp` and `Parch`, and the result `Survived`. Split the DataFrame into two new variables, one being a DataFrame with the feature columns and the other holding the result data
    - Using `sklearn`'s train-test-split function, split the data into a training set and testing set to evaluate your model

In [2]:
# YOUR ANSWER HERE
data = pd.read_csv('data/titanic.csv')
data['Sex'].replace(['female','male'],[0,1], inplace=True)
data['Age'] = data['Age'].fillna(data['Age'].mean())
feature = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]
result = data[['Survived']]
x_train, x_test, y_train, y_test = train_test_split(feature, result, test_size = 0.3, random_state = 42)
print(x_train.shape, x_test.shape, '\n', y_train.shape, y_test.shape)

(623, 5) (268, 5) 
 (623, 1) (268, 1)


### What can you do with a decision tree object?

| Object Method | Description |
| --- |:---:|
| `dt.apply()` | **Returns the index of the leaf that each sample is predicted as.** |
| `dt.decision_path()` | **Return the decision path in the tree** |
| `dt.fit()` | **Build a decision tree classifier from the training set.** |
| `dt.predict()` | **Predict class value for X.** |
| `dt.score()` | **Returns the mean accuracy on the given test data and labels.** |

<br/>

2. Using the `sklearn` module, make a decision tree object and fit it to the `training` dataset. Determine the accuracy of this model on the `test` dataset.

In [3]:
# YOUR ANSWER HERE
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
dt.score(x_test, y_test)

0.746268656716418

3. How would the accuracy be affected if you increase or decrease the `DecisionTreeClassifier` object parameters like `max_depth` and `min_samples_leaf`? You are welcome to try making the `DecisionTreeClassifier` object with different values of `max_depth` and `min_samples_leaf` and compare your model's accuracy. But, make sure you can justify the change in accuracy for the increase or decrease in these parameters.

In [4]:
# YOUR ANSWER HERE
dt_1 = DecisionTreeClassifier(max_depth = 3, min_samples_leaf = 5)
dt_1.fit(x_train, y_train)
dt_2 = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 7)
dt_2.fit(x_train, y_train)
print(dt_1.score(x_test, y_test), '\n',dt_2.score(x_test, y_test))

0.7985074626865671 
 0.8171641791044776


#### Justification
- As necessary. Use this to put down notes that you can come back to quickly later when you're studying !
- prediction accuracy will somehow increase with the increasement of max_depth and min_samples_leave and remain stable after a upper bound.

4. Scikit-Learn provides you with an easy way to visualize and export this **Decision Tree**. Export this decision tree to your machine and visually inspect it (**HINT**: This was performed in the lectures)

Note: If you're doing this at home, you may require extra files to compile the dot file to visually inspect. The steps are shown below.

In [8]:
# YOUR ANSWER HERE
dotf = export_graphviz(dt_2, out_file='out.dot')

#### Compiling the `.dot` file (using Graphviz)

##### OSX (Using Homebrew)
- In an open terminal window:
    - `brew install graphviz`
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)

##### Ubuntu 
- In an open terminal window
    - `sudo apt install graphviz` (if not already installed, can sometimes come standard) **(On CECS computers, it will be automatically installed)**
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)
    
##### Windows
- Install the appropiate packages from (here)[https://graphviz.gitlab.io/_pages/Download/Download_windows.html]
- Ensure Graphiz is within your PATH
- Using Powershell
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)

<br/>

## k-Nearest Neighbours

A **k-Nearest Neighbours** is a **supervised learning** algorithm. It is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying a new data point or instance. 

KNN is a **non-parametric learning algorithm**, which means that it doesn't assume anything about the underlying data. This is an extremely useful feature since most of the real world data doesn't really follow any theoretical assumption e.g. linear-separability, uniform distribution, etc.

<img src='./images/knn_classification.jpg'>

Source: [E_blog - K-Nearest Neighbour(KNN) Classification](https://zeidigital.wordpress.com/2016/08/13/k-nearest-neighbour-classification-algorithm-implementation-in-python/)

Given a dataset with **x and y**, Nearest Neighbours Regression can be used to:

* Build a predictive model to predict future values of **x<sub>i</sub>** without a **y<sub>i</sub>** value.
* Build a model without assuming any parameters, **K** is hyperparameter.
* It can be seen as a baseline method to compare your complex models.

Lets try doing some predictions using a k-Nearest Neighbours algorithm

## Exercise 2: Want to buy

We have provided a dataset of 200 consumers and 3 features and a target variable in the dataset. The dataset is provided under the `data` folder and has the following features:

| **Feature**              |**Description**                                                                   |
|--------------------------|----------------------------------------------------------------------------------|
| CustomerID               | Unique ID assigned to the customer                                               |
| Gender                   | Gender of the Shopper (Male or Female)                                           |
| Age                      | Age of the Shopper                                                               |
| Annual Income            | Annual Income of the customer (in thousands)                                     |
| Will Buy                 | Whether the customer will buy item x {Yes: 1, No: 0}                             |

#### Scenario
You run a local grocery store and have been approached regarding stocking a products. Product-x is being marketed as the biggest thing since sliced bread in the next town over, although you're unsure if your customers will be as responsive. You decide to purchase a test batch and allow your 200 loyalty members to review the product, and use their buying statistics to determine whether you should stock the product permanently. Loyalty memebers will inform you whether they brought the product or not, and you can use this to predict whether other customers will do the same.

Note: This is another example of a binary classification problem, as you are trying to determine whether the answer will be "yes" or "no" given a question.

1. Complete the following steps to prepare the data for use:
    - Load the dataset `buying_cx.csv` into a dataframe, using CustomerID as the index column
    - As with the previous task, alter the `Gender` column to reflect 0 meaning Male & 1 meaning Female (impliment this using a different method to the previous question) 
        - **Extension**: Use a LabelEncoder or the like from the `sklearn` module to change the values.
    - Split the data into a training and testing dataset

In [30]:
# YOUR ANSWER HERE
data2 = pd.read_csv('data/buying_cx.csv')
# create an LabelEncoder object
le = LabelEncoder()
# fit the pandas column into the LabelEncoder
le.fit(data2.Gender)
# view the fitted classed
print(le.classes_)
# apply the fitted LabelEncoder to the pandas column
data2.Gender = le.transform(data2.Gender)
# view the transformed dataFrame
print(data2.head())
# split the data into a training and testing dataset
feature2 = data2[['Gender', 'Age', 'Annual Income (k$)']]
result2 = data2['Will Buy']
x2_train,  x2_test, y2_train, y2_test = train_test_split(feature2, result2, test_size = 0.2, random_state = 42)
# View the shapes to ensure we do it right
print('train:', x2_train.shape, ' ', x2_test.shape, '\n', y2_train.shape, ' ', y2_test.shape)

['Female' 'Male']
   CustomerID  Gender  Age  Annual Income (k$)  Will Buy
0           1       1   19                  15         0
1           2       1   21                  15         1
2           3       0   20                  16         0
3           4       0   23                  16         1
4           5       0   31                  17         0
train: (160, 3)   (40, 3) 
 (160,)   (40,)


### What can you do with a k-Nearest Neighbors object?
While there are other methods, the main functions you will require for this exercise are the following:

| Method | Description |
| --- |:---:|
| `.fit()` | **Build a decision tree classifier from the training set.** |
| `.predict()` | **Predict class value for X.** |
| `.score()` | **Returns the mean accuracy on the given test data and labels.** |

<br/>

2. Using the k-Nearest Neighbors class from `sklearn`, impliment a k-Nearest Neighbors classifier where it checks the 5 closest neighbours, and determine the accuracy of the model

In [31]:
# Create an instance of KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
# Fit the train data into the knn model
knn.fit(x2_train, y2_train)
# predict the result for test data
Ypred2 = knn.predict(x2_test)
# View the predicted shape of result to ensure we stay on the right track
print(Ypred2.shape)
# determine the accuracy of the model
print(knn.score(x2_test, y2_test))

(40,)
0.725


3. As k is a hyperparameter that we specify when we create the model, it is possible to adjust the number of neighbours the model will check for a solution. Create 3 more models, each with a different k value (k=1, k=50, k=150) and compare the scores of each model.

In [32]:
# Create three models for k=1, k=50, k=150
knn_1 = KNeighborsClassifier(n_neighbors=1)
knn_2 = KNeighborsClassifier(n_neighbors=50)
knn_3 = KNeighborsClassifier(n_neighbors=150)
# Fit the train data into three knn models
knn_1.fit(x2_train, y2_train)
knn_2.fit(x2_train, y2_train)
knn_3.fit(x2_train, y2_train)
# print the scores for each model
print(knn_1.score(x2_test, y2_test), knn_2.score(x2_test, y2_test), knn_3.score(x2_test, y2_test))

0.65 0.575 0.7


Comparing these scores to when we tried for the 5 closest neighbours, what does this tell us about the optimal number of neighbours? Will the optimal number of neighbours remain constant over different divisions of testing and training data? Discuss this with other members of the course or your tutor.

#### NOTES
- Insert Notes from your discussion here as necessary
Self-thoughts:
The optimal number will remain constant over different divisions of testing and training data. When more data is used for training purpose, the prediction accuracy will be higher but the 5-sample one is still the highest.