# <center>CITS5508 Lab sheet 3</center>

**Name:** Alastair Mory<br>
**Student number:** 21120848<br>


Two random forrest classifiers will be trialled on a task involving categorising people as healthy or having parkinsons based on 22 voice metrics. The dataset was obtained [here](https://archive.ics.uci.edu/ml/datasets/Parkinsons).

<br><b>Contents</b><br>
[1 Dataset](#1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1 Data Visualisation and Statistics](#1.1)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2 Partioning Data](#1.2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.3 Scaling and Normalisation](#1.3)<br>
[2 Classification](#2)<br>
[3 Conclusion](#3)<br>

In [None]:
from typing import Any

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import ensemble, metrics, model_selection, preprocessing, svm

# Show all attribute columns in pandas tables.
pd.set_option('display.max_columns', None)

# 1. Dataset <a name="1">

This dataset is composed of a range of biomedical voice measurements from 
31 people, 23 with Parkinson's disease (PD). Each column in the table is a 
particular voice measure, and each row corresponds one of 195 voice 
recording from these individuals ("name" column). The main aim of the data 
is to discriminate healthy people from those with PD, according to "status" 
column which is set to 0 for healthy and 1 for PD.

In [None]:
# Read data from CSV
data = pd.read_csv('./parkinsons.data')

# Display example data rows
data.head()


## 1.1 Data Visualisation and Statistics <a name="1.1"/>

In [None]:
# Histogram of numerical features
%matplotlib inline
data.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# Statistics for numerical features
data.describe()

From the above plots and statistics there are a variety of distributions with a large range of scales, many having a positive skewness (e.g. the MDVP and shimmer measures) and others having a more normal unskewed distribution (e.g. D2, DFA, HNR & spread measures).

## 1.2 Partitioning Data <a name="1.2"/>

In [None]:
# Perform 80/20 train test split
train, test = model_selection.train_test_split(data,
                                               test_size=0.2,
                                               train_size=0.8)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

train_y = train['status']
test_y = test['status']

train_x = train.drop(columns=['name', 'status'])
test_x = test.drop(columns=['name', 'status'])

## 1.3 Scaling and Normalisation <a name="1.3"/>

As per [this answer](https://stats.stackexchange.com/questions/255765/does-random-forest-need-input-variables-to-be-scaled-or-centered) from StackExchange, no scaling or normalisation is required on data using a random forest classifier:

>Random Forests are based on tree partitioning algorithms.
>
>As such, there's no analogue to a coefficient one obtain in general regression strategies, which would depend on the units of the independent variables. Instead, one obtain a collection of partition rules, basically a decision given a threshold, and this shouldn't change with scaling. In other words, the trees only see ranks in the features.

# 2. Classification <a name="2"/>

In [None]:
def print_metrics(clf: Any,  # pretrained classifier
                  test_x: pd.DataFrame, test_y: pd.Series,
                  clf_name='Classifier') -> None:
    """
    Run prediction using provided trained classifier and show confusion matrix, accuracy and f1 scores. 
    """
    
    # Run classifier prediction
    pred_y = clf.predict(test_x)
    
    # Calculate accuracy & F1 score
    accuracy = metrics.accuracy_score(test_y, pred_y)
    f1 = metrics.f1_score(test_y, pred_y, average='weighted')
    # Place in dataframe for prettier printing
    scores = pd.DataFrame(data=[[accuracy, f1]],
                          index=[""],
                          columns=['Accuracy', 'F1 Score'])
    
    # Display confusion matrix and metric scores
    metrics.plot_confusion_matrix(clf, test_x, test_y, 
                              normalize='true',
                              cmap=plt.cm.Blues,)
    print(f"{clf_name} Metrics\n")
    print(f"{scores}\n")
    print(f"Confusion Matrix")

In [None]:
clf1 = ensemble.RandomForestClassifier(n_estimators=100, criterion='gini').fit(train_x, train_y)
clf2 = ensemble.RandomForestClassifier(n_estimators=10, criterion='gini').fit(train_x, train_y)

In [None]:
print_metrics(clf1, test_x, test_y, 'Random Forest 1')

In [None]:
print_metrics(clf2, test_x, test_y, 'Random Forest 2')

# 3. Conclusion <a name="3"/>

As there was no obvious difference between the two split quality criterion parameters (gini & entropy), the default of gini was used for both classifiers. The hyperparameter that was varied was the n_estimators parameter, which specifies the number of trees in the forrest. The first using sklearn's default of 100 has relatively stable performance (with both measures usually aroung 85%); the second classifier uses a value of 10, this results in more varied performance (generally 80-95%) but on average higher accuracy and f1 scores. This suggests the classifier with more trees (classifier 1) has a tendency to overfit and classifier 2 with less trees is generalising better.