# Parkinson's Machine Learning Database Practice

## Introduction Information

This notebook has been made as a resource information

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons

## Backup data

Most important step!

## Import the necessary libraries

In [1]:
# data analysis and wrangling
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier

# machine learning
import pandas as pd
import numpy as np
import random as rnd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Why are we using a dataframe?

* Data must be loaded into a dataframe so that pandas can read and manipulate it properly.

### Loading Our Data
Below the code loads the Parkinson's data into a dataframe called 'df' and provides the basis of all further data display and manipulation. In this case, df.head() displays the first 5 rows, or header if you will, in the data series. If df.tail() was called instead, it would show the last 5 rows in the data series.

In [2]:
df = pd.read_csv('Z:/PERSONAL/ljb/Python Code/Data/parkinsons.data')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


## Acquire the Features and Labels from the dataset

Features are the columns in a dataset and are considered the inputs into your machine learning model.

Labels are the classifications you give to the data output and are not associated with unsupervised machine learning, rather than the supervised machine learning shown here. (In this case the probability of a patient having Parkinson's)

Here is a link with more examples of machine learning terminology: https://developers.google.com/machine-learning/crash-course/framing/ml-terminology

### Setting our own Features and Labels using columns that are already there

When selecting data in a dataframe, df.loc locates the exact slice of a dataframe that is asked for.

The code below demonstrates that all the columns are being selected except for status. This is because status is being used as the Label (classification) for this supervised machine learning data exercise.

Status only has the values 1 or 0. 1 indicates the patient has Parkinson's and 0 indicates the patient does not and is therefore perfect for the Label in our model.

For a comprehensive guide to slice notation, stack overflow provides an excellent thread on the topic: https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295

In [3]:
features = df.loc[:, df.columns != 'status'].values[:, 1:]
labels = df.loc[:, 'status'].values

## Cursory data exploration

Counting is a deceptively powerful tool. Knowing how much of your data is available for analysis is essential for good machine learning projects. Often data sets are messy, incomplete and downright difficult to read. Large datasets may not need every field complete to generate a reliable prediction or classification, however the smaller the data project is, the larger impact missing data is likely to have.

Often the most interesting data ends up incomplete. One example of this is free text in questionnaires. This can reveal far more than the answers you thought you were looking for, however it is almost certain that there will be people who skip answering it to save time.

Due to the nature of surveying surveying, among others data gathering methods, it can be impossible to follow up respondents, repeat experiments, or get exactly the same observations ever again.

The following code counts the amount of 1's and 0's in the 'status' column that has been defined as labels. This gives an indication of the amount of data available to us. If the ratio of 1's and 0's are wildly unbalanced, it could make the data processing biased and analysis difficult.

Performing exploratory data analysis is an essential skill for data science, especially in real Health Care situations where perfect raw data sets are impossible to find.

In [4]:
print("Number of 1's:", labels[labels == 1].shape[0],
      "Number of 0's:", labels[labels == 0].shape[0])

Number of 1's: 147 Number of 0's: 48


## Data preparation

Data preparation is essential to ensure that the quality and size of your data are adequate for the task at hand. Without properly prepared data, you can have errors in your outcomes, or even choose the wrong outcomes entirely!

### Scaling our data appropriately

The following code introduces a MinMaxScaler that converts the numerical values of the features into a range between -1 and 1. The scaler.fit_transform() is a fitting method name as it fits and transforms the values without having to write excess code to manually adjust each value to fall between -1 and 1.

You may notice that only our Features are transformed, this is because the Labels do not need transformation or scaling as they are already within our range.

When working on your own data, you may need smaller or larger scales, or you may not even need to scale your data, depending on how it is structured and how many variable you have in play.

In [5]:
scaler = MinMaxScaler((-1, 1))
x = scaler.fit_transform(features)
y = labels

### Classifying our training and test data

There will be times where data is already prepared into training data and test data, and then there are times like this where the data we are using is in a single file. Ways to deal with this can be as simple as splitting the data set up randomly and saving two different files to ensure that training and test data are kept separately, or splitting it using python in the notebook.

You must ensure that you do not have the same data in both your test and training set (and validation set if you are using one) as this will lead to better reported accuracy than is actually the case!

As the dataset that we are using is quite small, the data will be split in this notebook by python rather than worrying about creating new files.

The following code randomly splits the dataframe's X and Y axis into two separate parts of 80% testing and 20% training data. The random seed ensures that the split is truly random, however it has been set as 7 so that after the random split, the results will be reproducible.

More information on random state is available from the scikit-learn documentation: https://scikit-learn.org/stable/developers/utilities.html

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=7)

## Model training

The model, much like a pet, needs to be trained before it has any hope of producing the correct result. Training the model has various methods associated with it, with multiple options available for use within different python modules. Here we will be using the gradient boosting algorithms within XGBClassifier as an example of one of these methods.

Gradient boosting algorithms which essentially relies on using our "lesser" Classifiers (Labels) as a group rather than individually to allow a more accurate decision to be made.

The following code trains the model using our set test data x_test and y_test using the XGBClassifier method to fit our model ready for use. We will not be changing any of the parameters in the initial setup and running of the code.

More information on each of the parameters returned after fitting the model is available here: https://xgboost.readthedocs.io/en/latest/python/python_api.html

In [7]:
model = XGBClassifier()
model.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

## Model validation

<font color=purple> _Author's opinion: Validating your model is more important than producing a result_ </font>

Validation is defined as "The action of checking or proving the validity or accuracy of something." (Source: https://www.lexico.com/definition/validation)

By building a model, either as an individual or a team, you should take responsibility and ensure that it is producing the correct results.

What constitutes an acceptable level of accuracy can vary from model to model and can change over time as more data is gathered through use of the model itself!

Being able to prove how accurate your model is in Health Care extremely important as there are often people's health depending on the predictions/outcomes of your model. This extends to other areas of machine learning predictions, such as business where being able to predict market trends is essential to financial decisions that can affect a company's financial success.

The code below uses the model to predict the outcomes using the test data. This then returns an accuracy score which compares the known test scores with what we predicted. This returned a score of 94.87...% which is rather good for a simple model with no attempt at tuning and random data selection.

In [8]:
y_pred=model.predict(x_test)
print("Model Accuracy:", (accuracy_score(y_test, y_pred)*100))

Model Accuracy: 94.87179487179486


If we repeat the same data split, with the same model structure, and replace the random seed with 1 instead of 7, we see that the accuracy goes up. This is due the random splitting choosing data that is more suitable for an accurate prediction in this particular instance.

When preparing your model, it is important to be aware of the fact that even small changes can have an enormous effect on your data validity. This is something that also comes with experience and domain specific knowledge to know how well fit a model needs to be, and how much data is necessary to ensure the best model possible for you or your business's needs.

On a positive note, once you have a working model it can be good to revisit it at appropriate intervals to add newly collected data and improve the model itself by increasing the amount of data that it can learn from. This can help to reduce randomness from under-fitting or over-fitting of your model outside of your control.

In [9]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=1)
model = XGBClassifier()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Model Accuracy:", (accuracy_score(y_test, y_pred)*100))

Model Accuracy: 97.43589743589743
