Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.


Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.


1. name - ASCII subject name and recording number
2. MDVP:Fo(Hz) - Average vocal fundamental frequency
3. MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
4. MDVP:Flo(Hz) - Minimum vocal fundamental frequency
5. MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency
6. MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
7. NHR,HNR - Two measures of ratio of noise to tonal components in the voice
8. status - Health status of the subject (one) - Parkinson's, (zero) - healthy
9. RPDE,D2 - Two nonlinear dynamical complexity measures
10. DFA - Signal fractal scaling exponent
11. spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation



The data consists of those diagnosed with Parkinson Disease and those who do not.

In [1]:
#Import all the necessary modules
import pandas as pandas
import numpy as numpy
import os
import matplotlib.pyplot as matplot
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
#from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
import itertools
from IPython.display import Image  
from sklearn import tree
from os import system
from Custom import Perform_EDA as EDA
import numpy as np
import pandas as pd
from scipy.stats import levene, shapiro, f_oneway
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from matplotlib.colors import ListedColormap
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from Custom import Build_Model as Build_Model
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
import matplotlib.cm as cm
numpy.random.seed(2345)

ModuleNotFoundError: No module named 'seaborn'

In [None]:
def plot_confusion_matrix(Y_test,Y_predict, target_names,title='Confusion matrix',cmap=None,normalize=True):
    cm = metrics.confusion_matrix(Y_test, Y_predict)
    accuracy = numpy.trace(cm) / float(numpy.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = matplot.get_cmap('Blues')

    matplot.figure(figsize=(8, 6))
    matplot.imshow(cm, interpolation='nearest', cmap=cmap)
    matplot.title(title)
    matplot.colorbar()

    if target_names is not None:
        tick_marks = numpy.arange(len(target_names))
        matplot.xticks(tick_marks, target_names, rotation=45)
        matplot.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, numpy.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            matplot.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            matplot.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    matplot.tight_layout()
    matplot.ylabel('True label')
    matplot.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    matplot.show()
    print(metrics.classification_report(Y_test, Y_predict))
    model_performance = [metrics.accuracy_score(Y_test, Y_predict),metrics.recall_score(Y_test, Y_predict),
                         metrics.precision_score(Y_test, Y_predict),metrics.f1_score(Y_test, Y_predict) ]
    accuracy_report = pandas.DataFrame(model_performance, columns=['Model_Performance'], 
                                   index=['Accuracy','Recall','Precision','f1_Score'])
    return accuracy_report

Defined set of Visualization, EDA functions for ease of analysis

In [None]:
Source = pandas.read_csv("parkinsons_data.csv")

# Understand the data set

Approach :
    
Data skimmed through to see what are the variables present, data type, shape, column names, mixed data types, missing values etc


In [None]:
Source.head()

In [None]:
Source.info()

In [None]:
Source.shape

In [None]:
Source["status"].value_counts()

In [None]:
147/185

In [None]:
Source["status"] = pandas.Categorical(Source["status"])

In [None]:
Source.info()

# Exploratory Data Analysis - Data Wrangling and Pre-processing

# Approach

1. Analyse 5 pont summary, Kurtosis, Skewness and Range
2. Analyse the distribution of the data for each variable
3. Analyse outliers using Box plot
4. Infer the results and assess the impact 
5. Perform correlation analysis and VIF to determine the relationships between X's
6. Determine the data transformation and treatment requirements like missing values, outliers, scaling etc
7. Choose the set of predictors which can be used for modelling
8. Remove outliers

In [None]:
EDA.EDA(Source)

In [None]:
EDA.univariate_plots(Source)

MDVP:Fo(Hz)	:
1. Range 171.77 and Std Dev 41.39 suggests that there is a large spread of data from the median
2. Kurtosis -0.62 suggests that light tails or less data is distributed around the tails. However this is not a strong negative and hence a thin tail can be observed
3. Skewness 0.59 suggests that the data is positively skewed. The strength of the skewness is less and hence need to evaluate whether transformation (scalar or log or exp)
4. Box plot suggests that there are no outliers in the data albeit the spread is large

MDVP:Fhi(Hz)	:
1. Range 489.885 and Std Dev 91.49 suggests that there is a large spread of data from the median and the data may have outliers
2. Kurtosis 7.62 suggests that there is a heavy tail and more data points are distributed around the tail. This is evident in the box plot
3. Skewness 2.542 suggests that the data is positively skewed. The strength of the skewness is high and hence if this feature is used as predictor, transformation is required
4. Box plot suggests that there are outliers in the data and this needs treatment before building model
5. Distplot shows there is a slight bi-modal distribution indicating a possible gaussian mixture. This however is a risk that needs to be accepted for this project
6. Distplot shows that there is narrow arear where data is distributed 

MDVP:Flo(Hz) :
1. Range 173.69 and std dev 43.52 suggests that the spread of the data is large
2. Kurtosis 0.654 suggests that there is strong tail. This is evident in distribution plot and box plot
3. Skewness is 1.217 suggests that the data is positively skewed. The strength of the skewness is high & hence if this feature is used as predictor, transformation is required
4. Box plot suggests that there are outliers and this needs treatment before building model


'MDVP:Jitter(%)' :
1. Range 0.031480 and std dev 0.004848 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 12.03 is the 4th highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 3.0849 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are quite a number for this variable and is evident in box plot

'MDVP:Jitter(Abs)' :
1. Range 0.000253 and std dev 0.000035 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 10.86 is one of the highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 2.649 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are quite a number for this variable and is evident in box plot

'MDVP:RAP' :
1. Range 0.020760 and std dev 0.002968 suggests that there is a fair amount of spread of data given the scale of data points
2. Kurtosis 14.213 is the 3rd highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 3.360 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are quite a number for this variable and is evident in box plot
5. The distribution plot indicates 2 small peaks at the long tail indicating possible gaussian miture however since its small, this may or may not impact models

'MDVP:PPQ' :
1. Range 0.018660 and std dev 0.002759 suggests that the spread of the data is large
2. Kurtosis 11.963922 is one of the highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 3.073892 suggests that the data is positively skewed. The strength of the skewness is high & hence if this feature is used as predictor, transformation is required
4. Box plot suggests that there are outliers and this needs treatment before building model
5. The distribution plot indicates small peaks at the long tail indicating possible gaussian miture however since its small, this may or may not impact models

'Jitter:DDP' :
1. Range 0.062290 and std dev 0.008903 suggests that the spread of the data is large
2. Kurtosis 14.224762 is the 2nd highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 3.362058 suggests that the data is positively skewed. The strength of the skewness is high & hence if this feature is used as predictor, transformation is required
4. Box plot suggests that there are outliers and this needs treatment before building model
5. Distribution plot clearly shows the inference made through Skewness and Kurtosis, the long tail and +ve skewness is very evident

'MDVP:Shimmer' : 
1. Range 0.109540 and std dev 0.018857 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 3.238308 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is 1.666480 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are quite a number for this variable and is evident in box plot. This will have to be treated before model building

'MDVP:Shimmer(dB)' : 
1. Range 1.217000 and std dev 0.194877 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 5.128193 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is 1.999389 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are quite a number for this variable and is evident in box plot. This will have to be treated before model building

'Shimmer:APQ3' :
1. Range 0.051920 and std dev 0.010153 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 2.720152 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is 1.580576 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building

'Shimmer:APQ5' : 
1. Range 0.073700 and std dev 0.012024 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 3.874210 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is 1.798697 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building
5. The distribution plot indicates small peaks at the long tail indicating possible gaussian miture however since its small, this may or may not impact models

'MDVP:APQ' : 
1. Range 0.130590 and std dev 0.016947 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 11.163288 is one of the highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 2.618047 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building

'Shimmer:DDA' : 
1. Range 0.155780 and std dev 0.030459 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 2.720661 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is 1.580618 suggests positive skewness of data. THis is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building

'NHR' :
1. Range 0.314170 and std dev 0.040418 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 21.994974 is the highest amongst all 22 variables of the data set indicates a strong tail or large data distribution around the tail. This needs treatment
3. Skewness is 4.220709 is the highest amongst all 22 variables shows strong positive skewness of data. THis is evident in distribution plot and box plot.
   Data transformation is required to handle this skewness
4. This variables has quite a lot of outliers and this will have to treated before model building

'HNR' :
1. Range 24.606000 and std dev 4.425764 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 0.616036 shows there is a small tail or data is distributed along the tail. Though this is small it requires treatment
3. Skewness is -0.514317 suggests negative skewness of data. This is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building
5. The distribution of data shows a small double peak indicating a possible gaussian mixture. This however is a risk that needs to be accepted for this project

'RPDE' :
1. Range 0.428581 and std dev 0.103942 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis -0.921781 suggests light tails or less data is distributed around the tails. However this is not a strong negative and hence a thin tail can be observed
3. Skewness is -0.143402 suggests negative skewness of data. This is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. Boxplot shows there are no outliers and hence no special treatment required
5. The distribution of data shows a  double peak indicating a possible gaussian mixture. This however is a risk that needs to be accepted for this project

'DFA' : 
1. Range 0.251006 and std dev 0.055336 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis -0.686152 suggests light tails or less data is distributed around the tails. However this is not a strong negative and hence a thin tail can be observed
3. Skewness is -0.033214 suggests negative skewness of data. This is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. Boxplot shows there are no outliers and hence no special treatment required
5. The distribution of data shows a  double peak indicating a possible gaussian mixture. This however is a risk that needs to be accepted for this project

'spread1' :
1. Range 5.530953 and std dev 1.090208 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis 0.050199 suggests light tails or less data is distributed around the tails. However this is not a strong negative and hence a thin tail can be observed
3. Skewness is 0.432139 suggests positive skewness of data. This is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building
5. The distribution of data shows a a good symmetry when compared. 

'spread2' :
1. Range 0.444219 and std dev 0.083406 suggests that there is a wide spread of data given the scale of data points
2. Kurtosis -0.083023 suggests light tails or less data is distributed around the tails. However this is not a strong negative and hence a thin tail can be observed
3. Skewness is 0.144430 suggests positive skewness of data. This is evident in distribution plot and box plot. Data transformation is required to handle this skewness
4. The outliers are two data point for this variable and is evident in box plot. This will have to be treated before model building
5. The distribution of data shows a a good symmetry when compared. 

'D2' : 
1. Range 2.247868 and Std Dev 0.382799 suggests that there is a large spread of data from the median and the data may have outliers
2. Kurtosis 0.220334 suggests that there is a heavy tail and more data points are distributed around the tail. This is evident in the box plot
3. Skewness 0.430384 suggests that the data is positively skewed. The strength of the skewness is high and hence if this feature is used as predictor, transformation is required
4. The outliers is only one data point for this variable and is evident in box plot. This will have to be treated before model building
5. The distribution of data shows a a good symmetry when compared. 

'PPE' :
1. Range 0.482828 and Std Dev 0.090119 suggests that there is a large spread of data from the median and the data may have outliers
2. Kurtosis 0.528335 suggests that there is a heavy tail and more data points are distributed around the tail. This is evident in the box plot
3. Skewness 0.797491 suggests that the data is positively skewed. The strength of the skewness is high and hence if this feature is used as predictor, transformation is required
4. The outliers are few for this variable and is evident in box plot. This will have to be treated before model building

The features are of different scales and hence normalization - scalar, log or exp may have to be done

In [None]:
sns.pairplot(Source[['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA']])

In [None]:
EDA.EDA_Corr(Source)

Correlation Analysis Inference :

1. 7 out of 22 Variables - MDVP:Fhi(Hz), DFA, MDVP:Fo(Hz), MDVP:Flo(Hz)	, RPDE, spread2, D2 have less correlationships with other Xs. This means that they can be 
   potentially a good predictor
2. 14 out of 22 Variables - PPE, spread1, MDVP:APQ, MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ5, MDVP:PPQ, Jitter:DDP, MDVP:RAP, MDVP:Jitter(Abs), Shimmer:APQ3, 
   MDVP:Jitter(%), Shimmer:DDA, NHR are influenced by other variables & each other positively. This means there can be an infleuncing factor compounded may resulting
   in all or some of them being poor predictors. During model building, these will have to be used judiciously
3. HNR is one variable which has inverse relationship with 16 out of remaining 21 variables. This again will have to be used judiciously during model building