# Prediction of Parkinson’s Disease Using Voice Analysis


> - ###  Halima Bulama-Ladan
> - ### 03/16/2022





## Parkinson's Disease:


> Parkinson's disease is an ongoing, progressive disease of the nervous system that affects a patient's movement. Millions of individuals worldwide are diagnosed with Parkinson's disease. The cause of Parkinson's disease is presently not known. However, research attributes the disease to a combination of genetic and environmental factors. Age also plays a role. Most cases of Parkinson's disease begins after an individual is sixty years old.

>Unfortunately, there is currently no cure for Parkinson's disease. There are, however, medications to surpress the symptoms of the disease. Occupational therapy is also plays a huge part of treating Parkinson's disease. Overall, this condition requires intense treatment and symptom management for a patient to live a full life.

## Major Early Symptom:

>Parkinson’s disease patients typically have a low-volume voice with a monotone (expressionless) quality. The speech pattern is often produced in short bursts with inappropriate silences between words and long pauses before initiating speech. The speech may also be slurred. A small percentage of patients (about 15 percent) may also have a tremulous voice





## Project Objectives:


> The objectives of this project, is to build three different models using Random Forest Classifier, LGBMClassifier, XGBClassifier respectively to detect the presence of the disease in individuals at an early stage, using voice analysis. The performance of the three models will be evaluated using evaluation metric and the model with the best scores will be chosen.



# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.impute import SimpleImputer
set_config(display = "diagram")
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, get_scorer, accuracy_score, recall_score, precision_score

#**Source of Data**

> The source of data for this project can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/).                  




##**Loading DataSet**

In [None]:
df =pd.read_csv("/content/parkinsons.data")
df.head()

##**Data Information**

In [None]:
df.info()
#There are 195 rows and 24 columns.

##**Data Types**

In [None]:
df.dtypes

##**Data Statistics**

In [None]:
df.describe().round(4)

Some columns have been observed to contain outliers. This will be investigated further and the neccessary action would be taken.

#**Data Cleaning Process**

##**Irrelevant Columns**

The names of the individuals whose voices were recorded for this data set is of no relevance to this project and hence will be dropped.

In [None]:
#Column "name" is not relevant to the prediction.
df = df.drop(columns = "name")
#Confirming changes
df.head(10)

##**Removal of Duplicates**

In [None]:
print(f'Duplicates: {df.duplicated().sum()}')

There are no duplicates in this data set.

##**Checking for null Values**

In [None]:
df.isnull().sum()
#There are no null values in this data set

There are no null values in this data set.

##**Checking of inconsistencies(Outliers) in data**

####**Outliers**

In [None]:
#Creating a funcution to check outliers across all columns.

for column_name in df.columns:
   q1 = df[column_name].quantile(0.25) # 25th percentile
   q3 = df[column_name].quantile(0.75) # 75th percentile
   iqr = q3 - q1 # Interquartile range

   low_limit = q1 - (1.5 * iqr) # low limit
   high_limit = q3 + (1.5 * iqr) # high limit

# Create outlier dataframes
   low_df = df[(df[column_name] < low_limit)]
   high_df = df[(df[column_name] > high_limit)]

# Calculate the outlier counts and percentages
   low_oulier_count = low_df.shape[0]
   low_outlier_percentge = round(((low_oulier_count)/(df.shape[0])*100),1)
   high_oulier_count = high_df.shape[0]
   high_outlier_percentge = round(((high_oulier_count)/(df.shape[0])*100),1)

  
   print(f'\n{column_name}:\n  Low Outliers:{low_oulier_count} ({low_outlier_percentge}%),          High Outliers: {high_oulier_count} ({high_outlier_percentge}%)')
  

The outliers of each column are less than 8% of the total percentage of values of the column, therefore all enteries with outliers will be located and dropped.

###**Exploring relations between all features and target feature, "status" using a barchart and a boxplot and removing outliers along the way**

In [None]:
def bar (u,v):
  plt.rcParams["figure.figsize"] = [28, 6]
  f, axes =plt.subplots(1,4)
  sns.set_theme(style="whitegrid")
  sns.barplot(ax = axes[0], x="status", y=u, data=df)
  sns.boxplot(ax = axes[1], x="status", y=u, data=df)
  sns.barplot(ax = axes[2], x="status", y=v, data=df)
  sns.boxplot(ax = axes[3], x="status", y=v, data=df)



In [None]:
def outliers (column_name, num):
  df[column_name] > num
  print( df[df[column_name] > num].index)


In [None]:
bar("MDVP:Fo(Hz)", "MDVP:Fhi(Hz)")

Individuals with PD have a lower MDVP:fo(HZ) and MDVP:fhi(HZ) than those who dont have it. No outliers were found for the MDVP:fo(HZ) but the MDVP:fhi(HZ) had some outliers as seen in the boxplot. 

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:Fhi(Hz)", 400)

In [None]:
bar("MDVP:Flo(Hz)", "MDVP:Jitter(%)")

The MDVP:Flo(Hz) appears to be higher in those with PD and lower in those without it. The MDVP:Jitter(%) however is lower in those with PD and higher in those without it. It also contains some outliers. 

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:Jitter(%)", 0.013)

In [None]:
bar("MDVP:PPQ","Jitter:DDP")

Both MDVP:PPQ and Jitter:DDP have lower values in PD patients than that of those without it and They both contain outliers.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:PPQ",0.007)
outliers("Jitter:DDP", 0.02)

In [None]:
bar("MDVP:Jitter(Abs)","MDVP:RAP")

The values in MDVP:Jitter(Abs) and MDVP:RAP are much lower in individuals with PD than those without it and they both contain outliers.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:Jitter(Abs)",0.00010)
outliers("MDVP:RAP",0.007)

In [None]:
bar("MDVP:PPQ","Jitter:DDP")

The two features, MDVP:PPQ and Jitter:DDP are both higher in individuals without PD and lower in those out it. There are outliers present in both features.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:PPQ",0.007)
outliers("Jitter:DDP", 0.02)

In [None]:
bar("MDVP:Shimmer", "MDVP:Shimmer(dB)")

MDVP:Shimmer and MDVP:Shimmer(dB) are both higher in individuals with PD than those without it.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:Shimmer",0.075)
outliers("MDVP:Shimmer(dB)", 0.7)

In [None]:
bar("Shimmer:APQ3", "Shimmer:APQ5")

Shimmer APQ3 and Shimmer APQ5 are both higher in individuals with PD than those without the disease.

In [None]:
bar("MDVP:APQ","Shimmer:DDA")

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("MDVP:APQ",0.06)
outliers("Shimmer:DDA", 0.12)

The same thing here for MDVP: APQ and Shimmer:DDA, they are higher in individuals with the disease and lower in those without it.

In [None]:
bar("NHR", "HNR")

Here The NHR is lower in PD negative individuals and higher in PD positive individuals, while the HNR is higher in those without the disease than those without it.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("NHR",0.05)
outliers("HNR", 30)

In [None]:
bar("RPDE", "DFA")

RPDE is higher in PD positive individuals and lower in PD negative individuals. As for DFA, There is only a slight difference between those with and without the disease. It is high in both positive and negative individuals, although it is slightly higher in PD positive individuals.

In [None]:
bar("spread1","spread2")

Notice the values in spread1 are higher in PD positive and lower negative individuals.The both start from below zero(0). Spread2 can also be seen to be higher in PD positive and lower in PD negative.

In [None]:
bar("D2", "PPE")

Both D2 and PPE are lower in PD negative and higher in PD positive.

In [None]:
#Locating the index numbers of enteries with outlier values.
outliers("D2",3.4)
outliers("PPE",0.41)

#**Dropping extreme outliers**

In [None]:
#A list of the index number of all outliers.
df = df.drop(index = [2, 4, 5,17, 18,19,20, 31, 32, 33, 34, 35, 68, 73, 84, 87,88, 89, 90, 91, 97, 98, 99, 100, 101, 102, 115, 
 116, 117, 118, 120, 141, 146, 147, 148, 149, 150, 151, 152, 153, 154, 157, 186, 187, 189, 192, 193, 194])

**Confirming the the drop of outliers using only columns ("NHR" and "HNR").**

In [None]:
bar("NHR", "HNR")

Dropping of extreme outliers was successfull.

##**Correlation between features**

In [None]:
plt.figure(figsize = (12, 10))

corr = df.corr()
sns.heatmap(corr, cmap='vlag');

plt.title('Correlation Heatmap', fontsize = 16, weight='bold')
plt.xticks(fontsize = 10, weight='bold', rotation=90)
plt.yticks(fontsize = 10, weight='bold');

plt.tight_layout()
plt.show;

There appears to be a strong positive correlation amongst 'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5','MDVP:APQ'and 'Shimmer:DDA'.
HNR also has a negative correlation most of the features.

##**An overview visualization of the relationship of feature "status, across all other features.**

In [None]:
 #Dividing columns into groups for easy isualization, due to their varying value range.
df1 = df.drop(columns = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'D2'])

df2 = df.drop(columns =[ 'MDVP:Jitter(%)', 'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'RPDE', 'DFA',
       'spread1', 'spread2', 'PPE'])

df3 = df.drop(columns = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
        'MDVP:Shimmer(dB)', 'HNR', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE'],)


df4 = df.drop(columns = ['MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)',  'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'RPDE', 'DFA',  'spread1', 'spread2', 'D2', 'PPE'])

        

In [None]:
grouped1 = df1.groupby("status").mean()

In [None]:
grouped2 = df2.groupby("status").mean()

In [None]:
grouped3 =df3.groupby("status").mean()

In [None]:
grouped4 =df4.groupby("status").mean()

###**A barchart to display the relations of all features with the column status.**

In [None]:
def rel (u,v,w,x):
  #plt.rcParams["figure.figsize"] = [20, 6]
 # f, axis =plt.subplots(2,2)
  sns.set_theme(style="whitegrid")

  
  xticks1 = np.arange(len(x.columns))
  xticks2 = xticks1 + .5
  plt.subplot()
  plt.figure(figsize = (6, 3))
  plt.bar ( xticks1, x.loc[0],color ="blue", label = '0', width = 0.25)
  plt.bar(xticks2, x.loc[1], color ="orange", label ="1", width =0.25)
  plt.xticks(xticks1, labels =x.columns);
  plt.tick_params(labelrotation =45)
  plt.axhline(0, color ="black")
  plt.legend()
  plt.show()

  
  xticks1 = np.arange(len(v.columns))
  xticks2 = xticks1 + .5
  plt.figure(figsize = (6, 3))
  plt.bar (xticks1, v.loc[0],color ="blue", label = '0', width = 0.25)
  plt.bar(xticks2, v.loc[1], color ="orange", label ="1", width =0.25)
  plt.xticks(xticks1, labels =v.columns);
  plt.tick_params(labelrotation =45)
  plt.axhline(0, color ="black")
  plt.legend()
  plt.show()

 
  xticks1 = np.arange(len(w.columns))
  xticks2 = xticks1 + .5
  plt.figure(figsize = (6, 3))
  plt.bar (xticks1, w.loc[0],color ="blue", label = '0', width = 0.25)
  plt.bar(xticks2, w.loc[1], color ="orange", label ="1", width =0.25)
  plt.xticks(xticks1, labels =w.columns);
  plt.tick_params(labelrotation =45)
  plt.axhline(0, color ="black")
  plt.legend()
  plt.show()

  xticks1 = np.arange(len(u.columns))
  xticks2 = xticks1 + .5
  plt.figure(figsize = (6, 3))
  plt.bar (xticks1, u.loc[0],color ="blue", label = '0', width = 0.25)
  plt.bar(xticks2, u.loc[1], color ="orange", label ="1", width =0.25)
  plt.xticks(xticks1, labels =u.columns);
  plt.tick_params(labelrotation =45)
  plt.axhline(0, color ="black")
  plt.legend()
  plt.show()


In [None]:
rel(grouped1, grouped2, grouped3, grouped4)

###**Checking for inbalance in target vector**

In [None]:
df["status"].unique()

In [None]:
print(f'Status(0)_counts:{df[df["status"]==0].shape[0]}')
print(f'Status(1)_counts:{df[df["status"]==1].shape[0]}')


We can see here that the target Vector is not balanced. This imbalance will be handled during the preprocessing stage.

#**Visualization of Data Clusters through dimentionality reduction(Principal Component Analysis, PCA)**

This data sets has more than one dimention, which makes it impossible to visualize, therefore the principal component analysis(pca), will be used to reduce the dimentionality of the dataset into two dimentions (clusters), thereby making it possible for visualization.

**Assigning Target Vector and Independent Variable**



In [None]:
X = df.drop(columns ="status")
y = df["status"]

le =LabelEncoder()
z= le.fit_transform(df["status"])


**Scaling Data in preparation for pca**

In [None]:
scaler =StandardScaler()
scaled_df = scaler.fit_transform(X)

**Instantiating pca**

In [None]:
pca = PCA(n_components = 2)
pca_df =pca.fit_transform(scaled_df)


**Clusters representing Individuals with and without Parkinson's disease.**

In [None]:
plt.figure(figsize = (8, 4))
#plt.scatter(pcss, c=z)
plt.scatter(pca_df[:,0], pca_df[:,1], c = z)

ax =plt.axes()
ax.set_facecolor ("grey")
plt.title('Visualization of all of our data using the first two Principal Components')
plt.xlabel('PC1')
plt.ylabel('PC2');

#**Preprocessing of Data**

Now lets begin processing our data for modelling.

In [None]:
#Train, test, splitting data.
X_train, X_test, y_train, y_test= train_test_split(X,y, random_state = 42)

###**Instantiating Column Selector**

In [None]:
#there are onlu numeric columns in this data frame
num_cols = make_column_selector(dtype_include="number")

###**Instantiating Standard scaler**

In [None]:
scaler =StandardScaler()

In [None]:
num_tuple =(scaler, num_cols)

###**Instantiating Column transformer**

In [None]:
column_transformer = make_column_transformer(num_tuple)

#**KNeighbors Classifier Model**

In [None]:
#Instantiating model
knn =KNeighborsClassifier()

**Handling imbalance target vector(y), using SMOTE**

In [None]:
#Instantiating smote
smote = SMOTE(random_state = 42)

In [None]:
pca = PCA(n_components= .95)

In [None]:
KS_pipe = Pipeline([("smote", smote), ("knn", knn)])

In [None]:
knn_pipe = make_pipeline(column_transformer, KS_pipe)
knn_pipe.fit(X_train, y_train)

**Tuning for the best parameters**

In [None]:
knn_pipe.get_params()

In [None]:
params = {'pipeline__knn__n_neighbors': [5,7,9,11],
          'pipeline__knn__weights': ["uniform", "distance"],
          'pipeline__knn__p':[2,3,4]}

##**Instantiating the GridSearch with Knn pipeline and paramameters**



In [None]:
knn_gs =GridSearchCV(knn_pipe, params)
knn_gs.fit(X_train, y_train)

In [None]:
knn_gs.best_params_

In [None]:
best_knn_pipe = knn_gs.best_estimator_

#**Predictions**

In [None]:
train_knn_preds = best_knn_pipe.predict(X_train)
test_knn_preds =best_knn_pipe.predict(X_test)

In [None]:
training_set =pd.DataFrame(train_knn_preds)
training_set = training_set.rename(columns ={0 :"Training set predictions"})
training_set.head(10)

In [None]:
testing_set =pd.DataFrame(test_knn_preds)
testing_set = testing_set.rename(columns ={0 :"Testing set predictions"})
testing_set.head(10)

###**Classification Report on Default KNeigbor Classifier**

In [None]:
train_report =classification_report(y_train, train_knn_preds)
test_report = classification_report(y_test, test_knn_preds)

In [None]:
print(f'Evaluation metrics on Knn model with tuned parameters\n\n Train set:\n{train_report}')
print("______________________________________\n")
print(f'Evaluation metrics on Knn model with tuned parameters\n\n Test set:\n{test_report}')

###**Confusion Matrix on KNeigbor Classifier**

**Train**

In [None]:
ConfusionMatrixDisplay.from_estimator(best_knn_pipe, X_train, y_train, cmap ="Reds", normalize ="true");

The KNN model did extremely well in predicting the true positives and the true negatives. It predicted all enteries correctly.

**Test**

In [None]:
ConfusionMatrixDisplay.from_estimator(best_knn_pipe, X_test, y_test, cmap="Reds", normalize = "true");

When it comes to predicting disease in people, it is always better to have a model that has zero amount of false positive error than the false negative error. The KNN model here predicted 7 entry as false positive, but did well in predicting the false negatives, true positives and true negatives. 

#**Instantiating the Light Classier**

In [None]:
lgbm =LGBMClassifier()
smote = SMOTE(random_state = 42)


In [None]:
LS_pipe = Pipeline([("smote", smote),("lgbm", lgbm)])

In [None]:
lgbm_pipe = make_pipeline(column_transformer, LS_pipe)
lgbm_pipe.fit(X_train, y_train)

In [None]:
lgbm_train_preds = lgbm_pipe.predict(X_train)
lgbm_test_preds = lgbm_pipe.predict(X_test)

In [None]:
training_set =pd.DataFrame(lgbm_train_preds)
training_set = training_set.rename(columns ={0 :"LGBM: Training set predictions"})
training_set.head(10)

In [None]:
testing_set =pd.DataFrame(lgbm_test_preds)
testing_set = testing_set.rename(columns ={0 :"LGBM: Testing set predictions"})
testing_set.head(10)

###**Model Evaluation**

In [None]:
lgbm_train_report =classification_report(y_train, lgbm_train_preds)
lgbm_test_report = classification_report(y_test, lgbm_test_preds)

In [None]:
print(f'Evaluation metrics on Knn model with tuned parameters\n\n Train set:\n{lgbm_train_report}')
print("______________________________________\n")
print(f'Evaluation metrics on Knn model with tuned parameters\n\n Test set:\n{lgbm_test_report}')

In [None]:
ConfusionMatrixDisplay.from_estimator(lgbm_pipe, X_train, y_train, cmap ="Greens" ,normalize = "true");

The LGBM model accurately on the train set

In [None]:
ConfusionMatrixDisplay.from_estimator(lgbm_pipe, X_test, y_test, cmap="Greens", normalize = "true");

The model also predicted accuaretly on the train set

#**XGBoost Classifier(xGBC)**

###**INstantiating Model**

In [None]:
xgb = XGBClassifier()
smote = SMOTE(random_state = 42)
#xgb.fit(X_train, y_train)

In [None]:
XS_pipe = Pipeline([("smote", smote), ("xgb", xgb)])

In [None]:
xgb_pipe= make_pipeline(column_transformer, XS_pipe)

In [None]:
xgb_pipe.fit(X_train, y_train)

In [None]:
xgb_train_preds = xgb_pipe.predict(X_train)
xgb_test_preds = xgb_pipe.predict(X_test)

In [None]:
training_set =pd.DataFrame(xgb_train_preds)
training_set = training_set.rename(columns ={0 :"XGB: Training set predictions"})
training_set.head(10)

In [None]:
testing_set =pd.DataFrame(xgb_test_preds)
testing_set = testing_set.rename(columns ={0 :"XGB: Testing set predictions"})
testing_set.head(10)

###**Model Evaluation**

In [None]:
xgb_train_report =classification_report(y_train, xgb_train_preds)
xgb_test_report = classification_report(y_test, xgb_test_preds)

In [None]:
print(f'Evaluation metrics on Gradient Boosting Classifier\n\n Train set:\n{xgb_train_report}')
print("______________________________________\n")
print(f'Evaluation metrics on Gradient Boosting Classifier\n\n Test set:\n{xgb_test_report}')

#**Confusion Matrix**

In [None]:
ConfusionMatrixDisplay.from_estimator(xgb_pipe, X_train, y_train, cmap ="Blues",  normalize = "true");

The XGB model also did well on the train set.

In [None]:
ConfusionMatrixDisplay.from_estimator(xgb_pipe, X_test, y_test, cmap ="Blues", normalize ="true");

The model predicted 2 false negatives and 1 false positives.

In [None]:
def models(model, X_train, X_test, y_train, y_test, scoring = "accuracy", model_name = "Classifier"):

  scoring_func = get_scorer(scoring)
  train_score = scoring_func(model, X_train, y_train)
  test_score = scoring_func(model, X_test, y_test)

  delta_score  = train_score -test_score
  

  score_dict ={f'{scoring}:train': train_score,
               f'{scoring}:test': test_score,
               f'{scoring}:difference' : delta_score}

  score_frame = pd.DataFrame(score_dict, index=[model_name])

  return(score_frame)

In [None]:
models_report= models(lgbm_pipe, X_train, X_test, y_train, y_test, model_name ="LGBM Classifier")
models_report= models_report.append(models(best_knn_pipe, X_train, X_test, y_train, y_test, model_name ="KNN Classifier"))
models_report= models_report.append(models(xgb_pipe, X_train, X_test, y_train, y_test, model_name ="XGBoost Classifier"))

models_report

#**Conclusion**


**All three models performed well on the train set, but KNN classifier had the highest accuracy score of 97%, which is just 2.7 less than the 100% score on the train set. Therefore the most suitable model among LGBM Classifier, KNN Classifier and XGBoost Classifier for the prediction of parkinson disease in divididuals is the KNN Classifier.**

