# Parkinson's Disease - ML Detection
This project aims to develop a machine learning model to predict whether a person has Parkinson's disease based on various biomedical voice measurements. Using a dataset containing features extracted from voice recordings of individuals.

## Project Overview
Parkinson's disease is a progressive neurological disorder that affects movement and can lead to significant disability. Early and accurate diagnosis is crucial for effective management and treatment. This project leverages data-driven approaches to aid in the early detection of the disease.

**Goals:**
- **Develop a Predictive Model:** Create a robust machine learning model capable of accurately predicting Parkinson's disease from voice measurements.
- **Data Exploration and Preprocessing:** Understand the dataset's structure, clean the data, and preprocess it to ensure it is suitable for machine learning algorithms.
- **Model Training and Evaluation:** Train a Support Vector Machine (SVM) classifier and evaluate its performance to ensure it generalizes well to new data.
- **Make Predictions on New Data:** Use the trained model to predict the presence of Parkinson's disease in new, unseen data points.
- **Raise Awareness:** Demonstrate the potential of machine learning in healthcare applications, specifically for the early detection of neurological disorders.

## Dataset
This [dataset](https://www.kaggle.com/datasets/vikasukani/parkinsons-disease-data-set) is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to the "status" column which is set to 0 for healthy and 1 for PD.

Matrix column entries (attributes):
- **name** - ASCII subject name and recording number
- **MDVP:Fo(Hz)** - Average vocal fundamental frequency
- **MDVP:Fhi(Hz)** - Maximum vocal fundamental frequency
- **MDVP:Flo(Hz)** - Minimum vocal fundamental frequency
- **MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP** - Several measures of variation in fundamental frequency
- **MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA** - Several measures of variation in amplitude
- **NHR, HNR** - Two measures of the ratio of noise to tonal components in the voice
- **status** - The health status of the subject (one) - Parkinson's, (zero) - healthy
- **RPDE, D2** - Two nonlinear dynamical complexity measures
- **DFA** - Signal fractal scaling exponent
- **Spread1,spread2,PPE** - Three nonlinear measures of fundamental frequency variation

## Machine Learning Prediction

**Install dependencies**

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score

**Data collection and analysis**

In [2]:
df = pd.read_csv('/kaggle/input/parkinsonsdataset/parkinsons.csv')

df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [3]:
df.shape

(195, 24)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [5]:
df.columns

Index(['name', 'MDVP:Fo(Hz)', 'MDVP:Fhi(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Jitter(%)',
       'MDVP:Jitter(Abs)', 'MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP',
       'MDVP:Shimmer', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5',
       'MDVP:APQ', 'Shimmer:DDA', 'NHR', 'HNR', 'status', 'RPDE', 'DFA',
       'spread1', 'spread2', 'D2', 'PPE'],
      dtype='object')

In [6]:
df.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [7]:
df.isnull().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

In [8]:
# Distribution of target variable
#** 0 = parkinson's negative
#** 1 = parkinson's positive

df['status'].value_counts()

status
1    147
0     48
Name: count, dtype: int64

In [9]:
df.drop('name', axis='columns', inplace=True)

# Group the data based on the 'status' column and calculate the mean
df.groupby('status').mean()

Unnamed: 0_level_0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,181.937771,223.63675,145.207292,0.003866,2.3e-05,0.001925,0.002056,0.005776,0.017615,0.162958,...,0.013305,0.028511,0.011483,24.67875,0.442552,0.695716,-6.759264,0.160292,2.154491,0.123017
1,145.180762,188.441463,106.893558,0.006989,5.1e-05,0.003757,0.0039,0.011273,0.033658,0.321204,...,0.0276,0.053027,0.029211,20.974048,0.516816,0.725408,-5.33342,0.248133,2.456058,0.233828


In [10]:
# Rename columns
df = df.rename(columns={'MDVP:Fo(Hz)':'Fo', 'MDVP:Fhi(Hz)':'Fhi', 'MDVP:Flo(Hz)':'Flo', 'MDVP:Jitter(%)':'Jitter_percentage', 'MDVP:Jitter(Abs)':'Jitter_Abs', 'MDVP:RAP':'RAP', 'MDVP:PPQ':'PPQ', 'Jitter:DDP':'DDP', 'MDVP:Shimmer':'Shimmer', 'MDVP:Shimmer(dB)':'Shimmer_dB', 'Shimmer:APQ3':'APQ3', 'Shimmer:APQ5':'APQ5','MDVP:APQ':'APQ', 'Shimmer:DDA':'DDA'})

**Separating features and target**

In [11]:
X = df.drop(columns=['status'], axis=1)
Y = df['status']

In [12]:
print(X)

          Fo      Fhi      Flo  Jitter_percentage  Jitter_Abs      RAP  \
0    119.992  157.302   74.997            0.00784     0.00007  0.00370   
1    122.400  148.650  113.819            0.00968     0.00008  0.00465   
2    116.682  131.111  111.555            0.01050     0.00009  0.00544   
3    116.676  137.871  111.366            0.00997     0.00009  0.00502   
4    116.014  141.781  110.655            0.01284     0.00011  0.00655   
..       ...      ...      ...                ...         ...      ...   
190  174.188  230.978   94.261            0.00459     0.00003  0.00263   
191  209.516  253.017   89.488            0.00564     0.00003  0.00331   
192  174.688  240.005   74.287            0.01360     0.00008  0.00624   
193  198.764  396.961   74.904            0.00740     0.00004  0.00370   
194  214.289  260.277   77.973            0.00567     0.00003  0.00295   

         PPQ      DDP  Shimmer  Shimmer_dB  ...      APQ      DDA      NHR  \
0    0.00554  0.01109  0.04374   

In [13]:
print(Y)

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64


In [14]:
# Extract column names to be used later when making web app interface
for column in X.columns:
    print(column)

Fo
Fhi
Flo
Jitter_percentage
Jitter_Abs
RAP
PPQ
DDP
Shimmer
Shimmer_dB
APQ3
APQ5
APQ
DDA
NHR
HNR
RPDE
DFA
spread1
spread2
D2
PPE


**Splitting data into Train and Test data**

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

In [16]:
print(X.shape, X_train.shape, X_test.shape)

(195, 22) (156, 22) (39, 22)


**Data Standardization**

In [17]:
scaler = StandardScaler()

In [18]:
scaler.fit(X_train)

In [19]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Model training: Support Vector Machine**

In [20]:
svm = svm.SVC(kernel='linear')

In [21]:
# train the svm with training data
svm.fit(X_train, y_train)

**Model evaluation: Accuracy Score**

In [22]:
#Accuracy score on train data
X_train_pred = svm.predict(X_train)

train_data_accuracy = accuracy_score(y_train, X_train_pred)

In [23]:
print('Train data accuracy score: ', train_data_accuracy)

Train data accuracy score:  0.8846153846153846


In [24]:
#Accuracy score on test data
X_test_pred = svm.predict(X_test)

test_data_accuracy = accuracy_score(y_test, X_test_pred)

In [25]:
print('Test data accuracy: ', test_data_accuracy)

Test data accuracy:  0.8717948717948718


**Building a predictive system**

In [26]:
input_data = (120.552,131.162,113.787,0.00968,0.00008,0.00463,0.0075,0.01388,0.04701,0.456,0.02328,0.03526,0.03243,0.06985,0.01222,21.378,0.415564,0.825069,-4.242867,0.299111,2.18756,0.357775)

# Change input_data to a numpy array
input_data_as_np_array = np.asarray(input_data)

# reshape the np array
input_data_reshaped = input_data_as_np_array.reshape(1, -1)

# standardize the data
standard_data = scaler.transform(input_data_reshaped)

# make prediction
prediction = svm.predict(standard_data)
print(prediction)

if (prediction[0] == 0):
    print('This person does not have Parkinsons Disease')
else:
    print('This person has Parkinsons Disease')

[1]
This person has Parkinsons Disease




**Saving the trained model**

In [27]:
import pickle

In [28]:
# Save the model to a file
filename = 'parkinsons_disease_model.sav'
pickle.dump(svm, open(filename, 'wb'))

In [29]:
# Save the scaler to a file
scaler_filename = 'parkinsons_disease_scaler.sav'
pickle.dump(scaler, open(scaler_filename, 'wb'))

In [30]:
# Loading the saved model

loaded_model = pickle.load(open('parkinsons_disease_model.sav', 'rb'))

In [31]:
# Loading the saved scaler
loaded_scaler = pickle.load(open('parkinsons_disease_scaler.sav', 'rb'))

In [32]:
input_data = (120.552,131.162,113.787,0.00968,0.00008,0.00463,0.0075,0.01388,0.04701,0.456,0.02328,0.03526,0.03243,0.06985,0.01222,21.378,0.415564,0.825069,-4.242867,0.299111,2.18756,0.357775)

# Change input data to np array
input_data_as_np_array = np.asarray(input_data)

# Reshape the array as we are predicting for one instance
input_data_reshaped = input_data_as_np_array.reshape(1,-1)

# Standardize the input data using the loaded scaler
input_data_standardized = loaded_scaler.transform(input_data_reshaped)

prediction = loaded_model.predict(input_data_standardized)
print(prediction)

if (prediction[0] == 0):
    print('This person does not have Parkinsons Disease')
else:
    print('This person has Parkinsons Disease')

[1]
This person has Parkinsons Disease


