Parkinson's disease is a progressive nervous system disorder that affects movement leading to shaking, stiffness, and difficulty with walking, balance and coordination. Degeneration of neurons in the brain which results in reduced amount of chemical known as dopamine in the brain. Due to which several processes of our body are affected and people tend to have movement disorders.

Work - Flow ==> Parkinsons'data -> Data Pre-processing -> Test-Train Split -> SVM Classifier

When we give any new data then New Data -> Trained SVM classifier -> Parkinson's healthy or unhealthy prediction

Importing the Dependencies

In [64]:
import numpy as np # numpy is used to create numpy arrays which are helpful while data-analysis
import pandas as pd # used to create pandas data-frame i.e. structured tables 
from sklearn.model_selection import train_test_split # we need to split the data to training and test data
from sklearn.preprocessing import StandardScaler # used to standardize our data in a common range
from sklearn.neighbors import KNeighborsClassifier # used to give knn model to the data.
from sklearn.metrics import accuracy_score # gives the accuracy_score and tells how good our model is.
from sklearn.tree import DecisionTreeRegressor
#from xgboost import XGBClassifier # applies xgboost model to the dataset.
from sklearn.ensemble import RandomForestClassifier # used for random forest model.

Data Collection and analysis (Collection of voice's - People's speech would be recorded of various frequencies so that we have the average vocal frequency and max vocal freq. Status tells about whether the person has parkinson or not (1 for person has disease else 0)).

In [65]:
# loading the data from csv file to a Pandas DataFrame
parkinsons_dise = pd.read_csv(r"C:\Users\DELL\Downloads\parkinsons.csv")

In [66]:
# number of rows and columns in the dataframe
parkinsons_dise.shape
# 195 rows and 24 columns

(195, 24)

In [67]:
# getting more information about the dataset
parkinsons_dise.info()
# non-null tells there is no null value in these columns with the count of non-null

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [68]:
# checking for missing values in each column
parkinsons_dise.isnull().sum()
# if there are some then we will use the statistical measures to fill those values

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

In [69]:
# getting some statistical measures about the data
parkinsons_dise.describe()
# count gives the count of the rows in this column whereas mean gives the mean 
# percentage gives the thing that 25% values are less than 117.57 in first coulumn and 50% of them are less than 148.79

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [70]:
# distribution of target Variable
parkinsons_dise['status'].value_counts()
# tells how many of the people have parkinson and how many of them don't have it (0 tells they are healthy and 1 tells they have the parkinson's disease).

1    147
0     48
Name: status, dtype: int64

In [71]:
# grouping the data based on the target variable
parkinsons_dise.groupby('status').mean()
# represents the mean of each column w.r.t the status of them whther they are healty or not.
# the more the value of a column the more likely that person is healthy else they are most likely to be affected by parkinson's.

Unnamed: 0_level_0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,181.937771,223.63675,145.207292,0.003866,2.3e-05,0.001925,0.002056,0.005776,0.017615,0.162958,...,0.013305,0.028511,0.011483,24.67875,0.442552,0.695716,-6.759264,0.160292,2.154491,0.123017
1,145.180762,188.441463,106.893558,0.006989,5.1e-05,0.003757,0.0039,0.011273,0.033658,0.321204,...,0.0276,0.053027,0.029211,20.974048,0.516816,0.725408,-5.33342,0.248133,2.456058,0.233828


Data Pre-Processing

Separating the features & target

In [72]:
X = parkinsons_dise.drop(columns=['name','status'], axis=1)
Y = parkinsons_dise['status']
# axis is 1 for dropping a column and axis is 0 for dropping a row.

In [73]:
print(X) 
# (showing everything except the name and status by rejecting them)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0        119.992       157.302        74.997         0.00784   
1        122.400       148.650       113.819         0.00968   
2        116.682       131.111       111.555         0.01050   
3        116.676       137.871       111.366         0.00997   
4        116.014       141.781       110.655         0.01284   
..           ...           ...           ...             ...   
190      174.188       230.978        94.261         0.00459   
191      209.516       253.017        89.488         0.00564   
192      174.688       240.005        74.287         0.01360   
193      198.764       396.961        74.904         0.00740   
194      214.289       260.277        77.973         0.00567   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
0             0.00007   0.00370   0.00554     0.01109       0.04374   
1             0.00008   0.00465   0.00696     0.01394       0.06134   
2             0.00

In [74]:
print(Y)
# separates the people according to whether they have parkinson or not.

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64


Splitting the data to training data and test data

In [75]:
# splitting the data into two arrays i.e. X_train data will go to Y_train and X_test data will go to Y_test
# test_size tells that we want 20% of data as test data and rest 80% as train data.
# random_state is used to split the data in same way as other person splits basically helpful in code reproduction.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [76]:
# X.shape basically shows the normal data shape and X_train.shape gives the size of the train data and X_test.shape gives the size of the test data i.e. 20% of normal data.
print(X.shape, X_train.shape, X_test.shape)

(195, 22) (156, 22) (39, 22)


Data Standardization ( use to standardize data to a common range without changing their meaning they convey , basically we are changing the main table with values up to a common range.)

In [77]:
scaler = StandardScaler()

In [78]:
# fitting the data to the standardScaler function
# standardize the data using this X_train and transform the data based on this scalar
# when we are fitting this X_train to a scalar it will transform all the values of this X_train to a common range
scaler.fit(X_train)

StandardScaler()

In [79]:
# now since we have fitted the data and now need to transform the data (converts all the values to a same range)
X_train = scaler.transform(X_train)
# transforming the X_test data
# fit only the X_train data and on that basis transform the X_test data 
# becoz we don't want to show the test to our ml model in prior.  
X_test = scaler.transform(X_test)

In [80]:
# brings the value lumpsum to -1 to +1
print(X_train)

[[ 0.63239631 -0.02731081 -0.87985049 ... -0.97586547 -0.55160318
   0.07769494]
 [-1.05512719 -0.83337041 -0.9284778  ...  0.3981808  -0.61014073
   0.39291782]
 [ 0.02996187 -0.29531068 -1.12211107 ... -0.43937044 -0.62849605
  -0.50948408]
 ...
 [-0.9096785  -0.6637302  -0.160638   ...  1.22001022 -0.47404629
  -0.2159482 ]
 [-0.35977689  0.19731822 -0.79063679 ... -0.17896029 -0.47272835
   0.28181221]
 [ 1.01957066  0.19922317 -0.61914972 ... -0.716232    1.23632066
  -0.05829386]]


Model Training ( using our training data )

In [81]:
model = KNeighborsClassifier()
model.fit(X_train , Y_train)
X_test_prediction = model.predict(X_test)
print(accuracy_score(X_test_prediction , Y_test))

0.7692307692307693


In [82]:
class XGBoostClassifier:
    def __init__(self, n_estimators=100, max_depth=3, learning_rate=0.1, subsample=1.0, colsample_bytree=1.0, reg_lambda=1.0):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.subsample = subsample
        self.colsample_bytree = colsample_bytree
        self.reg_lambda = reg_lambda
        self.trees = []

    def sigmoid(self, X):
        return 1 / (1 + np.exp(-X))

    def fit(self, X, Y):
        F = np.zeros(len(X))
        for i in range(self.n_estimators):
            residuals = Y - self.sigmoid(F)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            self.trees.append(tree)
            F += self.learning_rate * tree.predict(X)

    def predict(self, X):
        F = np.zeros(len(X))
        for tree in self.trees:
            F += self.learning_rate * tree.predict(X)
        y_pred = self.sigmoid(F)
        return (y_pred > 0.5).astype(int)


In [83]:
model2 = XGBoostClassifier()
model2.fit(X_train , Y_train)
X_test_prediction2 = model2.predict(X_test)
print(accuracy_score(X_test_prediction2 , Y_test))

0.7435897435897436


In [84]:
model3 = RandomForestClassifier()
model3.fit(X_train , Y_train)
X_test_prediction3 = model3.predict(X_test)
print(accuracy_score(X_test_prediction3 , Y_test))

0.8205128205128205


Build a predictive system

In [85]:
input_data = (243.43900,250.91200,232.43500,0.00210,0.000009,0.00109,0.00137,0.00327,0.01419,0.12600,0.00777,0.00898,0.01033,0.02330,0.00454,25.36800,0.438296,0.635285,-7.057869,0.091608,2.330716,0.091470)

# changing input data to a numpy array
#np.asarray converts the above tuple to a numpy array
input_as_numpy_array = np.asarray(input_data)


# reshape the numpy array
# tells our model that i am just giving the value for one data point and you need to predict for other data points.
input_data_reshaped = input_as_numpy_array.reshape(1,-1)



# standardize the data
std_data = scaler.transform(input_data_reshaped)


# used to predict the data whether the person has parkinson or not.
prediction = model3.predict(std_data)
#print the value as 0 or 1
print(prediction)



if (prediction[0] == 0):
  print("The Person does not have Parkinsons Disease")

else:
  print("The Person has Parkinsons")



[0]
The Person does not have Parkinsons Disease




In [86]:
import pickle

# Save the trained model as a pickle string.
filename = 'parkinsons.pkl'

pickle.dump(model3,open(filename,'wb'))
