# Import Libraries

### Data Handling & Visualization

pandas (pd) → Handles data manipulation and analysis.

numpy (np) → Provides numerical computing capabilities.

matplotlib.pyplot (plt) → Used for plotting graphs.

%matplotlib inline → Ensures that plots are displayed inline in Jupyter Notebook.

seaborn (sns) → Enhances visualization with better-looking statistical plots.

### Machine Learning Models

DecisionTreeClassifier → Decision tree algorithm for classification.

RandomForestClassifier → Ensemble model using multiple decision trees.

KNeighborsClassifier → k-Nearest Neighbors classifier.

GaussianNB (renamed as gnb) → Naïve Bayes classifier for normally distributed data.

### Model Training & Evaluation
train_test_split → Splits the dataset into training and testing sets.

accuracy_score → Measures the accuracy of the model predictions.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB as gnb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Dataset

In [3]:
data = pd.read_csv('parkinsons.csv')
data.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


# Data Exploration

### Checking Dataset Shape

In [4]:
data.shape

(195, 24)

### Moving 'status' Column to the End 

In [5]:
df1=data.pop('status') 
data['status'] = df1

### Checking Class Distribution

In [6]:
data.status.value_counts()

1    147
0     48
Name: status, dtype: int64

### Checking for Missing Values

In [7]:
print(data.isnull().sum())

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
status              0
dtype: int64


### Displaying Dataset Summary

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

###  Dropping the 'name' Column

Removes the column "name" from the dataset.

axis=1 → Specifies that we're dropping a column (not a row).

If "name" doesn’t exist, this will raise an error.

In [9]:
data = data.drop('name',axis=1)

In [10]:
from sklearn import metrics

### Splitting Features (X) and Target (y)

X (Features): Contains all columns except "status".

y (Target): Stores the "status" column (the variable we want to predict).

In [11]:
X = data.drop("status",axis=1)
y = data["status"]

### Printing Features and Target

Displays the feature matrix X and target variable y to verify the data.

In [12]:
print(X)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0        119.992       157.302        74.997         0.00784   
1        122.400       148.650       113.819         0.00968   
2        116.682       131.111       111.555         0.01050   
3        116.676       137.871       111.366         0.00997   
4        116.014       141.781       110.655         0.01284   
..           ...           ...           ...             ...   
190      174.188       230.978        94.261         0.00459   
191      209.516       253.017        89.488         0.00564   
192      174.688       240.005        74.287         0.01360   
193      198.764       396.961        74.904         0.00740   
194      214.289       260.277        77.973         0.00567   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
0             0.00007   0.00370   0.00554     0.01109       0.04374   
1             0.00008   0.00465   0.00696     0.01394       0.06134   
2             0.00

In [13]:
print(y)

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64


# Splitting Data into Training and Testing Sets

train_test_split() divides the dataset:

80% for training (X_train, y_train).

20% for testing (X_test, y_test).

random_state=42 ensures reproducibility (consistent results each time the code runs).

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=42)

In [15]:
print(len(X_train)),print(len(X_test))

156
39


(None, None)


**Imports the StandardScaler class from scikit**  -learn's preprocessing module

***st_x = StandardScaler()*** - Creates an instance of StandardScaler

**X_train = st_x.fit_transform(X_train)** - Fits the scaler to the training data and transforms it in one step by computing the mean and standard deviation

**X_test = st_x.transform(X_test)** - Transforms the test data using the same scaling parameters computed from the training data

In [16]:
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()    
X_train= st_x.fit_transform(X_train)    
X_test= st_x.transform(X_test)

In [17]:
print(X)

     MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  \
0        119.992       157.302        74.997         0.00784   
1        122.400       148.650       113.819         0.00968   
2        116.682       131.111       111.555         0.01050   
3        116.676       137.871       111.366         0.00997   
4        116.014       141.781       110.655         0.01284   
..           ...           ...           ...             ...   
190      174.188       230.978        94.261         0.00459   
191      209.516       253.017        89.488         0.00564   
192      174.688       240.005        74.287         0.01360   
193      198.764       396.961        74.904         0.00740   
194      214.289       260.277        77.973         0.00567   

     MDVP:Jitter(Abs)  MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  \
0             0.00007   0.00370   0.00554     0.01109       0.04374   
1             0.00008   0.00465   0.00696     0.01394       0.06134   
2             0.00

In [18]:
print(y)

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64


# Model Selection 

# KNN


KNeighborsClassifier is a machine learning algorithm based on the k-Nearest Neighbors (k-NN) technique, which is used for classification tasks. The k-NN algorithm classifies a data point based on the majority class of its k nearest neighbors in the feature space.

How does KNeighborsClassifier work?

Choose k (the number of neighbors):
    The algorithm looks at the k closest points in the training dataset.
Measure Distance:
    It calculates the distance between the test sample and all training samples using a distance metric (default: Euclidean distance).
Vote for the Majority Class:
    The class that appears most frequently among the k neighbors is assigned to the new data point.

In [19]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 4)
knn.fit(X_train, y_train)

In [20]:

print(f'The accuracy of the knn classifier on training data is: {knn.score(X_train, y_train)*100:.2f}')
print(f'The accuracy of the knn classifier on test data is: {knn.score(X_test, y_test)*100:.2f}')

The accuracy of the knn classifier on training data is: 93.59
The accuracy of the knn classifier on test data is: 89.74


# RandomForestClassifier


RandomForestClassifier is a machine learning algorithm based on the Random Forest technique. It is an ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy

How Random Forest Works
Bootstrap Sampling (Bagging)

The algorithm creates multiple random subsets (bootstrap samples) from the original training data.

Each subset is used to train a separate decision tree.

Feature Randomization

At each tree node, a random subset of features is selected for splitting.

This prevents trees from being too similar and improves generalization.

Majority Voting (Classification)

Each decision tree makes a prediction.

The final prediction is based on the majority vote of all trees.

In [21]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier()
#random_forest.fit(X_train, y_train)
rf_classifier = RandomForestClassifier(n_estimators=50, min_samples_split=3, min_samples_leaf=6, max_features=0.5)
rf_classifier.fit(X_train, y_train)

In [22]:
print(f'The accuracy of the Random Forest classifier on training data is: {rf_classifier.score(X_train, y_train)*100:.2f}')
print(f'The accuracy of the Random Forest classifier on test data is: {rf_classifier.score(X_test, y_test)*100:.2f}')


The accuracy of the Random Forest classifier on training data is: 95.51
The accuracy of the Random Forest classifier on test data is: 92.31


# LogisticRegression

Logistic Regression is a machine learning algorithm used for classification problems. It predicts binary outcomes (e.g., Yes/No, 0/1, True/False). Unlike Linear Regression, which predicts continuous values, Logistic Regression predicts probabilities.

In [23]:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver='lbfgs') #L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) Algorithm
lg.fit(X_train, y_train)

In [24]:

print(f'The accuracy of the lg on training data is: {lg.score(X_train, y_train)*100:.2f}')
print(f'The accuracy of the lg on test data is: {lg.score(X_test, y_test)*100:.2f}')

The accuracy of the lg on training data is: 87.18
The accuracy of the lg on test data is: 89.74


# SVC

SVC (Support Vector Classification) is a machine learning algorithm used for classification tasks. It is based on Support Vector Machines (SVM), which aim to find the best decision boundary (hyperplane) to separate different classes in a dataset.



In [25]:
from sklearn.svm import SVC
svm = SVC(kernel='rbf')#RBF-Radial bias Function (kernel)
svm.fit(X_train, y_train)

In [26]:

print(f'The accuracy of the lg on training data is: {svm.score(X_train, y_train)*100:.2f}')
print(f'The accuracy of the lg on test data is: {svm.score(X_test, y_test)*100:.2f}')

The accuracy of the lg on training data is: 89.10
The accuracy of the lg on test data is: 89.74


# GaussianNB

GaussianNB is a Naïve Bayes classifier that assumes features follow a Gaussian (Normal) distribution. It is widely used for classification tasks, especially when data is continuous.



How Does GaussianNB Work?

Naïve Bayes is based on Bayes' Theorem:
P(A∣B)= P(B∣A)⋅P(A)/P(B)
𝑃(𝐴∣𝐵) - Probability of class 𝐴 given data 𝐵
P(B∣A) → Likelihood of data 𝐵 given class 𝐴
P(A) → Prior probability of class 𝐴
P(B) → Probability of the observed data

In [27]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train , y_train)

In [28]:

print(f'The accuracy of the lg on training data is: {nb.score(X_train, y_train)*100:.2f}')
print(f'The accuracy of the lg on test data is: {nb.score(X_test, y_test)*100:.2f}')

The accuracy of the lg on training data is: 69.87
The accuracy of the lg on test data is: 71.79
