In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Parkinson's Disease Voice Dataset - Regression Task**

This dataset contains **5,875 voice recordings** from **42 early-stage Parkinson’s disease patients**.  
The goal is to **predict `motor_UPDRS`**, a clinical score for movement symptoms like **stiffness or tremors**, using **16 voice features** and other patient details.

---

## **Dataset Overview**
- **Total Patients:** 42  
- **Total Samples:** 5,875  
- **Prediction Target:** `motor_UPDRS` (0 to 100, higher = worse symptoms)  
- **Features Used for Prediction:**
  - **Demographics & Study Time:**
    - `age`: Patient’s age (years)
    - `sex`: Gender **(0 = male, 1 = female)**
    - `test_time`: Days since the patient joined the study (tracks symptom progression)

  - **Voice Features (16 Total)**
    - **Jitter Measures** *(Voice pitch stability, higher = more unstable)*
      - `Jitter(%)`, `Jitter(Abs)`, `Jitter:RAP`, `Jitter:PPQ5`, `Jitter:DDP`
    
    - **Shimmer Measures** *(Voice volume stability, higher = more unstable)*
      - `Shimmer`, `Shimmer(dB)`, `Shimmer:APQ3`, `Shimmer:APQ5`, `Shimmer:APQ11`, `Shimmer:DDA`
    
    - **Noise & Complexity Measures**
      - `NHR`: Noise-to-harmonics ratio **(higher = noisier, hoarse voice)**
      - `HNR`: Harmonics-to-noise ratio **(higher = clearer voice)**
      - `RPDE`: Recurrence Period Density Entropy **(higher = less predictable voice patterns)**
      - `DFA`: Detrended Fluctuation Analysis **(higher = more random voice patterns)**
      - `PPE`: Pitch Period Entropy **(higher = unstable pitch)**

---

## **Target Variable**
- `motor_UPDRS`: **Clinician-given score** from **0 to 100** (higher = worse movement symptoms).

### **Example Relationship**
- A patient with **high `Jitter(%)` and `Shimmer(dB)`** may have a **higher `motor_UPDRS` score**, indicating **worse movement symptoms**.

---

## **Target Variable**
- `motor_UPDRS`: **Clinician-given score** from **0 to 100** (higher = worse movement symptoms).

### **Example Relationship**
- A patient with **high `Jitter(%)` and `Shimmer(dB)`** may have a **higher `motor_UPDRS` score**, indicating **worse movement symptoms**.

---

## **How to Use This Data**
- **Train a regression model** to predict `motor_UPDRS` based on **16 voice features, age, sex, and test_time**.
-Each patient has **multiple recordings over time**, allowing us to **track symptom progression**.


<div style="text-align:center;font-size:30px; ont-weight:bold;">Loading and exploring the dataset</div>

In [3]:
df  = pd.read_csv('dataset_regression/parkinsons_updrs.data')
df.head()

Unnamed: 0,subject#,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
0,1,72,0,5.6431,28.199,34.398,0.00662,3.4e-05,0.00401,0.00317,...,0.23,0.01438,0.01309,0.01662,0.04314,0.01429,21.64,0.41888,0.54842,0.16006
1,1,72,0,12.666,28.447,34.894,0.003,1.7e-05,0.00132,0.0015,...,0.179,0.00994,0.01072,0.01689,0.02982,0.011112,27.183,0.43493,0.56477,0.1081
2,1,72,0,19.681,28.695,35.389,0.00481,2.5e-05,0.00205,0.00208,...,0.181,0.00734,0.00844,0.01458,0.02202,0.02022,23.047,0.46222,0.54405,0.21014
3,1,72,0,25.647,28.905,35.81,0.00528,2.7e-05,0.00191,0.00264,...,0.327,0.01106,0.01265,0.01963,0.03317,0.027837,24.445,0.4873,0.57794,0.33277
4,1,72,0,33.642,29.187,36.375,0.00335,2e-05,0.00093,0.0013,...,0.176,0.00679,0.00929,0.01819,0.02036,0.011625,26.126,0.47188,0.56122,0.19361


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5875 entries, 0 to 5874
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   subject#       5875 non-null   int64  
 1   age            5875 non-null   int64  
 2   sex            5875 non-null   int64  
 3   test_time      5875 non-null   float64
 4   motor_UPDRS    5875 non-null   float64
 5   total_UPDRS    5875 non-null   float64
 6   Jitter(%)      5875 non-null   float64
 7   Jitter(Abs)    5875 non-null   float64
 8   Jitter:RAP     5875 non-null   float64
 9   Jitter:PPQ5    5875 non-null   float64
 10  Jitter:DDP     5875 non-null   float64
 11  Shimmer        5875 non-null   float64
 12  Shimmer(dB)    5875 non-null   float64
 13  Shimmer:APQ3   5875 non-null   float64
 14  Shimmer:APQ5   5875 non-null   float64
 15  Shimmer:APQ11  5875 non-null   float64
 16  Shimmer:DDA    5875 non-null   float64
 17  NHR            5875 non-null   float64
 18  HNR     

In [8]:
# Checking for duplicates
df.duplicated().sum()

0

In [6]:
df.describe()

Unnamed: 0,subject#,age,sex,test_time,motor_UPDRS,total_UPDRS,Jitter(%),Jitter(Abs),Jitter:RAP,Jitter:PPQ5,...,Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,Shimmer:APQ11,Shimmer:DDA,NHR,HNR,RPDE,DFA,PPE
count,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,...,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0,5875.0
mean,21.494128,64.804936,0.317787,92.863722,21.296229,29.018942,0.006154,4.4e-05,0.002987,0.003277,...,0.31096,0.017156,0.020144,0.027481,0.051467,0.03212,21.679495,0.541473,0.65324,0.219589
std,12.372279,8.821524,0.465656,53.445602,8.129282,10.700283,0.005624,3.6e-05,0.003124,0.003732,...,0.230254,0.013237,0.016664,0.019986,0.039711,0.059692,4.291096,0.100986,0.070902,0.091498
min,1.0,36.0,0.0,-4.2625,5.0377,7.0,0.00083,2e-06,0.00033,0.00043,...,0.026,0.00161,0.00194,0.00249,0.00484,0.000286,1.659,0.15102,0.51404,0.021983
25%,10.0,58.0,0.0,46.8475,15.0,21.371,0.00358,2.2e-05,0.00158,0.00182,...,0.175,0.00928,0.01079,0.015665,0.02783,0.010955,19.406,0.469785,0.59618,0.15634
50%,22.0,65.0,0.0,91.523,20.871,27.576,0.0049,3.5e-05,0.00225,0.00249,...,0.253,0.0137,0.01594,0.02271,0.04111,0.018448,21.92,0.54225,0.6436,0.2055
75%,33.0,72.0,1.0,138.445,27.5965,36.399,0.0068,5.3e-05,0.00329,0.00346,...,0.365,0.020575,0.023755,0.032715,0.061735,0.031463,24.444,0.614045,0.711335,0.26449
max,42.0,85.0,1.0,215.49,39.511,54.992,0.09999,0.000446,0.05754,0.06956,...,2.107,0.16267,0.16702,0.27546,0.48802,0.74826,37.875,0.96608,0.8656,0.73173
