# Parkinson's Disease - Clinical Data Analysis & Clustering
### Final Project in Neuroscience / Biology
**Authors:** Tamar Kan, Alon Hillel, Or Galifat, Roni Itay  
**Date:** December 2025

---
**Project Goal:** To classify patients into distinct profiles based on personal characteristics, lifestyle, and medical history using unsupervised learning (K-Means Clustering).

## 1. Environment Setup
Importing necessary libraries for data manipulation, visualization, and statistical analysis.Data

In [2]:
# ייבוא ספריות חיוניות לניתוח נתונים
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# הגדרות עיצוב לגרפים שיראו טוב
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Data Loading & Inspection
Loading the dataset and displaying the initial structure to understand the features.

In [3]:
# טעינת הדאטה-סט
file_name = 'parkinsons_disease_data.csv' 

try:
    df = pd.read_csv(file_name)
    print(f"Data loaded successfully! Shape: {df.shape}")
    # הצצה ראשונית לנתונים 
    display(df.head())
except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found. Please check the file name.")

Data loaded successfully! Shape: (2105, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,FunctionalAssessment,Tremor,Rigidity,Bradykinesia,PosturalInstability,SpeechProblems,SleepDisorders,Constipation,Diagnosis,DoctorInCharge
0,3058,85,0,3,1,19.619878,0,5.108241,1.38066,3.893969,...,1.572427,1,0,0,0,0,0,0,0,DrXXXConfid
1,3059,75,0,0,2,16.247339,1,6.027648,8.409804,8.513428,...,4.787551,0,1,0,1,0,1,0,1,DrXXXConfid
2,3060,70,1,0,0,15.368239,0,2.242135,0.213275,6.498805,...,2.130686,1,0,0,0,1,0,1,1,DrXXXConfid
3,3061,52,0,0,0,15.454557,0,5.997788,1.375045,6.715033,...,3.391288,1,1,1,0,0,0,1,1,DrXXXConfid
4,3062,87,0,0,1,18.616042,0,9.775243,1.188607,4.657572,...,3.200969,0,0,0,1,0,1,0,0,DrXXXConfid


## 3. Data Preprocessing & Cleaning
**Action:** Removing non-informative columns (e.g., 'DoctorInCharge') and setting 'PatientID' as the index to ensure data quality.

In [4]:

# 1. הסרת עמודות לא רלוונטיות
# המסמך מציין ספציפית את 'DoctorInCharge'. אם יש עוד עמודות חסויות (Confidential), הוסיפי אותן לרשימה
cols_to_drop = ['DoctorInCharge'] 

# בדיקה שהעמודות קיימות לפני המחיקה כדי למנוע שגיאות
existing_cols_to_drop = [col for col in cols_to_drop if col in df.columns]
df = df.drop(columns=existing_cols_to_drop)
print(f"Dropped columns: {existing_cols_to_drop}")

# 2. סידור ה-ID
# אם העמודה PatientID קיימת, נהפוך אותה לאינדקס כדי שלא תפריע בחישובים הסטטיסטיים
if 'PatientID' in df.columns:
    df.set_index('PatientID', inplace=True)
    print("PatientID set as index.")

# הצגת המידע על העמודות שנשארו (לוודא שהכל תקין)
df.info()

Dropped columns: ['DoctorInCharge']
PatientID set as index.
<class 'pandas.core.frame.DataFrame'>
Index: 2105 entries, 3058 to 5162
Data columns (total 33 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       2105 non-null   int64  
 1   Gender                    2105 non-null   int64  
 2   Ethnicity                 2105 non-null   int64  
 3   EducationLevel            2105 non-null   int64  
 4   BMI                       2105 non-null   float64
 5   Smoking                   2105 non-null   int64  
 6   AlcoholConsumption        2105 non-null   float64
 7   PhysicalActivity          2105 non-null   float64
 8   DietQuality               2105 non-null   float64
 9   SleepQuality              2105 non-null   float64
 10  FamilyHistoryParkinsons   2105 non-null   int64  
 11  TraumaticBrainInjury      2105 non-null   int64  
 12  Hypertension              2105 non-null   int64  
 13  Diabe

## 4. Data Validation (Sanity Check)
Checking for missing values and duplicate records to ensure the integrity of the dataset before analysis.

In [5]:

# 1. Check for missing values (NaNs) in the entire dataset
missing_values = df.isnull().sum().sum()
print(f"Total missing values: {missing_values}")

# 2. Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {duplicates}")

# If duplicates exist, remove them (Uncomment the next line if needed)
# df = df.drop_duplicates()

Total missing values: 0
Total duplicate rows: 0


## 5. Feature Engineering
**Action:** Converting categorical variables ('Ethnicity', 'EducationLevel') into numerical format using One-Hot Encoding suitable for the clustering algorithm.

In [6]:
# Converting categorical variables into numerical format using One-Hot Encoding
# As per project plan: Ethnicity and EducationLevel need encoding

# List of columns to encode
categorical_cols = ['Ethnicity', 'EducationLevel']

# Apply One-Hot Encoding (drop_first=True helps reduce redundancy)
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display the new columns to verify
print("New columns after One-Hot Encoding:")
print(df_encoded.columns)
display(df_encoded.head())

New columns after One-Hot Encoding:
Index(['Age', 'Gender', 'BMI', 'Smoking', 'AlcoholConsumption',
       'PhysicalActivity', 'DietQuality', 'SleepQuality',
       'FamilyHistoryParkinsons', 'TraumaticBrainInjury', 'Hypertension',
       'Diabetes', 'Depression', 'Stroke', 'SystolicBP', 'DiastolicBP',
       'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL',
       'CholesterolTriglycerides', 'UPDRS', 'MoCA', 'FunctionalAssessment',
       'Tremor', 'Rigidity', 'Bradykinesia', 'PosturalInstability',
       'SpeechProblems', 'SleepDisorders', 'Constipation', 'Diagnosis',
       'Ethnicity_1', 'Ethnicity_2', 'Ethnicity_3', 'EducationLevel_1',
       'EducationLevel_2', 'EducationLevel_3'],
      dtype='object')


Unnamed: 0_level_0,Age,Gender,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,FamilyHistoryParkinsons,TraumaticBrainInjury,...,SpeechProblems,SleepDisorders,Constipation,Diagnosis,Ethnicity_1,Ethnicity_2,Ethnicity_3,EducationLevel_1,EducationLevel_2,EducationLevel_3
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3058,85,0,19.619878,0,5.108241,1.38066,3.893969,9.283194,0,0,...,0,0,0,0,False,False,True,True,False,False
3059,75,0,16.247339,1,6.027648,8.409804,8.513428,5.60247,0,0,...,0,1,0,1,False,False,False,False,True,False
3060,70,1,15.368239,0,2.242135,0.213275,6.498805,9.929824,0,0,...,1,0,1,1,False,False,False,False,False,False
3061,52,0,15.454557,0,5.997788,1.375045,6.715033,4.196189,0,0,...,0,0,1,1,False,False,False,False,False,False
3062,87,0,18.616042,0,9.775243,1.188607,4.657572,9.363925,0,0,...,0,1,0,0,False,False,False,True,False,False


## 6. Data Scaling (Normalization)
**Action:** Normalizing continuous numerical variables using `StandardScaler`.
**Why?** This ensures that all features contribute equally to the distance calculations in K-Means clustering, preventing variables with large ranges (like Cholesterol) from dominating the model.

In [10]:
# 1. ייבוא הכלי לנרמול (חשוב להריץ את זה אחרי ה-Restart)
from sklearn.preprocessing import StandardScaler

# 2. יצירת ה-Scaler (הכלי שמבצע את הנרמול)
scaler = StandardScaler()

# 3. רשימת העמודות הרצויה (כולל התיקון ל-MoCA)
desired_features = [
    'Age', 'BMI', 'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality',
    'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL',
    'CholesterolTriglycerides', 'UPDRS', 'MoCA', 'FunctionalAssessment'
]

# 4. בדיקה חכמה: לוקחים רק את העמודות שבאמת קיימות בטבלה שלך
existing_features = [col for col in desired_features if col in df_encoded.columns]

# בדיקה אם חסר משהו
missing_cols = set(desired_features) - set(existing_features)
if missing_cols:
    print(f"⚠️ Warning: Still missing these columns: {missing_cols}")
else:
    print("✅ Excellent! All columns found.")

# 5. ביצוע הנרמול
# יצירת עותק כדי לא לפגוע במקור
df_scaled = df_encoded.copy()

# הפעלת הנרמול רק על העמודות שנמצאו
if existing_features:
    df_scaled[existing_features] = scaler.fit_transform(df_encoded[existing_features])
    print("\nScaling completed successfully.")
    print("First 5 rows of scaled data:")
    display(df_scaled[existing_features].head())
else:
    print("Error: No columns found to scale.")

✅ Excellent! All columns found.

Scaling completed successfully.
First 5 rows of scaled data:


Unnamed: 0_level_0,Age,BMI,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,SystolicBP,DiastolicBP,CholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides,UPDRS,MoCA,FunctionalAssessment
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
3058,1.328366,-1.053179,-0.867475,-1.258035,-0.354851,1.304629,-0.178129,-1.773413,-0.092213,0.506438,-0.933115,1.122654,-1.678331,1.630256,-1.165038
3059,0.465684,-1.521172,-0.705769,1.173999,1.253913,-0.795464,1.105081,-0.835406,-0.375403,0.636191,0.751444,0.40929,-1.133101,-0.319603,-0.068916
3060,0.034343,-1.643161,-1.371569,-1.661943,0.552304,1.673573,-0.781992,0.161227,1.388905,-0.171563,1.109247,1.695488,-0.593466,1.716646,-0.974713
3061,-1.518484,-1.631183,-0.711021,-1.259978,0.627607,-1.597839,0.463476,-0.718155,1.227165,0.233919,-0.333858,1.366909,-0.856351,0.718665,-0.54494
3062,1.500903,-1.192477,-0.046638,-1.324484,-0.08892,1.350691,-0.706509,0.219852,1.311486,-0.407828,-1.480864,-0.71643,-1.407092,-0.782083,-0.609825


In [8]:
%pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.
