# **Data Cleaning, Manipulation and Analysis**

## Objectives

The intention of this notebook was tri-fold: data cleaning, data transformation and data loading. Light analysis has also been carried out to better understand, extract and load data.

## Inputs

* Dataset retrived from Kaggle (csv file containing data reagaring patients with or potentially at risk of Alzheimer's disease saved to inputs folder)

## Outputs

* Data cleaning pipeline (within this notebook)
* Machine learning pipeline (within this notebook)
* Cleaned data (csv file extracted to outputs folder)
* Data for machine learning (csv file extracted to outputs folder)

## Additional Comments

* Data was extracted from Kaggle with the source citation included in the README file.
* Data was saved in its raw orginal form and then cleaned (a machine learning dataset with scaling and encoding was also created). 


---
---

#### **REMINDER**: All notebook cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content).

---
---

## **Setup Information

---

### Change Working Directory

* When storing the notebooks in a subfolder to run in the editor, for projects such as these, it's best practice to change the working directory. 
* We need to change the working directory from its current folder to its parent folder.

First we access the current directory with os.getcwd()

In [40]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\healthcare-and-public-health'

Then we make the parent of the current directory the new current directory using os.path.dirname() to get the parent directory and *os.chdir()* to define the new current directory

In [41]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [42]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects'

## Package Installation Instructions

### **IMPORTANT**: Before running the cells below, you **MUST** restart the kernel!

**How to restart the kernel:**
1. Click on the **"Kernel"** menu at the top
2. Select **"Restart Kernel"** 
3. Confirm the restart
4. **Then** run the 3 STEP cells below in order

**Why restart is required:**
- Windows locks files that are currently in use
- NumPy is loaded in the current kernel session
- Restarting clears memory and releases file locks

In [44]:
# STEP 1: Upgrade numpy first (run after kernel restart)
%pip install --upgrade numpy

Collecting numpy
  Using cached numpy-2.3.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.3.2-cp312-cp312-win_amd64.whl (12.8 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.3.2
Note: you may need to restart the kernel to use updated packages.


In [None]:
# STEP 2: Install other packages (run after numpy upgrade completes)
%pip install pandas matplotlib seaborn scikit-learn plotly feature-engine

---

---

# Section 1

In [10]:
# STEP 3: Test all imports (run after all packages are installed)
import numpy as np
import pandas as pd
import matplotlib as mb
import matplotlib.pyplot as plt
import plotly as pl
import seaborn as sns
import sklearn as sk
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_engine as fe
import sys
sys.path.append("../")

print("All packages imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {mb.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Scikit-learn version: {sk.__version__}")
print(f"Plotly version: {pl.__version__}")
print(f"Feature-engine version: {fe.__version__}")

All packages imported successfully!
NumPy version: 1.26.4
Pandas version: 2.1.4
Matplotlib version: 3.10.5
Seaborn version: 0.13.2
Scikit-learn version: 1.7.1
Plotly version: 6.2.0
Feature-engine version: 1.8.3


In [None]:
# Use absolute path to be sure
# there's and HTML formatting error that has begun to appear recently due to a conflict with pandas and jupyter- using the print function solves this
file_path = "inputs/alzheimers_disease_data.csv"
pd.read_csv(file_path)
print(f"DataFrame shape: {df.shape}")
print("\nFirst 5 rows:")
print(df.head())

DataFrame shape: (134, 35)

First 5 rows:


AttributeError: 'Index' object has no attribute '_format_flat'

      PatientID  Age  Gender         Ethnicity  EducationLevel        BMI  \
679        5430   68  Female             Asian               1  22.753134   
110        4861   76  Female         Caucasian               1  34.623723   
37         4788   60  Female         Caucasian               2  31.568689   
1743       6494   75  Female         Caucasian               2  32.517662   
364        5115   72  Female         Caucasian               0  19.392584   
...         ...  ...     ...               ...             ...        ...   
142        4893   82    Male  African American               2  35.027687   
1639       6390   78  Female         Caucasian               2  34.639751   
1918       6669   89    Male         Caucasian               2  22.714726   
596        5347   61    Male         Caucasian               0  39.389871   
1175       5926   81  Female         Caucasian               0  35.783769   

     Smoking  AlcoholConsumption  PhysicalActivity  DietQuality  ...  \
679

In [23]:
df = df.sample(frac=0.25, random_state=10)
print(df)

      PatientID  Age  Gender  Ethnicity  EducationLevel        BMI  Smoking  \
679        5430   68       1          2               1  22.753134        0   
110        4861   76       1          0               1  34.623723        1   
37         4788   60       1          0               2  31.568689        0   
1743       6494   75       1          0               2  32.517662        0   
364        5115   72       1          0               0  19.392584        0   
...         ...  ...     ...        ...             ...        ...      ...   
142        4893   82       0          1               2  35.027687        0   
1639       6390   78       1          0               2  34.639751        1   
1918       6669   89       0          0               2  22.714726        0   
596        5347   61       0          0               0  39.389871        0   
1175       5926   81       1          0               0  35.783769        0   

      AlcoholConsumption  PhysicalActivity  DietQua

In [41]:
df
print("Available columns:")
print(df.columns.tolist())
print(f"\nDataset shape: {df.shape}")

Available columns:
['PatientID', 'Age', 'Gender', 'Ethnicity', 'EducationLevel', 'BMI', 'Smoking', 'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality', 'FamilyHistoryAlzheimers', 'CardiovascularDisease', 'Diabetes', 'Depression', 'HeadInjury', 'Hypertension', 'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL', 'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment', 'MemoryComplaints', 'BehavioralProblems', 'ADL', 'Confusion', 'Disorientation', 'PersonalityChanges', 'DifficultyCompletingTasks', 'Forgetfulness', 'Diagnosis', 'DoctorInCharge']

Dataset shape: (2149, 35)


In [None]:
df.min()
df[["Age", "BMI"]].min() #edit

PatientID                          4751
Age                                  60
Gender                                0
Ethnicity                             0
EducationLevel                        0
BMI                           15.008851
Smoking                               0
AlcoholConsumption             0.002003
PhysicalActivity               0.003616
DietQuality                    0.009385
SleepQuality                   4.002629
FamilyHistoryAlzheimers               0
CardiovascularDisease                 0
Diabetes                              0
Depression                            0
HeadInjury                            0
Hypertension                          0
SystolicBP                           90
DiastolicBP                          60
CholesterolTotal             150.093316
CholesterolLDL                50.230707
CholesterolHDL                20.003434
CholesterolTriglycerides      50.407194
MMSE                           0.005312
FunctionalAssessment            0.00046


In [7]:
df.max()

PatientID                          6899
Age                                  90
Gender                                1
Ethnicity                             3
EducationLevel                        3
BMI                           39.992767
Smoking                               1
AlcoholConsumption            19.989293
PhysicalActivity               9.987429
DietQuality                    9.998346
SleepQuality                    9.99984
FamilyHistoryAlzheimers               1
CardiovascularDisease                 1
Diabetes                              1
Depression                            1
HeadInjury                            1
Hypertension                          1
SystolicBP                          179
DiastolicBP                         119
CholesterolTotal             299.993352
CholesterolLDL               199.965665
CholesterolHDL                99.980324
CholesterolTriglycerides     399.941862
MMSE                          29.991381
FunctionalAssessment           9.996467


Check for duplicated rows and return the number of them.

In [47]:
df.duplicated().sum()

0

Check for missing cells in each column and return the number of them.

In [48]:
df.isnull().sum()

PatientID                    0
Age                          0
Gender                       0
Ethnicity                    0
BMI                          0
Smoking                      0
AlcoholConsumption           0
PhysicalActivity             0
DietQuality                  0
CardiovascularDisease        0
Depression                   0
CholesterolTotal             0
MMSE                         0
FunctionalAssessment         0
MemoryComplaints             0
BehavioralProblems           0
ADL                          0
PersonalityChanges           0
DifficultyCompletingTasks    0
Diagnosis                    0
DoctorInCharge               0
dtype: int64

---

# Section 2

Section 2 content

In [49]:
df

Unnamed: 0,PatientID,Age,Gender,Ethnicity,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,CardiovascularDisease,...,CholesterolTotal,MMSE,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,PersonalityChanges,DifficultyCompletingTasks,Diagnosis,DoctorInCharge
0,4751,73,0,0,22.927749,0,13.297218,6.327112,1.347214,0,...,242.366840,21.463532,6.518877,0,0,1.725883,0,1,0,XXXConfid
1,4752,89,0,0,26.827681,0,4.542524,7.619885,0.518767,0,...,231.162595,20.613267,7.118696,0,0,2.592424,0,0,0,XXXConfid
2,4753,73,0,3,17.795882,0,19.555085,7.844988,1.826335,0,...,284.181858,7.356249,5.895077,0,0,7.119548,0,1,0,XXXConfid
3,4754,74,1,0,33.800817,1,12.209266,8.428001,7.435604,0,...,159.582240,13.991127,8.965106,0,1,6.481226,0,0,0,XXXConfid
4,4755,89,0,0,20.716974,0,18.454356,6.310461,0.795498,0,...,237.602184,13.517609,6.045039,0,0,0.014691,1,1,0,XXXConfid
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,6895,61,0,0,39.121757,0,1.561126,4.049964,6.555306,0,...,280.476824,1.201190,0.238667,0,0,4.492838,0,0,1,XXXConfid
2145,6896,75,0,0,17.857903,0,18.767261,1.360667,2.904662,0,...,186.384436,6.458060,8.687480,0,1,9.204952,0,0,1,XXXConfid
2146,6897,77,0,0,15.476479,0,4.594670,9.886002,8.120025,0,...,237.024558,17.011003,1.972137,0,0,5.036334,0,0,1,XXXConfid
2147,6898,78,1,3,15.299911,0,8.674505,6.354282,1.263427,1,...,242.197192,4.030491,5.173891,0,0,3.785399,0,0,1,XXXConfid


In [25]:
# Replace values in categorical columns for better readability
# Gender mapping
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].replace({0: "Male", 1: "Female"})

# Ethnicity mapping
if "Ethnicity" in df.columns:
    df["Ethnicity"] = df["Ethnicity"].replace({
        0: "Caucasian", 1: "African American", 2: "Asian", 3: "Other"
    })

# Binary columns (0/1 to No/Yes)
binary_cols = ["Smoking", "CardiovascularDisease", "Depression", 
               "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", 
               "DifficultyCompletingTasks"]

for col in binary_cols:
    if col in df.columns:
        df[col] = df[col].replace({0: "No", 1: "Yes"})

# Diagnosis mapping
if "Diagnosis" in df.columns:
    df["Diagnosis"] = df["Diagnosis"].replace({0: "No Dementia", 1: "Dementia"})

print(df.head())

      PatientID  Age  Gender  Ethnicity  EducationLevel        BMI Smoking  \
679        5430   68  Female      Asian               1  22.753134      No   
110        4861   76  Female  Caucasian               1  34.623723     Yes   
37         4788   60  Female  Caucasian               2  31.568689      No   
1743       6494   75  Female  Caucasian               2  32.517662      No   
364        5115   72  Female  Caucasian               0  19.392584      No   

      AlcoholConsumption  PhysicalActivity  DietQuality  ...  \
679            13.241418          7.784493     9.812643  ...   
110             6.864896          6.224812     2.301047  ...   
37              3.478409          4.773200     8.856834  ...   
1743           15.287424          7.372272     9.789646  ...   
364             6.413479          9.302729     4.037830  ...   

      MemoryComplaints  BehavioralProblems       ADL  Confusion  \
679                 No                  No  3.365375          1   
110         

In [None]:
# drop specific columns
def drop_columns(df):
    return df.drop(columns=["EducationLevel", "SleepQuality", "FamilyHistoryAlzheimers", "Diabetes", "HeadInjury", "Hypertension", "SystolicBP", "DiastolicBP", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "Confusion", "Disorientation", "Forgetfulness", "DoctorInCharge"], errors="ignore")

# change column locations
def change_column_location(df):
    new_column_order = ["PatientID", "Age", "Gender", "Ethnicity", "BMI", "DietQuality", "PhysicalActivity", "Smoking", "AlcoholConsumption", "CardiovascularDisease", "CholesterolTotal", "FunctionalAssessment", "ADL", "MMSE", "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", "DifficultyCompletingTasks", "Depression", "Diagnosis", "DoctorInCharge"]
    # Only include columns that actually exist in the dataframe
    existing_columns = [col for col in new_column_order if col in df.columns]
    return df[existing_columns]

# convert data types
def convert_data_types(df):
    if "PatientID" in df.columns:
        df["PatientID"] = df["PatientID"].astype(int)
    if "Age" in df.columns:
        df["Age"] = df["Age"].astype(int)
    if "Gender" in df.columns:
        df["Gender"] = df["Gender"].astype(str)
    if "Ethnicity" in df.columns:
        df["Ethnicity"] = df["Ethnicity"].astype(str)
    if "BMI" in df.columns:
        df["BMI"] = df["BMI"].astype(float)
    if "Smoking" in df.columns:
        df["Smoking"] = df["Smoking"].astype(str)
    if "AlcoholConsumption" in df.columns:
        df["AlcoholConsumption"] = df["AlcoholConsumption"].astype(float)
    if "PhysicalActivity" in df.columns:
        df["PhysicalActivity"] = df["PhysicalActivity"].astype(int)
    if "DietQuality" in df.columns:
        df["DietQuality"] = df["DietQuality"].astype(str)
    if "CardiovascularDisease" in df.columns:
        df["CardiovascularDisease"] = df["CardiovascularDisease"].astype(str)
    if "Depression" in df.columns:
        df["Depression"] = df["Depression"].astype(str)
    if "CholesterolTotal" in df.columns:
        df["CholesterolTotal"] = df["CholesterolTotal"].astype(float)
    if "MMSE" in df.columns:
        df["MMSE"] = df["MMSE"].astype(float)
    if "FunctionalAssessment" in df.columns:
        df["FunctionalAssessment"] = df["FunctionalAssessment"].astype(int)
    if "MemoryComplaints" in df.columns:
        df["MemoryComplaints"] = df["MemoryComplaints"].astype(str)
    if "BehavioralProblems" in df.columns:
        df["BehavioralProblems"] = df["BehavioralProblems"].astype(str)
    if "ADL" in df.columns:
        df["ADL"] = df["ADL"].astype(float)
    if "PersonalityChanges" in df.columns:
        df["PersonalityChanges"] = df["PersonalityChanges"].astype(str)
    if "DifficultyCompletingTasks" in df.columns:
        df["DifficultyCompletingTasks"] = df["DifficultyCompletingTasks"].astype(str)
    if "Diagnosis" in df.columns:
        df["Diagnosis"] = df["Diagnosis"].astype(str)
    return df

# remove outliers using IQR method
def remove_outliers(df):
    columns = ["BMI", "CholesterolTotal"]
    df_cleaned = df.copy()
    for col in columns:
        if col in df_cleaned.columns: 
            Q1 = df_cleaned[col].quantile(0.25)
            Q3 = df_cleaned[col].quantile(0.75)
            IQR = Q3 - Q1
            mask = (df_cleaned[col] >= Q1 - 1.5 * IQR) & (df_cleaned[col] <= Q3 + 1.5 * IQR)
            df_cleaned = df_cleaned[mask]  
    return df_cleaned

# rename columns
def rename_columns(df):
    return df.rename(columns={
        "PatientID": "Patient_ID",
        "Age": "Patient_Age",
        "AlcoholConsumption": "Alcohol_Consumption",
        "PhysicalActivity": "Physical_Activity",
        "DietQuality": "Diet_Quality",
        "CardiovascularDisease": "Cardiovascular_Disease",
        "CholesterolTotal": "Cholesterol_Total",
        "FunctionalAssessment": "Functional_Assessment",
        "MemoryComplaints": "Memory_Complaints",
        "BehavioralProblems": "Behavioral_Problems",
        "ADL": "Activities_Of_Daily_Living",
        "PersonalityChanges": "Personality_Changes",
        "DifficultyCompletingTasks": "Difficulty_Completing_Tasks",  
    })

# drop missing values
def drop_missing_values(df):
    return df.dropna()

# remove duplicates
def remove_duplicates(df):
    return df.drop_duplicates()

# round numerical values to 2 decimal places
def round_values(df):
    return df.round(2)

# capitalize column names with proper acronym handling
def capitalize_columns(df):
    def smart_title(text):
        # Common acronyms that should stay uppercase
        acronyms = {
            'bmi': 'BMI',
            'mmse': 'MMSE', 
            'adl': 'ADL',
            'id': 'ID'
        }
        
        # Split by underscore and process each part
        parts = text.split('_')
        result_parts = []
        
        for part in parts:
            lower_part = part.lower()
            if lower_part in acronyms:
                result_parts.append(acronyms[lower_part])
            else:
                result_parts.append(part.title())
        
        return '_'.join(result_parts)
    
    df.columns = [smart_title(col) for col in df.columns]
    return df

# scale numerical values and encode categorical values
scaling_transformer = ColumnTransformer([
    ("num", StandardScaler(), ["Patient_Age", "BMI", "Alcohol_Consumption", "Physical_Activity", "Cholesterol_Total", "MMSE", "Functional_Assessment", "Activities_Of_Daily_Living"]), 
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), ["Gender", "Ethnicity", "Smoking", "Cardiovascular_Disease", "Depression", "Memory_Complaints", "Behavioral_Problems", "Personality_Changes", "Difficulty_Completing_Tasks", "Doctor_In_Charge"])  
])

# define transformers
change_column_location_transformer = FunctionTransformer(change_column_location)
drop_columns_transformer = FunctionTransformer(drop_columns)
convert_data_types_transformer = FunctionTransformer(convert_data_types)
remove_outliers_transformer = FunctionTransformer(remove_outliers)
rename_columns_transformer = FunctionTransformer(rename_columns)
capitalize_columns_transformer = FunctionTransformer(capitalize_columns)
drop_missing_values_transformer = FunctionTransformer(drop_missing_values)
remove_duplicates_transformer = FunctionTransformer(remove_duplicates)
round_values_transformer = FunctionTransformer(round_values)

# create data cleaning pipeline
data_cleaning_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer)
])

# Create advanced pipeline with scaling and encoding for machine learning
# this pipeline should clean and preprocess data, rename columns, scale numerical features, encode categorical features and handle unknown categories
ml_preprocessing_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer),
    ("scale_and_encode", scaling_transformer)
])

---

# Section 3

section 3 content

In [28]:
# Apply the pipeline to dataframe
processed_df = data_cleaning_pipeline.fit_transform(df)
print(f"Processed data shape: {processed_df.shape}")
# Display the first few rows of cleaned data
print(processed_df.head())

Processed data shape: (134, 21)
     Patient_ID  Patient_Age  Gender  Ethnicity    BMI        Diet_Quality  \
679        5430           68  Female      Asian  22.75   9.812643386033978   
110        4861           76  Female  Caucasian  34.62  2.3010466903936564   
37         4788           60  Female  Caucasian  31.57     8.8568335672575   
1743       6494           75  Female  Caucasian  32.52   9.789645835573284   
364        5115           72  Female  Caucasian  19.39   4.037829830055766   

      Physical_Activity Smoking  Alcohol_Consumption Cardiovascular_Disease  \
679                   7      No                13.24                     No   
110                   6     Yes                 6.86                    Yes   
37                    4      No                 3.48                    Yes   
1743                  7      No                15.29                     No   
364                   9      No                 6.41                     No   

      ...  Functional_As

In [30]:
# Apply the ML preprocessing pipeline to the original dataframe
scaled_data = ml_preprocessing_pipeline.fit_transform(df)
print(f"Scaled data shape: {scaled_data.shape}")
print(scaled_data)

Scaled data shape: (134, 22)
[[-0.6265218977008855 -0.7584829758654765 0.498975100888374 ... '5430'
  '9.812643386033978' 'Dementia']
 [0.19756580087009426 0.9042940233104716 -0.6093514164321299 ... '4861'
  '2.3010466903936564' 'No Dementia']
 [-1.4506095962718653 0.4770429830841331 -1.1965212641661898 ... '4788'
  '8.8568335672575' 'No Dementia']
 ...
 [1.5367083110479365 -0.7640862681963136 -0.31576649256509975 ... '6669'
  '9.37268051560284' 'Dementia']
 [-1.3475986339504928 1.5724866337628118 -1.646453188909153 ... '5347'
  '3.9574397288296206' 'Dementia']
 [0.7126206124769566 1.0667895009047519 -0.692736483565961 ... '5926'
  '7.691021735561546' 'Dementia']]


In [37]:
# Save the processed DataFrame (cleaned but not scaled/encoded)
df_load_1 = processed_df
print("Processed DataFrame shape:", df_load_1.shape)
print("Processed DataFrame columns:", df_load_1.columns.tolist())

# The scaled data is a NumPy array, so we need to handle it differently
df_load_2 = scaled_data
print("\nScaled data shape:", df_load_2.shape)
print("Scaled data type:", type(df_load_2))

# Use manual file writing to avoid pandas CSV bug
print("\n=== Saving Files Manually to Avoid Pandas Bug ===")

# Save the processed DataFrame manually
with open("outputs/processed_alzheimers_disease_data_unscaled.csv", 'w') as f:
    # Write header
    f.write(','.join(df_load_1.columns) + '\n')
    # Write data
    for _, row in df_load_1.iterrows():
        f.write(','.join(str(val) for val in row) + '\n')
print("✓ Unscaled data saved successfully!")

# Save the scaled data manually (handle mixed data types)
with open("outputs/processed_alzheimers_disease_data_scaled.csv", 'w') as f:
    for row in df_load_2:
        formatted_row = []
        for val in row:
            try:
                # Try to format as float
                formatted_row.append(f'{float(val):.6f}')
            except (ValueError, TypeError):
                # If it's not a number, convert to string
                formatted_row.append(str(val))
        f.write(','.join(formatted_row) + '\n')
print("✓ Scaled data saved successfully!")

print("\n=== Summary ===")
print("✓ Data processing pipeline completed successfully!")
print("✓ Files saved to outputs folder:")
print("  1. processed_alzheimers_disease_data_unscaled.csv - Cleaned data (134 rows, 21 columns)")  
print("  2. processed_alzheimers_disease_data_scaled.csv - Scaled & encoded data (134 rows, 22 columns)")
print("\n⚠️  Note: Pandas CSV export has a bug - ImportError: cannot import name 'SequenceNotStr'")
print("   This is a known pandas installation issue. Files saved using workaround.")

Processed DataFrame shape: (134, 21)
Processed DataFrame columns: ['Patient_ID', 'Patient_Age', 'Gender', 'Ethnicity', 'BMI', 'Diet_Quality', 'Physical_Activity', 'Smoking', 'Alcohol_Consumption', 'Cardiovascular_Disease', 'Cholesterol_Total', 'Functional_Assessment', 'Activities_Of_Daily_Living', 'MMSE', 'Memory_Complaints', 'Behavioral_Problems', 'Personality_Changes', 'Difficulty_Completing_Tasks', 'Depression', 'Diagnosis', 'Doctor_In_Charge']

Scaled data shape: (134, 22)
Scaled data type: <class 'numpy.ndarray'>

=== Saving Files Manually to Avoid Pandas Bug ===
✓ Unscaled data saved successfully!
✓ Scaled data saved successfully!

=== Summary ===
✓ Data processing pipeline completed successfully!
✓ Files saved to outputs folder:
  1. processed_alzheimers_disease_data_unscaled.csv - Cleaned data (134 rows, 21 columns)
  2. processed_alzheimers_disease_data_scaled.csv - Scaled & encoded data (134 rows, 22 columns)

⚠️  Note: Pandas CSV export has a bug - ImportError: cannot import

---

# Conclusion

* If you have any additional comments that don't fit in the previous bullets, please state them here. ---- needed to install Jupyter dependencies, as kernel kept dying, python kernel started, then the necessary packages were downloaded
we dropped several; columns as a aim was to focus on partiuclar parameters (make these clear and why - psycholigcal factors and current daya to day matter sthat might cause issue) nd to afford a more simplistic, less technical, viewer friendly application
we also extracted a fractioned/ sampled state due to this
the notenook also acts similar to a pre-analysis point
add in readme not to rule out outside factors like head injury, or other comordid disease such as diabetes, or family history (also included in orginal dataset)

# there's and HTML formatting error that has begun to appear recently due to a conflict with pandas and jupyter- using the print function solves this

the data within the table had already been scaled and encoded but for demonstrative purposes as well as to satisfy its use, we've added this transformer to the pipeline - say what it does

# capitalize column names
def capitalize_columns(df):
    df.columns = [col.title() for col in df.columns]
    return df

# drop missing values
def drop_missing_values(df):
    return df.dropna()

# remove duplicates
def remove_duplicates(df):
return df.drop_duplicates() ------------ included these as quality assurance measures

    "../inputs/alzheimers_disease_data.csv" to go up one directory to find the inputs folder.
    had to use the abosulte file path in the end as recent wasnt working and apply it to variable ----- # Use absolute path to be sure
file_path = "inputs/alzheimers_disease_data.csv"