# **Data Cleaning, Manipulation and Analysis**

---

Application of Data Analytics within the healthcare industry

## **Objectives**

The intention of this notebook was tri-fold: data cleaning, data transformation and data loading. Light analysis has also been carried out to better understand, extract and load data.

### **Inputs**

* Dataset retrieved from Kaggle (CSV file containing data regarding patients with, or potentially at risk of, Alzheimer's disease saved to inputs folder)

### **Outputs**

* Data cleaning pipeline (within this notebook)
* Machine learning pipeline (within this notebook)
* Cleaned data (CSV file extracted to outputs folder)
* Data for machine learning (txt file extracted to outputs folder)

### **Additional Comments**

* Data was extracted from Kaggle with the source citation included in the README file.
* Data was saved in its raw orginal form and then cleaned (a machine learning dataset with scaling and encoding was also created).

---
---

##### **REMINDER**: 
All notebook cells should be run top-down (you can't create a dynamic where at a given point you need to go back to a previous cell to execute a task and then return to the cell you were working on).

---
---

## **Setup Information**

---
---

#### **IMPORTANT**: 
Before running the cells below, you **MUST** restart the kernel!

**This is because:**
- Windows locks files that are currently in use.
- NumPy is loaded in the current kernel session.
- Restarting clears memory and releases file locks.

**How to restart the kernel:**
1. Click on the restart button above with the circular arrow before it
2. Confirm the restart
3. **Then** run the cells below in order

---
---

### **Change Working Directory**

* When storing the notebooks in a subfolder to run in the editor, for projects such as these, it's best practice to change the working directory. 
* We need to change the working directory from its current folder to its parent folder.

In [1]:
# Access the current directory with os.getcwd()
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\healthcare-and-public-health\\jupyter_notebooks'

In [2]:
# Make parent of current directory the new current directory
# Use os.path.dirname() to get parent directory
# Use os.chdir() to define new current directory
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
# Confirm new current directory
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\healthcare-and-public-health'

---

### **Install Packages**

---

In [4]:
# Upgrade numpy (run after kernel restart)
%pip install --upgrade numpy

Note: you may need to restart the kernel to use updated packages.


In [5]:
# Install other packages (run after numpy upgrade)
%pip install pandas matplotlib seaborn scikit-learn plotly feature-engine

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Test all imports (run after all packages installed)
import numpy as np
import pandas as pd
import matplotlib as mb
import matplotlib.pyplot as plt
import plotly as pl
import seaborn as sns
import sklearn as sk
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_engine as fe

print("All packages imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Matplotlib version: {mb.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"Scikit-learn version: {sk.__version__}")
print(f"Plotly version: {pl.__version__}")
print(f"Feature-engine version: {fe.__version__}")

All packages imported successfully!
NumPy version: 2.3.2
Pandas version: 2.3.1
Matplotlib version: 3.10.5
Seaborn version: 0.13.2
Scikit-learn version: 1.7.1
Plotly version: 6.2.0
Feature-engine version: 1.8.3


---

## **Section 1**

### **Data Extraction**
This section contains code for the loading of data.

---

Extract the dataset from the inputs folder and load it to notebook as a DataFrame.

In [7]:
df = pd.read_csv("inputs/alzheimers_disease_data.csv")
print("Data loaded successfully!")
print(f"DataFrame shape: {df.shape}")
df

Data loaded successfully!
DataFrame shape: (2149, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,...,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,...,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,...,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,...,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,...,0,0,0.014691,0,0,1,1,0,0,XXXConfid
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2144,6895,61,0,0,1,39.121757,0,1.561126,4.049964,6.555306,...,0,0,4.492838,1,0,0,0,0,1,XXXConfid
2145,6896,75,0,0,2,17.857903,0,18.767261,1.360667,2.904662,...,0,1,9.204952,0,0,0,0,0,1,XXXConfid
2146,6897,77,0,0,1,15.476479,0,4.594670,9.886002,8.120025,...,0,0,5.036334,0,0,0,0,0,1,XXXConfid
2147,6898,78,1,3,1,15.299911,0,8.674505,6.354282,1.263427,...,0,0,3.785399,0,0,0,0,1,1,XXXConfid


Create a random sample of the data. Consider the first 5 rows (head) throughout for better notebook observability.

In [8]:
df = df.sample(frac=0.25, random_state=10)
print("Data loaded successfully!")
print(f"DataFrame shape: {df.shape}")
print("\nFirst 5 rows:")
df.head()

Data loaded successfully!
DataFrame shape: (537, 35)

First 5 rows:


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
613,5364,61,1,1,0,29.27054,0,14.782028,7.315484,2.976619,...,1,0,0.775005,0,1,0,0,0,1,XXXConfid
1018,5769,81,1,0,0,34.641073,0,7.383103,2.473479,5.280584,...,0,0,3.389056,0,0,0,0,0,0,XXXConfid
264,5015,81,0,1,0,22.923111,0,9.314832,8.917378,3.807813,...,0,1,8.681801,0,0,0,0,0,1,XXXConfid
1758,6509,62,0,0,2,23.587924,0,1.236318,0.666426,3.360432,...,0,0,3.983733,1,0,0,0,0,0,XXXConfid
1441,6192,80,1,3,2,23.715891,1,12.339372,5.970801,1.625098,...,0,0,2.744058,0,0,0,0,1,0,XXXConfid


---

## **Section 2**

### **Data Transformation**
This section contains functions for transformer creation, pipeline code and light analysis.

---

Check the current columns.

In [9]:
print("Data loaded successfully!")
print("Available columns:")
print(df.columns.tolist())

Data loaded successfully!
Available columns:
['PatientID', 'Age', 'Gender', 'Ethnicity', 'EducationLevel', 'BMI', 'Smoking', 'AlcoholConsumption', 'PhysicalActivity', 'DietQuality', 'SleepQuality', 'FamilyHistoryAlzheimers', 'CardiovascularDisease', 'Diabetes', 'Depression', 'HeadInjury', 'Hypertension', 'SystolicBP', 'DiastolicBP', 'CholesterolTotal', 'CholesterolLDL', 'CholesterolHDL', 'CholesterolTriglycerides', 'MMSE', 'FunctionalAssessment', 'MemoryComplaints', 'BehavioralProblems', 'ADL', 'Confusion', 'Disorientation', 'PersonalityChanges', 'DifficultyCompletingTasks', 'Forgetfulness', 'Diagnosis', 'DoctorInCharge']


Check the minimum values for numerical columns.

In [10]:
print("Data loaded successfully!")
numerical_columns = ["Age", "Gender", "Ethnicity", "EducationLevel", "BMI", "AlcoholConsumption", "PhysicalActivity", "DietQuality", "SleepQuality", "SystolicBP", "DiastolicBP", "CholesterolTotal", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "MMSE", "FunctionalAssessment"]
print("Minimum values for numerical columns:")
df[numerical_columns].min()

Data loaded successfully!
Minimum values for numerical columns:


Age                          60.000000
Gender                        0.000000
Ethnicity                     0.000000
EducationLevel                0.000000
BMI                          15.012071
AlcoholConsumption            0.010504
PhysicalActivity              0.007483
DietQuality                   0.014332
SleepQuality                  4.002629
SystolicBP                   90.000000
DiastolicBP                  60.000000
CholesterolTotal            150.192183
CholesterolLDL               50.400003
CholesterolHDL               20.366771
CholesterolTriglycerides     51.064227
MMSE                          0.018022
FunctionalAssessment          0.013211
dtype: float64

Check the maximum values for numerical columns.

In [11]:
print("Data loaded successfully!")
numerical_columns = ["Age", "Gender", "Ethnicity", "EducationLevel", "BMI", "AlcoholConsumption", "PhysicalActivity", "DietQuality", "SleepQuality", "SystolicBP", "DiastolicBP", "CholesterolTotal", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "MMSE", "FunctionalAssessment"]
print("Maximum values for numerical columns:")
df[numerical_columns].max()

Data loaded successfully!
Maximum values for numerical columns:


Age                          90.000000
Gender                        1.000000
Ethnicity                     3.000000
EducationLevel                3.000000
BMI                          39.988513
AlcoholConsumption           19.960888
PhysicalActivity              9.987429
DietQuality                   9.980281
SleepQuality                  9.993039
SystolicBP                  179.000000
DiastolicBP                 119.000000
CholesterolTotal            299.890133
CholesterolLDL              199.965665
CholesterolHDL               99.768955
CholesterolTriglycerides    399.239711
MMSE                         29.991381
FunctionalAssessment          9.992610
dtype: float64

Check for duplicates and retrieve their sum.

In [12]:
print("Data loaded successfully!")
df.duplicated().sum()

Data loaded successfully!


np.int64(0)

Check for null values and retrieve their sum.

In [13]:
print("Data loaded successfully!")
df.isnull().sum()

Data loaded successfully!


PatientID                    0
Age                          0
Gender                       0
Ethnicity                    0
EducationLevel               0
BMI                          0
Smoking                      0
AlcoholConsumption           0
PhysicalActivity             0
DietQuality                  0
SleepQuality                 0
FamilyHistoryAlzheimers      0
CardiovascularDisease        0
Diabetes                     0
Depression                   0
HeadInjury                   0
Hypertension                 0
SystolicBP                   0
DiastolicBP                  0
CholesterolTotal             0
CholesterolLDL               0
CholesterolHDL               0
CholesterolTriglycerides     0
MMSE                         0
FunctionalAssessment         0
MemoryComplaints             0
BehavioralProblems           0
ADL                          0
Confusion                    0
Disorientation               0
PersonalityChanges           0
DifficultyCompletingTasks    0
Forgetfu

Create code to populate categorical columns with integer values with their string counterparts.

In [14]:
print("Data loaded successfully!")
# Replace values in categorical columns for better readability
# Gender mapping
if "Gender" in df.columns:
    df["Gender"] = df["Gender"].replace({0: "Male", 1: "Female"})

# Ethnicity mapping
if "Ethnicity" in df.columns:
    df["Ethnicity"] = df["Ethnicity"].replace({
        0: "Caucasian", 1: "African American", 2: "Asian", 3: "Other"
    })

# Binary columns (0/1 to No/Yes)
binary_cols = ["Smoking", "CardiovascularDisease", "Depression", 
               "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", 
               "DifficultyCompletingTasks"]

for col in binary_cols:
    if col in df.columns:
        df[col] = df[col].replace({0: "No", 1: "Yes"})

# Diagnosis mapping
if "Diagnosis" in df.columns:
    df["Diagnosis"] = df["Diagnosis"].replace({0: "No Dementia", 1: "Dementia"})
print(f"DataFrame shape: {df.shape}")
df.head()

Data loaded successfully!
DataFrame shape: (537, 35)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,...,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
613,5364,61,Female,African American,0,29.27054,No,14.782028,7.315484,2.976619,...,Yes,No,0.775005,0,1,No,No,0,Dementia,XXXConfid
1018,5769,81,Female,Caucasian,0,34.641073,No,7.383103,2.473479,5.280584,...,No,No,3.389056,0,0,No,No,0,No Dementia,XXXConfid
264,5015,81,Male,African American,0,22.923111,No,9.314832,8.917378,3.807813,...,No,Yes,8.681801,0,0,No,No,0,Dementia,XXXConfid
1758,6509,62,Male,Caucasian,2,23.587924,No,1.236318,0.666426,3.360432,...,No,No,3.983733,1,0,No,No,0,No Dementia,XXXConfid
1441,6192,80,Female,Other,2,23.715891,Yes,12.339372,5.970801,1.625098,...,No,No,2.744058,0,0,No,No,1,No Dementia,XXXConfid


Create functions to load into transformers.

In [15]:
# Drop specific columns
def drop_columns(df):
    return df.drop(columns=["EducationLevel", "SleepQuality", "FamilyHistoryAlzheimers", "Diabetes", "HeadInjury", "Hypertension", "SystolicBP", "DiastolicBP", "CholesterolLDL", "CholesterolHDL", "CholesterolTriglycerides", "Confusion", "Disorientation", "Forgetfulness", "DoctorInCharge"], errors="ignore")

# Change column locations
def change_column_location(df):
    new_column_order = ["PatientID", "Age", "Gender", "Ethnicity", "BMI", "DietQuality", "PhysicalActivity", "Smoking", "AlcoholConsumption", "CardiovascularDisease", "CholesterolTotal", "FunctionalAssessment", "ADL", "MMSE", "MemoryComplaints", "BehavioralProblems", "PersonalityChanges", "DifficultyCompletingTasks", "Depression", "Diagnosis"]
    # Only include columns that actually exist in the dataframe
    existing_columns = [col for col in new_column_order if col in df.columns]
    return df[existing_columns]

# Convert data types
def convert_data_types(df):
    if "PatientID" in df.columns:
        df["PatientID"] = df["PatientID"].astype(int)
    if "Age" in df.columns:
        df["Age"] = df["Age"].astype(int)
    if "Gender" in df.columns:
        df["Gender"] = df["Gender"].astype(str)
    if "Ethnicity" in df.columns:
        df["Ethnicity"] = df["Ethnicity"].astype(str)
    if "BMI" in df.columns:
        df["BMI"] = df["BMI"].astype(float)
    if "Smoking" in df.columns:
        df["Smoking"] = df["Smoking"].astype(str)
    if "AlcoholConsumption" in df.columns:
        df["AlcoholConsumption"] = df["AlcoholConsumption"].astype(float)
    if "PhysicalActivity" in df.columns:
        df["PhysicalActivity"] = df["PhysicalActivity"].astype(int)
    if "DietQuality" in df.columns:
        df["DietQuality"] = df["DietQuality"].astype(str)
    if "CardiovascularDisease" in df.columns:
        df["CardiovascularDisease"] = df["CardiovascularDisease"].astype(str)
    if "Depression" in df.columns:
        df["Depression"] = df["Depression"].astype(str)
    if "CholesterolTotal" in df.columns:
        df["CholesterolTotal"] = df["CholesterolTotal"].astype(float)
    if "MMSE" in df.columns:
        df["MMSE"] = df["MMSE"].astype(float)
    if "FunctionalAssessment" in df.columns:
        df["FunctionalAssessment"] = df["FunctionalAssessment"].astype(int)
    if "MemoryComplaints" in df.columns:
        df["MemoryComplaints"] = df["MemoryComplaints"].astype(str)
    if "BehavioralProblems" in df.columns:
        df["BehavioralProblems"] = df["BehavioralProblems"].astype(str)
    if "ADL" in df.columns:
        df["ADL"] = df["ADL"].astype(float)
    if "PersonalityChanges" in df.columns:
        df["PersonalityChanges"] = df["PersonalityChanges"].astype(str)
    if "DifficultyCompletingTasks" in df.columns:
        df["DifficultyCompletingTasks"] = df["DifficultyCompletingTasks"].astype(str)
    if "Diagnosis" in df.columns:
        df["Diagnosis"] = df["Diagnosis"].astype(str)
    return df

# Remove outliers using IQR method
def remove_outliers(df):
    columns = ["BMI", "CholesterolTotal"]
    df_cleaned = df.copy()
    for col in columns:
        if col in df_cleaned.columns: 
            Q1 = df_cleaned[col].quantile(0.25)
            Q3 = df_cleaned[col].quantile(0.75)
            IQR = Q3 - Q1
            mask = (df_cleaned[col] >= Q1 - 1.5 * IQR) & (df_cleaned[col] <= Q3 + 1.5 * IQR)
            df_cleaned = df_cleaned[mask]  
    return df_cleaned

# Scale numerical values and encode categorical values
scaling_transformer = ColumnTransformer([
    ("num", StandardScaler(), ["Patient_Age", "BMI", "Alcohol_Consumption", "Physical_Activity", "Cholesterol_Total", "MMSE", "Functional_Assessment", "Activities_Of_Daily_Living"]), 
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), ["Gender", "Ethnicity", "Smoking", "Cardiovascular_Disease", "Depression", "Memory_Complaints", "Behavioral_Problems", "Personality_Changes", "Difficulty_Completing_Tasks"])  
])

# Rename columns
def rename_columns(df):
    return df.rename(columns={
        "PatientID": "Patient_ID",
        "Age": "Patient_Age",
        "AlcoholConsumption": "Alcohol_Consumption",
        "PhysicalActivity": "Physical_Activity",
        "DietQuality": "Diet_Quality",
        "CardiovascularDisease": "Cardiovascular_Disease",
        "CholesterolTotal": "Cholesterol_Total",
        "FunctionalAssessment": "Functional_Assessment",
        "MemoryComplaints": "Memory_Complaints",
        "BehavioralProblems": "Behavioral_Problems",
        "ADL": "Activities_Of_Daily_Living",
        "PersonalityChanges": "Personality_Changes",
        "DifficultyCompletingTasks": "Difficulty_Completing_Tasks",  
    })

# Drop missing values
def drop_missing_values(df):
    return df.dropna()

# Remove duplicates
def remove_duplicates(df):
    return df.drop_duplicates()

# Round numerical values to 2 decimal places
def round_values(df):
    return df.round(2)

# Capitalize column names with proper acronym handling
def capitalize_columns(df):
    def smart_title(text):
        # Common acronyms that should stay uppercase
        acronyms = {
            "bmi": "BMI",
            "mmse": "MMSE", 
            "adl": "ADL",
            "id": "ID"
        }
        
        # Split by underscore and process each part
        parts = text.split("_")
        result_parts = []
        
        for part in parts:
            lower_part = part.lower()
            if lower_part in acronyms:
                result_parts.append(acronyms[lower_part])
            else:
                result_parts.append(part.title())
        
        return "_".join(result_parts)
    
    df.columns = [smart_title(col) for col in df.columns]
    return df

Create the transformers.

In [16]:
# Define transformers
change_column_location_transformer = FunctionTransformer(change_column_location)
drop_columns_transformer = FunctionTransformer(drop_columns)
convert_data_types_transformer = FunctionTransformer(convert_data_types)
remove_outliers_transformer = FunctionTransformer(remove_outliers)
rename_columns_transformer = FunctionTransformer(rename_columns)
capitalize_columns_transformer = FunctionTransformer(capitalize_columns)
drop_missing_values_transformer = FunctionTransformer(drop_missing_values)
remove_duplicates_transformer = FunctionTransformer(remove_duplicates)
round_values_transformer = FunctionTransformer(round_values)

Create the pipeline.

In [17]:
# Create data cleaning pipeline
data_cleaning_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer)
])

Create advanced machine learning pipeline.

In [18]:
# Create advanced pipeline with scaling and encoding for machine learning
# This pipeline should clean and preprocess data, rename columns, scale numerical features, encode categorical features and handle unknown categories
data_cleaning_with_ml_pipeline = Pipeline([
    ("drop_columns", drop_columns_transformer),
    ("change_column_order", change_column_location_transformer),
    ("convert_data_types", convert_data_types_transformer),
    ("rename_columns", rename_columns_transformer),
    ("capitalize_columns", capitalize_columns_transformer),
    ("remove_outliers", remove_outliers_transformer),
    ("drop_missing_values", drop_missing_values_transformer),
    ("remove_duplicates", remove_duplicates_transformer),
    ("round_values", round_values_transformer),  
    ("scale_and_encode", scaling_transformer)
])

---

## **Section 3**

### **Data Loading** 
In this section, we fit both pipelines to two separate instances of the same dataframe, allowing for the transformation process to take place, and the creation of new datasets due for loading as new, cleaned CSV and txt documents.

---

Fit the pipeline to the DataFrame.

In [19]:
# Apply the pipeline to original dataframe
processed_df = data_cleaning_pipeline.fit_transform(df)
print("Data loaded successfully!")
print(f"Processed data shape: {processed_df.shape}")
print(processed_df.head())

Data loaded successfully!
Processed data shape: (537, 20)
      Patient_ID  Patient_Age  Gender         Ethnicity    BMI  \
613         5364           61  Female  African American  29.27   
1018        5769           81  Female         Caucasian  34.64   
264         5015           81    Male  African American  22.92   
1758        6509           62    Male         Caucasian  23.59   
1441        6192           80  Female             Other  23.72   

            Diet_Quality  Physical_Activity Smoking  Alcohol_Consumption  \
613    2.976618872327678                  7      No                14.78   
1018   5.280583737322621                  2      No                 7.38   
264    3.807813179139379                  8      No                 9.31   
1758   3.360431588390945                  0      No                 1.24   
1441  1.6250982585740548                  5     Yes                12.34   

     Cardiovascular_Disease  Cholesterol_Total  Functional_Assessment  \
613            

Check the current column list after fitting pipeline.

In [20]:
print("Data loaded successfully!")
print("Processed DataFrame columns:", processed_df.columns.tolist())

Data loaded successfully!
Processed DataFrame columns: ['Patient_ID', 'Patient_Age', 'Gender', 'Ethnicity', 'BMI', 'Diet_Quality', 'Physical_Activity', 'Smoking', 'Alcohol_Consumption', 'Cardiovascular_Disease', 'Cholesterol_Total', 'Functional_Assessment', 'Activities_Of_Daily_Living', 'MMSE', 'Memory_Complaints', 'Behavioral_Problems', 'Personality_Changes', 'Difficulty_Completing_Tasks', 'Depression', 'Diagnosis']


fit the machine learning pipeline.

In [21]:
# Apply the ML pipeline to original dataframe
scaled_encoded_df = data_cleaning_with_ml_pipeline.fit_transform(df)
print("Data loaded successfully!")
print(f"Scaled data shape: {scaled_encoded_df.shape}")
print(scaled_encoded_df)

Data loaded successfully!
Scaled data shape: (537, 19)
[[-1.48929421  0.22219164  0.81923202 ...  0.          0.
   0.        ]
 [ 0.6818772   0.9683131  -0.47266499 ...  0.          0.
   0.        ]
 [ 0.6818772  -0.66009352 -0.13572428 ...  1.          0.
   0.        ]
 ...
 [ 1.65890434  0.46534109 -0.21428559 ...  0.          1.
   0.        ]
 [-0.51226707  0.46256224 -1.54459034 ...  0.          0.
   0.        ]
 [-1.05505993 -0.51420385  1.25568371 ...  0.          0.
   0.        ]]


Load both previously created dataframes to separate CSV files.

In [None]:
# Save the processed datasets
processed_df.to_csv("outputs/processed_alzheimers_disease_data_unscaled_and_unencoded.csv", index=False)
np.savetxt("outputs/processed_alzheimers_disease_data_scaled_and_encoded.csv", 
           scaled_encoded_df, delimiter=",", fmt="%.6f")
print("Files saved to outputs folder!")

---

## **Section 4**

In [None]:
section 4
notes

---

Data Analysis Core Concepts
Mean
The mean, which may also be considered as the average, is a measure in statistics that notes data centricity and as such relays the most commonly occuring value within a given set of figures. When considering a set of numbers, its defintion relates to the most "typical" value within that set. With calculating the mean, we add all figures together, and divide this by the total count of these numeric figures. The mean is important in data anlysis as it allows for the summarization of the typical value or tendency of a dataset. It can allow for understanding of central tendency.
Median 
The basic definition of the mean in statistics is the central value that lies between two extremes. The median is calculated simply by choosing the middle number with an odd set, and by finding the mean of two of the central numbers with an even set. It splits data in two, with either pool lying on the side, towards, or away from either extreme, depending on direction to or from the median. The mean is important to data analysis as it allows for the central deduction of a dataset, lying less prone to extremes, in comparison to the mean. It may be better able to deduce central tendency in some instances, where compared to the mean.
Standard Deviation
The standard deviation is a statistic that measures the variation values about the mean of a variable (on either side). Low or small standard deviation indicates that values fluctuate more closely to the mean, whereas high or large standard deviation notes that values lie more dispered from the mean. To find it, we subtract the mean from the data points and then square these differences and find the mean of this. This is the variance. We then find the square root of this. Standard deviation is important in data analysis as it helps us better understand the spread of data about a mean, allowing for findings of consistency or variability.
Hypothesis Testing
Hypothesis testing in statistics is where we employ inference to support or reject a particular hypothesis. When carrying these out, we may draw up hypotheses, null (default asumption) and alternative (contradictory to null hypothesis), and use the following method: deduce a value of rejection towards the null hypothesis (sginificance level) and calculate the value of a chosen test statistic (z-test, t-test etc.). Where the attained value lies within a "critical" region, we may reject the null hypothesis.
Basic Probability
Basic porbability defines a measure of change, where 0 (0%) is of no likelihood of occurence and 1 (100%) is of definite likelihood. We find the probability by dividing all favourable outcomes by all possible outcomes. In data analytics, probability can allow for the quantification of uncertainty and as such aid in assessing risks and help make predictions.

Call the processed DataFrame.

In [64]:
processed_df.head()

Unnamed: 0,Patient_Age,Gender,Ethnicity,BMI,Diet_Quality,Physical_Activity,Smoking,Alcohol_Consumption,Cardiovascular_Disease,Cholesterol_Total,Functional_Assessment,Activities_Of_Daily_Living,MMSE,Memory_Complaints,Behavioral_Problems,Personality_Changes,Difficulty_Completing_Tasks,Depression,Diagnosis
613,61,Female,African American,29.27,2.976618872327678,7,No,14.78,Yes,172.68,8,0.78,4.88,Yes,No,No,No,No,Dementia
1018,81,Female,Caucasian,34.64,5.280583737322621,2,No,7.38,No,264.95,7,3.39,9.99,No,No,No,No,No,No Dementia
264,81,Male,African American,22.92,3.807813179139379,8,No,9.31,No,283.13,4,8.68,1.11,No,Yes,No,No,No,Dementia
1758,62,Male,Caucasian,23.59,3.360431588390945,0,No,1.24,Yes,202.64,5,3.98,18.39,No,No,No,No,No,No Dementia
1441,80,Female,Other,23.72,1.6250982585740548,5,Yes,12.34,No,264.4,7,2.74,24.1,No,No,No,No,Yes,No Dementia


Calculate key statistical measures for numerical variables to understand data distribution and central tendencies.

In [63]:
print("From this summary, there are a few insights we can infer, such as a relatively high BMI")

cols_list = ["Patient_Age", "BMI", "Physical_Activity", "Alcohol_Consumption", "Cholesterol_Total", "Functional_Assessment", "Activities_Of_Daily_Living", "MMSE"]

# Print a title and a border
print("\n" + "=" * 70)
print("COMPREHENSIVE STATISTICAL SUMMARY")
print("=" * 70)

# Create a summary dataframe for better visualisation
stats_summary = pd.DataFrame({
    'Mean': means,
    'Median': medians,
    'Variance': variances,
    'Std_Deviation': std_devs,
    'Min': processed_df[cols_list].min(),
    'Max': processed_df[cols_list].max(),
    'Range': processed_df[cols_list].max() - processed_df[cols_list].min()
}).round(3)

print("\nStatistical Summary for Numerical Variables:")
print("-" * 70)
print(stats_summary)

From this summary, there are a few insights we can infer, such as a relatively high BMI

COMPREHENSIVE STATISTICAL SUMMARY

Statistical Summary for Numerical Variables:
----------------------------------------------------------------------
                                Mean   Median    Variance  Std_Deviation  \
Activities_Of_Daily_Living     5.038     5.08       8.971          2.995   
Alcohol_Consumption           10.087    10.14      32.871          5.733   
BMI                           27.671    27.38      51.897          7.204   
Cholesterol_Total            224.557   224.53    1861.925         43.150   
Functional_Assessment          4.618     5.00       8.296          2.880   
MMSE                          14.724    14.39      76.333          8.737   
Patient_Age                   74.719    75.00      85.012          9.220   
Patient_ID                  5830.492  5812.00  410116.303        640.403   
Physical_Activity              4.575     5.00       8.260          2.874   


In [47]:
cols_list = ["Patient_Age", "BMI", "Physical_Activity", "Alcohol_Consumption", "Cholesterol_Total", "Functional_Assessment", "Activities_Of_Daily_Living", "MMSE"]
cols = [col for col in cols_list if col in processed_df.columns]
print(cols)

['Patient_Age', 'BMI', 'Physical_Activity', 'Alcohol_Consumption', 'Cholesterol_Total', 'Functional_Assessment', 'Activities_Of_Daily_Living', 'MMSE']


---

Section 5
visualisations and why I choose these

Call the processed DataFrame.

In [23]:
processed_df.head()

Unnamed: 0,Patient_ID,Patient_Age,Gender,Ethnicity,BMI,Diet_Quality,Physical_Activity,Smoking,Alcohol_Consumption,Cardiovascular_Disease,Cholesterol_Total,Functional_Assessment,Activities_Of_Daily_Living,MMSE,Memory_Complaints,Behavioral_Problems,Personality_Changes,Difficulty_Completing_Tasks,Depression,Diagnosis
613,5364,61,Female,African American,29.27,2.976618872327678,7,No,14.78,Yes,172.68,8,0.78,4.88,Yes,No,No,No,No,Dementia
1018,5769,81,Female,Caucasian,34.64,5.280583737322621,2,No,7.38,No,264.95,7,3.39,9.99,No,No,No,No,No,No Dementia
264,5015,81,Male,African American,22.92,3.807813179139379,8,No,9.31,No,283.13,4,8.68,1.11,No,Yes,No,No,No,Dementia
1758,6509,62,Male,Caucasian,23.59,3.360431588390945,0,No,1.24,Yes,202.64,5,3.98,18.39,No,No,No,No,No,No Dementia
1441,6192,80,Female,Other,23.72,1.6250982585740548,5,Yes,12.34,No,264.4,7,2.74,24.1,No,No,No,No,Yes,No Dementia



COMPREHENSIVE STATISTICAL SUMMARY

Statistical Summary for All Numerical Variables:
----------------------------------------------------------------------
                                Mean   Median    Variance  Std_Deviation  \
Patient_ID                  5830.492  5812.00  410116.303        640.403   
Patient_Age                   74.719    75.00      85.012          9.220   
BMI                           27.671    27.38      51.897          7.204   
Physical_Activity              4.575     5.00       8.260          2.874   
Alcohol_Consumption           10.087    10.14      32.871          5.733   
Cholesterol_Total            224.557   224.53    1861.925         43.150   
Functional_Assessment          4.618     5.00       8.296          2.880   
Activities_Of_Daily_Living     5.038     5.08       8.971          2.995   
MMSE                          14.724    14.39      76.333          8.737   

                                Min      Max    Range  
Patient_ID                 

---

Matplotlib

Seaborn

Plotly

methodology
data collection, analysis and interpretation --- why specific research methodologies ---- experimental and observational for instance 
and certain data analysis techniques were chosen for project goals

xgboost, altair and imbalancelearn - check requirements folder

## **Conclusion**

---

ai -- document some eamples where Ai helped
its use in addressing domain specific challengs  --- how has ai helped with ideation, assitance with streamlit integrations

### **Notes** 
**Method Decisions**
- Created the mapping and binary columns as the original data was populated with numerical values for all columns.
- Dropped several columns as aim was to focus on particular parameters (health and lifestyle) and to afford a more simplistic, less technical, viewer/user friendly application.
- Also extracted a fractioned/sampled DataFrame from the original at a random state for analysis purposes.

**Challenges Faced**
- Following package conflicts with packages like NumPy and Pandas the installation block was added as a precautionary measure.
- Needed to install Jupyter dependencies within the notebook, as kernel kept dying; Python kernel was restarted, then the necessary packages were downloaded.
- Pandas faced issues such as import errors and HTML errors; this was resolved via a Pandas update as well as using the print function. 
- can consder more datasets ---eidt
- different techniques such as braoder range of visualisations
-dataset size issue

**Further Considerations**
- Consider not to rule out further factors, such head injury, other potential comordid diseases such as diabetes, or family history (these were included in the orginal dataset).
- Capitalize, drop missing values and remove duplicates were added for quality assurance purposes.



In [None]:
Reflection
ethical considerations - data privacy, bias and fairness
legal and social implications of data handling and findings -- ie gdpr