# Notebook 02 – Data Loading and Preprocessing

This notebook prepares the Alzheimer's disease dataset so it can be used in analysis and machine learning.  
We clean and transform the data to ensure high quality and usability.

#### In this notebook we:
- Load the dataset and take a first look at it.
- Find and handle missing values.
- Remove any duplicate rows.
- Convert text categories into numbers so the computer can understand them. (encoding)
- Organize the data in a way that makes it ready for analysis.

The cleaned data we create here will be used in the next notebook for exploring patterns and building models.

----------------------------------------------

## Setup And Load Environment

To get started, we need to set up our working environment. For this, we use some helper functions that we have created and stored in a folder called utils. These helper functions help us:
- Create folders to keep the project organized (such as data, models, plots, and reports)
- Apply default chart styles using Seaborn
- Load datasets and quickly explore them

Along with that, we also import common libraries like Pandas, NumPy, Seaborn and Matplotlib, which we will be using throughout the project.

In [27]:
# We are adding the parent folder to the Python path so we can import files from the "utils" folder
import sys
sys.path.append("..")

# Importing the custom helper functions from our project
from utils.setup_notebook import (
    init_environment,
    load_csv,
    print_shape,
    print_info,
    print_full_info,
    print_description,
    print_categorical_description,
    show_head
)
from utils.save_tools import save_plot, save_notebook_and_summary

# Importing commonly used libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Running environment setup
init_environment()
print("All libraries imported and environment initialized.")

Environment setup complete.
All libraries imported and environment initialized.


----------------------------

## Extract – Load the Dataset

In this step, we load the raw Alzheimer's dataset into our project using a custom helper function from our setup.py script. The dataset has not yet been cleaned or processed, this is the original version as collected.
Our helper function uses pandas to read the CSV file and automatically provides basic metadata. 

### Including:
- The file path from which the data was loaded.  
- The number of rows and columns present in the dataset.  

This step ensures that we have successfully accessed the correct dataset and gives us an initial understanding of its structure and scale before we proceed with cleaning and transformation. To keep the original intact, a working copy is also created. This ensures we can freely clean, explore, and manipulate the data without altering the raw file.

In [28]:
# We load the raw Alzheimer's dataset and save it as 'df_raw'
df_raw = load_csv("../data/alzheimers_disease_raw_data.csv")

# Then create a working copy to avoid modifying the raw dataset directly
df = df_raw.copy()
print("Copy of df_raw dataset created as 'df' succesfully")

Loaded data from ../data/alzheimers_disease_raw_data.csv with shape (2149, 35)
Copy of df_raw dataset created as 'df' succesfully


Thereafter save the working copy of the dataset to the project folder. This allows us to reuse it later without reloading or reprocessing the raw data each time. It also keeps the original dataset unchanged in case we need to go back to it.

In [29]:
# Save the copy of the dataset for future steps
df.to_csv("../data/alzheimers_raw_copy.csv", index=False)
print("Dataset saved to ../data/alzheimers_raw_copy.csv")

Dataset saved to ../data/alzheimers_raw_copy.csv


------------------------------

### ELT Approach: Extract → Load → Transform

In this project, we follow the ELT (Extract, Load, Transform) process to prepare the Alzheimer’s dataset for analysis and modeling.
- **Extract**: We begin by accessing the raw dataset, which is provided in CSV format.
- **Load**: Using pandas, we load the dataset into memory so it can be explored and manipulated.
- **Transform**: We then clean and organize the data by checking for duplicates, handling missing values, removing unnecessary ID columns, and grouping features by type. These steps make the dataset ready for machine learning.

This approach reflects a modern data workflow, where raw data is loaded first and then transformed in memory. It allows for faster iteration, flexible processing, and better reproducibility.

---------------

## Alternative Approach – Load Libraries

Before we can work with the data, we need to import the necessary Python libraries:

- **pandas** is used for handling tabular data (structured data in rows and columns, like a spreadsheet). It helps us load, clean, and manage datasets.
- **numpy** is used for numerical operations, especially with arrays and mathematical functions.
- **matplotlib.pyplot** and **seaborn** are used for visualizing data through plots and charts.

After importing the libraries, we use `read_csv()` to load the dataset into a DataFrame called `df`, and use `head()` to preview the first five rows.

This setup step is important because it gives us the tools needed for cleaning, exploring, and later modeling the dataset.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.preprocessing import StandardScaler

print("All libraries imported and environment initialized.")

All libraries imported and environment initialized.


In [31]:
# We load the dataset
dataframe = pd.read_csv("../data/alzheimers_disease_raw_data.csv")
print("Dataset loaded successfully.")

Dataset loaded successfully.


-----------------------------------------

## Initial Data Inspection

Now that the dataset is loaded and saved as a dataframe, we begin performing an initial inspection to understand the structure and contents. This helps us identify potential issues early and plan the cleaning steps that follow. We focus on:
- The number of rows and columns in dataframe.
- The data types of each column.
- The presence of missing values.
- Descriptive statistics for both numeric and categorical variables.
- A sample of the first 5 rows for quick overview.

In [None]:
# Check the number of rows and columns in the dataframe
print_shape(df)

#### Dataset Dimensions
The dataset contains 2,149 rows and 35 columns. Each row represents one patient, and each column provides a specific clinical, lifestyle, or cognitive feature.

In [None]:
# View data types and non-null counts for each column
print_info(df)

#### Data Types and Completeness
All 35 columns are fully populated, meaning there are no missing values in any part of the dataset.
The data types are mostly numeric:

- 22 columns are stored as integers (0 or 1 for yes/no values).
- 12 columns are float values, which allow decimals.
- 1 column (DoctorInCharge) is a text field and not needed for modeling.

This structure makes the data ready for analysis without needing to fill or drop missing values.

In [None]:
# View full dataset structure with memory usage and types
print_full_info(df)

#### Memory and Structure
The dataset uses about 587 KB of memory, which is very manageable for analysis in a local environment.
Each column has the correct data type, and pandas has successfully recognized the structure. This means no extra type conversions are required at this stage.

In [None]:
# Get descriptive statistics for all variables (numeric and categorical)
print_description(df)

### Key Statistics and Distributions

Looking at the data more closely helps us see patterns and differences between patients. This step is useful for figuring out which features might help us predict Alzheimer’s and which ones might need extra processing later (like adjusting the scale or changing the format).

### Key Points

- **Age**: Most patients are between 60 and 90 years old, with an average age of about 75. This makes sense since Alzheimer’s mostly affects older adults.

- **BMI (Body Mass Index)**: The average BMI is around 27.7, which falls in the overweight range. Since weight can affect both physical and brain health, BMI might be an important feature in our predictions.

- **Alcohol Consumption**: On average, patients drink about 10 units of alcohol per week. Some drink more, some less. This difference could matter when we look at lifestyle risk factors.

- **Blood Pressure (Systolic and Diastolic)**: The blood pressure numbers look normal for older adults, but there’s quite a bit of variation. That could be important since heart health is linked to brain health.

- **Cholesterol Levels**: These also vary a lot between people:
  - **Total cholesterol:** Average is about 225 mg/dL  
  - **LDL (bad cholesterol):** About 124 mg/dL  
  - **HDL (good cholesterol):** Around 59 mg/dL  
  - **Triglycerides:** These have the widest range and highest average, which might show differences in metabolism between patients.


- **MMSE Scores**: These scores, which check memory and thinking, range from 0 to 30. The average score is around 14.7, which means many patients show signs of cognitive decline.

- **Binary Medical Conditions**: Things like Diabetes, Depression, and Hypertension are shown as 0 (no) or 1 (yes). These conditions may increase the risk of Alzheimer’s and can be useful in prediction.

By reviewing the distributions and summary statistics, we get an early sense of which variables might be strong predictors, which ones are well-behaved, and whether we need to prepare the data in any special way (like removing outliers or standardizing ranges).

In [None]:
# Preview the first few rows to understand how values are structured
show_head(df)

#### First 5 rows overview
**The first five rows show that:**
- All values are properly formatted.
- Column names and values are clearly labeled.
- There's good variety in the data—no obvious errors, typos, or missing data.

## Indentify Categorical Features

In [None]:
# View summary of categorical columns 
print_categorical_description(df)

This helps us understand non-numeric features (e.g., DoctorInCharge). In our case, this column holds anonymized IDs and won't be used for modeling, but it’s good practice to review these separately.


## Understanding the Columns

Before cleaning or analyzing the data, we need to know what each column represents. This is a key part of **data exploration** in Business Intelligence. By understanding the **data types** and what the values mean, we avoid making wrong decisions. This follows the **GIGO principle**: *Garbage In, Garbage Out*. If the input data is poor or unclear, the results will also be poor, no matter how advanced the analysis.


### Column Overview

- **PatientID** – *ID* – A unique number for each patient. Not used for prediction.  
- **Age** – *Numeric* – The patient's age in years.  
- **Gender** – *Categorical* – 0 = Female, 1 = Male.  
- **Ethnicity** – *Categorical* – Example: 0 = White, 1 = Black, etc.  
- **EducationLevel** – *Ordinal* – Higher number means more education.  
- **BMI** – *Numeric* – Body Mass Index (based on height and weight). 
- **Smoking** – *Binary* – 0 = No, 1 = Yes.  
- **AlcoholConsumption** – *Numeric* – Amount of alcohol used.  
- **PhysicalActivity** – *Numeric* – How active the person is.
- **DietQuality** – *Numeric* – Higher number = healthier diet.  
- **ADL** – *Numeric* – Level of help needed with daily tasks.  
- **Diagnosis** – *Target label* – 0 = No Alzheimer’s, 1 = Alzheimer’s.  

Other columns like **Confusion**, **MemoryComplaints**, and **PersonalityChanges** are binary symptoms: 0 = No, 1 = Yes.

#### Why this is important
If we do not understand what the data means, we can not clean it or use it properly. This step helps us avoid wrong assumptions and prepares the data for meaningful analysis.

### Column Types and Unique Values

Now we check two important things about the columns:

1. **Data types** – Shows if values are stored as numbers (integers, floats) or as text.  
   For example, age should be numeric, while gender might be text or category codes.

2. **Unique values** – Tells how many different values exist in each column.  
   This helps identify which columns are categories (like gender or smoking) and which might be IDs (like PatientID), which are not useful for prediction.

These checks guide decisions on which columns to keep, transform, or remove later.

### Check Column Types
Before we clean or transform any columns, we need to check the data types to understand how each variable is stored. This helps us spot which columns are numeric, which are categorical, and whether anything needs to be converted.
This is an important part of data exploration. If we do not know what kind of data we are working with, we might handle it the wrong way.

In [None]:
# Display data types for each column to understand variable types
df.dtypes

### Output

We see that most columns in the dataset are either stored as int64 or float64, meaning they contain numerical values. This is good because numerical data can be used directly in many types of analysis and machine learning models. We also notice that one column, DoctorInCharge, is stored as an object. This usually means it contains text or categorical labels.

We interpret this as a mostly numerical dataset, which is a good starting point for further processing. We also conclude that some columns, like PatientID and DoctorInCharge, are probably identifiers and not useful as features. These will likely be removed later to avoid adding irrelevant information to our model.


## Unique values in columns

In [None]:
# Check the number of unique values in each column
# Helps identify categorical variables and ID-like columns
df.nunique().sort_values(ascending=False)

#### What We Learned from Unique Values: 

Understanding the number of unique values in each column helps us decide how to handle the data later on. Since the goal of this project is to analyze and predict Alzheimer's diagnoses based on patient characteristics, we need to be clear about which features are useful for that purpose.

- PatientID has a unique value for every row, which makes it an identifier. It’s not related to the diagnosis or any medical condition, so it would not help with predictions. We'll remove it during cleaning.

- Some columns like Gender, Smoking, and Diabetes only have two unique values. These are called binary variables and are usually coded as 0 and 1, for example, 0 might mean “no” and 1 might mean “yes.” These features are important because they can show risk factors or medical conditions linked to Alzheimer’s. However, even if a column has only two values, we should always check what those values actually mean. For instance, in the Gender column, 0 might mean “Female” and 1 might mean “Male,” so it’s not a simple yes/no. We will inspect value distributions during exploratory analysis to confirm their meaning.

- Columns like Age, BMI, MMSE, and CholesterolTotal have many unique values. These are continuous numeric variables, meaning they can show subtle differences between patients. This kind of data is very useful for modeling, but it often needs to be scaled so that features with large values don’t dominate the model.

- Diagnosis also has two unique values: 0 and 1. This is our target variable. Everything else in the dataset helps us try to predict this outcome.

We’re organizing the data this way because each type of variable requires different handling in preprocessing. Binary features might be used as-is, continuous features might need scaling, and ID columns should be removed entirely. By doing this upfront, we make sure our data is well-structured and meaningful, which is essential before we move on to any analysis or modeling