# Notebook 02 – Data Loading and Preprocessing

This notebook prepares the Alzheimer's disease dataset so it can be used in analysis and machine learning.  
We clean and transform the data to ensure high quality and usability.

#### In this notebook we:
- Load the dataset and take a first look at it.
- Find and handle missing values.
- Remove any duplicate rows.
- Convert text categories into numbers so the computer can understand them. (encoding)
- Organize the data in a way that makes it ready for analysis.

The cleaned data we create here will be used in the next notebook for exploring patterns and building models.

----------------------------------------------

## Setup And Load Environment

To get started, we need to set up our working environment. For this, we use some helper functions that we have created and stored in a folder called utils. These helper functions help us:
- Create folders to keep the project organized (such as data, models, plots, and reports)
- Apply default chart styles using Seaborn
- Load datasets and quickly explore them

Along with that, we also import common libraries like Pandas, NumPy, Seaborn and Matplotlib, which we will be using throughout the project.

In [27]:
# We are adding the parent folder to the Python path so we can import files from the "utils" folder
import sys
sys.path.append("..")

# Importing the custom helper functions from our project
from utils.setup_notebook import (
    init_environment,
    load_csv,
    print_shape,
    print_info,
    print_full_info,
    print_description,
    print_categorical_description,
    show_head
)
from utils.save_tools import save_plot, save_notebook_and_summary

# Importing commonly used libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Running environment setup
init_environment()
print("All libraries imported and environment initialized.")

Environment setup complete.
All libraries imported and environment initialized.


----------------------------

## Extract – Load the Dataset

In this step, we load the raw Alzheimer's dataset into our project using a custom helper function from our setup.py script. The dataset has not yet been cleaned or processed, this is the original version as collected.
Our helper function uses pandas to read the CSV file and automatically provides basic metadata. 

### Including:
- The file path from which the data was loaded.  
- The number of rows and columns present in the dataset.  

This step ensures that we have successfully accessed the correct dataset and gives us an initial understanding of its structure and scale before we proceed with cleaning and transformation. To keep the original intact, a working copy is also created. This ensures we can freely clean, explore, and manipulate the data without altering the raw file.

In [28]:
# We load the raw Alzheimer's dataset and save it as 'df_raw'
df_raw = load_csv("../data/alzheimers_disease_raw_data.csv")

# Then create a working copy to avoid modifying the raw dataset directly
df = df_raw.copy()
print("Copy of df_raw dataset created as 'df' succesfully")

Loaded data from ../data/alzheimers_disease_raw_data.csv with shape (2149, 35)
Copy of df_raw dataset created as 'df' succesfully


Thereafter save the working copy of the dataset to the project folder. This allows us to reuse it later without reloading or reprocessing the raw data each time. It also keeps the original dataset unchanged in case we need to go back to it.

In [29]:
# Save the copy of the dataset for future steps
df.to_csv("../data/alzheimers_raw_copy.csv", index=False)
print("Dataset saved to ../data/alzheimers_raw_copy.csv")

Dataset saved to ../data/alzheimers_raw_copy.csv


------------------------------

### ELT Approach: Extract → Load → Transform

In this project, we follow the ELT (Extract, Load, Transform) process to prepare the Alzheimer’s dataset for analysis and modeling.
- **Extract**: We begin by accessing the raw dataset, which is provided in CSV format.
- **Load**: Using pandas, we load the dataset into memory so it can be explored and manipulated.
- **Transform**: We then clean and organize the data by checking for duplicates, handling missing values, removing unnecessary ID columns, and grouping features by type. These steps make the dataset ready for machine learning.

This approach reflects a modern data workflow, where raw data is loaded first and then transformed in memory. It allows for faster iteration, flexible processing, and better reproducibility.

---------------

## Alternative Approach – Load Libraries

Before we can work with the data, we need to import the necessary Python libraries:

- **pandas** is used for handling tabular data (structured data in rows and columns, like a spreadsheet). It helps us load, clean, and manage datasets.
- **numpy** is used for numerical operations, especially with arrays and mathematical functions.
- **matplotlib.pyplot** and **seaborn** are used for visualizing data through plots and charts.

After importing the libraries, we use `read_csv()` to load the dataset into a DataFrame called `df`, and use `head()` to preview the first five rows.

This setup step is important because it gives us the tools needed for cleaning, exploring, and later modeling the dataset.

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.preprocessing import StandardScaler

print("All libraries imported and environment initialized.")

All libraries imported and environment initialized.


In [31]:
# We load the dataset
dataframe = pd.read_csv("../data/alzheimers_disease_raw_data.csv")
print("Dataset loaded successfully.")

Dataset loaded successfully.


-----------------------------------------