# Notebook 02 – Data Loading and Preprocessing

This notebook is for preparing the Alzheimer's disease dataset so it can be used in analysis and machine learning. In this notebook we:
- Load the dataset and take a first look at it
- Find and handle missing values
- Remove any duplicate rows
- Convert text categories into numbers so the computer can understand them
- Organize the data in a way that makes it ready for analysis

The cleaned data we create here will be used in the next notebook for exploring patterns and building models.

----------------------------------------------

## Setup And Load Environment

To get started, we need to set up our working environment. For this, we use some helper function that we have created and stored in the folder called utils. These helper functions help us:
- Create folders to keep the project organized (such as data, models, plots, and reports)
- Apply default chart styles using Seaborn
- Load datasets and quickly explore them

Along with that, we also import common libraries like Pandas, NumPy, Seaborn, and Matplotlib, which we will be using throughout the project.

In [6]:
# Add the parent folder to the Python path so we can import files from the "utils" folder
import sys
sys.path.append("..")

# Import custom helper functions from our project
from utils.setup_notebook import (
    init_environment,
    load_csv,
    print_shape,
    print_info,
    print_full_info,
    print_description,
    show_head
)
from utils.save_tools import save_plot, save_notebook_and_summary

# Import commonly used libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Run the setup function to create folders and apply default styles
init_environment()

Environment setup complete.


----------------------------

## Extract – Load the Dataset

In this step, we load the raw Alzheimer's dataset into our project using a custom helper function from our setup.py script. The dataset has not yet been cleaned or processed — this is the original version as collected.
Our helper function uses pandas to read the CSV file and automatically provides basic metadata, including:

- The file path from which the data was loaded  
- The number of rows and columns present in the dataset  

This step ensures that we have successfully accessed the correct dataset and gives us an initial understanding of its structure and scale before we proceed with cleaning and transformation. To keep the original intact, a working copy is also created. This ensures we can freely clean, explore, and manipulate the data without altering the raw file.

In [50]:
# Load the raw Alzheimer's dataset and save as variable 'df_raw'
df_raw = load_csv("../data/alzheimers_disease_raw_data.csv")

# Create a working copy to avoid modifying the raw dataset directly
df = df_raw.copy()
print("Copy of df_raw dataset created as 'df' succesfully")

Loaded data from ../data/alzheimers_disease_raw_data.csv with shape (2149, 35)
Copy of df_raw dataset created as 'df' succesfully


---------------

## Alternative Approach - Load libraries

Before working with the data, we first import the necessary Python libraries:

- **pandas** is used for handling tabular data
- **numpy** helps with numerical operations
- **matplotlib.pyplot** and **seaborn** are used for visualizing data We then use read_csv() to load the dataset into a DataFrame called df, and head() to preview the first 5 rows. This setup is important because it gives us the tools to clean, explore, and later model the data.

In [20]:
# We load the dataset
dataframe = pd.read_csv("../data/alzheimers_disease_raw_data.csv")
print("Dataset loaded successfully.")

Dataset loaded successfully.


-----------------------------------------

## Initial Data Inspection

Now that the dataset is loaded, we begin by exploring the shape, structure, and contents of the dataset to guide preprocessing and modeling decisions.

We will focus on:
- How many number of rows and columns are present
- Which data types each column contains
- The presence of missing values
- Descriptive statistics for both numeric and categorical variables
- A sample of the first few records in the dataframe

We save our working copy of the dataset to the project folder. This allows us to reuse it later without having to reload or reprocess the raw data each time. It also keeps the original dataset unchanged in case we need to go back to it.

In [55]:
# Save the copy of the dataset for future steps
df.to_csv("../data/alzheimers_raw_copy.csv", index=False)
print("Dataset saved to ../data/alzheimers_raw_copy.csv")

Dataset saved to ../data/alzheimers_raw_copy.csv


## 3. Initial Data Inspection

In [None]:

We perform an initial inspection to understand the structure, completeness, and scale of the data.


In [23]:
# Check the number of rows and columns
print_shape(dataframe)

----- Dataset Shape -----
Rows: 2149, Columns: 35


In [25]:
# View data types and non-null counts
print_info(dataframe)


----- Data Types and Non-Null Counts -----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Columns: 35 entries, PatientID to DoctorInCharge
dtypes: float64(12), int64(22), object(1)
memory usage: 587.7+ KB


In [27]:
# View full technical summary
print_full_info(dataframe)


----- Full Dataset Info -----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  2149 non-null   int64  
 1   Age                        2149 non-null   int64  
 2   Gender                     2149 non-null   int64  
 3   Ethnicity                  2149 non-null   int64  
 4   EducationLevel             2149 non-null   int64  
 5   BMI                        2149 non-null   float64
 6   Smoking                    2149 non-null   int64  
 7   AlcoholConsumption         2149 non-null   float64
 8   PhysicalActivity           2149 non-null   float64
 9   DietQuality                2149 non-null   float64
 10  SleepQuality               2149 non-null   float64
 11  FamilyHistoryAlzheimers    2149 non-null   int64  
 12  CardiovascularDisease      2149 non-null   int64  
 13  Diabetes         

In [29]:
# View descriptive statistics
print_description(dataframe)


----- Statistical Summary -----
This summary includes count, mean, std, min, max, and percentiles.



Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PatientID,2149.0,,,,5825.0,620.507185,4751.0,5288.0,5825.0,6362.0,6899.0
Age,2149.0,,,,74.908795,8.990221,60.0,67.0,75.0,83.0,90.0
Gender,2149.0,,,,0.506282,0.500077,0.0,0.0,1.0,1.0,1.0
Ethnicity,2149.0,,,,0.697534,0.996128,0.0,0.0,0.0,1.0,3.0
EducationLevel,2149.0,,,,1.286645,0.904527,0.0,1.0,1.0,2.0,3.0
BMI,2149.0,,,,27.655697,7.217438,15.008851,21.611408,27.823924,33.869778,39.992767
Smoking,2149.0,,,,0.288506,0.453173,0.0,0.0,0.0,1.0,1.0
AlcoholConsumption,2149.0,,,,10.039442,5.75791,0.002003,5.13981,9.934412,15.157931,19.989293
PhysicalActivity,2149.0,,,,4.920202,2.857191,0.003616,2.570626,4.766424,7.427899,9.987429
DietQuality,2149.0,,,,4.993138,2.909055,0.009385,2.458455,5.076087,7.558625,9.998346


In [31]:
# View first few rows
show_head(dataframe)


----- First 5 Rows -----


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,AlcoholConsumption,PhysicalActivity,DietQuality,SleepQuality,FamilyHistoryAlzheimers,CardiovascularDisease,Diabetes,Depression,HeadInjury,Hypertension,SystolicBP,DiastolicBP,CholesterolTotal,CholesterolLDL,CholesterolHDL,CholesterolTriglycerides,MMSE,FunctionalAssessment,MemoryComplaints,BehavioralProblems,ADL,Confusion,Disorientation,PersonalityChanges,DifficultyCompletingTasks,Forgetfulness,Diagnosis,DoctorInCharge
0,4751,73,0,0,2,22.927749,0,13.297218,6.327112,1.347214,9.025679,0,0,1,1,0,0,142,72,242.36684,56.150897,33.682563,162.189143,21.463532,6.518877,0,0,1.725883,0,0,0,1,0,0,XXXConfid
1,4752,89,0,0,0,26.827681,0,4.542524,7.619885,0.518767,7.151293,0,0,0,0,0,0,115,64,231.162595,193.407996,79.028477,294.630909,20.613267,7.118696,0,0,2.592424,0,0,0,0,1,0,XXXConfid
2,4753,73,0,3,1,17.795882,0,19.555085,7.844988,1.826335,9.673574,1,0,0,0,0,0,99,116,284.181858,153.322762,69.772292,83.638324,7.356249,5.895077,0,0,7.119548,0,1,0,1,0,0,XXXConfid
3,4754,74,1,0,1,33.800817,1,12.209266,8.428001,7.435604,8.392554,0,0,0,0,0,0,118,115,159.58224,65.366637,68.457491,277.577358,13.991127,8.965106,0,1,6.481226,0,0,0,0,0,0,XXXConfid
4,4755,89,0,0,0,20.716974,0,18.454356,6.310461,0.795498,5.597238,0,0,0,0,0,0,94,117,237.602184,92.8697,56.874305,291.19878,13.517609,6.045039,0,0,0.014691,0,0,1,1,0,0,XXXConfid


In [None]:
Summary of Dataset Inspection
After inspecting the dataset, we observed the following:

Size: The dataset contains 2,149 records and 35 columns. This gives us a fairly rich dataset with a good number of features to analyze.

Data Types: Most columns are numeric (int64 or float64), except for DoctorInCharge, which is of type object. This indicates it likely contains text or categorical values.

Missing Values: There are no missing values in the dataset. Every column has 2,149 non-null entries, which means no imputation is needed at this stage.

Descriptive Statistics:

The statistical summary shows key metrics like mean, standard deviation, min, and max for all numeric columns.

Some features like Age, BMI, SystolicBP, DiastolicBP, and CholesterolTotal have wide ranges and varying distributions, which may require scaling later.

Many columns appear to be binary indicators (e.g. Gender, Smoking, Diabetes) or limited-range integers, which might represent categorical or boolean-like data.

First Records: The first few rows confirm the structure and contents of the dataset. Columns like Diagnosis, MMSE, and FunctionalAssessment appear to be important targets or indicators for Alzheimer's-related conditions.