In [None]:
# Notebook 03

In [None]:
## Statistics of Cleaned Data

In [None]:
# Descriptive statistics
df_clean.describe()

In [None]:
### Output

We see a full summary of the numeric columns in our cleaned dataset. This includes useful values like the mean, standard deviation, minimum and maximum, and the 25th, 50th (median), and 75th percentiles for each feature.

Some columns, like Gender, Smoking, and Diagnosis, only contain values between 0 and 1. That means they are binary — either yes or no. For example, the mean of Diagnosis is about 0.35, which tells us that roughly 35% of the patients are diagnosed with the condition.

Other columns, like BMI, Age, and Cholesterol, have a much wider range. These are continuous features, and we will likely need to scale them before using them in a machine learning model.

This output helps us understand what kind of data we are working with. It also confirms that everything looks complete and consistent, which is important before moving on to visualizations or further analysis.

In [None]:
## Boxplots – All Numeric Variables

We first tried creating a boxplot of all numeric features in the dataset. Boxplots are useful for showing the distribution of values, including the median, quartiles, and possible outliers.

In [None]:
# Boxplot of all numeric variables
plt.figure(figsize=(14, 8))
sns.boxplot(data=df_clean.select_dtypes(include='number'), orient="h")
plt.title("Boxplot of All Numeric Features")
plt.show()

In [None]:
### Output

We see that the boxplot is difficult to interpret because some features, like PatientID, have very large values, while others are binary or have very small ranges. This large difference in scale causes most of the boxes to be compressed or almost invisible.

This shows that combining all numeric features into one boxplot is not very useful when the variables have very different value ranges. The result is a plot that does not give us any clear insight.
z

In [None]:
## Boxplots – Selected Continuous Features

To get a clearer overview, we selected only a few continuous features that are on similar scales. This makes it easier to see the spread of values, compare medians, and identify outliers.

We included variables like BMI, PhysicalActivity, DietQuality, and several cholesterol-related columns. These features have measurable ranges and are good candidates for comparison. This approach gives us a much more readable and meaningful visualization.


In [None]:
# Select continuous numeric columns for clearer boxplots
columns_to_plot = [
    "BMI", "PhysicalActivity", "DietQuality", "SleepQuality",
    "CholesterolTotal", "CholesterolLDL", "CholesterolHDL", "MMSE"
]

plt.figure(figsize=(12, 6))
sns.boxplot(data=df_clean[columns_to_plot], orient="h")
plt.title("Boxplot of Selected Continuous Features")
plt.show()

In [None]:
### Output

We see that the boxplots now clearly show the distribution of each selected variable. We can observe the range of values, the middle 50% (the box), and any outliers outside the whiskers.

For example, CholesterolLDL and MMSE show wider ranges, while features like PhysicalActivity and DietQuality have smaller, more concentrated distributions. This tells us that these variables may need different kinds of preprocessing (like scaling) before we use them in a machine learning model.

In [None]:
## Boxplots – Log-Transformed Features

To improve the clarity of the boxplots even more, we applied a log transformation to each feature. Log transformation helps reduce skew, especially when there are a few very large values.

This technique compresses the scale of high values and stretches the scale of low values, making the distributions easier to compare side by side. We used log(1 + x) to safely handle values near zero.

In [None]:
# Apply log transform to reduce skew before plotting
import numpy as np

df_log = df_clean[columns_to_plot].apply(lambda x: np.log1p(x))  # log(1 + x)

plt.figure(figsize=(12, 6))
sns.boxplot(data=df_log, orient="h")
plt.title("Boxplot of Log-Transformed Continuous Features")
plt.show()

In [None]:
### Output

We see that the log-transformed boxplots are more balanced and evenly spread. The variables are easier to compare visually, and patterns in the data are more visible. For example, MMSE had a very wide spread in the original scale, but after transformation we can clearly see its shape and outliers. The same applies to Cholesterol features.
This confirms that log transformation is a helpful tool for visualizing skewed data and preparing features for machine learning models.

In [None]:
## Plot Histograms for Numeric Variables

In [None]:
# Histogram grid for feature distributions
df_clean.hist(figsize=(16, 12), bins=20)
plt.suptitle("Histogram of Features", fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
#### Detect Outliers Using IQR

In [None]:
def detect_outliers_iqr(df, columns):
    """
    Returns a DataFrame of rows considered outliers in any specified column using IQR method.
    """
    outlier_indices = []

    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
        outlier_indices.extend(outliers)

    outlier_indices = list(set(outlier_indices))
    return df.loc[outlier_indices]

In [None]:
#### View and Count Outliers

In [None]:
# Detect outliers in numeric columns
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
outliers_df = detect_outliers_iqr(df_clean, numeric_cols)

print(f"Total rows with outliers: {outliers_df.shape[0]}")
outliers_df.head()


In [None]:
#### Remove Outliers

In [None]:
# Remove detected outliers
df_no_outliers = df_clean.drop(outliers_df.index)

print(f"Shape after removing outliers: {df_no_outliers.shape}")


In [None]:
#### Save dataset

In [None]:
# Save outlier-free data
df_no_outliers.to_csv("../data/alzheimers_no_outliers.csv", index=False)


In [None]:
#### Correlation Matrix

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df_no_outliers.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix (Outlier-Free Data)")
plt.show()



In [None]:
#### Pairplot for Highly Correlated Features

In [None]:
# Optional: select top correlated variables for pairplot
corr_matrix = df_no_outliers.corr(numeric_only=True)
top_corr = corr_matrix.abs().unstack().sort_values(ascending=False)
top_pairs = top_corr[(top_corr < 1.0)].drop_duplicates().head(3)

cols = list(set([i for pair in top_pairs.index for i in pair]))
sns.pairplot(df_no_outliers[cols])
plt.suptitle("Pairplot of Top Correlated Features")
plt.show()

In [None]:
#### Feature Scaling

In [None]:
# Standardize numeric columns
scaler = StandardScaler()
df_scaled = df_no_outliers.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df_no_outliers[numeric_cols])
