# Exploratory Analysis and Data Preprocessing

In [5]:
# Load the dataset
import pandas as pd

data = pd.read_csv(r'C:\Users\Kanika Barik\projects\Classification-of-Alzheimer-s-Disease\alzheimers_disease_data.csv')

# Display basic information and the first few rows of the dataset
data_info = data.info()
data_head = data.head()

data_info, data_head

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2149 entries, 0 to 2148
Data columns (total 35 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   PatientID                  2149 non-null   int64  
 1   Age                        2149 non-null   int64  
 2   Gender                     2149 non-null   int64  
 3   Ethnicity                  2149 non-null   int64  
 4   EducationLevel             2149 non-null   int64  
 5   BMI                        2149 non-null   float64
 6   Smoking                    2149 non-null   int64  
 7   AlcoholConsumption         2149 non-null   float64
 8   PhysicalActivity           2149 non-null   float64
 9   DietQuality                2149 non-null   float64
 10  SleepQuality               2149 non-null   float64
 11  FamilyHistoryAlzheimers    2149 non-null   int64  
 12  CardiovascularDisease      2149 non-null   int64  
 13  Diabetes                   2149 non-null   int64

(None,
    PatientID  Age  Gender  Ethnicity  EducationLevel        BMI  Smoking  \
 0       4751   73       0          0               2  22.927749        0   
 1       4752   89       0          0               0  26.827681        0   
 2       4753   73       0          3               1  17.795882        0   
 3       4754   74       1          0               1  33.800817        1   
 4       4755   89       0          0               0  20.716974        0   
 
    AlcoholConsumption  PhysicalActivity  DietQuality  ...  MemoryComplaints  \
 0           13.297218          6.327112     1.347214  ...                 0   
 1            4.542524          7.619885     0.518767  ...                 0   
 2           19.555085          7.844988     1.826335  ...                 0   
 3           12.209266          8.428001     7.435604  ...                 0   
 4           18.454356          6.310461     0.795498  ...                 0   
 
    BehavioralProblems       ADL  Confusion  Di

The dataset contains 35 columns with a mix of numerical and categorical data. Here's a summary:

Target Variable: Diagnosis (likely 0 for no Alzheimer's, 1 for Alzheimer's).
Features: Includes demographics (e.g., Age, Gender), lifestyle factors (e.g., Smoking, AlcoholConsumption), medical data (e.g., CholesterolTotal, MMSE), and behavioral issues.
Non-Predictive Columns: PatientID (identifier) and DoctorInCharge (likely confidential and non-predictive).

# Train-Test Split

In [6]:
# Drop non-predictive columns
data_cleaned = data.drop(columns=['PatientID', 'DoctorInCharge'])

# Check class distribution of the target variable
class_distribution = data_cleaned['Diagnosis'].value_counts(normalize=True)

# Display class distribution
class_distribution


Diagnosis
0    0.646347
1    0.353653
Name: proportion, dtype: float64

The dataset is slightly imbalanced:

No Alzheimer's (0): 64.63% of cases.
Alzheimer's (1): 35.37% of cases.
To address this imbalance during modeling, we can use techniques such as oversampling (e.g., SMOTE), undersampling, or class weighting.

Encode categorical features if needed, scale numerical features for models sensitive to scaling, and then split the dataset into training and testing sets.


In [7]:
# Install scikit-learn
%pip install scikit-learn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features and target variable
X = data_cleaned.drop(columns=['Diagnosis'])
y = data_cleaned['Diagnosis']

# Split the dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Check shapes to confirm split and scaling
X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


((1719, 32), (430, 32), (1719,), (430,))