# AI-Based Multi-Disease Risk Predictor  
## Notebook 02: Data Cleaning

### Objective
To clean, simplify, and prepare the Heart Disease and Diabetes datasets  
by selecting relevant features and handling invalid or noisy values.

### import Libraries

In [1]:
import pandas as pd 
import numpy as np 
import warnings
warnings.filterwarnings('ignore')

### Load Raw Datasets

In [2]:
heart=pd.read_csv('C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/heart.csv')
diabetes=pd.read_csv('C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/diabetes.csv')

In [3]:
print('Heart Dataset shape ::',heart.shape)
print('Diabetes Dataset shape ::',diabetes.shape)

Heart Dataset shape :: (1025, 14)
Diabetes Dataset shape :: (768, 9)


## Feature Selection Rationale

For a hackathon-ready and explainable AI system:
- We avoid overly complex or domain-specific attributes
- We focus on commonly understood clinical parameters
- Fewer features improve interpretability and reduce overfitting

This also helps during judging and viva explanations.

### Heart Dataset: Feature Selection

In [4]:
heart_clean = heart[
    ["age", "trestbps", "chol", "thalach", "oldpeak", "target"]
]
heart_clean.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,target
0,52,125,212,168,1.0,0
1,53,140,203,155,3.1,0
2,70,145,174,125,2.6,0
3,61,148,203,161,0.0,0
4,62,138,294,106,1.9,0


### Diabetes Dataset: Feature Selection

In [5]:
diabetes_clean = diabetes[
    ["Age", "Glucose", "BMI", "Outcome"]
]

diabetes_clean.rename(columns={
    "Age": "age",
    "Glucose": "glucose",
    "BMI": "bmi",
    "Outcome": "target"
}, inplace=True)

diabetes_clean.head()

Unnamed: 0,age,glucose,bmi,target
0,50,148,33.6,1
1,31,85,26.6,0
2,32,183,23.3,1
3,21,89,28.1,0
4,33,137,43.1,1


## Data Quality Issues

- The Diabetes dataset contains zero values in medical attributes
  such as glucose and BMI.
- Zero values are not medically valid and represent missing data.
- These values must be handled before model training.

In [6]:
for col in ["glucose", "bmi"]:
    diabetes_clean[col] = diabetes_clean[col].replace(
        0, diabetes_clean[col].mean()
    )

### Missing Value Check

In [7]:
print("Heart dataset missing values:")
print(heart_clean.isnull().sum())

print("\nDiabetes dataset missing values:")
print(diabetes_clean.isnull().sum())

Heart dataset missing values:
age         0
trestbps    0
chol        0
thalach     0
oldpeak     0
target      0
dtype: int64

Diabetes dataset missing values:
age        0
glucose    0
bmi        0
target     0
dtype: int64


### Save Cleaned Datasets

In [8]:
heart_clean.to_csv("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/heart_clean.csv", index=False)
diabetes_clean.to_csv("C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/diabetes_clean.csv", index=False)

## Summary

- Selected relevant and explainable features
- Removed unnecessary complexity
- Handled invalid zero values in diabetes data
- Saved clean datasets for reuse

The next step is Exploratory Data Analysis (EDA)  
to visualize patterns and relationships in the data.