# AI-Based Multi-Disease Risk Predictor  
## Notebook 01: Data Understanding

### Objective
The objective of this notebook is to understand the structure, features, and target variables  
of the real-world healthcare datasets used for predicting disease risk.

### Datasets Used
1. Heart Disease Dataset (UCI – Kaggle)
2. PIMA Indians Diabetes Dataset (Kaggle)

### Import Libraries

In [1]:
import pandas as pd 
import numpy as np

### Load Datasets

In [2]:
heart=pd.read_csv('C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/heart.csv')
diabetes=pd.read_csv('C://Users//JOHN//Desktop//ML_projects//ai-disease-risk-predictor//data//raw_data/diabetes.csv')

In [3]:
print('Heart Dataset shape ::',heart.shape)
print('Diabetes Dataset shape ::',diabetes.shape)

Heart Dataset shape :: (1025, 14)
Diabetes Dataset shape :: (768, 9)


### Dataset Preview

In [4]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [5]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Dataset Source

### Heart Disease Dataset
- Source: UCI Machine Learning Repository (via Kaggle)
- Contains clinical parameters related to heart health
- Widely used in medical machine learning research

### Diabetes Dataset
- Source: PIMA Indians Diabetes Dataset (via Kaggle)
- Contains diagnostic measurements for diabetes prediction
- Standard benchmark dataset in healthcare analytics

## Dataset Structure & Data Types

In [6]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [7]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


## Feature Description

### Heart Disease Dataset
Key attributes include:
- age: Age of the patient
- trestbps: Resting blood pressure
- chol: Serum cholesterol
- thalach: Maximum heart rate achieved
- oldpeak: ST depression induced by exercise
- target: Presence (1) or absence (0) of heart disease

### Diabetes Dataset
Key attributes include:
- Age: Patient age
- Glucose: Plasma glucose concentration
- BMI: Body Mass Index
- Outcome: Presence (1) or absence (0) of diabetes
----------------------------------------------------------------------------------------------------------

## Target Variable Identification


Each dataset has a clearly defined target variable:

- Heart Disease Dataset:
  - `target`
  - 1 → Heart disease present
  - 0 → No heart disease

- Diabetes Dataset:
  - `Outcome`
  - 1 → Diabetes present
  - 0 → No diabetes

These target variables make both problems suitable for binary classification.

-----------------------------------------------------------------------------------------------------------

## Initial Observations

- Both datasets are structured and tabular.
- Target variables are binary, suitable for classification models.
- Features are numerical and clinically meaningful.
- No missing values detected at first glance.
- Further cleaning and preprocessing may be required.

Next steps will involve data cleaning and feature selection.