## 01-Data Exploration

In this notebook, I will be verifying:
1. The dataset's integrity with mentioned 1,025 records, 13 features, and one target variable in the original paper.
2. Loading the dataset and verifying its structure comparing to the original study.
3. Column names, column counts and record counts.
4. If it has any missing values or null values.
5. If it has any glaring errors
6. Finally, I will be reproducing the <a href="../contents/tables/originals/Table1.pdf" target="_blank"> Table 1</a> from the original paper.

In [1]:
# import necessary libs
import numpy as np
import pandas as pd

In [2]:
# load data & show 10 heads
df = pd.read_csv("../raw/heart_dataset.csv")
df.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
5,58,0,0,100,248,0,0,122,0,1.0,1,0,2,1
6,58,1,0,114,318,0,2,140,0,4.4,0,3,1,0
7,55,1,0,160,289,0,0,145,1,0.8,1,1,3,0
8,46,1,0,120,249,0,0,144,0,0.8,2,0,3,0
9,54,1,0,122,286,0,0,116,1,3.2,1,2,2,0


In [3]:
print(f"Column Names:\n{df.columns}")
print(f"Number of Columns\t: {len(df.columns)}")
print(f"Number of Rows\t\t: {len(df)}")
print(f"Target Variable\t\t: {df.columns[13]}") # 13th index is `target` variable mentioned in the paper 

Column Names:
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')
Number of Columns	: 14
Number of Rows		: 1025
Target Variable		: target


All the column names, counts, and number of records matched with the original study. Now, will check for dataset information for missing values.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


**NO MISSING/NULL VALUES FOUND. ALL THE 1,025 RECORDS ARE AVAILABLE WITH NON-NULL VALUES.**

Now, reproducing the **Table 1** with available data. I will categorize each variable as mentioned in the original study and create a function defining each vriable's categorizing criteria as shown in Table 1.

In [5]:
# Create Table 1 for reproduction
def create_table1(df):
    """Recreating Table 1 from the paper with exact categories mentioned"""

    table1_results = []  # dictionary for results
    
    # 1. AGE - Categorized into 4 ('24-39', '40-54', '55-69', '70-85')
    age_bins = [23, 40, 55, 70, 86]
    age_labels = ['24-39', '40-54', '55-69', '70-85']
    df['age_cat'] = pd.cut(df['age'], bins=age_bins, labels=age_labels, include_lowest=True)
    
    age_counts = pd.crosstab(df['age_cat'], df['target'], margins=True)
    for idx in age_labels:
        table1_results.append({
            'Feature': 'Age',
            'Category': idx,
            'No Heart Disease': age_counts.loc[idx, 0],
            'Heart Disease': age_counts.loc[idx, 1],
            'Total': age_counts.loc[idx, 'All']
        })
    
    # 2. SEX (0=Female, 1=Male) [N.B: The original paper encoded it wrong as 0=Male, 1=Female]
    sex_counts = pd.crosstab(df['sex'], df['target'], margins=True)
    table1_results.append({
        'Feature': 'Sex',
        'Category': 'Male',
        'No Heart Disease': sex_counts.loc[1, 0],
        'Heart Disease': sex_counts.loc[1, 1],
        'Total': sex_counts.loc[1, 'All']
    })
    table1_results.append({
        'Feature': 'Sex',
        'Category': 'Female',
        'No Heart Disease': sex_counts.loc[0, 0],
        'Heart Disease': sex_counts.loc[0, 1],
        'Total': sex_counts.loc[0, 'All']
    })
    
    # 3. CP - Chest Pain (Categorized 1-4)
    cp_labels = {0: '1 = Typical angina', 1: '2 = Atypical angina', 
                 2: '3 = Non-anginal pain', 3: '4 = Asymptomatic'}
    cp_counts = pd.crosstab(df['cp'], df['target'], margins=True)
    for cp_val in [0, 1, 2, 3]:
        table1_results.append({
            'Feature': 'cp',
            'Category': cp_labels[cp_val],
            'No Heart Disease': cp_counts.loc[cp_val, 0],
            'Heart Disease': cp_counts.loc[cp_val, 1],
            'Total': cp_counts.loc[cp_val, 'All']
        })
    
    # 4. TRESTBPS - Resting Blood Pressure (Categorized into 4)
    trestbps_bins = [89, 120, 150, 180, 211]
    trestbps_labels = ['90-119', '120-149', '150-179', '180-210']
    df['trestbps_cat'] = pd.cut(df['trestbps'], bins=trestbps_bins, labels=trestbps_labels, include_lowest=True)
    
    trestbps_counts = pd.crosstab(df['trestbps_cat'], df['target'], margins=True)
    for idx in trestbps_labels:
        table1_results.append({
            'Feature': 'Trestbps',
            'Category': idx,
            'No Heart Disease': trestbps_counts.loc[idx, 0],
            'Heart Disease': trestbps_counts.loc[idx, 1],
            'Total': trestbps_counts.loc[idx, 'All']
        })
    
    # 5. CHOL - Cholesterol (Categorized into 4)
    chol_bins = [119, 231, 342, 454, 565]
    chol_labels = ['120-230', '231-341', '342-453', '454-564']
    df['chol_cat'] = pd.cut(df['chol'], bins=chol_bins, labels=chol_labels, include_lowest=True)
    
    chol_counts = pd.crosstab(df['chol_cat'], df['target'], margins=True)
    for idx in chol_labels:
        table1_results.append({
            'Feature': 'chol',
            'Category': idx,
            'No Heart Disease': chol_counts.loc[idx, 0],
            'Heart Disease': chol_counts.loc[idx, 1],
            'Total': chol_counts.loc[idx, 'All']
        })
    
    # 6. FBS - Fasting Blood Sugar
    fbs_counts = pd.crosstab(df['fbs'], df['target'], margins=True)
    table1_results.append({
        'Feature': 'fbs',
        'Category': '<120mg/dl',
        'No Heart Disease': fbs_counts.loc[0, 0],
        'Heart Disease': fbs_counts.loc[0, 1],
        'Total': fbs_counts.loc[0, 'All']
    })
    table1_results.append({
        'Feature': 'fbs',
        'Category': '>120mg/dl',
        'No Heart Disease': fbs_counts.loc[1, 0],
        'Heart Disease': fbs_counts.loc[1, 1],
        'Total': fbs_counts.loc[1, 'All']
    })
    
    # 7. RESTECG - Electrocardiographic results (Categorized into 3)
    restecg_labels = {0: '0=Normal', 1: '1 = ST-T wave abnormality', 2: '2 = LV hypertrophy'}
    restecg_counts = pd.crosstab(df['restecg'], df['target'], margins=True)
    for val in [0, 1, 2]:
        table1_results.append({
            'Feature': 'Restecg',
            'Category': restecg_labels[val],
            'No Heart Disease': restecg_counts.loc[val, 0],
            'Heart Disease': restecg_counts.loc[val, 1],
            'Total': restecg_counts.loc[val, 'All']
        })
    
    # 8. THALACH - Maximum Heart Rate (Categorized into 4)
    thalach_bins = [69, 105, 140, 175, 210]
    thalach_labels = ['70-104', '105-139', '140-174', '175-209']
    df['thalach_cat'] = pd.cut(df['thalach'], bins=thalach_bins, labels=thalach_labels, include_lowest=True)
    
    thalach_counts = pd.crosstab(df['thalach_cat'], df['target'], margins=True)
    for idx in thalach_labels:
        table1_results.append({
            'Feature': 'Thalach',
            'Category': idx,
            'No Heart Disease': thalach_counts.loc[idx, 0],
            'Heart Disease': thalach_counts.loc[idx, 1],
            'Total': thalach_counts.loc[idx, 'All']
        })
    
    # 9. EXANG - Exercise induced angina
    exang_counts = pd.crosstab(df['exang'], df['target'], margins=True)
    table1_results.append({
        'Feature': 'Exang',
        'Category': 'No',
        'No Heart Disease': exang_counts.loc[0, 0],
        'Heart Disease': exang_counts.loc[0, 1],
        'Total': exang_counts.loc[0, 'All']
    })
    table1_results.append({
        'Feature': 'Exang',
        'Category': 'Yes',
        'No Heart Disease': exang_counts.loc[1, 0],
        'Heart Disease': exang_counts.loc[1, 1],
        'Total': exang_counts.loc[1, 'All']
    })
    
    # 10. OLDPEAK - ST depression
    oldpeak_bins = [-0.1, 1.55, 3.01, 4.65, 6.3]
    oldpeak_labels = ['0-1.54', '1.55-3.00', '3.01-4.64', '4.65-6.2']
    df['oldpeak_cat'] = pd.cut(df['oldpeak'], bins=oldpeak_bins, labels=oldpeak_labels, include_lowest=True)
    
    oldpeak_counts = pd.crosstab(df['oldpeak_cat'], df['target'], margins=True)
    for idx in oldpeak_labels:
        table1_results.append({
            'Feature': 'Oldpeak',
            'Category': idx,
            'No Heart Disease': oldpeak_counts.loc[idx, 0],
            'Heart Disease': oldpeak_counts.loc[idx, 1],
            'Total': oldpeak_counts.loc[idx, 'All']
        })
    
    # 11. SLOPE - Slope of peak exercise ST segment (Categorized into 3)
    slope_labels = {0: '1 = Up sloping', 1: '2 = Flat', 2: '3 = Down sloping'}
    slope_counts = pd.crosstab(df['slope'], df['target'], margins=True)
    for val in [0, 1, 2]:
        table1_results.append({
            'Feature': 'Slope',
            'Category': slope_labels[val],
            'No Heart Disease': slope_counts.loc[val, 0],
            'Heart Disease': slope_counts.loc[val, 1],
            'Total': slope_counts.loc[val, 'All']
        })
    
    # 12. CA - Number of major blood vessels 
    ca_counts = pd.crosstab(df['ca'], df['target'], margins=True)
    for val in [0, 1, 2, 3, 4]:
        table1_results.append({
            'Feature': 'ca',
            'Category': str(val),
            'No Heart Disease': ca_counts.loc[val, 0] if val in ca_counts.index else 0,
            'Heart Disease': ca_counts.loc[val, 1] if val in ca_counts.index else 0,
            'Total': ca_counts.loc[val, 'All'] if val in ca_counts.index else 0
        })
    
    # 13. THAL - Thalassemia (Categorized into 4)
    thal_labels = {0: '0', 1: 'Normal', 2: 'Fixed defect', 3: 'Reversible'}
    thal_counts = pd.crosstab(df['thal'], df['target'], margins=True)
    for val in [0, 1, 2, 3]:
        table1_results.append({
            'Feature': 'Thal',
            'Category': thal_labels[val],
            'No Heart Disease': thal_counts.loc[val, 0] if val in thal_counts.index else 0,
            'Heart Disease': thal_counts.loc[val, 1] if val in thal_counts.index else 0,
            'Total': thal_counts.loc[val, 'All'] if val in thal_counts.index else 0
        })
    
    return pd.DataFrame(table1_results)

##### **DISCREPANCY FOUND: Feature Encoding**

The paper's Table 1 shows categories labeled 1-4 for `cp` and 1-3 for `slope`, but the actual dataset uses 0-based indexing (0-3 and 0-2 respectively). 

This is a **missing detail** in the paper - they didn't document that the data was recoded or how the original encoding worked.

**Impact:** This doesn't affect the analysis but highlights lack of transparency about data preprocessing.

In [6]:
# Create the table
table1_reproduced = create_table1(df)

# show the table
print("====== REPRODUCED TABLE 1 ======\n")
print(table1_reproduced.to_string(index=False))

# Save to CSV for comparison
table1_reproduced.to_csv('../contents/tables/table1_reproduced.csv', index=False)


 Feature                  Category  No Heart Disease  Heart Disease  Total
     Age                     24-39                23             45     68
     Age                     40-54               155            283    438
     Age                     55-69               318            181    499
     Age                     70-85                 3             17     20
     Sex                      Male               413            300    713
     Sex                    Female                86            226    312
      cp        1 = Typical angina               375            122    497
      cp       2 = Atypical angina                33            134    167
      cp      3 = Non-anginal pain                65            219    284
      cp          4 = Asymptomatic                26             51     77
Trestbps                    90-119               138            191    329
Trestbps                   120-149               287            294    581
Trestbps                

## Summary of Table 01
Successfully reproduced Table 1 structure with all 13 clinical features and their categorical breakdowns. **Total counts match the paper** (N=1,025; 499 no disease, 526 disease), but **individual category distributions differ significantly**.


##### Possible Explanations of Distributional Mismatch
1. **Different dataset versions**: The paper uses data from 1988, but multiple versions exist on UCI/Kaggle with different preprocessing
2. **Undocumented data cleaning**: Paper may have removed/modified records without documentation
3. **Binning differences**: Category boundaries may have been applied differently (inclusive vs exclusive)
4. **Sex encoding reversed**: Male/Female counts are exactly swapped, suggesting opposite encoding (0=Male vs 1=Male)
5. **Missing preprocessing details**: Paper doesn't specify handling of outliers, missing values, or data transformation

#### REPRODUCED TABLE 01
Here is the reproduced version of <a href="../contents/tables/table1_reproduced.pdf" target="_blank"> Table 1</a>.