# Dataset (Population) Differences

We're now going to investigate the population differences between datasets. This is going to be done over a few dimensions:

1. **Dataset Size & Completeness**: How many observations each dataset has and the percent of rows that do not contain any missing values.
2. **Feature Level Missingness**: Compare which features are missing in each dataset and in what quantities.
3. **CVD Class Distribution**: Test whether disease severity differs across population using chi-square (*super important and useful to know*).
4. **Numeric Features**: Compare age, blood pressure, cholesterol, etc. across datasets using Kruskal-Wallis tests (non-parametric alternative to ANOVA) 
5. **Categorical Features**: Analyzes sex, chest pain types, etc. with chi-square tests
6. **Correlation Structure**: Observe if feature relationships differ by population

This analysis helps with building a multi-class CVD prediction model for a few reasons, mainly: 

- Population differences may require different feature weights.
- Different CVD prevalence means you'll need stratified sampling.
- Common risk factors to the development of cardiovascular disease can be validated by looking at which features are consistently associated with CVD Class across all datasets.

In [None]:
from load_datasets import dfs

overview_stats = []
for name, _df in zip(dataset_names, dfs):
    analyzable_df = _df.replace('?', np.nan) # Important step as they won't be count as missing if this isn't done 

    total_cells = analyzable_df.shape[0] * analyzable_df.shape[1]
    missing_cells = analyzable_df.isnull().sum().sum()
    complete_cases = (~analyzable_df.isnull().any(axis=1)).sum()
    
    overview_stats.append({
        'Dataset': name,
        'N_Observations': analyzable_df.shape[0],
        'N_Features': analyzable_df.shape[1],
        'Complete_Cases': complete_cases,
        'Complete_Cases_%': (complete_cases / analyzable_df.shape[0] * 100),
        'Missing_Cells': missing_cells,
        'Missing_%': (missing_cells / total_cells * 100)
    })

overview_df = pd.DataFrame(overview_stats)

overview_df.round()