# Artrya Demographics Assessment

## Introduction

### Objective
The objective of this analysis is to understand and manipulate the dataset using inclusion and exclusion criteria files and to examine the depth and relevance of demographic details relevant to coronary artery disease (CAD) diagnosis. Through this assessment, we aim to understand how demographic details are represented and identify opportunities for enhancing their granularity to improve classification and diagnosis.

### Dataset Description
This project utilizes the following data sources:
- **Primary Dataset**: A JSON file named `Artrya_primary_dataset.json`, which contains demographic and clinical data pertinent to CAD diagnosis.
- **Inclusion and Exclusion Criteria**: Excel files (`inclusions.xlsx` and `exclusions.xlsx`) that provide criteria for filtering the primary dataset to ensure the analysis focuses on relevant cases.


In [7]:
import pandas as pd
import json
import os
import matplotlib.pyplot as plt
import seaborn as sns

Read the primary dataset provided

In [None]:
# Load the primary dataset
with open('Artrya_primary_dataset.json') as f:
    primary_data = json.load(f)

# Convert to a DataFrame for easier manipulation
primary_df = pd.DataFrame(primary_data)

In [None]:
# Load the inclusion and exclusion files
inclusion_df = pd.read_excel('inclusions.xlsx')
exclusion_df = pd.read_excel('exclusions.xlsx')

Data exploration to understand the structure of the data

In [None]:
# Display the first few rows of the primary dataset
print(primary_df.head())

# Display the first few rows of the inclusion and exclusion datasets
print(inclusion_df.head())
print(exclusion_df.head())

In [None]:
# Check data types
print(primary_df.dtypes)
print(inclusion_df.dtypes)
print(exclusion_df.dtypes)

In [None]:
# Explore the dataset structure
num_rows, num_columns = primary_df.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")
print("Variable names:", primary_df.columns.tolist())

# Display summary statistics
print(primary_df.describe(include='all'))

Review unique categories for all categorical variables and visualize the data, to get a glimpse of what it looks like using different visuals appropriate for different data types

In [None]:
# Identify unique categories for categorical variables
categorical_cols = primary_df.select_dtypes(include=['object']).columns
unique_categories = {col: primary_df[col].unique() for col in categorical_cols}

# Display unique categories
for col, categories in unique_categories.items():
    print(f"\nUnique categories for {col}:")
    print(categories)

In [None]:
# Visualize age distribution
plt.figure(figsize=(14, 6))
sns.histplot(primary_df['age_generated'], bins=10, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Visualize gender distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=primary_df, x='sex', palette='pastel')
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=0)
plt.show()

In [None]:
# Visualize race distribution
plt.figure(figsize=(12, 6))
sns.countplot(data=primary_df, y='race_mapped', palette='muted')
plt.title('Race Distribution')
plt.xlabel('Count')
plt.ylabel('Race')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Visualize ethnicity distribution
plt.figure(figsize=(10, 5))
sns.countplot(data=primary_df, x='ethnicity', palette='Set2')
plt.title('Ethnicity Distribution')
plt.xlabel('Ethnicity')
plt.ylabel('Count')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45)
plt.show()

Now lets look at race and gender by stenosis severity using tabular summary and also assess the relationship between calcium score numerical ranges and the corresponding risk categories.

In [None]:
# Set the figure size
plt.figure(figsize=(12, 8))

# Create a violin plot with a logarithmic scale for the calcium scores
sns.violinplot(x='calcium_score_risk_cat', y='calcium_score_modified', data=primary_df)
plt.yscale('log')  # Apply logarithmic scale to the y-axis

# Add titles and labels
plt.title('Calcium Score Distribution by Risk Category (Log Scale)')
plt.xlabel('Calcium Score Risk Category')
plt.ylabel('Calcium Score (Log Scale)')
plt.xticks(rotation=45)

# Show the plot
plt.show()

In [None]:
# Group by risk category and calculate descriptive statistics for calcium scores
calcium_score_stats = primary_df.groupby('calcium_score_risk_cat')['calcium_score_modified'].describe()
calcium_score_stats

The distribution of calcium scores across plaque burden categories reveals diverse patterns. "Extensive" and "Moderate" plaque burdens show high variability, with a wide range of unique scores, indicating diverse calcification levels that necessitate individualized clinical assessments. In contrast, "Minimal" and "No Plaque Burden" categories exhibit concentrated scores at low or zero values, reflecting minor or absent calcification. This distribution highlights the complexity within higher burden categories and suggests the need for tailored risk management, while lower burden categories display expected uniformity in calcification absence or minimal presence.

Modify our dataset by adding the new column based on the criteria provided.

In [None]:
# Load the inclusion and exclusion files
inclusion_df = pd.read_excel('inclusions.xlsx')
exclusion_df = pd.read_excel('exclusions.xlsx')

# Ensure column names exist
print("Columns in inclusion_df:", inclusion_df.columns)
print("Columns in exclusion_df:", exclusion_df.columns)

# Display unique IDs to verify
print("Unique Inclusion IDs:", inclusion_df['study_id'].unique())
print("Unique Exclusion IDs:", exclusion_df['study_id'].unique())

# Ensure the study_id column in primary_df is consistent in type
primary_df['study_id'] = primary_df['study_id'].astype(str)
inclusion_df['study_id'] = inclusion_df['study_id'].astype(str)
exclusion_df['study_id'] = exclusion_df['study_id'].astype(str)

# Function to categorize IDs
def categorize_id(row):
    if row['study_id'] in inclusion_df['study_id'].values:
        return 'include'
    elif row['study_id'] in exclusion_df['study_id'].values:
        return 'exclude'
    else:
        return 'to be determined'

# Apply the function to create a new column
primary_df['inclusion_status'] = primary_df.apply(categorize_id, axis=1)

# Display the first few rows of the updated primary DataFrame
primary_df.head()

In [None]:
# Extract first five numeric characters from 'mod_patient_id'
primary_df['treatment_site'] = primary_df['mod_patient_id'].str.extract(r'(\d{5})')

# Display the first few rows of the modified DataFrame
primary_df.head()

# Save the modified DataFrame to a CSV file
primary_df.to_csv('Modified_Artrya_primary_dataset.csv', index=False)

### Key Findings
The analysis reveals that the current demographic details are not sufficiently granular for optimal classification. By refining these details to a more granular level, we can enhance the understanding and classification of individuals with CAD.

In designing a study to collect race and ethnicity data in detailed form, it is crucial to consider both scientific objectives and the social and cultural contexts of the study population. A detailed and culturally sensitive approach to collecting race and ethnicity data is recommended. This approach should include:

- Detailed subject self-identification.
- Additional options such as "other" for flexibility.
- Expansion of sub-options for various ethnic groups, such as:
  - Asian: Chinese, Indian, Filipino, Vietnamese, Korean, Japanese, etc.
  - Hispanic or Latino: Mexican, Puerto Rican, Cuban, Salvadoran, Dominican, etc.
  - African: Nigerian, Ethiopian, Somali, Ghanaian, etc.
  - Middle Eastern or North African: Egyptian, Iranian, Syrian, Lebanese, etc.

By implementing such a detailed approach, the study can achieve a more nuanced understanding of health disparities, cultural influences, and social determinants of health. This careful categorization can ultimately lead to more targeted and effective interventions.

**Author:**

**Fabian Msafiri**