<a href="https://colab.research.google.com/github/22053604/Cardiovascular-Health-Analysis/blob/main/Business%20Report%20Colab%20notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Business Notebook: **Cardiovascular Health Analysis**




# **1. Overview**

In this assessment, we analyzed a cardiovascular health dataset to understand the relationship between various factors—demographic, clinical, and lifestyle—on cardiovascular disease (CVD) risk. Using Python in Google Colab, we performed data cleaning, preprocessing, and exploratory data analysis (EDA) on the dataset. We focused on key combinations of factors (e.g., age, cholesterol, smoking status) and examined how these relate to the likelihood of developing CVD. To support our analysis, we generated visualizations such as bar charts and scatter plots, revealing trends and insights into cardiovascular health. Additionally, we followed a structured process to ensure that the code was organized, efficient, and tested. The results from this analysis are intended to provide actionable recommendations for reducing CVD risk, as well as insights into patterns in cardiovascular health.


# **2. Problem Solving Methodology**


# Problem Statement
Cardiovascular disease (CVD) is one of the leading causes of death globally, affecting millions of individuals each year. Given the complex interaction of demographic factors (such as age and gender), clinical factors (such as blood pressure and cholesterol levels), and lifestyle factors (such as smoking and physical activity), there is a pressing need to identify patterns that could help predict and prevent CVD. This assessment seeks to investigate the relationships between these variables and cardiovascular health outcomes, with a particular focus on determining which factor combinations are the most significant predictors of CVD. By understanding these relationships, we aim to support healthcare professionals and policymakers in developing targeted interventions to reduce the prevalence of CVD.

# Objectives
The objective of this assessment is to conduct a comprehensive analysis of a cardiovascular health dataset to uncover patterns and correlations between demographic, clinical, and lifestyle factors and the risk of developing cardiovascular disease. Specifically, the analysis will:
1. Identify the most significant demographic, clinical, and lifestyle predictors of cardiovascular disease.
2. Analyze different combinations of factors to determine which contribute most to CVD risk.
3. Provide actionable insights that can inform public health strategies to prevent and manage CVD.
4. Generate visualizations that clearly present the relationships between these factors and cardiovascular health, supporting the interpretation of the results.

#**3. Dataset and Factor Selection**
The dataset includes information such as age, gender, cholesterol levels, glucose levels, smoking status, physical activity, alcohol consumption, and the presence of cardiovascular disease (CVD).

After loading the data from the medical_examination table in the SQLite database, we conducted data cleaning by removing duplicates and converting age from days into years. The factors were then categorized into more readable formats. For example, age was grouped into bins, cholesterol and glucose levels were categorized into "Normal," "Above Normal," and "Well Above Normal," and lifestyle factors such as smoking, alcohol consumption, and physical activity were transformed into understandable binary labels.

# Factor Combinations:
We selected specific factor combinations to conduct a thorough analysis of how demographic, clinical, and lifestyle factors impact cardiovascular disease risk. The combinations chosen include:

1. Demographic + Lifestyle:
* Age Group + Smoking Status: This combination helps analyze how smoking behavior affects CVD risk across different age groups.
2. Demographic + Clinical:
* Gender + Cholesterol: By analyzing gender along with cholesterol levels, we examine how these clinical factors impact CVD differently for men and women.
3. Combined:
* Age Group + Blood Pressure + Physical Activity: This combination explores the interaction between age, blood pressure, and physical activity, revealing trends in how these variables together influence CVD risk.

# **4. Input and Output Description**

***Input Variables***

**Age:**
1. Converted from days to years for easier interpretation.
2. Grouped into age ranges (e.g., 0-9, 10-19, etc.) to observe patterns across different life stages.

**Gender:** Mapped from categorical values of 1 (Female) and 2 (Male).

**Cholesterol:** A categorical variable that has been transformed into three distinct levels:
"Normal"
"Above Normal"
"Well Above Normal"

**Glucose:** Similarly to cholesterol, glucose levels were mapped into:
"Normal"
"Above Normal"
"Well Above Normal"

**Smoking Status:** Mapped from binary values:
0 (Non-smoker)
1 (Smoker)

**Alcohol Consumption:** Coded as a binary variable:
0 (Non-drinker)
1 (Drinker)

**Physical Activity** classified into:
0 (Inactive)
1 (Active)

**Cardiovascular Disease Presence:**
A binary variable that indicates whether the individual has been diagnosed with cardiovascular disease:
0 (No Disease)
1 (Disease)

***Output Variables:***

**Cardiovascular Disease Prediction:**
1. The primary output is a binary prediction that identifies whether an individual is likely to have cardiovascular disease based on the combination of the above input variables.
2. This allows for a simplified "Yes" (1) or "No" (0) outcome regarding disease presence.

**Patterns and Trends:**
1. Grouped analyses to detect correlations between various factors and cardiovascular disease, focusing on:
2. Age Group: Correlation between different age ranges and cardiovascular disease incidence.
3. Gender Differences in Smoking Status: Analysis of smoking patterns between males and females.
4. Cholesterol Levels and Physical Activity: Exploration of how cholesterol levels interact with physical activity, along with gender-specific trends.

# **5. Analysis Process**

**1. Data Cleaning and Transformation**
As part of our data preparation process, we first addressed any inconsistencies by removing duplicate entries and handling missing values, ensuring that the dataset was ready for accurate analysis. To improve the clarity and interpretation of the data, we applied the following transformations:
* Age: Converted from days to years, and then grouped into age categories (e.g., 20–29, 30–39, etc.) to allow for easier comparison across different age groups.
* Cholesterol and Glucose: Both were categorized into three levels—Normal, Above
Normal, and Well Above Normal—providing a clearer indication of cardiovascular risk.
* Gender: Recoded as 1 for Female and 2 for Male to ensure consistency in analysis.
* Lifestyle Factors (Smoking Status, Alcohol Consumption, Physical Activity): These were represented as binary variables, with 0 indicating non-smokers, non-drinkers, or inactive individuals, and 1 for smokers, drinkers, or those who are physically active.
This transformation process made the dataset more interpretable and allowed us to perform a more insightful analysis of cardiovascular disease trends based on these factors.

**2. Visualization and Analysis**
Once the data was cleaned and transformed, we used a range of visualizations to explore relationships between cardiovascular disease and key risk factors. We incorporated widgets for age group and gender. These widgets allowed dynamic filitering of the dataset, enabling more deatailed exploration of cardiovascular disease trends across different demographics. Below are the visualizations we used and the insights derived from each:

* Age Group vs. Cardiovascular Disease: To examine how age impacts cardiovascular disease prevalence, we plotted age groups against the presence of cardiovascular disease. As expected, older age groups (especially 50+) showed a higher prevalence. However, we also noticed that younger groups (particularly 30–39) exhibited a significant risk when other factors like high cholesterol or smoking were present.
* Gender vs. Smoking Status: A bar chart was created to explore gender differences in smoking habits. Our analysis revealed that males tend to have higher smoking rates, which correlates with a slightly higher cardiovascular disease prevalence among men compared to women.
* Cholesterol vs. Physical Activity by Gender: This visualization helped us
explore the interaction between cholesterol levels, physical activity, and gender. It became clear that individuals with high cholesterol who were also physically inactive—especially males—had a substantially higher risk of cardiovascular disease.
* Correlation Heatmap: To gain a deeper understanding of the relationships between the numeric variables in our dataset, we generated a correlation matrix. The heatmap highlighted strong positive correlations between factors like cholesterol levels, glucose levels, age, and the presence of cardiovascular disease. This allowed us to identify the most influential factors contributing to cardiovascular health outcomes.

# **6. Evaluation (Gen AI)**




# Gen AI Assistance 1: Combining Code and Debugging

## **1. Gen AI Prompt**

**We approached Gen AI with the following request:**

"Can you help combine our two codes together? We each tried to do the code on our own and later combined it, but we ran into issues such as coding incompatibilities. We keep seeing this error:

ops.pyx in pandas._libs.ops.scalar_compare()
TypeError: '>' not supported between instances of 'str' and 'int'"

###**2. Gen AI Response**

**The AI provided the following analysis and suggestion:**

The error message 'TypeError: '>' not supported between instances of 'str' and 'int'' arises in the line:


age_cholesterol_combination = df_cleaned[(df_cleaned['age_years'] > 50) & (df_cleaned['cholesterol'] > 1)]




This indicates a type mismatch in the comparison operation. Specifically, you're likely attempting to compare a string value in the 'cholesterol' column with the integer value 1 using the '>' operator. This operation is invalid in Python, as strings and integers cannot be directly compared using numerical operators.

In the code, the 'cholesterol' column is transformed using the map function, replacing its numerical values (1, 2, 3) with string labels ('Normal', 'Above Normal', 'Well Above Normal'). Consequently, when you use df_cleaned['cholesterol'] > 1, it's trying to compare these string labels with the integer 1, which causes the TypeError.

Suggested Change:

python
demographic_lifestyle_combination = df_cleaned[(df_cleaned['gender'] == 1) & (df_cleaned['smoke'] == 1)]
print(demographic_lifestyle_combination.head())
"

### **3. Gen AI Response Effectiveness**

The Gen AI response was very effective. After implementing the suggested changes into the code and fixing it according to the given feedback, the code worked as intended.
AI had accurately identified the root cause of the error and provided a clear explanation of the type mismatch issue, so it was helpful in fixing our error.


### **4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following modifications:


* Compare 'cholesterol' with string labels instead of numerical value

These changes addressed the type mismatch issue by comparing the 'cholesterol' column with string labels instead of numerical values. We also added an additional filtering condition for age, blood pressure, and activity level to further refine our analysis.


In [None]:
#Implemented suggested changes from Gen AI
age_cholesterol_combination = df_cleaned[(df_cleaned['age_years'] > 50) & (df_cleaned['cholesterol'] != 'Normal')]
print(age_cholesterol_combination.head())

age_bp_activity = df_cleaned[(df_cleaned['age_years'] > 40) & (df_cleaned['ap_hi'] > 120) & (df_cleaned['active'] == 1)]
print(age_bp_activity.head())


# Gen AI Assistance 2: Resolving Plotting Error for Numeric Data

**1. Gen AI Prompt**

We encountered the following error after cleaning our data (smoker = 1 & 0):

python:

_compute_plot_data(self)
    696         # no non-numeric frames or series allowed
    697         if is_empty:
--> 698             raise TypeError("no numeric data to plot")
    699
    700         self.data = numeric_data.apply(type(self)._convert_to_ndarray)
TypeError: no numeric data to plot


We asked Gen AI for assistance in resolving this issue.

**2. Gen AI Response**


The AI provided the following analysis and suggestions:

"The error message 'TypeError: no numeric data to plot' indicates that the data being passed to the plot function does not contain any numeric values for plotting. This is likely happening because you are mapping values to strings in your smoke column, which would make your gender_smoke_data DataFrame contain only string (object) data type columns and no numeric data. matplotlib requires numeric values to create a plot.

Suggested Changes:

Preserve the numeric values for the 'smoke' column.
Create a new column 'smoking_status' to store string data.
Modify the grouping and plotting code to use the numeric 'smoke' column.

### 3. Gen AI Response Effectiveness

The Gen AI response was very effective. After implementing the suggested changes, the code worked as intended. The AI accurately identified the root cause of the error and provided a clear solution to preserve numeric data for plotting while still maintaining readable categories for analysis.

### 4. Modification Needed/Implementation

Based on the AI's guidance, we implemented the following key modifications:

1. Kept the original 'smoke' column as numeric (1 for smoker, 2 for non-smoker).
2. Created a new 'smoking_status' column with string labels ('smoker', 'non-smoker').
3. Modified the grouping and plotting code to use the numeric 'smoke' column instead of the string 'smoking_status'.

These changes allowed us to maintain numeric data for plotting while still having readable categories for other analyses. The plot was successfully generated, showing the cardiovascular risk by gender and smoking status.

In [None]:
#Implemented the suggested changes by AI:

#Grouping into category groups - define Age into bins and labels
age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
age_labels = ['0-9', '10-19','20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-100']
#Create age columns
df['age_group'] = pd.cut(df['age_years'], bins=age_bins, labels=age_labels, right=False)

#Change data into understandable format
#Transform Gender codes into understandable format (1 = female, 2 = male)
df['gender'] = df['gender'].map({1: 'female', 2: 'male'})

#Transform Smoker codes into understandable format (1 = smoker, 2 = non-smoker)
# Instead of converting them to strings, keep them as numbers, or create a new column 'smoking_status' to store string data
df['smoking_status'] = df['smoke'].map({1: 'smoker', 2: 'non-smoker'})  # Use a different name and keep numeric values

#Transform Cholesterol levels into readable categories
df['cholesterol'] = df['cholesterol'].map({1: 'Normal', 2: 'Above Normal', 3: 'Well Above Normal'})

# Change the plotting code to use the numeric smoke column
gender_smoke_data = df.groupby(['gender','smoke'])['cardio'].mean().unstack() # groupby still expects numeric column
gender_smoke_data.plot(kind='bar', stacked=True)
plt.title('Cardiovascular Risk by Gender and Smoking Status')
plt.xlabel('Gender')
plt.ylabel('Cardiovascular Risk')
plt.show()


# Gen AI Assistance 3:Resolving Visualization Error for Age Group Data

**1. Gen AI Prompt**
We encountered an issue when creating graph visualizations and asked AI to help identify the error and provide suggestions for changes.

**2. Gen AI Response**

The AI provided the following analysis and suggestions:

"The error message KeyError: 'age_group' indicates that the column 'age_group' is not found in the DataFrame df that you are passing to the create_visualizations function. This is likely because the data processing and cleaning steps that create the 'age_group' column (like binning and labeling the 'age' column) were performed on a different DataFrame called df_cleaned, but the original df was used in the visualization function.

Suggested solutions:

Pass df_cleaned to the function: The create_visualizations function should be called with df_cleaned instead of df, as df_cleaned contains the 'age_group' column after the data processing steps.

Error Handling: Add a try-except block to the create_visualizations function to handle potential KeyError exceptions. If either 'age_group' or 'cardio_disease_presence' column is missing, it will print an informative message to help diagnose the issue and prevent the script from crashing.

Here's the suggested code:

* Creating function to create visualizations

        def create_visualizations(df):
    
* Age Group vs Cardiovascular Disease
  
        plt.figure(figsize=(10, 6))
    
* Ensure the input DataFrame has 'age_group' and 'cardio_disease_presence' columns.

* If not, the try-except block will print a message and skip plotting
    try:
        age_disease_counts = df.groupby(['age_group', 'cardio_disease_presence']).size().unstack()
        age_disease_counts.plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
    except KeyError as e:
        print(f"Error: Missing column(s) for visualization: {e}")
        return  # Exit the function early if necessary columns are missing
        plt.title('Age Group vs Cardiovascular Disease')
        plt.xlabel('Age Group')
        plt.ylabel('Count')
        plt.xticks(rotation=45)
        plt.legend(title='Cardiovascular Disease Presence')
        plt.show()

Call the function with the DataFrame containing 'age_group' column:

        create_visualizations(df_cleaned) #Pass df_cleaned instead of df


### **3. Gen AI Response Effectiveness**

The Gen AI response was very effective. The AI accurately identified the likely cause of the error and provided a clear solution with explanations for each change.

### **4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following key modifications:

1. Updated the function call to use df_cleaned instead of df:
   python
   create_visualizations(df_cleaned)

Added error handling within the create_visualizations function:

    try:
        age_disease_counts = df.groupby(['age_group', 'cardio_disease_presence']).size().unstack()
        age_disease_counts.plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
    except KeyError as e:
        print(f"Error: Missing column(s) for visualization: {e}")
        return

These changes allowed the visualization function to work with the correct DataFrame (df_cleaned) that contains the 'age_group' column. The added error handling also provides informative messages if any required columns are missing, improving the robustness of the code.

# **7. Interpretation and Conclusion**

***1. Main Insights***

Through our analysis, we gained several key insights into the relationship between cardiovascular disease and the various risk factors:
* Age and Cardiovascular Disease: As anticipated, older age groups, especially
those over 50 years, exhibited a higher prevalence of cardiovascular diseases. Interestingly, we also identified that younger individuals, particularly those in the 20–29 and 30–39 age groups, are not exempt from cardiovascular risks. This was especially true for those with additional risk factors such as high cholesterol or smoking habits.
* Gender and Smoking: Our analysis highlighted that males tend to have higher
smoking rates compared to females, which correlates with a slightly increased incidence of cardiovascular disease among males. This suggests that smoking is a significant factor in cardiovascular health disparities between genders.
* Cholesterol, Physical Activity, and Gender: We observed that individuals with above-normal or well above-normal cholesterol levels who were also physically inactive showed a markedly higher risk of cardiovascular disease. This trend was particularly pronounced among males, further underscoring the interaction between lifestyle factors and health outcomes.
* Correlation Analysis: The correlation heatmap revealed strong positive correlations between cardiovascular disease and factors such as cholesterol levels, glucose levels, and age. These factors appear to be the primary contributors to increased cardiovascular risk, providing valuable insight into which variables should be prioritized in preventive strategies.

***2. Limitations and Improvements***

While our analysis provided useful insights, there are certain limitations in the workflow and areas where improvements could be made:

***Limitations:***


* The dataset might contain missing or incomplete records, which could result in biased or inaccurate findings.
* Our analysis focused primarily on categorical exploration, offering basic insights into the relationships between variables. While this approach helps identify trends, it does not provide a detailed predictive model for cardiovascular disease risk.
* The study did not account for genetic factors, which are known to play a crucial role in cardiovascular diseases. Without this information, the analysis may overlook an essential determinant of health outcomes.

***Improvements:***


* Including more comprehensive datasets that factor in genetic, socioeconomic, and environmental variables could enhance the accuracy and depth of our analysis. This would provide a more holistic view of cardiovascular disease risk.
* Applying advanced statistical models, such as logistic regression or decision trees, would offer deeper insights into the predictive factors associated with cardiovascular disease. Such models would allow us to assess the probability of disease occurrence based on a combination of variables.
* Conducting a time-series analysis on lifestyle changes over the years could provide better insights into how continuous improvements in health behaviors impact the prevalence of cardiovascular disease over time. This would help in understanding the long-term benefits of health interventions.

***Conclusion:***

In conclusion, our analysis has provided valuable insights into the relationships between cardiovascular diseases and various risk factors, such as age, gender, cholesterol levels, and lifestyle behaviors like smoking and physical inactivity. While the study confirmed the expected trends, such as the heightened risk for older individuals and smokers, it also revealed less obvious patterns, particularly among younger age groups and the interaction between cholesterol levels and physical activity. These insights can serve as a foundation for designing targeted interventions to mitigate cardiovascular risks.

However, the limitations of our analysis, including incomplete data and the lack of genetic and socioeconomic considerations, highlight the need for a more comprehensive approach in future research. By incorporating more diverse datasets and applying advanced modeling techniques, future analyses could yield even deeper insights, leading to more effective preventive strategies for cardiovascular health.

Overall, our work demonstrates the potential of data-driven approaches in public health while also emphasizing the need for continuous improvement and refinement in analysis techniques.



# **8. Reflection**