<a href="https://colab.research.google.com/github/22053604/Cardiovascular-Health-Analysis/blob/main/Business%20Report%20Colab%20notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Business Notebook: **Cardiovascular Health Analysis**




# **1. Overview**

In this assessment, we analyzed a cardiovascular health dataset to understand the relationship between various factors—demographic, clinical, and lifestyle—on cardiovascular disease (CVD) risk. Using Python in Google Colab, we performed data cleaning, preprocessing, and exploratory data analysis (EDA) on the dataset. We focused on key combinations of factors (e.g., age, cholesterol, smoking status) and examined how these relate to the likelihood of developing CVD. To support our analysis, we generated visualizations such as bar charts and scatter plots, revealing trends and insights into cardiovascular health. Additionally, we followed a structured process to ensure that the code was organized, efficient, and tested. The results from this analysis are intended to provide actionable recommendations for reducing CVD risk, as well as insights into patterns in cardiovascular health.


# **2. Problem Solving Methodology**


# Problem Statement
Cardiovascular disease (CVD) is one of the leading causes of death globally, affecting millions of individuals each year. Given the complex interaction of demographic factors (such as age and gender), clinical factors (such as blood pressure and cholesterol levels), and lifestyle factors (such as smoking and physical activity), there is a pressing need to identify patterns that could help predict and prevent CVD. This assessment seeks to investigate the relationships between these variables and cardiovascular health outcomes, with a particular focus on determining which factor combinations are the most significant predictors of CVD. By understanding these relationships, we aim to support healthcare professionals and policymakers in developing targeted interventions to reduce the prevalence of CVD.

# Objectives
The objective of this assessment is to conduct a comprehensive analysis of a cardiovascular health dataset to uncover patterns and correlations between demographic, clinical, and lifestyle factors and the risk of developing cardiovascular disease. Specifically, the analysis will:
1. Identify the most significant demographic, clinical, and lifestyle predictors of cardiovascular disease.
2. Analyze different combinations of factors to determine which contribute most to CVD risk.
3. Provide actionable insights that can inform public health strategies to prevent and manage CVD.
4. Generate visualizations that clearly present the relationships between these factors and cardiovascular health, supporting the interpretation of the results.

#**3. Dataset and Factor Selection**
The dataset includes information such as age, gender, cholesterol levels, glucose levels, smoking status, physical activity, alcohol consumption, and the presence of cardiovascular disease (CVD).

After loading the data from the medical_examination table in the SQLite database, we conducted data cleaning by removing duplicates and converting age from days into years. The factors were then categorized into more readable formats. For example, age was grouped into bins, cholesterol and glucose levels were categorized into "Normal," "Above Normal," and "Well Above Normal," and lifestyle factors such as smoking, alcohol consumption, and physical activity were transformed into understandable binary labels.

# Factor Combinations:
We selected specific factor combinations to conduct a thorough analysis of how demographic, clinical, and lifestyle factors impact cardiovascular disease risk. The combinations chosen include:

1. Demographic + Lifestyle:
* Age Group + Smoking Status: This combination helps analyze how smoking behavior affects CVD risk across different age groups.
2. Demographic + Clinical:
* Gender + Cholesterol: By analyzing gender along with cholesterol levels, we examine how these clinical factors impact CVD differently for men and women.
3. Combined:
* Age Group + Blood Pressure + Physical Activity: This combination explores the interaction between age, blood pressure, and physical activity, revealing trends in how these variables together influence CVD risk.

# **4. Input and Output Description**

***Input Variables***

**Age:**
1. Converted from days to years for easier interpretation.
2. Grouped into age ranges (e.g., 0-9, 10-19, etc.) to observe patterns across different life stages.

**Gender:** Mapped from categorical values of 1 (Female) and 2 (Male).

**Cholesterol:** A categorical variable that has been transformed into three distinct levels:
"Normal"
"Above Normal"
"Well Above Normal"

**Glucose:** Similarly to cholesterol, glucose levels were mapped into:
"Normal"
"Above Normal"
"Well Above Normal"

**Smoking Status:** Mapped from binary values:
0 (Non-smoker)
1 (Smoker)

**Alcohol Consumption:** Coded as a binary variable:
0 (Non-drinker)
1 (Drinker)

**Physical Activity** classified into:
0 (Inactive)
1 (Active)

**Cardiovascular Disease Presence:**
A binary variable that indicates whether the individual has been diagnosed with cardiovascular disease:
0 (No Disease)
1 (Disease)

***Output Variables:***

**Cardiovascular Disease Prediction:**
1. The primary output is a binary prediction that identifies whether an individual is likely to have cardiovascular disease based on the combination of the above input variables.
2. This allows for a simplified "Yes" (1) or "No" (0) outcome regarding disease presence.

**Patterns and Trends:**
1. Grouped analyses to detect correlations between various factors and cardiovascular disease, focusing on:
2. Age Group: Correlation between different age ranges and cardiovascular disease incidence.
3. Gender Differences in Smoking Status: Analysis of smoking patterns between males and females.
4. Cholesterol Levels and Physical Activity: Exploration of how cholesterol levels interact with physical activity, along with gender-specific trends.

# **5. Analysis Process**

**1. Data Cleaning and Transformation**
As part of our data preparation process, we first addressed any inconsistencies by removing duplicate entries and handling missing values, ensuring that the dataset was ready for accurate analysis. To improve the clarity and interpretation of the data, we applied the following transformations:
* Age: Converted from days to years, and then grouped into age categories (e.g., 20–29, 30–39, etc.) to allow for easier comparison across different age groups.
* Cholesterol and Glucose: Both were categorized into three levels—Normal, Above
Normal, and Well Above Normal—providing a clearer indication of cardiovascular risk.
* Gender: Recoded as 1 for Female and 2 for Male to ensure consistency in analysis.
* Lifestyle Factors (Smoking Status, Alcohol Consumption, Physical Activity): These were represented as binary variables, with 0 indicating non-smokers, non-drinkers, or inactive individuals, and 1 for smokers, drinkers, or those who are physically active.
This transformation process made the dataset more interpretable and allowed us to perform a more insightful analysis of cardiovascular disease trends based on these factors.

**2. Visualization and Analysis**
Once the data was cleaned and transformed, we used a range of visualizations to explore relationships between cardiovascular disease and key risk factors.We implemented interactive widgets for dynamic filtering by age group and gender, which allowed detailed exploration of trends across different demographics. Below are the visualizations we used and the insights derived from each:

* **Widgets (Interactive Filters):** Dropdown widgets were created for age group and gender, enabling users to filter the dataset dynamically and explore how the visualizations change across different segments of the population.
* **Age Group vs. Cardiovascular Disease:**To examine how age impacts cardiovascular disease prevalence, we plotted age groups against the presence of cardiovascular disease. As expected, older age groups (especially 50+) showed a higher prevalence. However, we also noticed that younger groups (particularly 30–39) exhibited a significant risk when other factors like high cholesterol or smoking were present.
* **Gender vs. Smoking Status**: A bar chart was created to explore gender
differences in smoking habits. Our analysis revealed that males tend to have higher smoking rates, which correlates with a slightly higher cardiovascular disease prevalence among men compared to women.
* **Cholesterol vs. Physical Activity by Gender:** This visualization helped us
explore the interaction between cholesterol levels, physical activity, and gender. It became clear that individuals with high cholesterol who were also physically inactive—especially males—had a substantially higher risk of cardiovascular disease.
* **Correlation Heatmap:** To gain a deeper understanding of the relationships between the numeric variables in our dataset, we generated a correlation matrix. The heatmap highlighted strong positive correlations between factors like cholesterol levels, glucose levels, age, and the presence of cardiovascular disease. This allowed us to identify the most influential factors contributing to cardiovascular health outcomes.
* **Age Distribution of Female Smokers:** This histogram visualizes the age distribution of female smokers. It provides insights into the prevalence of smoking among women across different age groups. The histogram revealed that female smokers are predominantly concentrated in the age range of 30 to 50 years. This helps identify the demographic that may need targeted interventions for smoking cessation programs to reduce cardiovascular risk.
* **Age Distribution of Patients with High Cholesterol (Age > 50):** A histogram was created to show the age distribution of patients aged over 50 with elevated cholesterol levels. This visualization highlights that a significant portion of patients in this age group have cholesterol levels classified as above normal or well above normal, placing them at an increased risk of cardiovascular disease. The chart emphasized that high cholesterol is a critical factor in older adults' cardiovascular health.
* **Age Distribution of Active Individuals with High Blood Pressure (Age > 40):** This histogram examines individuals over the age of 40 who are physically active but have elevated blood pressure (systolic blood pressure > 120 mmHg). The distribution indicates that while these individuals are active, the combination of age and high blood pressure still presents a cardiovascular risk. The analysis underlined the importance of managing blood pressure even in physically active populations.







# **6. Evaluation (Gen AI)**




# Gen AI Assistance 1: Combining Code and Debugging

## **1. Gen AI Prompt**

**We approached Gen AI with the following request:**

"Can you help combine our two codes together? We each tried to do the code on our own and later combined it, but we ran into issues such as coding incompatibilities. We keep seeing this error:

      ops.pyx in pandas._libs.ops.scalar_compare()
      TypeError: '>' not supported between instances of 'str' and 'int'"

###**2. Gen AI Response**

The AI provided the following analysis and suggestion:

      The error message 'TypeError: '>' not supported between instances of 'str' and 'int'' arises in the line:

      age_cholesterol_combination = df_cleaned[(df_cleaned['age_years'] > 50) & (df_cleaned['cholesterol'] > 1)]


This indicates a type mismatch in the comparison operation. Specifically, you're likely attempting to compare a string value in the 'cholesterol' column with the integer value 1 using the '>' operator. This operation is invalid in Python, as strings and integers cannot be directly compared using numerical operators.

In the code, the 'cholesterol' column is transformed using the map function, replacing its numerical values (1, 2, 3) with string labels ('Normal', 'Above Normal', 'Well Above Normal'). Consequently, when you use df_cleaned['cholesterol'] > 1, it's trying to compare these string labels with the integer 1, which causes the TypeError.


### **3. Gen AI Response Effectiveness**

The Gen AI response was very effective. After implementing the suggested changes into the code and fixing it according to the given feedback, the code worked as intended. AI had accurately identified the root cause of the error and provided a clear explanation of the type mismatch issue, so it was helpful in fixing our error.


### **4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following modifications:

* Compare 'cholesterol' with string labels instead of numerical value


Implemented suggested changes from Gen AI:

      
      demographic_lifestyle_combination = df_cleaned[(df_cleaned['gender'] == 1) & (df_cleaned['smoke'] == 1)]
      print(demographic_lifestyle_combination.head())

      age_cholesterol_combination = df_cleaned[(df_cleaned['age_years'] > 50) & (df_cleaned['cholesterol'] != 'Normal')]  
      print(age_cholesterol_combination.head())
      
      age_bp_activity = df_cleaned[(df_cleaned['age_years'] > 40) & (df_cleaned['ap_hi'] > 120) & (df_cleaned['active'] == 1)]
      print(age_bp_activity.head())


These changes addressed the type mismatch issue by comparing the 'cholesterol' column with string labels instead of numerical values. We also added an additional filtering condition for age, blood pressure, and activity level to further refine our analysis.


# Gen AI Assistance 2: Resolving Plotting Error for Numeric Data

**1. Gen AI Prompt**

We encountered the following error after cleaning our data where for example smoker, non-smoker = 1 & 0. There was an issue with graphing this numerical value with the following error:



      _compute_plot_data(self)
          696         # no non-numeric frames or series allowed
          697         if is_empty:
      --> 698             raise TypeError("no numeric data to plot")
          699
          700         self.data = numeric_data.apply(type(self)._convert_to_ndarray)
      TypeError: no numeric data to plot


So we asked Gen AI to read our code and ask for assistance by resolving this issue.


**2. Gen AI Response**


The AI provided the following analysis and suggestions:

"The error message 'TypeError: no numeric data to plot' indicates that the data being passed to the plot function does not contain any numeric values for plotting. This is likely happening because you are mapping values to strings in your smoke column, which would make your gender_smoke_data DataFrame contain only string (object) data type columns and no numeric data. matplotlib requires numeric values to create a plot.


Suggested Changes:

* Preserve the numeric values for the 'smoke' column.
* Create a new column 'smoking_status' to store string data.
* Modify the grouping and plotting code to use the numeric 'smoke' column.

### **3. Gen AI Response Effectiveness**

The Gen AI response was very effective. After implementing the suggested changes, the code worked as intended. The AI accurately identified the root cause of the error and provided a clear solution to preserve numeric data for plotting while still maintaining readable categories for analysis.

### **4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following key modifications:

1. Kept the original 'smoke' column as numeric (1 for smoker, 2 for non-smoker).
2. Created a new 'smoking_status' column with string labels ('smoker', 'non-smoker').
3. Modified the grouping and plotting code to use the numeric 'smoke' column instead of the string 'smoking_status'.

Implemented the suggested changes by AI:

      #Grouping into category groups - define Age into bins and labels
      age_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
      age_labels = ['0-9', '10-19','20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-100']
      #Create age columns
      df['age_group'] = pd.cut(df['age_years'], bins=age_bins, labels=age_labels, right=False)

      #Transform Gender codes into understandable format (1 = female, 2 = male)
      df['gender'] = df['gender'].map({1: 'female', 2: 'male'})

      #Transform Smoker codes into understandable format (1 = smoker, 2 = non-smoker)
      # Instead of converting them to strings, keep them as numbers, or create a new column 'smoking_status' to store string data
      df['smoking_status'] = df['smoke'].map({1: 'smoker', 2: 'non-smoker'})  # Use a different name and keep numeric values

      #Transform Cholesterol levels into readable categories
      df['cholesterol'] = df['cholesterol'].map({1: 'Normal', 2: 'Above Normal', 3: 'Well Above Normal'})

      # Change the plotting code to use the numeric smoke column
      gender_smoke_data = df.groupby(['gender','smoke'])['cardio'].mean().unstack() # groupby still expects numeric column
      gender_smoke_data.plot(kind='bar', stacked=True)
      plt.title('Cardiovascular Risk by Gender and Smoking Status')
      plt.xlabel('Gender')
      plt.ylabel('Cardiovascular Risk')
      plt.show()


These changes allowed us to maintain numeric data for plotting while still having readable categories for other analyses. The plot was successfully generated, showing the cardiovascular risk by gender and smoking status.

# Gen AI Assistance 3:Resolving Visualization Error for Age Group Data

**1. Gen AI Prompt**
We encountered an issue when creating graph visualizations where it would not display the graphs. So, we asked AI to help identify the error and provide suggestions for changes.

**2. Gen AI Response**

The AI provided the following analysis and suggestions:

"The error message KeyError: 'age_group' indicates that the column 'age_group' is not found in the DataFrame df that you are passing to the create_visualizations function. This is likely because the data processing and cleaning steps that create the 'age_group' column (like binning and labeling the 'age' column) were performed on a different DataFrame called df_cleaned, but the original df was used in the visualization function.

Suggested solutions:

* Pass df_cleaned to the function: The create_visualizations function should be called with df_cleaned instead of df, as df_cleaned contains the 'age_group' column after the data processing steps.

* Error Handling: Add a try-except block to the create_visualizations function to handle potential KeyError exceptions. If either 'age_group' or 'cardio_disease_presence' column is missing, it will print an informative message to help diagnose the issue and prevent the script from crashing.

* Ensure the input DataFrame has 'age_group' and 'cardio_disease_presence' columns.





### **3. Gen AI Response Effectiveness**

The Gen AI response was very effective. The AI accurately identified the likely cause of the error and provided a clear solution with explanations for each change.

### **4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following key modifications:

Creating function to create visualizations

        def create_visualizations(df):
    
If not, the try-except block will print a message and skip plotting:

        try:
            age_disease_counts = df.groupby(['age_group', 'cardio_disease_presence']).size().unstack()
            age_disease_counts.plot(kind='bar', stacked=True, color=['skyblue', 'salmon'])
        except KeyError as e:
            print(f"Error: Missing column(s) for visualization: {e}")
            return

Call the function with the DataFrame containing 'age_group' column:

        create_visualizations(df_cleaned) #Pass df_cleaned instead of df

Updated the function call to use df_cleaned instead of df:
   
        create_visualizations(df_cleaned)

These changes allowed the visualization function to work with the correct DataFrame (df_cleaned) that contains the 'age_group' column. The added error handling also provides informative messages if any required columns are missing, improving the robustness of the code.

# **Gen AI Assistance 4: Data Cleaning**

**1. Gen AI Prompt**
"How do I clean a dataset in Python, including removing duplicates and handling missing values?"

**2. Gen AI Response**

To clean a dataset in Python, you can use the pandas library. Here's how you can remove duplicates and handle missing values



In [None]:
import pandas as pd
#Load dataset
df = pd.read_csv('medical_examination.csv')
#Remove duplicates
df_cleaned = df.drop_duplicates()
#Handle missing values by dropping rows with missing data
df_cleaned = df_cleaned.dropna()

#alternatively, fill missing values
df_cleaned = df.fillna(methods = 'ffill')

print(df_cleaned.head())

**3. Gen AI Response Effectiveness**
This response is useful because it directly addresses the data cleaning process, showing how to remove duplicates and handle missing values. The code provided is straightforward and can be easily implemented.

**4. Modification Needed/Implementation**
We added an additional step to convert the age from days to years, as this was a specific requirement for the assignment.

# **Gen AI Assistance 5: SQL Queries in Pandas**


**1. Gen AI Prompt**

We asked Gen AI how to filter data with the following:

"How can I filter data in pandas like SQL queries, for example filtering by gender and smoking status?"

**2. Gen AI Response**

You can filter data in pandas using boolean indexing, similar to SQL WHERE clauses. Here's an example of how to filter for female smokers:



In [None]:
#filter for gender = female and smoke = smoker
*filtered_df = df[(df['gender'] == 1) & (df['smoke'] ==1)]*

*Print(filtered_df.head())*


**3. Gen AI Response Effectivness**

The response is very effective for filtering the dataset based on specific conditions, similar to SQL queries. It directly answers the question and shows how to apply logical conditions in pandas.

**4. Modification Needed/implementation**

No modifications were needed as the solution provided was exactly what was required for the task.

# **Gen AI Assistance 6: Visualizing Data With Matplotlib**


**1. Gen AI Prompt:**

"How do I create a histogram in Python to visualize the distribution of ages for a filtered dataset?"

**2. Gen AI Response:**

To create a histogram in Python, you can use the matplotlib library. Here's how you can plot the distribution of ages




In [None]:
import matplotlib.pyplot as plt

#create a histogram of the 'age_years'column

*plt.hist(filtered_df['age_years'], bins=10, alpha=0.7, color='blue')*

*plt.title('Age distribution')*

*plt.xlabel('Age(years)')*

*plt.ylabel(Frequency)*

*plt.show()*

**3. Gen AI Response Effectiveness:**

The response effectively explains how to create a histogram using matplotlib, which is exactly what is needed to visualize the distribution of ages in the dataset.

**4. Modification Needed/Implementation**

No modifications were needed, as the code worked perfectly to create a histogram for the required factor combinations.

# **Gen AI Assistance 7. Modifying Factor Combinations For Analysis**

**1. Gen AI Prompt:**

"What is the best way to analyze different combinations of demographic, clinical, and lifestyle factors in a dataset?"

**2. Gen AI Response:**

To analyze combinations of factors, you can filter your dataset using multiple conditions and group the data based on the combinations of interest.





In [None]:
#filter for age > 50 and cholesterol level > 1
filtered_df = df[(df['age_years'] > 50) & (df['cholesterol'] > 1)]
print(filtered_df.head())

#filter for age > 40, high blood pressure, and active lifestyle
filtered_df = df[(df['age_years'] > 40) & (df['ap_hi'] > 120) & (df['active'] == 1)]
print(filtered_df.head())

**3. Gen AI Response Effectiveness:**

The response provides a clear method for analyzing multiple factor combinations, which is essential for this task. It uses logical conditions to filter data, making it simple to apply different combinations of factors.

**4. Modification Needed:**

No modifications were required for this, as the method provided was easy to implement and aligned with the requirements of the assignment.


#**Gen AI Assistance 8: Resolving Correlation Matrix Error for Mixed Data Types**

**1. Gen AI Prompt**

We encountered an error when adding additional graphs and asked Gen AI to explain and help fix the error. The main error was:

    ValueError: could not convert string to float: '50-59'

      1662         ret = take_nd(self.categories._values, self._codes)
      1663         if dtype and np.dtype(dtype) != self.categories.dtype:
    -> 1664             return np.asarray(ret, dtype)
      1665         # When we're a Categorical[ExtensionArray], like Interval,
      1666         # we need to ensure _array_ gets all the way to an
    ValueError: could not convert string to float: '50-59'

**2. Gen AI Response**

The AI provided the following analysis and suggestions:

    "Hypothesis:
    The error 'ValueError: could not convert string to float: '50-59'' is raised because the df.corr() function, used to calculate correlation, expects numerical data. However, after the data preprocessing steps, the DataFrame df_cleaned contains columns with string values, particularly the 'age_group' column which has values like '50-59', introduced by the pd.cut function.
    The df.corr() function attempts to convert all the data in the DataFrame to float for correlation calculation, but it fails when it encounters these string values in the 'age_group' column (and potentially other categorical columns that were transformed to strings). This leads to the ValueError.

Suggested Changes:

Import numpy: Add import numpy as np at the beginning if it's not already present to use numpy functions.

Select Numeric Columns: Before calculating the correlation matrix, select only the numeric columns from the DataFrame using numeric_df = df.select_dtypes(include=np.number). This ensures that df.corr() is applied only to data that can be converted to floats."

**3. Gen AI Response Effectiveness**

The Gen AI response was very effective. I implemented the changes and other graph suggestions provided by the AI. The analysis accurately identified the cause of the error and provided a clear solution to resolve it.

**4. Modification Needed/Implementation**
Based on the AI's guidance, we implemented the following key modifications:

Modified the correlation matrix calculation to use only numeric columns:

    numeric_df = df_cleaned.select_dtypes(include=np.number)
    correlation_matrix = numeric_df.corr()

Updated the heatmap visualization code to use the new correlation_matrix:


    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    plt.title('Correlation Heatmap of Numeric Features')
    plt.show()

These changes allowed us to generate the correlation matrix and heatmap visualization without encountering the ValueError. By selecting only numeric columns, we ensured that the correlation function could process the data correctly, avoiding issues with string values in categorical columns like 'age_group'.

#**Gen AI Assistance 8: Resolving NameError for Database Connection Function**


**1. Gen AI Prompt**

We encountered the following error and asked Gen AI to explain it:


    NameError: name 'connect_to_db' is not defined

**2. Gen AI Response**

The AI provided the following explanation and solution:

  "The error NameError: name 'connect_to_db' is not defined arises because the function connect_to_db is being called within the main function, but it hasn't been defined anywhere in your code. This means Python doesn't know what to do when it encounters this function call, leading to the error.

  To resolve this, you need to define the connect_to_db function before it's called. Here's the suggested implementation:


      def connect_to_db(db_path):
          """Connects to the SQLite database.
          Args:
              db_path: Path to the database file.
          Returns:
              A database connection object, or None if the connection fails.
          """
          try:
              conn = sqlite3.connect(db_path)
              return conn
          except sqlite3.Error as e:
              print(f"Error connecting to database: {e}")
              return None

      def load_data(conn, table_name):
          """Loads data from a specified table in the database.
          Args:
              conn: Database connection object.
              table_name: Name of the table to load data from.
          Returns:
              A pandas DataFrame containing the data from the table.
          """
          try:
              df = pd.read_sql(f"SELECT * FROM {table_name}", conn)
              return df
          except pd.io.sql.DatabaseError as e:
              print(f"Error loading data from table '{table_name}': {e}")
              return None

Make sure to place these function definitions before any code that calls them."


**3. Gen AI Response Effectiveness**

The Gen AI response was very effective. The AI accurately identified the cause of the error and provided a clear solution with well-documented function definitions.


**4. Modification Needed/Implementation**

Based on the AI's guidance, we implemented the following modifications:

Added the connect_to_db function definition:

    def connect_to_db(db_path):
        try:
            conn = sqlite3.connect(db_path)
            return conn
        except sqlite3.Error as e:
            print(f"Error connecting to database: {e}")
            return None

Added the load_data function definition:

    def load_data(conn, table_name):
        try:
            df = pd.read_sql(f"SELECT * FROM {table_name}", conn)
            return df
        except pd.io.sql.DatabaseError as e:
            print(f"Error loading data from table '{table_name}': {e}")
            return None

Ensured these function definitions were placed before any code that calls them in the script.


These changes resolved the NameError by properly defining the connect_to_db function before it was called. Additionally, the load_data function was implemented to facilitate data retrieval from the database. The error handling in both functions improves the robustness of the code by gracefully handling potential database connection and query errors.


# **7. Interpretation and Conclusion**

***1. Main Insights***

Through our analysis, we gained several key insights into the relationship between cardiovascular disease and the various risk factors:
* Age and Cardiovascular Disease: As anticipated, older age groups, especially
those over 50 years, exhibited a higher prevalence of cardiovascular diseases. Interestingly, we also identified that younger individuals, particularly those in the 20–29 and 30–39 age groups, are not exempt from cardiovascular risks. This was especially true for those with additional risk factors such as high cholesterol or smoking habits.
* Gender and Smoking: Our analysis highlighted that males tend to have higher
smoking rates compared to females, which correlates with a slightly increased incidence of cardiovascular disease among males. This suggests that smoking is a significant factor in cardiovascular health disparities between genders.
* Cholesterol, Physical Activity, and Gender: We observed that individuals with above-normal or well above-normal cholesterol levels who were also physically inactive showed a markedly higher risk of cardiovascular disease. This trend was particularly pronounced among males, further underscoring the interaction between lifestyle factors and health outcomes.
* Correlation Analysis: The correlation heatmap revealed strong positive correlations between cardiovascular disease and factors such as cholesterol levels, glucose levels, and age. These factors appear to be the primary contributors to increased cardiovascular risk, providing valuable insight into which variables should be prioritized in preventive strategies.

***2. Limitations and Improvements***

While our analysis provided useful insights, there are certain limitations in the workflow and areas where improvements could be made:

***Limitations:***


* The dataset might contain missing or incomplete records, which could result in biased or inaccurate findings.
* Our analysis focused primarily on categorical exploration, offering basic insights into the relationships between variables. While this approach helps identify trends, it does not provide a detailed predictive model for cardiovascular disease risk.
* The study did not account for genetic factors, which are known to play a crucial role in cardiovascular diseases. Without this information, the analysis may overlook an essential determinant of health outcomes.

***Improvements:***


* Including more comprehensive datasets that factor in genetic, socioeconomic, and environmental variables could enhance the accuracy and depth of our analysis. This would provide a more holistic view of cardiovascular disease risk.
* Applying advanced statistical models, such as logistic regression or decision trees, would offer deeper insights into the predictive factors associated with cardiovascular disease. Such models would allow us to assess the probability of disease occurrence based on a combination of variables.
* Conducting a time-series analysis on lifestyle changes over the years could provide better insights into how continuous improvements in health behaviors impact the prevalence of cardiovascular disease over time. This would help in understanding the long-term benefits of health interventions.

***Conclusion:***

In conclusion, our analysis has provided valuable insights into the relationships between cardiovascular diseases and various risk factors, such as age, gender, cholesterol levels, and lifestyle behaviors like smoking and physical inactivity. While the study confirmed the expected trends, such as the heightened risk for older individuals and smokers, it also revealed less obvious patterns, particularly among younger age groups and the interaction between cholesterol levels and physical activity. These insights can serve as a foundation for designing targeted interventions to mitigate cardiovascular risks.

However, the limitations of our analysis, including incomplete data and the lack of genetic and socioeconomic considerations, highlight the need for a more comprehensive approach in future research. By incorporating more diverse datasets and applying advanced modeling techniques, future analyses could yield even deeper insights, leading to more effective preventive strategies for cardiovascular health.

Overall, our work demonstrates the potential of data-driven approaches in public health while also emphasizing the need for continuous improvement and refinement in analysis techniques.



# **8. Reflection**

**Reflection - Siddheshwari Sawant (22053604)**

Reflecting on this group assignment, I feel confident of what we accomplished together. My main contribution was to the business report notebook, where I focused on organizing our analysis and presenting our findings in a clear way. I also created the GitHub repository and set up the two notebooks we needed for this assessment, ensuring everyone had access to the materials. While my teammates worked on the technical parts like data analysis and coding, I made sure our interpretations matched the insights we got from the data. This meant turning results from visualizations—like how age groups relate to cardiovascular disease or the impact of smoking by gender—into meaningful conclusions for the business report. I collaborated closely with the team to make sure our work fit well with the coding and visualizations in the development notebook, resulting in a polished final product.

Throughout this project, I achieved the unit’s learning outcomes, especially in understanding how data analysis connects to real-world business applications. For instance, during lectures and lab activities, I learned how risk factors like cholesterol levels and physical inactivity affect cardiovascular health. These lessons were crucial when I was putting together the business report because they helped me not only describe our findings but also analyze their importance in real-life situations. I also learned how to work effectively in a team, applying project management skills like task delegation and open communication, which were emphasized in class. Two key takeaways for me were learning to create meaningful data visualizations and appreciating the value of teamwork in tackling complex projects.

However, we faced some challenges, especially when it came to our busy schedules with work and other assignments. At first, it was tough to find times when everyone could meet to discuss our progress. To tackle this, we set up a shared calendar where each team member could mark their availability. This helped us find overlapping times for meetings. We also agreed to keep in touch through a group chat, so even if some of us couldn’t meet, we could still share updates and stay informed. This approach created a supportive environment, allowing us to work together effectively despite our busy schedules.

We also faced some issues with the Gen AI tools we used. While they were helpful in generating code snippets and troubleshooting, sometimes the AI provided responses that were too general or didn’t fit our specific needs. For example, when we visualized cholesterol levels based on physical activity and gender, the AI's initial code didn’t fully address the categories we wanted to highlight. To fix this, we started documenting our interactions with Gen AI using a simple structure: prompt, response, effectiveness, and modifications. This practice helped us refine our prompts and better tailor the AI's outputs to our project’s needs.

Our instructor was also a great help, guiding us on how to structure our evaluations in the business report. This feedback helped us connect our analysis with the data insights, making our work stronger.

Overall, teamwork was a highlight of this project. We communicated regularly through meetings and group chats, which made sure everyone was clear on their roles. I enjoyed collaborating with my teammates on both the business and technical sides of the project. This experience not only enhanced my technical skills but also improved my ability to work effectively in a team, which will be valuable for future projects and my professional career.