### **Introduction**

This project has two distinct objectives. Objective 1 is exercising advanced engineering techniques aimed at enhancing time efficiency. Objective 2 is discerning the viability and utility of a synthetically developed data set. 

The techniques employed in this project include the use of dictionaries for fast lookups and sets for quick cohort membership testing. Specifically, dictionaries are used in the Patients class for fast lookups in the get_patient_from_id_fast method. This method utilizes a dictionary (self.id_to_patient) to quickly retrieve patient data by ID. Fast lookups refer to retrieving information quickly using efficient data structures. Sets are used in the check_high_risk_cohort_fast function for quick cohort membership testing, where sets track patients with high heart rates and those in the high-risk cohort. Quick cohort membership testing refers to efficiently checking if an element belongs to a set. These methods are reasonable to test for time efficiency because dictionaries provide average O(1) time complexity for lookups, and sets offer O(1) average time complexity for membership testing, both of which can significantly reduce computational time compared to linear search methods.

The expected slower methods that we are comparing to our optimized methods include linear search for patient ID lookup and cohort membership testing. Linear search involves checking each element one by one until the desired element is found, This has a has a time complexity of O(n), where n is the number of elements. This means that as the dataset size increases, the time required for search operations increases linearly.

In regard to fulfilling Objective 2 we will examine the relationship between Body Mass Index (BMI) and various health-related metrics including cholesterol, triglycerides, blood pressure, and heart rate. The expectation is that there will be an increase in risk in all these metrics as BMI increases per individual. I am particularly interested in how the various categories behave along the spectrum of BMI status.

I will work with the dataset 'heart_attack_prediction_dataset.csv', which can be found at Kaggle. This dataset is artificially generated rather than collected from real-world observations or experiments, typically using statistical methods, algorithms, or randomization techniques to create data that mimics the properties of real-world data. I am interested to see whether the statistical trends supported by true patient-generated data are fundamentally represented in this dataset.



#### **Import Necessary Modules and Libraries**

This code imports necessary modules and libraries for working with CSV files, file paths, time functions, random number generation, and numerical operations. These imports will be used in the subsequent implementation of the Patients class.




In [432]:


import csv  # Module for reading and writing CSV files
from pathlib import Path  # Module for working with file paths
import time  # Module for time-related functions
import random  # Module for generating random numbers
import numpy as np  # Library for numerical operations in Python


#### **Patient Class**

The Patients class provides functionality to manage and analyze patient data from a CSV file. It includes methods for fast patient lookup by ID, risk score calculation based on BMI, cholesterol, and blood pressure, and categorization of metrics by BMI category. The class also allows for the display of average heart rate, blood pressure, triglycerides, and cholesterol values across different BMI categories. The risk score calculation used in this class is a basic model designed after an initial review of the data. Additionally, the class includes methods for calculating risk thresholds for heart rate and triglycerides based on the  top 90th percentile of these data categories to identify high-risk patients. These thresholds are used to flag patients who exceed these values, marking them as part of a high-risk cohort. These  models are simple in nature, intended to elicit some well established trends in physical and blood analysis metrics.  








In [433]:


class Patients:
    def __init__(self, csv_filename):
        self.rows = []
        with open(csv_filename, newline='') as csvfile:
            reader = csv.reader(csvfile)
            self.headers = next(reader)  # Skip the header row
            for row in reader:
                self.rows.append(row)
        
        # Create a dictionary for fast lookups
        self.id_to_patient = {row[0]: row for row in self.rows}

    def get_patient_from_id(self, patient_id):
        # Linear search for the patient ID
        for row in self.rows:
            if row[0] == patient_id:
                return row
        return None

    def get_patient_from_id_fast(self, patient_id):
        # Return the patient data for the given ID, or None if not found
        return self.id_to_patient.get(patient_id, None)

    def calculate_risk_score(self, patient):
        # Extract relevant metrics from the patient data
        bmi = float(patient[18])
        cholesterol = float(patient[3])
        blood_pressure = patient[4]
        
        # Split the blood pressure value into systolic and diastolic components
        systolic, diastolic = map(float, blood_pressure.split('/'))
        
        # Calculate and return the risk score based on the metrics
        return round(bmi * 0.3 + cholesterol * 0.4 + (systolic + diastolic) / 2 * 0.3, 2)

    def bmi_category(self, bmi):
        # Determine BMI category based on the BMI value
        if bmi < 18.5:
            return 'Underweight'
        elif 18.5 <= bmi < 25:
            return 'Normal'
        elif 25 <= bmi < 30:
            return 'Overweight'
        elif 30 <= bmi < 35:
            return 'Obese Class 1'
        elif 35 <= bmi < 40:
            return 'Obese Class 2'
        else:
            return 'Obese Class 3'

    def collect_metrics_by_bmi(self):
        # Initialize a dictionary to collect metrics by BMI category
        bmi_metrics = {
            'Heart Rate': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
            'Blood Pressure': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
            'Triglycerides': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
            'Cholesterol': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []}
        }

        # Initialize a dictionary to count the number of patients in each category
        bmi_category_count = {category: 0 for category in bmi_metrics['Heart Rate'].keys()}
        
        # Iterate over each row in the patient data
        for row in self.rows:
            # Extract the BMI value and determine its category
            bmi = float(row[18])  # BMI is in column 18
            category = self.bmi_category(bmi)
            
            # Increment the count for the category
            bmi_category_count[category] += 1
            
            # Extract other relevant metrics
            hr = float(row[5])    # Heart Rate is in column 5
            systolic, diastolic = map(float, row[4].split('/'))  # Blood Pressure is in column 4
            bp = (systolic + diastolic) / 2  # Average blood pressure
            trig = float(row[19])  # Triglycerides is in column 19
            chol = float(row[3])  # Cholesterol is in column 3
            
            # Append the metrics to the corresponding category
            bmi_metrics['Heart Rate'][category].append(hr)
            bmi_metrics['Blood Pressure'][category].append(bp)
            bmi_metrics['Triglycerides'][category].append(trig)
            bmi_metrics['Cholesterol'][category].append(chol)
        
        # Print the count of patients in each BMI category for debugging
        print("\nPatient Count by BMI Category:")
        for category, count in bmi_category_count.items():
            print(f"{category}: {count}")
        
        # Return the collected metrics
        return bmi_metrics

    def display_bmi_metrics(self):
        # Collect metrics by BMI category
        metrics = self.collect_metrics_by_bmi()
        categories = ['Underweight', 'Normal', 'Overweight', 'Obese Class 1', 'Obese Class 2', 'Obese Class 3']

        print("\nAverage Metrics by BMI Category:")
        # Iterate over each metric and display the average for each category
        for metric, data in metrics.items():
            print(f"\n{metric}:")
            for category in categories:
                # Calculate the average for each category
                avg = sum(data[category]) / len(data[category]) if data[category] else 0
                print(f"{category}: {avg:.2f}")

    def calculate_risk_thresholds(self):
        # Calculate the thresholds for heart rate and triglycerides
        heart_rates = [float(row[5]) for row in self.rows]
        triglycerides = [float(row[19]) for row in self.rows]
        
        # Example threshold calculation using the 90th percentile
        hr_threshold = np.percentile(heart_rates, 90)
        trig_threshold = np.percentile(triglycerides, 90)
        
        return hr_threshold, trig_threshold


### **Heart Attack Prediction Dataset Initialization**

This block of code defines the file path to a CSV file containing heart attack prediction data and instantiates a `Patients` object using this file. The CSV file is located in the `data` directory relative to the current working directory. By creating the `Patients` object, the dataset is loaded and ready for analysis and manipulation using the methods provided by the `Patients` class.


In [434]:
# File path
# Define the path to the CSV file containing the heart attack prediction dataset
csv_path = Path.cwd() / 'data' / 'heart_attack_prediction_dataset.csv'

# Create Patients object
# Instantiate a Patients object using the CSV file path
patients = Patients(str(csv_path))


#### **Risk Score Analysis and Health Metrics Comparison**


The average risk score across all 8763 patients is 145.65, with a range from 79.09 to 211.66.

97 patients were flagged for heart rate and triglycerides levels based on the 90th percentile thresholds for both metrics in concert.

For the top 10% of patients with the highest risk scores ("high risk" cohort), the average BMI is 30.10, average blood pressure is 116.54, average heart rate is 73.95, average cholesterol level is 383.24, and average triglycerides level is 413.78. In comparison, the bottom 10% of patients (with the lowest risk scores) have an average BMI of 27.72, average blood pressure of 104.23, average heart rate of 74.58, average cholesterol level of 135.78, and average triglycerides level of 411.84.

The differences between these cohorts are most pronounced in cholesterol levels, with the top 10% cohort having an average cholesterol level that is 247.46 units higher than the bottom 10% cohort. Other metrics in this comparison show smaller differences, such as BMI (2.38 units higher), blood pressure (12.31 units higher), and heart rate (-0.63 units lower).

When comparing the flagged patient group to the bottom 10% cohort (lowest risk category), significant differences are observed in triglycerides (353.74 units higher) and cholesterol (126.69 units higher). However, differences in BMI (2.38 units higher), blood pressure (6.40 units higher), and heart rate (32.50 units higher) are far less substantial.

These findings are difficult to reconcile with the well-documented medical literature. The expected relationships between BMI and various health metrics, along with the sources for these relationships, are as follows:

Blood Pressure

Expected Relationship: There is a positive correlation between BMI and blood pressure. As BMI increases, blood pressure tends to increase. Higher body weight is associated with increased arterial resistance, leading to higher systolic and diastolic blood pressure.
Source:
World Health Organization (WHO). "Obesity and Overweight." Available at: WHO Fact Sheets
American Heart Association. "Overweight and Obesity." Available at: AHA
Resting Heart Rate

Expected Relationship: The relationship between BMI and resting heart rate is less direct but generally, individuals with higher BMI may have a slightly higher resting heart rate due to the increased demand on the heart to pump blood to a larger body mass.
Source:
American Heart Association. "Resting Heart Rate." Available at: AHA
Triglycerides

Expected Relationship: There is a positive correlation between BMI and triglycerides. Higher BMI is often associated with higher levels of triglycerides in the blood. This is partly because overweight and obesity are associated with increased production of very-low-density lipoprotein (VLDL) and decreased clearance of triglycerides.
Source:
National Institutes of Health (NIH). "High Blood Triglycerides." Available at: NIH
American Heart Association. "Triglycerides." Available at: AHA
Cholesterol

Expected Relationship: Higher BMI is often associated with unfavorable lipid profiles, including higher levels of total cholesterol and low-density lipoprotein (LDL) cholesterol, and lower levels of high-density lipoprotein (HDL) cholesterol. Obesity is linked to an increased production of cholesterol and lipids.
Source:
Centers for Disease Control and Prevention (CDC). "Cholesterol." Available at: CDC
American Heart Association. "Cholesterol." Available at: AHA
These sources and expected relationships are well-documented in medical literature and public health guidelines, providing a reliable basis for understanding how BMI impacts these important health metrics.

The deviation from these expected trends leads to a further investigation of the data trends within this data set as related to BMI and the other above described metrics. 








In [435]:
# Function to calculate averages for a given list of patients
def calculate_averages(patients_list):
    total_bmi = 0
    total_bp = 0
    total_hr = 0
    total_chol = 0
    total_trig = 0
    num_patients = len(patients_list)
    
    for patient in patients_list:
        bmi = float(patient[18])
        cholesterol = float(patient[3])
        systolic, diastolic = map(float, patient[4].split('/'))
        bp = (systolic + diastolic) / 2
        hr = float(patient[5])
        trig = float(patient[19])
        
        total_bmi += bmi
        total_bp += bp
        total_hr += hr
        total_chol += cholesterol
        total_trig += trig
    
    return {
        'Average BMI': total_bmi / num_patients,
        'Average BP': total_bp / num_patients,
        'Average HR': total_hr / num_patients,
        'Average Cholesterol': total_chol / num_patients,
        'Average Triglycerides': total_trig / num_patients
    }

# Calculate the risk scores for all patients
risk_scores = [patients.calculate_risk_score(patient) for patient in patients.rows]

# Calculate the average risk score
average_risk_score = sum(risk_scores) / len(risk_scores)

# Calculate the range of risk scores
min_risk_score = min(risk_scores)
max_risk_score = max(risk_scores)
risk_score_range = max_risk_score - min_risk_score

print(f"Average Risk Score: {average_risk_score:.2f}")
print(f"Range of Risk Scores: {risk_score_range:.2f} (Min: {min_risk_score:.2f}, Max: {max_risk_score:.2f})")

# Calculate risk thresholds for heart rate and triglycerides
hr_threshold, trig_threshold = patients.calculate_risk_thresholds()

# Count the number of patients flagged for heart rate and triglycerides based on both thresholds
flagged_patients = [patient for patient in patients.rows if float(patient[5]) >= hr_threshold and float(patient[19]) >= trig_threshold]
flagged_count = len(flagged_patients)

print(f"Number of Patients Flagged for Heart Rate and Triglycerides (90th percentile for both heart rate and triglycerides): {flagged_count}")

# Sort patients by risk score
patients_sorted_by_risk = sorted(patients.rows, key=lambda x: patients.calculate_risk_score(x))

# Identify the top 10% and bottom 10% of the risk score cohort
top_10_percent_count = int(len(patients_sorted_by_risk) * 0.1)
bottom_10_percent_count = top_10_percent_count

top_10_percent_patients = patients_sorted_by_risk[-top_10_percent_count:]
bottom_10_percent_patients = patients_sorted_by_risk[:bottom_10_percent_count]

# Calculate averages for top 10%, bottom 10%, and flagged cohorts
top_10_percent_averages = calculate_averages(top_10_percent_patients)
bottom_10_percent_averages = calculate_averages(bottom_10_percent_patients)
flagged_averages = calculate_averages(flagged_patients)

# Calculate the differences between top 10% and bottom 10% cohorts
average_differences_top_bottom = {metric: top_10_percent_averages[metric] - bottom_10_percent_averages[metric]
                                  for metric in top_10_percent_averages}

# Calculate the differences between flagged patients and bottom 10% cohorts
average_differences_flagged_bottom = {metric: flagged_averages[metric] - bottom_10_percent_averages[metric]
                                      for metric in flagged_averages}

# Print results for risk score cohorts
print("\nTop 10% Averages (Highest Risk):")
for metric, value in top_10_percent_averages.items():
    print(f"{metric}: {value:.2f}")
print(f"Number of patients in top 10% cohort: {top_10_percent_count}")

print("\nBottom 10% Averages (Lowest Risk):")
for metric, value in bottom_10_percent_averages.items():
    print(f"{metric}: {value:.2f}")
print(f"Number of patients in bottom 10% cohort: {bottom_10_percent_count}")

print("\nAverage Differences (Top 10% - Bottom 10%):")
for metric, value in average_differences_top_bottom.items():
    print(f"{metric}: {value:.2f}")

print("\nPatients Flagged for Heart Rate and Triglycerides (90th percentile for both heart rate and triglycerides):")
for metric, value in flagged_averages.items():
    print(f"{metric}: {value:.2f}")
print(f"Number of flagged patients: {flagged_count}")

print("\nAverage Differences (Flagged - Bottom 10%):")
for metric, value in average_differences_flagged_bottom.items():
    print(f"{metric}: {value:.2f}")


Average Risk Score: 145.65
Range of Risk Scores: 132.57 (Min: 79.09, Max: 211.66)
Number of Patients Flagged for Heart Rate and Triglycerides (90th percentile for both heart rate and triglycerides): 97

Top 10% Averages (Highest Risk):
Average BMI: 30.10
Average BP: 116.54
Average HR: 73.95
Average Cholesterol: 383.24
Average Triglycerides: 413.78
Number of patients in top 10% cohort: 876

Bottom 10% Averages (Lowest Risk):
Average BMI: 27.72
Average BP: 104.23
Average HR: 74.58
Average Cholesterol: 135.78
Average Triglycerides: 411.84
Number of patients in bottom 10% cohort: 876

Average Differences (Top 10% - Bottom 10%):
Average BMI: 2.38
Average BP: 12.31
Average HR: -0.63
Average Cholesterol: 247.46
Average Triglycerides: 1.94

Patients Flagged for Heart Rate and Triglycerides (90th percentile for both heart rate and triglycerides):
Average BMI: 28.48
Average BP: 110.63
Average HR: 107.08
Average Cholesterol: 262.47
Average Triglycerides: 765.58
Number of flagged patients: 97

Ave

### **Analysis of BMI Categories and Health Metrics**

The next formulated function (collect and calculate averages by BMI category) has produced some unexpected results when compared to well-documented medical literature as described above.  Here, we discuss how the output deviates from expected trends, particularly noting the absence of individuals categorized as Obese Class 3. (In the United State of America approximately 9.2% of the population is categorized as Obese Class 3). This would translate to an expected 806 individuals (8763 patients X 9.2%). It is hard to imagine that no individuals fit this category.

For resting heart rate, the expected relationship is that individuals with higher BMI may have a slightly higher resting heart rate due to the increased demand on the heart. However, the output data shows the following averages: Underweight: 75.48, Normal: 74.92, Overweight: 74.78, Obese Class 1: 75.06, Obese Class 2: 75.34, and Obese Class 3: 0.00. This does not show a clear positive correlation with BMI as expected.  

Regarding blood pressure, the expected relationship is a positive correlation with BMI. As BMI increases, blood pressure tends to increase. The output data shows slight increases: Underweight: 108.19, Normal: 110.18, Overweight: 110.18, Obese Class 1: 110.26, Obese Class 2: 110.02, and Obese Class 3: 0.00. These increases are less pronounced than expected. 

For triglycerides, higher BMI is often associated with higher levels of triglycerides. The output data presents: Underweight: 407.97, Normal: 421.59, Overweight: 419.29, Obese Class 1: 411.03, Obese Class 2: 418.31, and Obese Class 3: 0.00. While the levels are generally higher with increasing BMI, the differences are not substantial. 

Higher BMI is also expected to be associated with higher levels of total cholesterol and low-density lipoprotein (LDL) cholesterol. The output data shows: Underweight: 256.17, Normal: 260.10, Overweight: 257.33, Obese Class 1: 259.60, Obese Class 2: 262.92, and Obese Class 3: 0.00. Cholesterol levels do not show a strong positive correlation with BMI, and the differences are minor. 










### **Value and utility of the 'heart_attack_prediction_dataset.csv'**

A review of the data set literature 'heart_attack_prediction_dataset.csv' indicates the following: 

"Context:
The Heart Attack Risk Prediction Dataset serves as a valuable resource for delving into the intricate dynamics of heart health and its predictors. Heart attacks, or myocardial infarctions, continue to be a significant global health issue, necessitating a deeper comprehension of their precursors and potential mitigating factors. This dataset encapsulates a diverse range of attributes including age, cholesterol levels, blood pressure, smoking habits, exercise patterns, dietary preferences, and more, aiming to elucidate the complex interplay of these variables in determining the likelihood of a heart attack. By employing predictive analytics and machine learning on this dataset, researchers and healthcare professionals can work towards proactive strategies for heart disease prevention and management. The dataset stands as a testament to collective efforts to enhance our understanding of cardiovascular health and pave the way for a healthier future.

Content:
This synthetic dataset provides a comprehensive array of features relevant to heart health and lifestyle choices, encompassing patient-specific details such as age, gender, cholesterol levels, blood pressure, heart rate, and indicators like diabetes, family history, smoking habits, obesity, and alcohol consumption. Additionally, lifestyle factors like exercise hours, dietary habits, stress levels, and sedentary hours are included. Medical aspects comprising previous heart problems, medication usage, and triglyceride levels are considered. Socioeconomic aspects such as income and geographical attributes like country, continent, and hemisphere are incorporated. The dataset, consisting of 8763 records from patients around the globe, culminates in a crucial binary classification feature denoting the presence or absence of a heart attack risk, providing a comprehensive resource for predictive analysis and research in cardiovascular health."

My data set investigation reveals no correlation between the basic parameters investigated and the broad and basic expectations in public and individual health. I have thus determined that the dataset 'heart_attack_prediction_dataset.csv' has no utility in real world analysis.  

In [436]:
# Function to collect and calculate averages by BMI category
def collect_and_calculate_averages_by_bmi():
    bmi_metrics = {
        'Heart Rate': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
        'Blood Pressure': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
        'Triglycerides': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []},
        'Cholesterol': {'Underweight': [], 'Normal': [], 'Overweight': [], 'Obese Class 1': [], 'Obese Class 2': [], 'Obese Class 3': []}
    }
    
    for row in patients.rows:
        bmi = float(row[18])
        category = patients.bmi_category(bmi)
        
        hr = float(row[5])
        systolic, diastolic = map(float, row[4].split('/'))
        bp = (systolic + diastolic) / 2
        trig = float(row[19])
        chol = float(row[3])
        
        bmi_metrics['Heart Rate'][category].append(hr)
        bmi_metrics['Blood Pressure'][category].append(bp)
        bmi_metrics['Triglycerides'][category].append(trig)
        bmi_metrics['Cholesterol'][category].append(chol)
    
    average_bmi_metrics = {metric: {category: (sum(values) / len(values)) if values else 0
                                    for category, values in categories.items()}
                           for metric, categories in bmi_metrics.items()}
    
    return average_bmi_metrics

# Calculate and print averages by BMI category
average_bmi_metrics = collect_and_calculate_averages_by_bmi()

print("\nAverage Metrics by BMI Category:")
for metric, categories in average_bmi_metrics.items():
    print(f"\n{metric}:")
    for category, value in categories.items():
        print(f"{category}: {value:.2f}")



Average Metrics by BMI Category:

Heart Rate:
Underweight: 75.48
Normal: 74.92
Overweight: 74.78
Obese Class 1: 75.06
Obese Class 2: 75.34
Obese Class 3: 0.00

Blood Pressure:
Underweight: 108.19
Normal: 110.18
Overweight: 110.18
Obese Class 1: 110.26
Obese Class 2: 110.02
Obese Class 3: 0.00

Triglycerides:
Underweight: 407.97
Normal: 421.59
Overweight: 419.29
Obese Class 1: 411.03
Obese Class 2: 418.31
Obese Class 3: 0.00

Cholesterol:
Underweight: 256.17
Normal: 260.10
Overweight: 257.33
Obese Class 1: 259.60
Obese Class 2: 262.92
Obese Class 3: 0.00


#### **Random Patient ID Generation**

The function `generate_random_patient_id` creates a unique patient identifier by combining randomly generated letters and numbers. This function generates a string of three uppercase letters followed by a string of four digits. The letters are chosen from the English alphabet (`A-Z`), and the digits are selected from `0-9`. The function uses the `random.choices` method to randomly select characters for both the letters and digits. These two components are then concatenated to form a patient ID in the format `ABC1234`, ensuring that each patient ID is unique and consistent.


In [393]:
# Function to generate a random patient ID
def generate_random_patient_id():
    # Generate a random string of 3 uppercase letters
    letters = ''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=3))
    # Generate a random string of 4 digits
    numbers = ''.join(random.choices('0123456789', k=4))
    # Concatenate the letters and numbers to form the patient ID
    return letters + numbers


### **Benchmarking Patient Lookup Methods**

The `run_benchmark` function is designed to compare the performance of two different methods for looking up patient IDs in a dataset. This function first generates a specified number of random patient IDs using the `generate_random_patient_id` function. It then measures the time taken to look up these IDs using two methods: a standard method (`get_patient_from_id`) and an optimized method (`get_patient_from_id_fast`). The standard method performs a linear search through the dataset, while the optimized method uses a dictionary for faster lookups. The total time taken for each method is recorded, and the results are printed, including the speed improvement factor and the average time per lookup for both methods. This benchmarking process helps to illustrate the efficiency gains achieved by using the optimized lookup method.


In [394]:
# Function to run a benchmark comparing two patient lookup methods
def run_benchmark(patients, num_lookups):
    # Generate a list of random patient IDs
    random_ids = [generate_random_patient_id() for _ in range(num_lookups)]

    # Benchmark standard method
    start = time.perf_counter()
    for random_id in random_ids:
        patients.get_patient_from_id(random_id)
    total_time_no_dict = time.perf_counter() - start

    # Benchmark optimized method
    start = time.perf_counter()
    for random_id in random_ids:
        patients.get_patient_from_id_fast(random_id)
    total_time_dict = time.perf_counter() - start

    # Print benchmark results
    print(f"\nResults for {num_lookups} lookups:")
    print(f"Total time for standard method: {total_time_no_dict:.6f} seconds")
    print(f"Total time for optimized method: {total_time_dict:.6f} seconds")
    print(f"Speed improvement: {total_time_no_dict / total_time_dict:.2f}x faster")
    print(f"Average time per lookup (standard method): {(total_time_no_dict / num_lookups):.9f} seconds")
    print(f"Average time per lookup (optimized method): {(total_time_dict / num_lookups):.9f} seconds")


### **Running Benchmarks for Different Numbers of Lookups**

This code snippet runs a series of benchmarks to compare the performance of two patient ID lookup methods over varying scales. The `run_benchmark` function is called with different numbers of lookups: 10,000, 100,000, and 1,000,000. For each benchmark, the function measures the time taken to perform the specified number of lookups using both the standard method (`get_patient_from_id`) and the optimized method (`get_patient_from_id_fast`). By analyzing the execution times across these different scales, the benchmarks provide insights into the efficiency and scalability of each lookup method, highlighting the significant performance improvements achieved by the optimized method.## Analysis of Benchmark Results

The benchmark results demonstrate a significant performance difference between the standard and optimized patient ID lookup methods across different scales of lookups. For 10,000 lookups, the standard method took approximately 8.44 seconds, while the optimized method took only about 0.011 seconds, resulting in a speed improvement of over 764 times. For 100,000 lookups, the trend continued with the standard method taking around 83.28 seconds compared to the optimized method's 0.109 seconds, yielding a speed improvement of approximately 766 times. At the scale of 1,000,000 lookups, the standard method required about 814.47 seconds, whereas the optimized method completed the task in just 1.10 seconds, demonstrating a speed improvement of over 743 times. These results clearly highlight the efficiency and scalability of the optimized lookup method, which consistently outperforms the standard linear search approach by a substantial margin. The average time per lookup for the optimized method remains almost constant across different scales, indicating its suitability for handling large datasets efficiently.

In [395]:
# Run benchmarks for different numbers of lookups
for num_lookups in [10_000, 100_000, 1_000_000]:
    run_benchmark(patients, num_lookups)



Results for 10000 lookups:
Total time for standard method: 8.435899 seconds
Total time for optimized method: 0.011041 seconds
Speed improvement: 764.04x faster
Average time per lookup (standard method): 0.000843590 seconds
Average time per lookup (optimized method): 0.000001104 seconds

Results for 100000 lookups:
Total time for standard method: 83.281405 seconds
Total time for optimized method: 0.108694 seconds
Speed improvement: 766.20x faster
Average time per lookup (standard method): 0.000832814 seconds
Average time per lookup (optimized method): 0.000001087 seconds

Results for 1000000 lookups:
Total time for standard method: 814.467886 seconds
Total time for optimized method: 1.095935 seconds
Speed improvement: 743.17x faster
Average time per lookup (standard method): 0.000814468 seconds
Average time per lookup (optimized method): 0.000001096 seconds


#### **Benchmarking High-Risk Cohort Identification Methods**


The following code segment benchmarks the performance of standard and optimized high-risk cohort identification methods using random patient data. The function benchmark_risk_cohort_random generates a specified number of random patients and calculates risk thresholds for heart rate and triglycerides. It then measures the time taken by both the standard and optimized methods to identify high-risk patients and prints the benchmark results, including execution time, speed improvement, high-risk cohort size, and whether the results match between the two methods.

The unexpected results, where the standard method outperformed the optimized method in terms of time efficiency, can be attributed to several considerations. Firstly, the optimized method, designed to speed up the process by reducing repetitive checks, might be suffering from overhead related to set operations and additional conditional checks, which may not scale well with increasing data size. While set operations are generally efficient, the combined use of multiple sets and the associated logic in the optimized method could introduce complexity that outweighs its benefits for this specific task.

Secondly, the inherent simplicity and directness of the standard method, which iterates through the patient data straightforwardly, allows it to leverage the efficiency of modern CPUs and memory caching mechanisms better. The optimized method's additional operations and checks can disrupt these optimizations, leading to slower overall performance.

The respective time efficiency models for the methods can also provide insights. The standard method follows a linear time complexity, O(n), where n is the number of patients, as it involves a single pass through the data. In contrast, the optimized method also has a linear time complexity, O(n), because it must still check each patient. However, the optimized method may have higher constant factors due to the additional operations involved, such as set lookups and insertions, which can increase the overall execution time despite the similar asymptotic complexity.

The benchmark results showed that for 1,000 patients, the standard method took 0.000218 seconds, while the optimized method took 0.000509 seconds, resulting in a speed improvement factor of 0.43x. For 10,000 patients, the standard method took 0.001836 seconds compared to the optimized method's 0.004626 seconds, with a speed improvement of 0.40x. As the number of patients increased to 100,000 and 1,000,000, the standard method continued to outperform the optimized method, with speed improvement factors of 0.38x for both cases. Despite the slower performance, the optimized method consistently identified the same high-risk cohort as the standard method, confirming the accuracy of its results.

These considerations suggest that while theoretical optimizations can appear beneficial, practical performance gains depend heavily on implementation details and the nature of the data and operations involved. Further analysis and refinement of the optimized method, possibly by reducing unnecessary operations and leveraging more efficient data structures, could help achieve the intended performance improvements.








In [408]:
# Function to benchmark the performance of standard and optimized high-risk cohort identification methods with random data
def benchmark_risk_cohort_random(num_patients):
    # Generate random patients data
    random_patients = RandomPatients(num_patients)
    # Calculate risk thresholds for heart rate and triglycerides
    hr_threshold, trig_threshold = random_patients.calculate_risk_thresholds()
    
    # Measure the time taken by the standard method
    start = time.perf_counter()
    standard_cohort = check_high_risk_cohort(random_patients, hr_threshold, trig_threshold)
    standard_time = time.perf_counter() - start
    
    # Measure the time taken by the optimized method
    start = time.perf_counter()
    fast_cohort = check_high_risk_cohort_fast(random_patients, hr_threshold, trig_threshold)
    fast_time = time.perf_counter() - start
    
    # Print benchmark results
    print(f"\nResults for {num_patients} patients:")
    print(f"Standard method time: {standard_time:.6f} seconds")
    print(f"Optimized method time: {fast_time:.6f} seconds")
    print(f"Speed improvement: {standard_time / fast_time:.2f}x faster")
    print(f"High-risk cohort size: {len(standard_cohort)}")
    print(f"Results match: {set(standard_cohort) == set(fast_cohort)}")

# Run benchmarks with different numbers of patients
for num_patients in [1000, 10000, 100000, 1000000]:
    benchmark_risk_cohort_random(num_patients)



Results for 1000 patients:
Standard method time: 0.000218 seconds
Optimized method time: 0.000509 seconds
Speed improvement: 0.43x faster
High-risk cohort size: 39
Results match: True

Results for 10000 patients:
Standard method time: 0.001836 seconds
Optimized method time: 0.004626 seconds
Speed improvement: 0.40x faster
High-risk cohort size: 413
Results match: True

Results for 100000 patients:
Standard method time: 0.017838 seconds
Optimized method time: 0.046834 seconds
Speed improvement: 0.38x faster
High-risk cohort size: 4315
Results match: True

Results for 1000000 patients:
Standard method time: 0.179092 seconds
Optimized method time: 0.470349 seconds
Speed improvement: 0.38x faster
High-risk cohort size: 43335
Results match: True


### **Conclusion**


Objective 1: Enhancing Time Efficiency with Advanced Engineering Techniques
The primary goal of Objective 1 was to enhance the time efficiency of patient data analysis by employing advanced engineering techniques. These techniques included using dictionaries for fast lookups and sets for quick cohort membership testing. In the Patients class, the get_patient_from_id_fast method utilized a dictionary for efficient patient ID retrieval, while the check_high_risk_cohort_fast function leveraged sets for quick membership testing.

The benchmark results highlighted a significant performance difference between the standard and optimized patient ID lookup methods across various scales of lookups. For 10,000 lookups, the standard method took approximately 8.44 seconds, whereas the optimized method took only about 0.011 seconds, achieving a speed improvement of over 764 times. This trend continued at larger scales, with the optimized method consistently outperforming the standard linear search approach by a substantial margin. The average time per lookup for the optimized method remained almost constant across different scales, demonstrating its suitability for handling large datasets efficiently.

However, the benchmarking of high-risk cohort identification methods produced unexpected results. The standard method consistently outperformed the optimized method in terms of time efficiency. This discrepancy can be attributed to the overhead associated with set operations and additional conditional checks in the optimized method. Despite its theoretical advantages, the optimized method's practical performance suffered due to the complexity introduced by these operations. The respective time efficiency models indicated that both methods had a linear time complexity of O(n), but the optimized method's higher constant factors negated its potential benefits.

These results suggest that while theoretical optimizations can appear beneficial, practical performance gains depend heavily on implementation details and the nature of the data and operations involved. Further analysis and refinement of the optimized method, possibly by reducing unnecessary operations and leveraging more efficient data structures, could help achieve the intended performance improvements.

Objective 2: Viability and Utility of a Synthetically Developed Dataset
Objective 2 aimed to discern the viability and utility of the synthetically developed dataset 'heart_attack_prediction_dataset.csv' by examining the relationship between Body Mass Index (BMI) and various health-related metrics, such as cholesterol, triglycerides, blood pressure, and heart rate.

The average risk score across all 8763 patients was 145.65, with a range from 79.09 to 211.66. A specific analysis of the top 10% of patients with the highest risk scores revealed that these patients had higher average BMI, blood pressure, and cholesterol levels compared to the bottom 10% with the lowest risk scores. However, the data showed unexpected trends when categorized by BMI. For instance, no individuals were categorized as Obese Class 3, despite an expected 9.2% of the population fitting this category.

The function designed to collect and calculate averages by BMI category produced results that deviated from well-documented medical literature. The expected positive correlation between BMI and metrics such as blood pressure and cholesterol was not consistently observed. Specifically, resting heart rate and blood pressure did not show the anticipated increase with higher BMI, and the absence of individuals in the Obese Class 3 category further skewed the data.

A review of the dataset literature indicated that the 'heart_attack_prediction_dataset.csv' was intended to mimic real-world data. However, the observed discrepancies in basic health metrics suggest that the dataset does not reliably reflect real-world trends. This lack of correlation and the absence of expected population segments indicate that the dataset has limited utility in real-world analysis.

Final Thoughts
In conclusion, while the project successfully demonstrated the application of advanced engineering techniques for time efficiency, the practical performance gains were mixed, highlighting the need for careful implementation and further refinement. Additionally, the synthetic dataset's inability to accurately represent real-world health trends underscores the importance of validating data sources before relying on them for critical analyses.






