#### **Introduction**

This notebook expands on a previously created novel synthetic dataset of 250,000 rows by inserting additional attributes related to health and lifestyle. The expanded dataset now includes several new columns and computations, offering a richer basis for statistical analysis and machine learning applications.

#### Data Enrichment
New columns are generated based on a variety of health risk factors, including:

- **Smoking Habits**: (Smoker, Former Smoker, Non-Smoker).
- **Alcohol Consumption**: (Non-Drinker, Moderate Drinker, Heavy/Binge Drinker).
- **Exercise Patterns**: (Meets Both Guidelines, Meets Aerobic Only, Insufficiently Active, Inactive).
- **Sleep Deprivation**: (Adequate Sleep, Less than 7 Hours, Chronic Sleep Deprivation).
- **Other Health Conditions**: Columns for conditions such as Metabolic Syndrome, COPD, Cancer, Diabetes, and Heart Disease were also included.

#### Health Condition Assignment
Algorithms were designed to assign binary outcomes for a number of health conditions:

- **Diabetes**: Assigned based on age groups and fasting blood glucose (FBG) thresholds. Younger individuals require higher FBG levels for a diabetes diagnosis, while older individuals have a lower FBG threshold.
- **Cancer**: Assigned with a base probability for individuals over 65, with smoking and BMI > 37 as contributing risk factors.
- **Heart Disease**: Probability of heart disease is increased by risk factors such as BMI, alcohol consumption, high fasting blood glucose, and smoking.
- **Metabolic Syndrome**: Assigned based on the presence of at least three risk factors, including high blood pressure, high triglycerides, low HDL, high fasting blood glucose, and high waist circumference.
- **COPD**: Assigned to smokers over age 45, with increased probabilities for smokers over age 60.

#### Physical Exam Score Calculation
A custom **Physical Exam Score** is calculated for each individual, starting with an age-based base score:

- **Base Score**: Determined by age, with younger individuals receiving a higher score.
- **Adjustments**: Reductions are applied based on negative health factors like smoking, alcohol consumption, inactivity, sleep deprivation, and conditions like heart disease, cancer, COPD, and diabetes. Each risk factor subtracts a specific percentage from the base score.

#### Statistical Summaries
The notebook computes key statistics, such as the percentage of individuals with diabetes, cancer, COPD, heart disease, and metabolic syndrome. In addition, a breakdown of physical exam scores by age group is provided.


#### **File Existence Check and Dataset Saving**

This script checks whether the input CSV file exists before attempting to read and save it. The goal is to avoid errors related to file non-existence and ensure the dataset is successfully processed.

#### Steps:
1. **File Existence Check**:
   - The script checks if the file `final_synthetic_dataset_250k.csv` exists in the current directory.
   
2. **Reading the CSV File**:
   - If the file exists, it is read into a Pandas DataFrame.

3. **Saving the Enhanced Dataset**:
   - The DataFrame is saved as `enhanced_250K.csv`.
   
4. **Output**:
   - If the file is found, a success message is printed.
   - If the file is not found, an error message is printed indicating the missing file.


In [119]:
import pandas as pd
import os

# Since you are in the correct directory, directly use the file name
input_file = "final_synthetic_dataset_250k.csv"
output_file = "enhanced_250K.csv"

# Step 1: Check if the file exists before trying to read
if os.path.exists(input_file):
    df = pd.read_csv(input_file)
    df.to_csv(output_file, index=False)
    print(f"Enhanced dataset saved as {output_file}")
else:
    print(f"File not found: {input_file}")


Enhanced dataset saved as enhanced_250K.csv


In [120]:
# Display the first two rows from the actual dataset
df = pd.read_csv('final_synthetic_dataset_250k.csv')
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG  
0                    1  233  
1                    1   82  


#### **Alcohol Use Assignment Algorithm**

This algorithm is used to assign the **"Alcohol Use"** category (Non-Drinker, Moderate Drinker, or Heavy/Binge Drinker) to each row in the dataset based on various health-related conditions. 

#### Steps:
1. **Base Probabilities**: 
   - Non-Drinker: 30%
   - Moderate Drinker: 50%
   - Heavy/Binge Drinker: 20%
   
2. **Probability Adjustments**:
   - **FBG > 175**: Increases the probability of being a **Moderate Drinker** and **Heavy/Binge Drinker** by 2x.
   - **High Blood Pressure > 150**: Increases the probability of being a **Moderate Drinker** and **Heavy/Binge Drinker** by 1.5x.
   - **Triglyceride > 125**: Increases the probability of being a **Moderate Drinker** and **Heavy/Binge Drinker** by 1.5x.

3. **Normalization**:
   - Probabilities are normalized to ensure they sum to 1.

4. **Random Assignment**:
   - Each individual is assigned an alcohol use category based on the adjusted and normalized probabilities.

5. **Output**:
   - The dataset is updated with the new **Alcohol Use** column and saved as `enhanced_250K.csv`.


In [121]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('final_synthetic_dataset_250k.csv')

# Define the base probabilities
base_probs = {'Non-Drinker': 0.30, 'Moderate Drinker': 0.50, 'Heavy/Binge Drinker': 0.20}

# Function to assign alcohol use based on conditions
def assign_alcohol_use(row):
    # Base probabilities
    probs = [0.30, 0.50, 0.20]  # Non-Drinkers, Moderate Drinkers, Heavy/Binge Drinkers
    
    # Adjust the probabilities for specific conditions
    if row['FBG'] > 175:
        probs[1] *= 2  # Moderate Drinkers weight increases by 2x
        probs[2] *= 2  # Heavy/Binge Drinkers weight increases by 2x

    if row['High_Blood_Pressure'] == 1 and row['High_Blood_Pressure'] > 150:
        probs[1] *= 1.5  # Moderate Drinkers weight increases by 1.5x
        probs[2] *= 1.5  # Heavy/Binge Drinkers weight increases by 1.5x

    if row['Triglyceride'] > 125:  # Correct column name
        probs[1] *= 1.5  # Moderate Drinkers weight increases by 1.5x
        probs[2] *= 1.5  # Heavy/Binge Drinkers weight increases by 1.5x

    # Normalize the probabilities so that they sum to 1
    total = sum(probs)
    normalized_probs = [p / total for p in probs]
    
    # Randomly assign based on the weighted probabilities
    return np.random.choice(['Non-Drinker', 'Moderate Drinker', 'Heavy/Binge Drinker'], p=normalized_probs)

# Apply the function to each row
df['Alcohol Use'] = df.apply(assign_alcohol_use, axis=1)

# Display the first few rows to check the distribution
print(df[['Age', 'FBG', 'High_Blood_Pressure', 'Triglyceride', 'Alcohol Use']].head())

# Save the updated dataset
df.to_csv('enhanced_250K.csv', index=False)


   Age  FBG  High_Blood_Pressure  Triglyceride       Alcohol Use
0   40  233                    1           211  Moderate Drinker
1   24   82                    1           290  Moderate Drinker
2   73  102                    1            86  Moderate Drinker
3   90  209                    0           202  Moderate Drinker
4   99   94                    0            71  Moderate Drinker


In [122]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use  
0                    1  233  Moderate Drinker  
1                    1   82  Moderate Drinker  


#### **Smoking Status Assignment Algorithm**

This algorithm assigns a **smoking status** (Smoker, Former Smoker, or Non-Smoker) to each individual in the dataset based on their gender. Different probabilities are applied depending on whether the individual is **Male** or **Female**. If the gender is missing or undefined, default probabilities for the general population are used.

#### Steps:
1. **Gender-Based Probabilities**:
   - **Male**:
     - Smoker: 16%
     - Former Smoker: 29%
     - Non-Smoker: 55%
   - **Female**:
     - Smoker: 11%
     - Former Smoker: 29%
     - Non-Smoker: 60%
   - **General Population (Undefined Gender)**:
     - Smoker: 12.5%
     - Former Smoker: 29%
     - Non-Smoker: 58.5%

2. **Random Assignment**:
   - Smoking status is assigned randomly based on the gender-specific probabilities.

3. **Output**:
   - The dataset is updated with the new **Smoker** column and saved as `enhanced_250K.csv`.


In [123]:
import numpy as np

# Function to assign smoking status based on gender
def assign_smoker(row):
    if row['Gender'] == 'Male':
        probs = [0.16, 0.29, 0.55]  # [Smoker, Former Smoker, Non-Smoker] for Men
    elif row['Gender'] == 'Female':
        probs = [0.11, 0.29, 0.60]  # [Smoker, Former Smoker, Non-Smoker] for Women
    else:
        # If gender is missing or undefined, use overall probabilities
        probs = [0.125, 0.29, 0.585]  # [Smoker, Former Smoker, Non-Smoker] for general population

    # Randomly assign smoking status based on the probabilities
    return np.random.choice(['Smoker', 'Former Smoker', 'Non Smoker'], p=probs)

# Apply the function to each row in the dataset
df['Smoker'] = df.apply(assign_smoker, axis=1)

# Display the first few rows to check the Smoker column
print(df[['Age', 'Gender', 'Smoker']].head())

# Save the updated dataset
df.to_csv('enhanced_250K.csv', index=False)


   Age  Gender         Smoker
0   40    Male     Non Smoker
1   24    Male         Smoker
2   73  Female     Non Smoker
3   90    Male  Former Smoker
4   99    Male     Non Smoker


In [124]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  
0                    1  233  Moderate Drinker  Non Smoker  
1                    1   82  Moderate Drinker      Smoker  


#### **Exercise Category Assignment Algorithm**

This algorithm assigns an **exercise category** (Meets Both Guidelines, Meets Aerobic Only, Insufficiently Active, Inactive) based on both **gender** and **age**. Probabilities for each exercise category are adjusted to reflect known trends in physical activity among different gender and age groups.

#### Steps:
1. **Base Probabilities**:
   - **Meets Both Guidelines**: 25%
   - **Meets Aerobic Only**: 47.5%
   - **Insufficiently Active**: 30%
   - **Inactive**: 22.5%

2. **Adjustments Based on Gender**:
   - **Men**:
     - Higher chance of meeting both guidelines: 30%
   - **Women**:
     - Lower chance of meeting both guidelines: 22%

3. **Adjustments Based on Age**:
   - **Ages 18-24**: 35% chance of meeting both guidelines.
   - **Ages 25-44**: 30% chance of meeting both guidelines.
   - **Ages 45-64**: 25% chance of meeting both guidelines.
   - **Ages 65+**: 20% chance of meeting both guidelines.

4. **Normalization**:
   - Probabilities are normalized to ensure they sum to 1.

5. **Random Assignment**:
   - Exercise categories are assigned randomly based on the adjusted probabilities for gender and age.

6. **Output**:
   - The dataset is updated with the new **Exercise** column and saved as `enhanced_250K.csv`.


In [125]:
# Function to assign exercise category based on gender and age
def assign_exercise(row):
    # Base probabilities for exercise categories
    probs = [0.25, 0.475, 0.30, 0.225]  # [Meets Both, Meets Aerobic Only, Insufficiently Active, Inactive]

    # Adjust based on gender
    if row['Gender'] == 'Male':
        probs = [0.30, 0.45, 0.30, 0.20]  # Men are slightly more likely to meet both guidelines
    elif row['Gender'] == 'Female':
        probs = [0.22, 0.48, 0.30, 0.25]  # Women have a lower chance of meeting both guidelines

    # Adjust based on age
    if 18 <= row['Age'] <= 24:
        probs[0] = 0.35  # Young adults are more likely to meet both guidelines
    elif 25 <= row['Age'] <= 44:
        probs[0] = 0.30  # Middle-aged adults
    elif 45 <= row['Age'] <= 64:
        probs[0] = 0.25  # Older adults
    elif row['Age'] >= 65:
        probs[0] = 0.20  # Seniors have a lower chance of meeting both guidelines

    # Normalize probabilities so they sum to 1
    total = sum(probs)
    normalized_probs = [p / total for p in probs]

    # Randomly assign exercise category based on weighted probabilities
    return np.random.choice(
        ['Meets Both Guidelines', 'Meets Aerobic Only', 'Insufficiently Active', 'Inactive'],
        p=normalized_probs
    )

# Apply the function to assign exercise categories
df['Exercise'] = df.apply(assign_exercise, axis=1)

# Display the first few rows to check the Exercise column
print(df[['Age', 'Gender', 'Exercise']].head())

# Save the updated dataset
df.to_csv('enhanced_250K.csv', index=False)


   Age  Gender               Exercise
0   40    Male     Meets Aerobic Only
1   24    Male  Insufficiently Active
2   73  Female               Inactive
3   90    Male  Meets Both Guidelines
4   99    Male  Meets Both Guidelines


In [126]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise  
0     Meets Aerobic Only  
1  Insufficiently Active  


In [127]:
# Calculate the percentage distribution of each category in the Exercise column
exercise_distribution = df['Exercise'].value_counts(normalize=True) * 100

# Print the percentage distribution
print(exercise_distribution)


Exercise
Meets Aerobic Only       37.2720
Insufficiently Active    24.2364
Meets Both Guidelines    20.3256
Inactive                 18.1660
Name: proportion, dtype: float64


#### **Sleep Category Assignment Algorithm**

This algorithm assigns a **sleep category** (Adequate Sleep, Less than 7 Hours, Chronic Sleep Deprivation) based on the individual's **BMI**, **age**, and **gender**. Probabilities for each sleep category are adjusted based on known patterns in sleep behavior for these demographic groups.

#### Steps:
1. **Base Probabilities**:
   - Adequate Sleep (7+ hours): 65%
   - Less than 7 Hours: 25%
   - Chronic Sleep Deprivation (≤5 hours): 10%

2. **Adjustments Based on Gender**:
   - **Men**: Slightly lower chance of getting adequate sleep (60% adequate, 30% less than 7 hours).
   - **Women**: Slightly higher chance of getting adequate sleep (70% adequate, 20% less than 7 hours).

3. **Adjustments Based on Age**:
   - **Ages 18-24**: Higher chance of getting adequate sleep (70%).
   - **Ages 25-44**: Middle-aged adults have 65% chance of adequate sleep.
   - **Ages 45-64**: Older adults have a 60% chance of adequate sleep.
   - **Ages 65+**: Seniors have a 60% chance of adequate sleep.

4. **Adjustments Based on BMI**:
   - **BMI > 30** (Obesity): 
     - Doubles the probability of getting less than 7 hours of sleep.
     - Triples the probability of experiencing chronic sleep deprivation.

5. **Normalization**:
   - Probabilities are normalized to ensure they sum to 1.

6. **Random Assignment**:
   - Sleep categories are assigned randomly based on the adjusted probabilities for BMI, gender, and age.

7. **Output**:
   - The dataset is updated with the new **Hours of Sleep** column and saved as `enhanced_250K.csv`.


In [128]:
# Function to assign sleep category based on BMI, age, and gender
def assign_sleep(row):
    # Base probabilities for sleep categories (Adequate Sleep, Less than 7 Hours, Chronic Sleep Deprivation)
    probs = [0.65, 0.25, 0.10]  # General population probabilities

    # Adjust based on gender
    if row['Gender'] == 'Male':
        probs = [0.60, 0.30, 0.10]  # Men are slightly less likely to get adequate sleep
    elif row['Gender'] == 'Female':
        probs = [0.70, 0.20, 0.10]  # Women tend to get more sleep

    # Adjust based on age
    if 18 <= row['Age'] <= 24:
        probs[0] = 0.70  # Young adults have higher chance of getting sufficient sleep
        probs[1] = 0.20  # Less likely to report insufficient sleep
        probs[2] = 0.10  # Chronic sleep deprivation remains the same
    elif 25 <= row['Age'] <= 44:
        probs[0] = 0.65  # Middle-aged adults
        probs[1] = 0.25
        probs[2] = 0.10
    elif 45 <= row['Age'] <= 64:
        probs[0] = 0.60  # Older adults
        probs[1] = 0.30
        probs[2] = 0.10
    elif row['Age'] >= 65:
        probs[0] = 0.60  # Seniors have slightly better sleep rates
        probs[1] = 0.30
        probs[2] = 0.10

    # Adjust based on BMI
    if row['BMI'] > 30:  # Obese individuals
        probs[1] *= 2  # Double the chance for less than 7 hours of sleep
        probs[2] *= 3  # Triple the chance for chronic sleep deprivation

    # Normalize probabilities so they sum to 1
    total = sum(probs)
    normalized_probs = [p / total for p in probs]

    # Randomly assign sleep category based on weighted probabilities
    return np.random.choice(
        ['Adequate Sleep (7+ hours)', 'Less than 7 Hours', 'Chronic Sleep Deprivation (≤5 hours)'],
        p=normalized_probs
    )

# Apply the function to assign sleep categories
df['Hours of Sleep'] = df.apply(assign_sleep, axis=1)

# Display the first few rows to check the Hours of Sleep column
print(df[['Age', 'BMI', 'Gender', 'Hours of Sleep']].head())

# Save the updated dataset
df.to_csv('enhanced_250K.csv', index=False)


   Age   BMI  Gender                        Hours of Sleep
0   40  36.4    Male             Adequate Sleep (7+ hours)
1   24  33.3    Male                     Less than 7 Hours
2   73  30.2  Female             Adequate Sleep (7+ hours)
3   90  29.2    Male  Chronic Sleep Deprivation (≤5 hours)
4   99  19.9    Male             Adequate Sleep (7+ hours)


In [129]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  
0     Meets Aerobic Only  Adequate Sleep (7+ hours)  
1  Insufficiently Active          Less than 7 Hours  


In [130]:
# Calculate the percentage distribution of each sleep category
sleep_distribution = df['Hours of Sleep'].value_counts(normalize=True) * 100

# Print the percentage distribution
print(sleep_distribution)


Hours of Sleep
Adequate Sleep (7+ hours)               54.3856
Less than 7 Hours                       31.6696
Chronic Sleep Deprivation (≤5 hours)    13.9448
Name: proportion, dtype: float64


#### **Heart Disease Assignment Algorithm**

This algorithm assigns **heart disease** (binary: 1 for yes, 0 for no) to individuals based on various **risk factors**. The base probability for heart disease in the general population is **6.7%**, and the risk is adjusted based on the individual's BMI, alcohol consumption, FBG levels, and smoking status.

#### Steps:
1. **Base Probability**:
   - The starting probability for heart disease is **6.7%**.
   
2. **Risk Factor Adjustments**:
   - **BMI**:
     - BMI > 40: 2x base risk.
     - BMI > 35: 2x base risk.
     - BMI > 30: 1.5x base risk.
   - **Alcohol Use**:
     - Heavy/Binge Drinker: 2x base risk.
   - **FBG (Fasting Blood Glucose)**:
     - FBG > 200: 2x base risk.
   - **Smoking**:
     - Current Smoker: 3x base risk.
   
3. **Cap on Probability**:
   - The probability is capped at 100% to avoid exceeding a certainty of heart disease.

4. **Random Assignment**:
   - Heart disease is assigned randomly based on the adjusted probability.

5. **Analysis**:
   - Total percentage of heart disease in the dataset.
   - Percentage of **Heavy/Binge Drinkers** with heart disease.
   - Percentage of individuals with **BMI > 40** and heart disease.
   - Percentage of individuals with **FBG > 200** and heart disease.
   - Percentage of **Smokers** with heart disease.

6. **Output**:
   - The dataset is updated with the new **Heart Disease** column and saved as `enhanced_250K.csv`.


In [131]:
import numpy as np

# Base probability for heart disease (6.7%)
base_heart_disease_prob = 0.067

# Function to assign heart disease based on risk factors
def assign_heart_disease(row):
    # Start with the base probability
    prob = base_heart_disease_prob
    
    # Apply BMI-based risk
    if row['BMI'] > 40:
        prob = base_heart_disease_prob * 2  # 2x for BMI > 40
    elif row['BMI'] > 35:
        prob = base_heart_disease_prob * 2  # 2x for BMI > 35
    elif row['BMI'] > 30:
        prob = base_heart_disease_prob * 1.5  # 1.5x for BMI > 30
    
    # Apply Alcohol Use risk
    if row['Alcohol Use'] == 'Heavy/Binge Drinker':
        prob = base_heart_disease_prob * 2  # 2x for heavy drinkers
    
    # Apply FBG-based risk
    if row['FBG'] > 200:
        prob = base_heart_disease_prob * 2  # 2x for FBG > 200
    
    # Apply Smoking-based risk
    if row['Smoker'] == 'Smoker':  # 3x for current smokers
        prob = base_heart_disease_prob * 3
    
    # Cap the probability to ensure it doesn't exceed 100%
    prob = min(prob, 1.0)
    
    # Randomly assign heart disease based on the final probability
    return np.random.choice([1, 0], p=[prob, 1 - prob])

# Apply the function to each row in the dataset
df['Heart Disease'] = df.apply(assign_heart_disease, axis=1)

# 1. Total Percentage of Heart Disease in the dataset
total_heart_disease_percent = df['Heart Disease'].mean() * 100
print(f"Total Percentage of Heart Disease: {total_heart_disease_percent:.2f}%")

# 2. Percentage of Heavy/Binge Drinkers with Heart Disease
heavy_drinkers = df[df['Alcohol Use'] == 'Heavy/Binge Drinker']
heavy_drinker_heart_disease_percent = (heavy_drinkers['Heart Disease'].mean()) * 100
print(f"Percentage of Heavy/Binge Drinkers with Heart Disease: {heavy_drinker_heart_disease_percent:.2f}%")

# 3. Percentage of Individuals with BMI > 40 and Heart Disease
bmi_over_40 = df[df['BMI'] > 40]
bmi_over_40_heart_disease_percent = (bmi_over_40['Heart Disease'].mean()) * 100
print(f"Percentage of Individuals with BMI > 40 and Heart Disease: {bmi_over_40_heart_disease_percent:.2f}%")

# 4. Percentage of Individuals with FBG > 200 and Heart Disease
fb_over_200 = df[df['FBG'] > 200]
fb_over_200_heart_disease_percent = (fb_over_200['Heart Disease'].mean()) * 100
print(f"Percentage of Individuals with FBG > 200 and Heart Disease: {fb_over_200_heart_disease_percent:.2f}%")

# 5. Percentage of Smokers with Heart Disease
smokers = df[df['Smoker'] == 'Smoker']
smoker_heart_disease_percent = (smokers['Heart Disease'].mean()) * 100
print(f"Percentage of Smokers with Heart Disease: {smoker_heart_disease_percent:.2f}%")

# Save the updated dataset
df.to_csv('enhanced_250K.csv', index=False)


Total Percentage of Heart Disease: 11.47%
Percentage of Heavy/Binge Drinkers with Heart Disease: 14.40%
Percentage of Individuals with BMI > 40 and Heart Disease: 14.73%
Percentage of Individuals with FBG > 200 and Heart Disease: 14.40%
Percentage of Smokers with Heart Disease: 20.19%


In [132]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  Heart Disease  
0     Meets Aerobic Only  Adequate Sleep (7+ hours)              0  
1  Insufficiently Active          Less than 7 Hours              0  


#### **Cancer Assignment Algorithm**

This algorithm assigns **cancer** (binary: 1 for yes, 0 for no) to individuals based on their **age**, **BMI**, and **smoking status**. The base cancer probability is applied to individuals aged 65 and older, with increased risks for smokers and individuals with high BMI.

#### Steps:
1. **Base Probability**:
   - Individuals under age 65 have a 0% chance of cancer.
   - Individuals aged 65 and older have a base cancer probability between **20-25%**.

2. **Risk Factor Adjustments**:
   - **Smoking**:
     - Current smokers have a 2x higher risk of cancer.
   - **BMI > 37**:
     - Individuals with a BMI > 37 have a 1.5x higher risk of cancer.
   
3. **Cap on Probability**:
   - The cancer probability is capped at 100% to ensure it does not exceed a certainty of diagnosis.

4. **Random Assignment**:
   - Cancer is assigned randomly based on the adjusted probability.

5. **Analysis**:
   - Total percentage of cancer in the dataset.
   - Percentage of **Smokers** with cancer.
   - Percentage of individuals with **BMI > 37 and age < 50** who have cancer.

6. **Output**:
   - The dataset is updated with the new **Cancer** column.


In [133]:
import numpy as np

# Base probability for cancer for individuals aged 65+ set to 1%
base_cancer_prob = 0.01

# Function to assign cancer based on age and risk factors
def assign_cancer(row):
    # Start with a base probability of 0 for individuals under 65
    prob = 0

    # Apply base cancer probability for individuals 65 and older
    if row['Age'] >= 65:
        prob = np.random.uniform(0.20, 0.25)  # 20-25% for adults 65+

    # Apply Smoking-based risk
    if row['Smoker'] == 'Smoker':  # 2x for current smokers
        prob = prob * 2 if prob > 0 else base_cancer_prob * 2  # Apply only if base prob exists
    
    # Apply BMI-based risk for BMI > 37
    if row['BMI'] > 37:
        prob = prob * 1.5 if prob > 0 else base_cancer_prob * 1.5  # Apply only if base prob exists

    # Cap the probability to ensure it doesn't exceed 100%
    prob = min(prob, 1.0)

    # Randomly assign cancer based on the final probability
    return np.random.choice([1, 0], p=[prob, 1 - prob])

# Apply the function to each row in the dataset
df['Cancer'] = df.apply(assign_cancer, axis=1)

# 1. Total Percentage of Cancer in the dataset
total_cancer_percent = df['Cancer'].mean() * 100
print(f"Total Percentage of Cancer: {total_cancer_percent:.2f}%")

# 2. Percentage of Smokers with Cancer
smokers = df[df['Smoker'] == 'Smoker']
smoker_cancer_percent = (smokers['Cancer'].mean()) * 100
print(f"Percentage of Smokers with Cancer: {smoker_cancer_percent:.2f}%")

# 3. Percentage of Individuals with BMI > 37 and age < 50 who have cancer
bmi_over_37_age_under_50 = df[(df['BMI'] > 37) & (df['Age'] < 50)]
bmi_over_37_age_under_50_with_cancer = bmi_over_37_age_under_50[bmi_over_37_age_under_50['Cancer'] == 1].shape[0]
total_bmi_over_37_age_under_50 = bmi_over_37_age_under_50.shape[0]

if total_bmi_over_37_age_under_50 > 0:
    percentage_with_cancer = (bmi_over_37_age_under_50_with_cancer / total_bmi_over_37_age_under_50) * 100
    print(f"Percentage of individuals with BMI > 37 and age < 50 who have cancer: {percentage_with_cancer:.2f}%")
else:
    print("No individuals with BMI > 37 and age < 50 in the dataset.")


Total Percentage of Cancer: 9.22%
Percentage of Smokers with Cancer: 17.21%
Percentage of individuals with BMI > 37 and age < 50 who have cancer: 1.82%


#### **Cancer Assignment Algorithm**

This algorithm assigns **cancer** (binary: 1 for yes, 0 for no) to individuals based on their **age**, **BMI**, and **smoking status**. The base cancer probability is applied to individuals aged 65 and older, with increased risks for smokers and individuals with high BMI.

#### Steps:
1. **Base Probability**:
   - Individuals under age 65 have a 0% chance of cancer.
   - Individuals aged 65 and older have a base cancer probability between **20-25%**.

2. **Risk Factor Adjustments**:
   - **Smoking**:
     - Current smokers have a 2x higher risk of cancer.
   - **BMI > 37**:
     - Individuals with a BMI > 37 have a 1.5x higher risk of cancer.
   
3. **Cap on Probability**:
   - The cancer probability is capped at 100% to ensure it does not exceed a certainty of diagnosis.

4. **Random Assignment**:
   - Cancer is assigned randomly based on the adjusted probability.

5. **Analysis**:
   - **Total percentage of cancer** in the dataset.
   - Percentage of **Smokers** with cancer.
   - Percentage of individuals with **BMI > 37 and age < 50** who have cancer.

6. **Output**:
   - The dataset is updated with the new **Cancer** column.


In [134]:
import numpy as np

# Base probability for cancer for individuals aged 65+ set to 1%
base_cancer_prob = 0.01

# Function to assign cancer based on age and risk factors
def assign_cancer(row):
    # Start with a base probability of 0 for individuals under 65
    prob = 0

    # Apply base cancer probability for individuals 65 and older
    if row['Age'] >= 65:
        prob = np.random.uniform(0.20, 0.25)  # 20-25% for adults 65+

    # Apply Smoking-based risk
    if row['Smoker'] == 'Smoker':  # 2x for current smokers
        prob = prob * 2 if prob > 0 else base_cancer_prob * 2  # Apply only if base prob exists
    
    # Apply BMI-based risk for BMI > 37
    if row['BMI'] > 37:
        prob = prob * 1.5 if prob > 0 else base_cancer_prob * 1.5  # Apply only if base prob exists

    # Cap the probability to ensure it doesn't exceed 100%
    prob = min(prob, 1.0)

    # Randomly assign cancer based on the final probability
    return np.random.choice([1, 0], p=[prob, 1 - prob])

# Apply the function to each row in the dataset
df['Cancer'] = df.apply(assign_cancer, axis=1)

# 1. Total Percentage of Cancer in the dataset
total_cancer_percent = df['Cancer'].mean() * 100
print(f"Total Percentage of Cancer: {total_cancer_percent:.2f}%")

# 2. Percentage of Smokers with Cancer
smokers = df[df['Smoker'] == 'Smoker']
smoker_cancer_percent = (smokers['Cancer'].mean()) * 100
print(f"Percentage of Smokers with Cancer: {smoker_cancer_percent:.2f}%")

# 3. Percentage of Individuals with BMI > 37 and age < 50 who have cancer
bmi_over_37_age_under_50 = df[(df['BMI'] > 37) & (df['Age'] < 50)]
bmi_over_37_age_under_50_with_cancer = bmi_over_37_age_under_50[bmi_over_37_age_under_50['Cancer'] == 1].shape[0]
total_bmi_over_37_age_under_50 = bmi_over_37_age_under_50.shape[0]

if total_bmi_over_37_age_under_50 > 0:
    percentage_with_cancer = (bmi_over_37_age_under_50_with_cancer / total_bmi_over_37_age_under_50) * 100
    print(f"Percentage of individuals with BMI > 37 and age < 50 who have cancer: {percentage_with_cancer:.2f}%")
else:
    print("No individuals with BMI > 37 and age < 50 in the dataset.")


Total Percentage of Cancer: 9.22%
Percentage of Smokers with Cancer: 17.13%
Percentage of individuals with BMI > 37 and age < 50 who have cancer: 1.55%


In [135]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  Heart Disease  Cancer  
0     Meets Aerobic Only  Adequate Sleep (7+ hours)              0       0  
1  Insufficiently Active          Less than 7 Hours              0       0  


#### **Metabolic Syndrome Assignment Algorithm**

This algorithm assigns **Metabolic Syndrome** (binary: 1 for yes, 0 for no) to individuals based on the presence of **five key risk factors**. An individual is classified as having Metabolic Syndrome if **three or more** of these risk factors are present.

#### Risk Factors:
1. **High Blood Pressure**: Present if the individual has high blood pressure.
2. **High Triglycerides**: Present if triglycerides are **≥ 150 mg/dL**.
3. **Low HDL (Good Cholesterol)**:
   - Males: HDL **< 40 mg/dL**.
   - Females: HDL **< 50 mg/dL**.
4. **High Fasting Blood Glucose (FBG)**: Present if FBG is **≥ 100 mg/dL**.
5. **High Waist Circumference**:
   - Males: Waist circumference **≥ 40 inches**.
   - Females: Waist circumference **≥ 35 inches**.

#### Steps:
1. **Check Risk Factors**: The algorithm checks each of the above factors for every individual in the dataset.
2. **Assign Metabolic Syndrome**:
   - If an individual has **three or more** of the risk factors, they are assigned **Metabolic Syndrome** (1). 
   - If fewer than three factors are present, they are not assigned Metabolic Syndrome (0).

3. **Analysis**:
   - The total **percentage of individuals** with Metabolic Syndrome in the dataset is calculated.

4. **Output**:
   - The dataset is updated with the new **Metabolic Syndrome** column and saved as `enhanced_250K_with_metabolic_syndrome.csv`.


In [136]:
# Function to assign Metabolic Syndrome based on criteria
def assign_metabolic_syndrome(row):
    # Initialize the count of risk factors
    risk_factors = 0

    # Check each risk factor
    if row['High_Blood_Pressure'] == 1:  # High blood pressure
        risk_factors += 1
    if row['Triglyceride'] >= 150:  # High triglycerides
        risk_factors += 1
    if (row['Gender'] == 'Male' and row['HDL'] < 40) or (row['Gender'] == 'Female' and row['HDL'] < 50):  # Low HDL
        risk_factors += 1
    if row['FBG'] >= 100:  # High fasting blood glucose
        risk_factors += 1
    if (row['Gender'] == 'Male' and row['Waist_Circumference'] >= 40) or (row['Gender'] == 'Female' and row['Waist_Circumference'] >= 35):  # High waist circumference
        risk_factors += 1

    # Metabolic syndrome is present if 3 or more risk factors are present
    return 1 if risk_factors >= 3 else 0

# Apply the function to each row in the dataset
df['Metabolic_Syndrome'] = df.apply(assign_metabolic_syndrome, axis=1)

# Calculate the percentage of individuals with Metabolic Syndrome
metabolic_syndrome_percent = df['Metabolic_Syndrome'].mean() * 100
print(f"Percentage of individuals with Metabolic Syndrome: {metabolic_syndrome_percent:.2f}%")

# Save the updated dataset
df.to_csv('enhanced_250K_with_metabolic_syndrome.csv', index=False)


Percentage of individuals with Metabolic Syndrome: 34.44%


In [137]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  Heart Disease  Cancer  \
0     Meets Aerobic Only  Adequate Sleep (7+ hours)              0       0   
1  Insufficiently Active          Less than 7 Hours              0       0   

   Metabolic_Syndrome  
0                   1  
1                   0  


#### **COPD Assignment Algorithm**

This algorithm assigns **COPD** (Chronic Obstructive Pulmonary Disease) to individuals based on their **age** and **smoking status**. The probability of having COPD is higher for smokers, particularly older smokers, and this is reflected in the assignment process.

#### Steps:
1. **Smokers Over Age 60**:
   - **75%** of smokers over age 60 are assigned COPD.
   
2. **Smokers Aged 45-60**:
   - **35%** of smokers aged 45 to 60 are assigned COPD.

3. **Non-Smokers and Smokers Without COPD**:
   - The rest of the population (non-smokers and smokers without COPD) are assigned **0** for COPD.

4. **Analysis**:
   - **Total percentage of smokers** in the population is calculated.
   - **Total percentage of individuals with COPD** is calculated.

5. **Output**:
   - The dataset is updated with the new **COPD** column and saved as `enhanced_250K_with_COPD.csv`.


In [138]:
import numpy as np

# Step 1: Set COPD for Smokers over age 60 (75% have COPD)
smokers_over_60 = df[(df['Smoker'] == 'Smoker') & (df['Age'] > 60)]
copd_over_60_cases = int(len(smokers_over_60) * 0.75)  # 75% of smokers over 60
df.loc[smokers_over_60.sample(n=copd_over_60_cases, replace=False).index, 'COPD'] = 1

# Step 2: Set COPD for Smokers aged 45-60 (35% have COPD)
smokers_45_to_60 = df[(df['Smoker'] == 'Smoker') & (df['Age'].between(45, 60))]
copd_45_to_60_cases = int(len(smokers_45_to_60) * 0.35)  # 35% of smokers aged 45-60
df.loc[smokers_45_to_60.sample(n=copd_45_to_60_cases, replace=False).index, 'COPD'] = 1

# Step 3: Set COPD for the rest of the population to 0 (non-smokers and other smokers without COPD)
df['COPD'] = df['COPD'].fillna(0)

# Step 4: Calculate percentage of smokers in the total population
total_population = len(df)
total_smokers = len(df[df['Smoker'] == 'Smoker'])
smoker_percentage = (total_smokers / total_population) * 100

# Step 5: Calculate the percentage of individuals with COPD
copd_percent = df['COPD'].mean() * 100

# Output results
print(f"Percentage of smokers in the population: {smoker_percentage:.2f}%")
print(f"Percentage of individuals with COPD: {copd_percent:.2f}%")

# Save the updated dataset
df.to_csv('enhanced_250K_with_COPD.csv', index=False)


Percentage of smokers in the population: 13.50%
Percentage of individuals with COPD: 5.10%


In [139]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  Heart Disease  Cancer  \
0     Meets Aerobic Only  Adequate Sleep (7+ hours)              0       0   
1  Insufficiently Active          Less than 7 Hours              0       0   

   Metabolic_Syndrome  COPD  
0                   1   0.0  
1                   0   0.0  


#### **Dropping the 'COPD_Base' Column**

This operation removes the unnecessary **'COPD_Base'** column from the dataset, ensuring a cleaner structure for future analysis.

#### Steps:
1. **Drop 'COPD_Base' Column**:
   - The column `'COPD_Base'` is removed from the DataFrame.

2. **Verification**:
   - The updated DataFrame is printed to verify that the `'COPD_Base'` column has been successfully removed.

3. **Save the Cleaned Dataset**:
   - The cleaned dataset is saved as `enhanced_250K_with_COPD.csv` without the `'COPD_Base'` column.


In [140]:
# Check if 'COPD_Base' column exists before dropping
if 'COPD_Base' in df.columns:
    # Drop the 'COPD_Base' column from the dataframe
    df = df.drop(columns=['COPD_Base'])
    print("'COPD_Base' column removed successfully.")
else:
    print("'COPD_Base' column not found, skipping removal.")

# Verify that the 'COPD_Base' column is removed
print(df.head())

# Save the cleaned dataset
df.to_csv('enhanced_250K_with_COPD.csv', index=False)


'COPD_Base' column not found, skipping removal.
   Age  Gender   BMI  Waist_Circumference   BMI_Category  Triglyceride  HDL  \
0   40    Male  36.4                 42.4          Obese           211   55   
1   24    Male  33.3                 38.3          Obese           290   41   
2   73  Female  30.2                 43.1          Obese            86   56   
3   90    Male  29.2                 58.6     Overweight           202   33   
4   99    Male  19.9                 45.8  Normal weight            71   69   

   High_Blood_Pressure  FBG       Alcohol Use         Smoker  \
0                    1  233  Moderate Drinker     Non Smoker   
1                    1   82  Moderate Drinker         Smoker   
2                    1  102  Moderate Drinker     Non Smoker   
3                    0  209  Moderate Drinker  Former Smoker   
4                    0   94  Moderate Drinker     Non Smoker   

                Exercise                        Hours of Sleep  Heart Disease  \
0     Meets

#### **Diabetes Assignment Algorithm**

This algorithm assigns **diabetes** (binary: 1 for yes, 0 for no) to individuals based on their **age group** and **FBG** (Fasting Blood Glucose) thresholds. Different thresholds for FBG are applied to different age groups.

#### Steps:
1. **Assign Diabetes Based on Age and FBG**:
   - **Ages 18-39**: Diabetes is assigned if FBG is **greater than 250 mg/dL**.
   - **Ages 40-64**: Diabetes is assigned if FBG is **greater than 200 mg/dL**.
   - **Ages 65+**: Diabetes is assigned if FBG is **greater than 160 mg/dL**.

2. **Apply the Diabetes Function**:
   - The function is applied to each row in the dataset, creating a new column **'Diabetes'**.

3. **Calculate Percentage of Individuals with Diabetes**:
   - The total percentage of individuals with diabetes in the entire population is calculated.
   
4. **Percentage of Individuals with Diabetes by Age Group**:
   - The percentage of individuals with diabetes is calculated separately for the following age groups:
     - **Ages 18-39**.
     - **Ages 40-64**.
     - **Ages 65+**.

5. **Save the Updated Dataset**:
   - The updated dataset is saved as `enhanced_250K_with_Diabetes.csv` with the new **Diabetes** column.


In [141]:
# Step 1: Assign diabetes based on age group and FBG thresholds
def assign_diabetes(row):
    if 18 <= row['Age'] < 40 and row['FBG'] > 250:
        return 1  # Diabetes for ages 18-39 if FBG > 250
    elif 40 <= row['Age'] < 65 and row['FBG'] > 200:
        return 1  # Diabetes for ages 40-64 if FBG > 200
    elif row['Age'] >= 65 and row['FBG'] > 160:
        return 1  # Diabetes for ages 65+ if FBG > 160
    return 0  # No diabetes if none of the conditions are met

# Step 2: Apply the diabetes function to each row in the dataset
df['Diabetes'] = df.apply(assign_diabetes, axis=1)

# Step 3: Calculate the percentage of individuals with diabetes in the total population
total_diabetes_percent = df['Diabetes'].mean() * 100
print(f"Percentage of individuals with Diabetes (Total Population): {total_diabetes_percent:.2f}%")

# Step 4: Calculate the percentage of individuals with diabetes for each age group
# 18-39 years
diabetes_18_39 = df[(df['Age'] >= 18) & (df['Age'] < 40)]['Diabetes'].mean() * 100
print(f"Percentage of individuals with Diabetes (Ages 18-39): {diabetes_18_39:.2f}%")

# 40-64 years
diabetes_40_64 = df[(df['Age'] >= 40) & (df['Age'] < 65)]['Diabetes'].mean() * 100
print(f"Percentage of individuals with Diabetes (Ages 40-64): {diabetes_40_64:.2f}%")

# 65+ years
diabetes_65_plus = df[df['Age'] >= 65]['Diabetes'].mean() * 100
print(f"Percentage of individuals with Diabetes (Ages 65+): {diabetes_65_plus:.2f}%")

# Step 5: Save the updated dataset with the diabetes column
df.to_csv('enhanced_250K_with_Diabetes.csv', index=False)


Percentage of individuals with Diabetes (Total Population): 20.95%
Percentage of individuals with Diabetes (Ages 18-39): 8.73%
Percentage of individuals with Diabetes (Ages 40-64): 20.96%
Percentage of individuals with Diabetes (Ages 65+): 31.72%


In [142]:
# Display the first 2 rows of the actual dataset with the Alcohol Use column
print(df.head(2))


   Age Gender   BMI  Waist_Circumference BMI_Category  Triglyceride  HDL  \
0   40   Male  36.4                 42.4        Obese           211   55   
1   24   Male  33.3                 38.3        Obese           290   41   

   High_Blood_Pressure  FBG       Alcohol Use      Smoker  \
0                    1  233  Moderate Drinker  Non Smoker   
1                    1   82  Moderate Drinker      Smoker   

                Exercise             Hours of Sleep  Heart Disease  Cancer  \
0     Meets Aerobic Only  Adequate Sleep (7+ hours)              0       0   
1  Insufficiently Active          Less than 7 Hours              0       0   

   Metabolic_Syndrome  COPD  Diabetes  
0                   1   0.0         1  
1                   0   0.0         0  


#### **Physical Exam Score Calculation Algorithm**

This algorithm calculates the **Physical Exam Score** for each individual based on their **age** and the presence of various **health conditions**. The score is adjusted by subtracting percentages from a base score, which is determined by the individual's age.

#### Steps:
1. **Assign Base Score Based on Age**:
   - **Ages 18-30**: Base score of 100.
   - **Ages 31-39**: Base score of 95.
   - **Ages 40-49**: Base score of 85.
   - **Ages 50-59**: Base score of 75.
   - **Ages 60-69**: Base score of 60.
   - **Ages 70-79**: Base score of 55.
   - **Ages 80+**: Base score of 50.

2. **Subtract Percentages Based on Health Conditions**:
   - **Alcohol Use (Heavy/Binge Drinker)**: Subtract 5%.
   - **Former Smoker**: Subtract 5%.
   - **Smoker**: Subtract 15%.
   - **Inactive (Exercise)**: Subtract 15%.
   - **Insufficiently Active (Exercise)**: Subtract 10%.
   - **Chronic Sleep Deprivation (≤5 hours)**: Subtract 5%.
   - **Heart Disease**: Subtract 15%.
   - **Cancer**: Subtract 20%.
   - **Metabolic Syndrome**: Subtract 5%.
   - **COPD**: Subtract 10%.
   - **Diabetes**: Subtract 10%.

3. **Apply the Exam Score Calculation**:
   - The **Exam Score** is calculated for each individual by subtracting the relevant percentages from their base score.

4. **Summarize Exam Scores by Age Group**:
   - For each age group, the algorithm calculates and prints the **average**, **high**, and **low** exam scores.

5. **Save the Updated Dataset**:
   - The dataset is updated with the new **Exam Score** column and saved as `enhanced_250K_with_Exam_Scores.csv`.


In [143]:
# Step 1: Function to assign base score based on age
def assign_base_score(age):
    if 18 <= age <= 30:
        return 100
    elif 31 <= age <= 39:
        return 95
    elif 40 <= age <= 49:
        return 85
    elif 50 <= age <= 59:
        return 75
    elif 60 <= age <= 69:
        return 60
    elif 70 <= age <= 79:
        return 55
    elif age >= 80:
        return 50
    return 0  # In case of invalid age

# Step 2: Function to calculate physical exam score
def calculate_exam_score(row):
    base_score = assign_base_score(row['Age'])
    original_base = base_score  # Store the original base score for reporting purposes

    # Subtract percentages based on health conditions
    if row['Alcohol Use'] == 'Heavy/Binge Drinker':
        base_score -= original_base * 0.05
    if row['Smoker'] == 'Former Smoker':
        base_score -= original_base * 0.05
    if row['Smoker'] == 'Smoker':
        base_score -= original_base * 0.15
    if row['Exercise'] == 'Inactive':
        base_score -= original_base * 0.15
    if row['Exercise'] == 'Insufficiently Active':
        base_score -= original_base * 0.10
    if row['Hours of Sleep'] == 'Chronic Sleep Deprivation (≤5 hours)':
        base_score -= original_base * 0.05
    if row['Heart Disease'] == 1:
        base_score -= original_base * 0.15
    if row['Cancer'] == 1:
        base_score -= original_base * 0.20
    if row['Metabolic_Syndrome'] == 1:
        base_score -= original_base * 0.05
    if row['COPD'] == 1:
        base_score -= original_base * 0.10
    if row['Diabetes'] == 1:
        base_score -= original_base * 0.10  # Assuming 10% reduction for Diabetes

    return base_score

# Step 3: Apply the function to each row in the dataset to calculate the exam score
df['Exam Score'] = df.apply(calculate_exam_score, axis=1)

# Step 4: Function to calculate average, high, and low exam scores per age group
def summarize_scores_by_age_group(df):
    age_groups = [
        (18, 30), (31, 39), (40, 49), (50, 59), (60, 69), (70, 79), (80, 150)
    ]
    
    for group in age_groups:
        age_min, age_max = group
        group_df = df[(df['Age'] >= age_min) & (df['Age'] <= age_max)]
        if not group_df.empty:
            avg_score = group_df['Exam Score'].mean()
            high_score = group_df['Exam Score'].max()
            low_score = group_df['Exam Score'].min()
            print(f"Age {age_min}-{age_max}:")
            print(f"  Average Exam Score: {avg_score:.2f}")
            print(f"  High Exam Score: {high_score:.2f}")
            print(f"  Low Exam Score: {low_score:.2f}")
            print()

# Step 5: Calculate and display the average, high, and low exam scores per age group
summarize_scores_by_age_group(df)

# Save the updated dataset with the exam score column
df.to_csv('enhanced_250K_with_Exam_Scores.csv', index=False)


Age 18-30:
  Average Exam Score: 86.92
  High Exam Score: 100.00
  Low Exam Score: 25.00

Age 31-39:
  Average Exam Score: 81.84
  High Exam Score: 95.00
  Low Exam Score: 14.25

Age 40-49:
  Average Exam Score: 71.47
  High Exam Score: 85.00
  Low Exam Score: 4.25

Age 50-59:
  Average Exam Score: 62.24
  High Exam Score: 75.00
  Low Exam Score: 11.25

Age 60-69:
  Average Exam Score: 47.65
  High Exam Score: 60.00
  Low Exam Score: 3.00

Age 70-79:
  Average Exam Score: 41.74
  High Exam Score: 55.00
  Low Exam Score: 2.75

Age 80-150:
  Average Exam Score: 37.87
  High Exam Score: 50.00
  Low Exam Score: 2.50



In [144]:
import pandas as pd
from tabulate import tabulate  # Install this via `pip install tabulate` if necessary

# Assuming df already has all necessary columns

# Display the first 10 rows as a nicely formatted table
print(tabulate(df.head(10), headers='keys', tablefmt='pretty'))


+---+-----+--------+------+---------------------+---------------+--------------+-----+---------------------+-----+---------------------+---------------+-----------------------+--------------------------------------+---------------+--------+--------------------+------+----------+------------+
|   | Age | Gender | BMI  | Waist_Circumference | BMI_Category  | Triglyceride | HDL | High_Blood_Pressure | FBG |     Alcohol Use     |    Smoker     |       Exercise        |            Hours of Sleep            | Heart Disease | Cancer | Metabolic_Syndrome | COPD | Diabetes | Exam Score |
+---+-----+--------+------+---------------------+---------------+--------------+-----+---------------------+-----+---------------------+---------------+-----------------------+--------------------------------------+---------------+--------+--------------------+------+----------+------------+
| 0 | 40  |  Male  | 36.4 |        42.4         |     Obese     |     211      | 55  |          1          | 233 |  Moder

In [145]:
import pandas as pd
from tabulate import tabulate

# Assuming df already has all necessary columns

# Print the first 10 rows as a nicely formatted table using tabulate
pretty_table = tabulate(df.head(10), headers='keys', tablefmt='fancy_grid')
print(pretty_table)


╒════╤═══════╤══════════╤═══════╤═══════════════════════╤════════════════╤════════════════╤═══════╤═══════════════════════╤═══════╤═════════════════════╤═══════════════╤═══════════════════════╤══════════════════════════════════════╤═════════════════╤══════════╤══════════════════════╤════════╤════════════╤══════════════╕
│    │   Age │ Gender   │   BMI │   Waist_Circumference │ BMI_Category   │   Triglyceride │   HDL │   High_Blood_Pressure │   FBG │ Alcohol Use         │ Smoker        │ Exercise              │ Hours of Sleep                       │   Heart Disease │   Cancer │   Metabolic_Syndrome │   COPD │   Diabetes │   Exam Score │
╞════╪═══════╪══════════╪═══════╪═══════════════════════╪════════════════╪════════════════╪═══════╪═══════════════════════╪═══════╪═════════════════════╪═══════════════╪═══════════════════════╪══════════════════════════════════════╪═════════════════╪══════════╪══════════════════════╪════════╪════════════╪══════════════╡
│  0 │    40 │ Male     │  36.4 │ 

#### **Conclusion**

This notebook successfully enriched a 250,000-row synthetic dataset with additional health and lifestyle attributes, providing a comprehensive view of various health risk factors across a large population. The key outcomes of this analysis include:

- **Accurate Health Condition Assignments**:
   - Using age-specific and lifestyle-related thresholds, we assigned binary outcomes for conditions such as **Diabetes**, **Cancer**, **Heart Disease**, **Metabolic Syndrome**, and **COPD**. Each condition was assigned using targeted algorithms based on medical criteria. 
   - For example, diabetes was assigned based on age-specific FBG thresholds, while cancer assignment relied on age, smoking habits, and BMI.

- **Physical Exam Score Calculation**:
   - A **Physical Exam Score** was computed to quantitatively assess an individual’s overall health. The score starts with a base value determined by age, then adjusts downward for the presence of risk factors such as smoking, inactivity, chronic sleep deprivation, and specific health conditions.
   - This score provides a useful single metric for evaluating overall health, and could be employed in further analyses or predictive models.

- **Statistical Summaries of Health Conditions**:
   - The notebook provided detailed statistics on the prevalence of conditions such as **Diabetes**, **Cancer**, **COPD**, and **Metabolic Syndrome**, breaking down the data by age group and other demographics. These summaries offer a clear picture of the health challenges faced by various population segments.

Overall, this enriched dataset and the analysis performed lay the groundwork for future machine learning applications, especially in predictive modeling for health outcomes. By leveraging the diverse attributes and health metrics, this dataset can be used to build models for predicting the onset of specific health conditions, identifying high-risk populations, and improving public health strategies.

The dataset, stored as **`enhanced_250K.csv`**, serves as a robust foundation for deeper exploration of health-related patterns and potential predictive analytics.
