#### **Gender Distribution**

This code generates a synthetic dataset with **250,000 rows** where the first column represents the **Gender** distribution based on current adult gender percentages in the USA:
- **Female**: 50.8%
- **Male**: 49.2%

Using `numpy`, the gender values are randomly assigned to each row, and the data is stored in a `pandas` DataFrame.


In [46]:
import pandas as pd
import numpy as np

# Number of rows
num_rows = 250000

# Gender distribution percentages in the USA
female_percentage = 50.8 / 100
male_percentage = 49.2 / 100

# Generate the Gender column based on the distribution
# 'Female' for 50.8%, 'Male' for 49.2%
gender = np.random.choice(['Female', 'Male'], size=num_rows, p=[female_percentage, male_percentage])

# Create the DataFrame
df = pd.DataFrame({
    'Gender': gender
})

# Display the first few rows
df.head()


Unnamed: 0,Gender
0,Male
1,Male
2,Female
3,Male
4,Male


#### **Gender and Age Distribution Dataset**


  
 The second column, **Age**, is assigned according to approximate age distributions in the USA per gender. The age ranges are divided as follows:

### Female Age Distribution:
- 18-29 years: 20%
- 30-44 years: 25%
- 45-59 years: 25%
- 60-74 years: 20%
- 75+ years: 10%

### Male Age Distribution:
- 18-29 years: 22%
- 30-44 years: 25%
- 45-59 years: 25%
- 60-74 years: 20%
- 75+ years: 8%

The `numpy` library is used to randomly assign the * **Age** columns based on these distributions, and the data is stored in a `pandas` DataFrame.


In [47]:
import pandas as pd
import numpy as np

# Number of rows
num_rows = 250000

# Gender distribution percentages in the USA
female_percentage = 50.8 / 100
male_percentage = 49.2 / 100

# Generate the Gender column based on the distribution
gender = np.random.choice(['Female', 'Male'], size=num_rows, p=[female_percentage, male_percentage])

# Define age distributions for each gender
age_bins_female = [18, 30, 45, 60, 75, 90]
age_probs_female = [0.20, 0.25, 0.25, 0.20, 0.10]  # Probabilities for each age group for females

age_bins_male = [18, 30, 45, 60, 75, 90]
age_probs_male = [0.22, 0.25, 0.25, 0.20, 0.08]  # Probabilities for each age group for males

# Function to assign ages based on gender
def assign_age(gender_value):
    if gender_value == 'Female':
        age_group = np.random.choice(np.arange(5), p=age_probs_female)
        return np.random.randint(age_bins_female[age_group], age_bins_female[age_group + 1])
    else:
        age_group = np.random.choice(np.arange(5), p=age_probs_male)
        return np.random.randint(age_bins_male[age_group], age_bins_male[age_group + 1])

# Apply the function to assign ages to each row
ages = [assign_age(g) for g in gender]

# Create the DataFrame with both Gender and Age
df = pd.DataFrame({
    'Gender': gender,
    'Age': ages
})

# Display the first few rows
df.head()


Unnamed: 0,Gender,Age
0,Male,27
1,Female,54
2,Male,21
3,Female,62
4,Male,61


#### BMI Distribution Assignment by Age Group

This code assigns a BMI (Body Mass Index) value to each individual in the dataset based on their age group, following the distribution patterns seen in population data.

#### Algorithm:
1. **BMI Category Assignment**:
   - For ages 18-39, individuals are categorized into one of four BMI categories: `Underweight`, `Normal weight`, `Overweight`, or `Obese`, using specified probabilities.
   - For ages 40-59 and 60+, similar BMI categories are used, but the probabilities are adjusted according to the specific age group's trends.
   
2. **Random BMI Value**:
   - After assigning a category, the actual BMI value is generated using a random uniform distribution within predefined ranges:
     - **Underweight**: BMI between 16 and 18.4
     - **Normal weight**: BMI between 18.5 and 24.9
     - **Overweight**: BMI between 25 and 29.9
     - **Obese**: BMI between 30 and 45

3. **Application to Dataset**:
   - The `assign_bmi()` function is applied to each row in the dataset, based on the individual's age, to assign a BMI value.

The resulting BMI values are stored in the `BMI` column, and the updated DataFrame is displayed for verification.


In [48]:
# Define BMI distribution per age group
def assign_bmi(age):
    if 18 <= age <= 39:
        bmi_category = np.random.choice(['Underweight', 'Normal weight', 'Overweight', 'Obese'], 
                                        p=[0.02, 0.275, 0.325, 0.38])
    elif 40 <= age <= 59:
        bmi_category = np.random.choice(['Underweight', 'Normal weight', 'Overweight', 'Obese'], 
                                        p=[0.01, 0.225, 0.375, 0.39])
    elif age >= 60:
        bmi_category = np.random.choice(['Underweight', 'Normal weight', 'Overweight', 'Obese'], 
                                        p=[0.01, 0.25, 0.375, 0.365])
    
    # Assign random BMI value based on the category
    if bmi_category == 'Underweight':
        return round(np.random.uniform(16, 18.4), 1)
    elif bmi_category == 'Normal weight':
        return round(np.random.uniform(18.5, 24.9), 1)
    elif bmi_category == 'Overweight':
        return round(np.random.uniform(25, 29.9), 1)
    else:  # Obese
        return round(np.random.uniform(30, 45), 1)

# Apply the function to assign BMI based on the age column
df['BMI'] = df['Age'].apply(assign_bmi)

# Display the first few rows to verify
df.head()


Unnamed: 0,Gender,Age,BMI
0,Male,27,22.7
1,Female,54,28.5
2,Male,21,21.3
3,Female,62,28.6
4,Male,61,21.4


#### **Heart Disease Assignment Based on Age, Gender, and BMI**

This code assigns the likelihood of heart disease for each individual in the dataset based on their age, gender, and BMI. The following steps are taken:

#### Algorithm:
1. **Base Heart Disease Probability**:
   - The base probability of heart disease is determined by the individual's age and gender.
   - For example, individuals aged 20-39 have a 1% chance of heart disease, whereas those aged 60-79 have a higher likelihood, with men at 40% and women at 25%.

2. **BMI Adjustments**:
   - The base probability of heart disease is adjusted upwards if the BMI exceeds certain thresholds:
     - **BMI > 27.5**: +10% to the base probability.
     - **BMI > 32.5**: +20%.
     - **BMI > 37.5**: +30%.
     - **BMI > 42.5**: +40%.

3. **Heart Disease Assignment**:
   - The `assign_heart_disease()` function combines the base probability (based on age and gender) with the BMI adjustment to calculate the final probability of heart disease for each individual.
   - A random outcome is generated based on this probability, assigning either 1 (heart disease) or 0 (no heart disease).

4. **Application to Dataset**:
   - The function is applied to each row of the dataset to create the `Heart Disease` column.
   - The resulting heart disease status is stored in the column, and the distribution of heart disease cases is printed for verification.

The resulting DataFrame includes a new `Heart Disease` column, with the distribution of heart disease cases checked to ensure proper assignment.


In [49]:
# Define base probabilities for heart disease by age and gender
def base_heart_disease_probability(age, gender):
    if 18 <= age <= 39:
        return 0.01  # 1% for both genders
    elif 40 <= age <= 59:
        return 0.10 if gender == 'Male' else 0.08  # 10% for men, 8% for women
    elif 60 <= age <= 79:
        return 0.40 if gender == 'Male' else 0.25  # 40% for men, 25% for women
    elif age >= 80:
        return 0.65 if gender == 'Male' else 0.50  # 65% for men, 50% for women
    return 0  # Default value

# Adjust probabilities based on BMI thresholds
def adjusted_probability(base_prob, bmi):
    if bmi > 42.5:
        return base_prob + (base_prob * 0.40)  # +40% for BMI > 42.5
    elif bmi > 37.5:
        return base_prob + (base_prob * 0.30)  # +30% for BMI > 37.5
    elif bmi > 32.5:
        return base_prob + (base_prob * 0.20)  # +20% for BMI > 32.5
    elif bmi > 27.5:
        return base_prob + (base_prob * 0.10)  # +10% for BMI > 27.5
    return base_prob  # No increase if BMI <= 27.5

# Function to assign heart disease based on adjusted probabilities
def assign_heart_disease(age, gender, bmi):
    base_prob = base_heart_disease_probability(age, gender)
    final_prob = adjusted_probability(base_prob, bmi)
    return np.random.choice([1, 0], p=[final_prob, 1 - final_prob])

# Apply the function to create the 'Heart Disease' column
df['Heart Disease'] = df.apply(lambda row: assign_heart_disease(row['Age'], row['Gender'], row['BMI']), axis=1)

# Display the first few rows to verify
df.head()

# Check the distribution of heart disease cases
heart_disease_distribution = df['Heart Disease'].value_counts(normalize=True) * 100
print("Heart Disease Distribution (%):")
print(heart_disease_distribution)


Heart Disease Distribution (%):
Heart Disease
0    84.1784
1    15.8216
Name: proportion, dtype: float64


In [50]:
# Display the first few rows to verify
df.head()

Unnamed: 0,Gender,Age,BMI,Heart Disease
0,Male,27,22.7,0
1,Female,54,28.5,0
2,Male,21,21.3,0
3,Female,62,28.6,0
4,Male,61,21.4,0


In [51]:
import numpy as np
import pandas as pd

# Define base cancer probability based on age groups
def base_cancer_probability(age):
    if 18 <= age <= 39:
        return 0.01  # 1% for ages 20-39
    elif 40 <= age <= 59:
        return 0.05  # 5% for ages 40-59
    elif 60 <= age <= 79:
        return 0.15  # 15% for ages 60-79
    elif age >= 80:
        return 0.25  # 25% for ages 80+
    return 0  # Default value if outside the range

# Adjust cancer probability based on BMI thresholds
def adjusted_cancer_probability(base_prob, age, bmi):
    if 40 <= age <= 59 and bmi > 35:
        return base_prob * 1.5  # 1.5x more likely for BMI > 35 in ages 40-59
    elif age >= 60 and bmi > 30:
        return base_prob * 1.5  # 1.5x more likely for BMI > 30 in ages 60-79 and 80+
    return base_prob  # No adjustment if BMI is below threshold

# Function to assign cancer based on adjusted probabilities
def assign_cancer(age, bmi):
    base_prob = base_cancer_probability(age)
    adjusted_prob = adjusted_cancer_probability(base_prob, age, bmi)
    return np.random.choice([1, 0], p=[adjusted_prob, 1 - adjusted_prob])

# Apply the function to the DataFrame to create the 'Cancer' column
df['Cancer'] = df.apply(lambda row: assign_cancer(row['Age'], row['BMI']), axis=1)

# Create Age Group categories for analysis
def age_group(age):
    if 18 <= age <= 39:
        return '20-39'
    elif 40 <= age <= 59:
        return '40-59'
    elif 60 <= age <= 79:
        return '60-79'
    elif age >= 80:
        return '80+'
    return 'Unknown'

# Create a new column for age groups
df['Age Group'] = df['Age'].apply(age_group)

# Calculate cancer percentage per age group and gender
cancer_percentages = df.groupby(['Age Group', 'Gender'])['Cancer'].mean() * 100

# Reset index and rename columns for clarity
cancer_percentages = cancer_percentages.reset_index()
cancer_percentages.columns = ['Age Group', 'Gender', 'Cancer (%)']

# Display the resulting cancer rates
print(cancer_percentages)

# If you want to display it more clearly for each group, you can print each group separately
for age_group, data in cancer_percentages.groupby('Age Group'):
    print(f"\nCancer Percentage for Age Group {age_group}:")
    print(data)


  Age Group  Gender  Cancer (%)
0     20-39  Female    0.981049
1     20-39    Male    1.061598
2     40-59  Female    5.685090
3     40-59    Male    5.693803
4     60-79  Female   17.574031
5     60-79    Male   17.791521
6       80+  Female   30.024961
7       80+    Male   30.208490

Cancer Percentage for Age Group 20-39:
  Age Group  Gender  Cancer (%)
0     20-39  Female    0.981049
1     20-39    Male    1.061598

Cancer Percentage for Age Group 40-59:
  Age Group  Gender  Cancer (%)
2     40-59  Female    5.685090
3     40-59    Male    5.693803

Cancer Percentage for Age Group 60-79:
  Age Group  Gender  Cancer (%)
4     60-79  Female   17.574031
5     60-79    Male   17.791521

Cancer Percentage for Age Group 80+:
  Age Group  Gender  Cancer (%)
6       80+  Female   30.024961
7       80+    Male   30.208490


In [53]:
import numpy as np

# Function to randomly assign 25% of heart disease cases to smoking
def assign_smoking(heart_disease):
    if heart_disease == 1:
        return np.random.choice([1, 0], p=[0.25, 0.75])  # 25% chance of being attributed to smoking
    return 0  # If no heart disease, not attributed to smoking

# Apply the function to create the 'Smoking' column
df['Smoking'] = df['Heart Disease'].apply(assign_smoking)

# Display the first few rows to verify
df.head()

# Calculate and display the percentage of heart disease cases attributed to smoking
smoking_attributed_percentage = df[df['Heart Disease'] == 1]['Smoking'].mean() * 100
print(f"Percentage of heart disease cases attributed to smoking: {smoking_attributed_percentage:.2f}%")


Percentage of heart disease cases attributed to smoking: 25.21%


In [54]:
import numpy as np

# Function to assign COPD based on age group
def assign_copd(age):
    if 18 <= age <= 44:
        return np.random.choice([1, 0], p=[0.03, 0.97])  # 3% prevalence
    elif 45 <= age <= 64:
        return np.random.choice([1, 0], p=[0.12, 0.88])  # 12% prevalence
    elif age >= 65:
        return np.random.choice([1, 0], p=[0.17, 0.83])  # 17% prevalence
    return 0

# Apply the function to create the 'COPD' column
df['COPD'] = df['Age'].apply(assign_copd)

# Now set 'Smoking' to 1 for all individuals with COPD
df.loc[df['COPD'] == 1, 'Smoking'] = 1

# Display the first few rows to verify
df.head()

# Calculate the percentage of COPD cases
copd_prevalence = df['COPD'].mean() * 100
print(f"Overall COPD Prevalence: {copd_prevalence:.2f}%")

# Verify that all COPD cases are marked as smokers
smoking_for_copd = df[df['COPD'] == 1]['Smoking'].mean() * 100
print(f"Percentage of COPD cases attributed to smoking: {smoking_for_copd:.2f}%")


Overall COPD Prevalence: 8.99%
Percentage of COPD cases attributed to smoking: 100.00%


In [55]:
df.head()

Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD
0,Male,27,22.7,0,0,20-39,0,0
1,Female,54,28.5,0,0,40-59,0,0
2,Male,21,21.3,0,0,20-39,0,0
3,Female,62,28.6,0,0,60-79,0,0
4,Male,61,21.4,0,0,60-79,0,0


In [58]:
import pandas as pd
import numpy as np

# Assuming BMI >= 30 is classified as 'Obesity'
df['Obesity'] = df['BMI'].apply(lambda x: 1 if x >= 30 else 0)

# Function to assign Alzheimer's based on age, obesity, and smoking
def assign_alzheimers(row):
    age = row['Age']
    obesity_or_smoking = row['Obesity'] or row['Smoking']
    
    # Assign base probabilities based on age group
    if 65 <= age <= 74:
        base_prob = 0.03  # 3%
        multiplier = 1.5 if obesity_or_smoking else 1.0
    elif 75 <= age <= 84:
        base_prob = 0.13  # 13%
        multiplier = 1.4 if obesity_or_smoking else 1.0
    elif age >= 85:
        base_prob = 0.30  # 30%
        multiplier = 1.3 if obesity_or_smoking else 1.0
    else:
        return 0  # No Alzheimer's if under 65
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign Alzheimer's based on probability
    return np.random.rand() < final_prob

# Apply the function to the DataFrame to create a new column 'Alzheimers'
df['Alzheimers'] = df.apply(assign_alzheimers, axis=1)

# Display the DataFrame to check the results
df.head()


Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers
0,Male,27,22.7,0,0,20-39,0,0,0,0
1,Female,54,28.5,0,0,40-59,0,0,0,0
2,Male,21,21.3,0,0,20-39,0,0,0,0
3,Female,62,28.6,0,0,60-79,0,0,0,0
4,Male,61,21.4,0,0,60-79,0,0,0,0


In [60]:
import pandas as pd
import numpy as np

# Function to assign Diabetes based on age, obesity, and heart disease
def assign_diabetes(row):
    age = row['Age']
    obesity_or_heart_disease = row['Obesity'] or row['Heart Disease']
    
    # Assign base probabilities based on age group
    if 18 <= age <= 44:
        base_prob = 0.02  # 2%
        multiplier = 2.0 if obesity_or_heart_disease else 1.0
    elif 45 <= age <= 64:
        base_prob = 0.09  # 9%
        multiplier = 2.0 if obesity_or_heart_disease else 1.0
    elif age >= 65:
        base_prob = 0.17  # 17%
        multiplier = 2.0 if obesity_or_heart_disease else 1.0
    else:
        return 0  # No diabetes assigned for ages under 18
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign Diabetes based on probability
    return np.random.rand() < final_prob

# Create 'Obesity' column if it is based on BMI, assuming BMI >= 30 is Obese
df['Obesity'] = df['BMI'].apply(lambda x: 1 if x >= 30 else 0)

# Apply the function to the DataFrame to create a new column 'Diabetes'
df['Diabetes'] = df.apply(assign_diabetes, axis=1)

# Display the DataFrame to check the results
df.head()


Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers,Diabetes
0,Male,27,22.7,0,0,,0,0,0,0,False
1,Female,54,28.5,0,0,,0,0,0,0,True
2,Male,21,21.3,0,0,,0,0,0,0,False
3,Female,62,28.6,0,0,,0,0,0,0,False
4,Male,61,21.4,0,0,,0,0,0,0,False


In [62]:
# Convert the 'Diabetes' column from boolean (True/False) to integer (1/0)
df['Diabetes'] = df['Diabetes'].astype(int)

# Display the DataFrame to check the results
df.head()


Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers,Diabetes
0,Male,27,22.7,0,0,,0,0,0,0,0
1,Female,54,28.5,0,0,,0,0,0,0,1
2,Male,21,21.3,0,0,,0,0,0,0,0
3,Female,62,28.6,0,0,,0,0,0,0,0
4,Male,61,21.4,0,0,,0,0,0,0,0


In [64]:
import pandas as pd
import numpy as np

# Function to assign CKD based on age, heart disease, smoking, BMI, and diabetes
def assign_ckd(row):
    age = row['Age']
    risk_factors = row['Heart Disease'] or row['Smoking'] or row['BMI'] > 35 or row['Diabetes']
    
    # Assign base probabilities based on age group
    if 18 <= age <= 44:
        base_prob = 0.02  # 2% for ages 18-44
        multiplier = 2.0 if risk_factors else 1.0
    elif 45 <= age <= 64:
        base_prob = 0.05  # 5% for ages 45-64
        multiplier = 2.0 if risk_factors else 1.0
    elif age >= 65:
        base_prob = 0.15  # 15% for ages 65+
        multiplier = 2.0 if risk_factors else 1.0
    else:
        return 0  # No CKD for those under 18
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign CKD based on probability
    return np.random.rand() < final_prob

# Apply the function to the DataFrame to create a new column 'CKD'
df['CKD'] = df.apply(assign_ckd, axis=1)

# Convert CKD from boolean (True/False) to integer (1/0)
df['CKD'] = df['CKD'].astype(int)

# Display the DataFrame to check the results
df.head()


Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers,Diabetes,CKD
0,Male,27,22.7,0,0,,0,0,0,0,0,0
1,Female,54,28.5,0,0,,0,0,0,0,1,0
2,Male,21,21.3,0,0,,0,0,0,0,0,0
3,Female,62,28.6,0,0,,0,0,0,0,0,0
4,Male,61,21.4,0,0,,0,0,0,0,0,0


In [66]:
# Create an Age Group based on the specified ranges
df['Age Group'] = pd.cut(df['Age'], bins=[0, 44, 64, np.inf], labels=['18-44', '45-64', '65+'])

# Display the DataFrame to check the results
df.head()


Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers,Diabetes,CKD
0,Male,27,22.7,0,0,18-44,0,0,0,0,0,0
1,Female,54,28.5,0,0,45-64,0,0,0,0,1,0
2,Male,21,21.3,0,0,18-44,0,0,0,0,0,0
3,Female,62,28.6,0,0,45-64,0,0,0,0,0,0
4,Male,61,21.4,0,0,45-64,0,0,0,0,0,0


In [82]:
import pandas as pd
import numpy as np

# Step 1: Function to assign High Blood Pressure based on age, obesity, diabetes, and CKD
def assign_hypertension(row):
    age = row['Age']
    
    # Determine if any of the risk factors are present
    if row['Obesity'] or row['Diabetes'] or row['CKD']:
        obesity = row['Obesity']
        diabetes = row['Diabetes']
        ckd = row['CKD']
    else:
        return 0  # No hypertension if no risk factors are present
    
    # Assign base probabilities based on age group
    if 18 <= age <= 39:
        base_prob = 0.25  # 10% for ages 18-39
        if obesity:
            multiplier = 2.0
        elif diabetes:
            multiplier = 3.0
        elif ckd:
            multiplier = 4.0
        else:
            multiplier = 1.0
    elif 40 <= age <= 59:
        base_prob = 0.35  # 30% for ages 40-59
        if obesity:
            multiplier = 2.0
        elif diabetes:
            multiplier = 2.0
        elif ckd:
            multiplier = 3.0
        else:
            multiplier = 1.0
    elif age >= 60:
        base_prob = 0.5  # 40% for ages 60+
        if obesity:
            multiplier = 2.0
        elif diabetes:
            multiplier = 2.0
        elif ckd:
            multiplier = 2.0
        else:
            multiplier = 1.0
    else:
        return 0  # No hypertension if under 18

    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign High Blood Pressure based on probability
    return np.random.rand() < final_prob

# Step 2: Apply the function to the DataFrame to create a new column 'High Blood Pressure'
df['High Blood Pressure'] = df.apply(assign_hypertension, axis=1)

# Convert High Blood Pressure from boolean (True/False) to integer (1/0)
df['High Blood Pressure'] = df['High Blood Pressure'].astype(int)

# Step 3: Calculate percentage of High Blood Pressure by Age Group
df['Age Group'] = pd.cut(df['Age'], bins=[0, 39, 59, np.inf], labels=['18-39', '40-59', '60+'])

# Step 4: Calculate percentage of High Blood Pressure by age group, with observed=True to silence warning
hypertension_by_age_group = df.groupby('Age Group', observed=True)['High Blood Pressure'].mean() * 100

# Step 5: Calculate the total percentage of the population with High Blood Pressure
total_hypertension_percentage = df['High Blood Pressure'].mean() * 100

# Step 6: Display the results
print("Percentage of High Blood Pressure by Age Group:")
print(hypertension_by_age_group)
print(f"\nTotal Percentage of the population with High Blood Pressure: {total_hypertension_percentage:.2f}%")


Percentage of High Blood Pressure by Age Group:
Age Group
18-39    21.211186
40-59    33.329342
60+      59.167299
Name: High Blood Pressure, dtype: float64

Total Percentage of the population with High Blood Pressure: 36.27%


In [85]:
import pandas as pd
import numpy as np

# Step 1: Function to assign Stroke based on age, smoking, diabetes, and heart disease
def assign_stroke(row):
    age = row['Age']
    risk_factors = row['Smoking'] or row['Diabetes'] or row['Heart Disease']
    
    # Assign base probabilities based on age group
    if 18 <= age <= 44:
        base_prob = 0.01  # 1% for ages 18-44
        multiplier = 2.0 if risk_factors else 1.0
    elif 45 <= age <= 64:
        base_prob = 0.02  # 2% for ages 45-64
        multiplier = 4.0 if risk_factors else 1.0
    elif age >= 65:
        base_prob = 0.03  # 3% for ages 65+
        multiplier = 6.0 if risk_factors else 1.0
    else:
        return 0  # No stroke if under 18
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign Stroke based on probability
    return np.random.rand() < final_prob

# Step 2: Apply the function to the DataFrame to create a new column 'Stroke'
df['Stroke'] = df.apply(assign_stroke, axis=1)

# Convert Stroke from boolean (True/False) to integer (1/0)
df['Stroke'] = df['Stroke'].astype(int)

# Step 3: Group the data by age group
df['Age Group'] = pd.cut(df['Age'], bins=[0, 44, 64, np.inf], labels=['18-44', '45-64', '65+'])

# Step 4: Calculate percentage of Stroke by age group, with observed=True to silence warning
stroke_by_age_group = df.groupby('Age Group', observed=True)['Stroke'].mean() * 100

# Step 5: Calculate the total percentage of the population with Stroke
total_stroke_percentage = df['Stroke'].mean() * 100

# Step 6: Display the results
print("Percentage of Stroke by Age Group:")
print(stroke_by_age_group)
print(f"\nTotal Percentage of the population with Stroke: {total_stroke_percentage:.2f}%")


Percentage of Stroke by Age Group:
Age Group
18-44     1.069151
45-64     4.017694
65+      12.598637
Name: Stroke, dtype: float64

Total Percentage of the population with Stroke: 4.58%


In [90]:
import pandas as pd
import numpy as np

# Step 1: Function to assign Liver Disease (Liver Dx) based on age, high alcohol, and obesity
def assign_liver_dx(row):
    age = row['Age']
    obesity = row['Obesity']
    
    # Assign base probabilities based on age group
    if 18 <= age <= 44:
        base_prob = 0.03  # 3% for ages 18-44
        multiplier = 2.0 if obesity else 1.0  # Only using obesity for now, will assign high alcohol later
    elif 45 <= age <= 64:
        base_prob = 0.10  # 10% for ages 45-64
        multiplier = 3.0 if obesity else 1.0
    elif age >= 65:
        base_prob = 0.15  # 8% for ages 65+
        multiplier = 4.0 if obesity else 1.0
    else:
        return 0  # No liver disease if under 18
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Assign Liver Disease based on probability
    return np.random.rand() < final_prob

# Step 2: Apply the function to the DataFrame to create a new column 'Liver Dx'
df['Liver Dx'] = df.apply(assign_liver_dx, axis=1)

# Convert Liver Dx from boolean (True/False) to integer (1/0)
df['Liver Dx'] = df['Liver Dx'].astype(int)

# Step 3: Assign High Alcohol only to rows where Liver Dx is present (Liver Dx == 1)
# Randomly assign High Alcohol consumption (for example, to 50% of Liver Dx cases)
df['High Alcohol'] = np.where(df['Liver Dx'] == 1, np.random.choice([0, 1], size=len(df), p=[0.5, 0.5]), 0)

# Step 4: Recalculate the Liver Dx with both High Alcohol and Obesity factors
def recalculate_liver_dx(row):
    age = row['Age']
    high_alcohol = row['High Alcohol']
    obesity = row['Obesity']
    
    # Assign base probabilities based on age group
    if 18 <= age <= 44:
        base_prob = 0.03  # 3% for ages 18-44
        multiplier = 3.0 if high_alcohol else (2.0 if obesity else 1.0)
    elif 45 <= age <= 64:
        base_prob = 0.10  # 10% for ages 45-64
        multiplier = 5.0 if high_alcohol else (3.0 if obesity else 1.0)
    elif age >= 65:
        base_prob = 0.15  # 15% for ages 65+
        multiplier = 6.0 if high_alcohol else (4.0 if obesity else 1.0)
    else:
        return 0  # No liver disease if under 18
    
    # Calculate final probability
    final_prob = base_prob * multiplier
    
    # Reassign Liver Disease based on new probability
    return np.random.rand() < final_prob

# Step 5: Reassign Liver Dx using both High Alcohol and Obesity risk factors
df['Liver Dx'] = df.apply(recalculate_liver_dx, axis=1)

# Convert Liver Dx from boolean (True/False) to integer (1/0)
df['Liver Dx'] = df['Liver Dx'].astype(int)

# Step 6: Group the data by age group
df['Age Group'] = pd.cut(df['Age'], bins=[0, 44, 64, np.inf], labels=['18-44', '45-64', '65+'])

# Step 7: Calculate percentage of Liver Dx by age group
liver_dx_by_age_group = df.groupby('Age Group', observed=True)['Liver Dx'].mean() * 100

# Step 8: Calculate the total percentage of the population with Liver Disease
total_liver_dx_percentage = df['Liver Dx'].mean() * 100

# Step 9: Display the results
print("Percentage of Liver Disease (Liver Dx) by Age Group:")
print(liver_dx_by_age_group)
print(f"\nTotal Percentage of the population with Liver Disease: {total_liver_dx_percentage:.2f}%")


Percentage of Liver Disease (Liver Dx) by Age Group:
Age Group
18-44     4.207766
45-64    20.253563
65+      37.613398
Name: Liver Dx, dtype: float64

Total Percentage of the population with Liver Disease: 16.77%


In [92]:
df.head()

Unnamed: 0,Gender,Age,BMI,Heart Disease,Cancer,Age Group,Smoking,COPD,Obesity,Alzheimers,Diabetes,CKD,High Blood Pressure,HBP,Stroke,High Alcohol,Liver Dx
0,Male,27,22.7,0,0,18-44,0,0,0,0,0,0,0,0,0,0,1
1,Female,54,28.5,0,0,45-64,0,0,0,0,1,0,0,1,0,0,0
2,Male,21,21.3,0,0,18-44,0,0,0,0,0,0,0,0,0,0,0
3,Female,62,28.6,0,0,45-64,0,0,0,0,0,0,0,1,0,0,0
4,Male,61,21.4,0,0,45-64,0,0,0,0,0,0,0,1,0,0,0


In [108]:
import numpy as np

# Step 1: Randomly assign High Alcohol to 30% of individuals with Heart Disease if not already present
df.loc[(df['Heart Disease'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['Heart Disease'] == 1) & (df['High Alcohol'] == 0)]), p=[0.7, 0.3])

# Step 2: Randomly assign High Alcohol to 20% of individuals with Obesity if not already present
df.loc[(df['Obesity'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['Obesity'] == 1) & (df['High Alcohol'] == 0)]), p=[0.8, 0.2])

# Step 3: Randomly assign High Alcohol to 30% of individuals with Diabetes if not already present
df.loc[(df['Diabetes'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['Diabetes'] == 1) & (df['High Alcohol'] == 0)]), p=[0.7, 0.3])

# Step 4: Randomly assign High Alcohol to 30% of individuals with Alzheimer's if not already present
df.loc[(df['Alzheimers'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['Alzheimers'] == 1) & (df['High Alcohol'] == 0)]), p=[0.7, 0.3])

# Step 5: Randomly assign High Alcohol to 15% of individuals with High Blood Pressure if not already present
df.loc[(df['High Blood Pressure'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['High Blood Pressure'] == 1) & (df['High Alcohol'] == 0)]), p=[0.85, 0.15])

# Step 6: Randomly assign High Alcohol to 30% of individuals with Stroke if not already present
df.loc[(df['Stroke'] == 1) & (df['High Alcohol'] == 0), 'High Alcohol'] = np.random.choice(
    [0, 1], size=len(df[(df['Stroke'] == 1) & (df['High Alcohol'] == 0)]), p=[0.7, 0.3])

# Step 7: Calculate the total percentage of the population with High Alcohol consumption
high_alcohol_percentage = df['High Alcohol'].mean() * 100

# Display the total percentage of High Alcohol consumption
print(f"Total Percentage of the population with High Alcohol consumption: {high_alcohol_percentage:.2f}%")


Total Percentage of the population with High Alcohol consumption: 44.36%


In [109]:
df.head()

Unnamed: 0,Gender,Age,Age Group,BMI,Obesity,Smoking,High Alcohol,Heart Disease,Cancer,COPD,Alzheimers,Diabetes,CKD,High Blood Pressure,Stroke,Liver Dx
0,Male,27,18-44,22.7,0,0,0,0,0,0,0,0,0,0,0,1
1,Female,54,45-64,28.5,0,0,0,0,0,0,0,1,0,0,0,0
2,Male,21,18-44,21.3,0,0,1,0,0,0,0,0,0,0,0,0
3,Female,62,45-64,28.6,0,0,0,0,0,0,0,0,0,0,0,0
4,Male,61,45-64,21.4,0,0,1,0,0,0,0,0,0,0,0,0


In [110]:
# Step 1: Define the new column order
new_column_order = [
    'Gender', 'Age', 'Age Group', 'BMI', 'Obesity', 'Smoking', 'High Alcohol',
    'Heart Disease', 'Cancer', 'COPD', 'Alzheimers', 'Diabetes', 'CKD', 'High Blood Pressure', 
    'Stroke', 'Liver Dx'
]

# Step 2: Reorder the DataFrame columns
df = df[new_column_order]

# Step 3: Select only the numeric columns for percentage calculation
numeric_columns = [
    'Age', 'BMI', 'Obesity', 'Smoking', 'High Alcohol', 'Heart Disease', 'Cancer', 'COPD',
    'Alzheimers', 'Diabetes', 'CKD', 'High Blood Pressure', 'Stroke', 'Liver Dx'
]

# Step 4: Calculate percentage by Age Group for numeric columns only, with observed=True to silence warning
percentages_by_age_group = df.groupby('Age Group', observed=True)[numeric_columns].mean() * 100

# Step 5: Display the percentages by Age Group for all numeric variables
print("Percentages by Age Group for Numeric Variables:")
print(percentages_by_age_group)


Percentages by Age Group for Numeric Variables:
                   Age          BMI    Obesity    Smoking  High Alcohol  \
Age Group                                                                 
18-44      3085.280227  2957.329128  38.026733   3.695410     46.220940   
45-64      5411.514953  2992.403685  38.692359  15.318404     46.498381   
65+        7456.642869  2956.736271  36.477177  26.150983     37.488146   

           Heart Disease     Cancer       COPD Alzheimers   Diabetes  \
Age Group                                                              
18-44           2.811857   1.815900   2.993970        0.0   2.777875   
45-64          15.407882   8.218125  11.954782        0.0  13.295694   
65+            43.124519  21.122980  17.096999  11.133179  27.482241   

                 CKD  High Blood Pressure     Stroke   Liver Dx  
Age Group                                                        
18-44       2.602733            22.638632   1.069151   4.207766  
45-64       7.459

In [111]:
# Step 1: Define the new column order
new_column_order = [
    'Gender', 'Age', 'Age Group', 'BMI', 'Obesity', 'Smoking', 'High Alcohol',
    'Heart Disease', 'Cancer', 'COPD', 'Alzheimers', 'Diabetes', 'CKD', 'High Blood Pressure', 
    'Stroke', 'Liver Dx'
]

# Step 2: Reorder the DataFrame columns
df = df[new_column_order]

# Step 3: Select only the binary (percentage) columns for scaling
percentage_columns = [
    'Obesity', 'Smoking', 'High Alcohol', 'Heart Disease', 'Cancer', 'COPD',
    'Alzheimers', 'Diabetes', 'CKD', 'High Blood Pressure', 'Stroke', 'Liver Dx'
]

# Step 4: Calculate mean values for percentage columns and regular means for Age and BMI
percentages_by_age_group = df.groupby('Age Group', observed=True)[percentage_columns].mean() * 100
age_bmi_means_by_age_group = df.groupby('Age Group', observed=True)[['Age', 'BMI']].mean()

# Step 5: Combine the results
final_result = pd.concat([age_bmi_means_by_age_group, percentages_by_age_group], axis=1)

# Step 6: Display the results
print("Mean Age, BMI, and Percentages by Age Group:")
print(final_result)


Mean Age, BMI, and Percentages by Age Group:
                 Age        BMI    Obesity    Smoking  High Alcohol  \
Age Group                                                             
18-44      30.852802  29.573291  38.026733   3.695410     46.220940   
45-64      54.115150  29.924037  38.692359  15.318404     46.498381   
65+        74.566429  29.567363  36.477177  26.150983     37.488146   

           Heart Disease     Cancer       COPD Alzheimers   Diabetes  \
Age Group                                                              
18-44           2.811857   1.815900   2.993970        0.0   2.777875   
45-64          15.407882   8.218125  11.954782        0.0  13.295694   
65+            43.124519  21.122980  17.096999  11.133179  27.482241   

                 CKD  High Blood Pressure     Stroke   Liver Dx  
Age Group                                                        
18-44       2.602733            22.638632   1.069151   4.207766  
45-64       7.459451            37.54930

In [112]:
import numpy as np

# Step 1: List of conditions to check for no positive values
conditions = ['Obesity', 'Smoking', 'Heart Disease', 'Cancer', 'COPD', 'Alzheimers', 
              'Diabetes', 'CKD', 'High Blood Pressure', 'Stroke', 'Liver Dx']

# Step 2: Create a new 'Strength' column initialized to 0
df['Strength'] = 0

# Step 3: Filter and assign Strength to 30% of 18-44 who have no positive values for the conditions
age_18_44_no_conditions = df[(df['Age Group'] == '18-44') & (df[conditions].sum(axis=1) == 0)]
num_to_assign_18_44 = int(0.30 * len(age_18_44_no_conditions))  # 30% of this group
if num_to_assign_18_44 > 0:
    df.loc[age_18_44_no_conditions.sample(n=num_to_assign_18_44).index, 'Strength'] = 1

# Step 4: Filter and assign Strength to 15% of 45-64 who have no positive values for the conditions
age_45_64_no_conditions = df[(df['Age Group'] == '45-64') & (df[conditions].sum(axis=1) == 0)]
num_to_assign_45_64 = int(0.15 * len(age_45_64_no_conditions))  # 15% of this group
if num_to_assign_45_64 > 0:
    df.loc[age_45_64_no_conditions.sample(n=num_to_assign_45_64).index, 'Strength'] = 1

# Step 5: Filter and assign Strength to 5% of 65+ who have no positive values for the conditions
age_65_plus_no_conditions = df[(df['Age Group'] == '65+') & (df[conditions].sum(axis=1) == 0)]
num_to_assign_65_plus = int(0.05 * len(age_65_plus_no_conditions))  # 5% of this group
if num_to_assign_65_plus > 0:
    df.loc[age_65_plus_no_conditions.sample(n=num_to_assign_65_plus).index, 'Strength'] = 1

# Step 6: Display the result
print("New Strength column created and applied as per conditions.")


New Strength column created and applied as per conditions.


In [117]:
# Step 1: Define base scores for each specified Exam Age group
base_scores = {
    '18-29': 90,
    '30-39': 87,
    '40-49': 84,
    '50-59': 80,
    '60-69': 77,
    '70-79': 74,
    '80+': 71
}

# Step 2: Function to map actual age values to the correct Exam Age group
def assign_exam_age(age):
    if 18 <= age <= 29:
        return '18-29'
    elif 30 <= age <= 39:
        return '30-39'
    elif 40 <= age <= 49:
        return '40-49'
    elif 50 <= age <= 59:
        return '50-59'
    elif 60 <= age <= 69:
        return '60-69'
    elif 70 <= age <= 79:
        return '70-79'
    elif age >= 80:
        return '80+'
    else:
        return None

# Step 3: Create the 'Exam Age' column based on the 'Age' column
df['Exam Age'] = df['Age'].apply(assign_exam_age)

# Step 4: Function to calculate exam score based on conditions and base score
def calculate_exam_score(row):
    # Get the base score from the Exam Age group
    exam_age_group = row['Exam Age']
    base_score = base_scores[exam_age_group]
    
    # Add 20 points if Strength = 1
    if row['Strength'] == 1:
        base_score += 20

    # Subtract percentages from the base score based on conditions
    if row['Heart Disease'] == 1:
        base_score -= 0.10 * base_score
    if row['Cancer'] == 1:
        base_score -= 0.15 * base_score
    if row['COPD'] == 1:
        base_score -= 0.15 * base_score
    if row['Alzheimers'] == 1:
        base_score -= 0.40 * base_score
    if row['Diabetes'] == 1:
        base_score -= 0.10 * base_score
    if row['CKD'] == 1:
        base_score -= 0.15 * base_score
    if row['Stroke'] == 1:
        base_score -= 0.25 * base_score
    if row['Liver Dx'] == 1:
        base_score -= 0.10 * base_score
    if row['High Blood Pressure'] == 1:
        base_score -= 0.05 * base_score
    
    return base_score

# Step 5: Apply the function to create a new 'Exam Score' column
df['Exam Score'] = df.apply(calculate_exam_score, axis=1)

# Step 6: Calculate high, low, and average exam scores by Exam Age group
exam_scores_by_age_group = df.groupby('Exam Age')['Exam Score'].agg(['min', 'max', 'mean'])

# Step 7: Display the results using Pandas
print("High, Low, and Average Exam Scores by Exam Age:")
print(exam_scores_by_age_group)


High, Low, and Average Exam Scores by Exam Age:
                min    max       mean
Exam Age                             
18-29     46.330313  110.0  90.480301
30-39     44.785969  107.0  87.617400
40-49     36.755381  104.0  79.571152
50-59     31.504612  100.0  72.931650
60-69     17.337730   97.0  62.703986
70-79     15.736554   94.0  56.233332
80+       13.588727   91.0  46.677415


In [118]:
df.head(20)

Unnamed: 0,Gender,Age,Age Group,BMI,Obesity,Smoking,High Alcohol,Heart Disease,Cancer,COPD,Alzheimers,Diabetes,CKD,High Blood Pressure,Stroke,Liver Dx,Strength,Exam Age,Exam Score
0,Male,27,18-44,22.7,0,0,0,0,0,0,0,0,0,0,0,1,0,18-29,81.0
1,Female,54,45-64,28.5,0,0,0,0,0,0,0,1,0,0,0,0,0,50-59,72.0
2,Male,21,18-44,21.3,0,0,1,0,0,0,0,0,0,0,0,0,1,18-29,110.0
3,Female,62,45-64,28.6,0,0,0,0,0,0,0,0,0,0,0,0,0,60-69,77.0
4,Male,61,45-64,21.4,0,0,1,0,0,0,0,0,0,0,0,0,0,60-69,77.0
5,Female,24,18-44,41.2,1,0,1,0,0,0,0,0,0,1,0,0,0,18-29,85.5
6,Male,77,65+,38.8,1,0,1,1,1,0,True,1,0,1,0,1,0,70-79,26.136837
7,Female,60,45-64,29.0,0,1,0,0,0,1,0,0,0,0,0,1,0,60-69,58.905
8,Female,69,65+,22.8,0,1,1,0,0,1,False,0,1,1,0,1,0,60-69,47.565787
9,Female,38,18-44,27.4,0,0,0,0,0,0,0,0,0,0,0,0,0,30-39,87.0


In [119]:
# Convert all values in the Alzheimers column to 1 or 0
df['Alzheimers'] = df['Alzheimers'].apply(lambda x: 1 if x == True or x == 1 else 0)

# Display the updated DataFrame
print(df[['Gender', 'Age', 'Alzheimers']])


        Gender  Age  Alzheimers
0         Male   27           0
1       Female   54           0
2         Male   21           0
3       Female   62           0
4         Male   61           0
...        ...  ...         ...
249995    Male   84           1
249996  Female   46           0
249997    Male   74           0
249998    Male   29           0
249999    Male   58           0

[250000 rows x 3 columns]


In [121]:
# Convert the DataFrame to a CSV file
csv_file_name = 'exam_scores_updated_dataframe.csv'
df.to_csv(csv_file_name, index=False)

# Confirmation message
print(f"DataFrame has been successfully saved as {csv_file_name}")


DataFrame has been successfully saved as exam_scores_updated_dataframe.csv
