#### **Dataset Bias Introduction and Metrics:**

Initial Baseline Metrics (Before Bias Introduction):
Total Average Exam Score: 61.85
Average Exam Score for Males: 61.40
Average Exam Score for Females: 62.31
Adjusted Bias Introduction:

We introduced a 12% reduction in Exam Scores for 50% of females over age 65 and a 10% reduction in Exam Scores for 50% of all females.
After introducing this bias, the metrics changed to:
Total Average Exam Score: 59.91
Average Exam Score for Males: 61.40 (unchanged)
Average Exam Score for Females: 58.41
Percentage Deviations (Before vs. After Bias):

Total Average Exam Score dropped from 61.85 to 59.91:
Deviation: 
3.14%
Average Exam Score for Females dropped from 62.31 to 58.41:
Deviation: 
6.26%

Discussion of ML Capabilities and Strategies to Work with This Bias:
Machine Learning Model's Ability to Detect Bias:
A machine learning model, when trained on this dataset, might detect some patterns related to gender, particularly if gender is included as a feature. Since the bias introduced is a systematic reduction in exam scores for certain subgroups of females, the model may inadvertently learn these biased patterns, leading to unfair predictions.

However, models by default are not designed to detect bias unless specific fairness techniques are applied. Instead, they will optimize for accuracy, potentially leading to biased predictions for female candidates, especially older ones.

Potential Strategies to Address and Work with Bias:

Fairness-Aware Algorithms:

Demographic Parity and Equality of Opportunity can be evaluated using fairness-aware machine learning models.
These metrics help measure if the model is disproportionately biased against certain groups (like females in this case).
Post-processing methods such as adversarial debiasing or reweighting can be applied to mitigate these effects.
Data Rebalancing or Reweighting:

You can reweight the dataset to give more importance to the underrepresented or biased group (e.g., females who have been disadvantaged by the bias).
Another approach is to oversample the affected female data points to balance the model's exposure to them.
Pre-processing Mitigation:

Fair representations could be applied to the dataset, transforming the features in a way that removes bias before training the model.
Rewriting exam scores for fairness, such as correcting for bias in females' scores using statistical methods, would be another pre-processing strategy.

Post-processing Mitigation:
After the model has made predictions, you can apply threshold adjustments or correction algorithms to ensure that bias does not disproportionately affect the output.
This could include adjusting the decision threshold to ensure that females and males are treated equally by the model.

Fairness Constraints:
Introduce fairness constraints directly into the machine learning optimization process, which penalizes the model for biased decisions and encourages equal treatment across groups.

Conclusion:
The bias we introduced is measurable and affects the average exam scores for females, particularly older females. Machine learning models trained on this data will likely pick up on these biased patterns unless steps are taken to mitigate them. Using fairness metrics, data rebalancing, or post-processing fairness techniques, we can train a model to either detect or correct this bias.

In [14]:
import requests
import pandas as pd
from io import StringIO

# URL of the CSV file on GitHub
url = 'https://raw.githubusercontent.com/Compcode1/data_points_4.75_million/refs/heads/main/4.75_million_health_datapoints.csv'

# Fetch the CSV file using requests, bypassing certificate verification
response = requests.get(url, verify=False)

# Check if the request was successful
if response.status_code == 200:
    # Convert the content to a StringIO object and then read it into a DataFrame
    csv_data = StringIO(response.text)
    df = pd.read_csv(csv_data)
    print("Dataset successfully imported!")
else:
    print(f"Failed to retrieve the data. Status code: {response.status_code}")

# Display the first few rows of the DataFrame
df.head()




Dataset successfully imported!


Unnamed: 0,Age,Gender,BMI,Waist_Circumference,BMI_Category,Triglyceride,HDL,High_Blood_Pressure,FBG,Alcohol Use,Smoker,Exercise,Hours of Sleep,Heart Disease,Cancer,Metabolic_Syndrome,COPD,Diabetes,Exam Score
0,40,Male,36.4,42.4,Obese,211,55,1,233,Moderate Drinker,Non Smoker,Meets Aerobic Only,Adequate Sleep (7+ hours),0,0,1,0.0,1,72.25
1,24,Male,33.3,38.3,Obese,290,41,1,82,Moderate Drinker,Smoker,Insufficiently Active,Less than 7 Hours,0,0,0,0.0,0,75.0
2,73,Female,30.2,43.1,Obese,86,56,1,102,Moderate Drinker,Non Smoker,Inactive,Adequate Sleep (7+ hours),0,0,1,0.0,0,44.0
3,90,Male,29.2,58.6,Overweight,202,33,0,209,Moderate Drinker,Former Smoker,Meets Both Guidelines,Chronic Sleep Deprivation (≤5 hours),0,0,1,0.0,1,37.5
4,99,Male,19.9,45.8,Normal weight,71,69,0,94,Moderate Drinker,Non Smoker,Meets Both Guidelines,Adequate Sleep (7+ hours),0,0,0,0.0,0,50.0


In [15]:
# Calculate the overall average exam score
average_exam_score_total = df['Exam Score'].mean()

# Calculate the average exam score for males
average_exam_score_males = df[df['Gender'] == 'Male']['Exam Score'].mean()

# Calculate the average exam score for females
average_exam_score_females = df[df['Gender'] == 'Female']['Exam Score'].mean()

# Output the results
print(f"Average Exam Score (Total): {average_exam_score_total:.2f}")
print(f"Average Exam Score (Males): {average_exam_score_males:.2f}")
print(f"Average Exam Score (Females): {average_exam_score_females:.2f}")


Average Exam Score (Total): 61.85
Average Exam Score (Males): 61.40
Average Exam Score (Females): 62.31


In [16]:
# Introduce bias: 50% of females over age 65, reduce Exam Score by 12%
females_over_65 = df[(df['Gender'] == 'Female') & (df['Age'] > 65)]
# Randomly select 50% of these females
sample_females_over_65 = females_over_65.sample(frac=0.5, random_state=1)
# Apply a 12% reduction in Exam Score
df.loc[sample_females_over_65.index, 'Exam Score'] *= 0.88

# Introduce bias: 50% of all females, reduce Exam Score by 10%
all_females = df[df['Gender'] == 'Female']
# Randomly select 50% of all females
sample_all_females = all_females.sample(frac=0.5, random_state=1)
# Apply a 10% reduction in Exam Score
df.loc[sample_all_females.index, 'Exam Score'] *= 0.90

# Combine both sets of indices (sample_females_over_65 and sample_all_females) using pd.concat
affected_rows = pd.concat([sample_females_over_65, sample_all_females]).drop_duplicates()

# Check if the bias has been introduced properly by looking at the updated data
df.loc[affected_rows.index].head()


Unnamed: 0,Age,Gender,BMI,Waist_Circumference,BMI_Category,Triglyceride,HDL,High_Blood_Pressure,FBG,Alcohol Use,Smoker,Exercise,Hours of Sleep,Heart Disease,Cancer,Metabolic_Syndrome,COPD,Diabetes,Exam Score
214811,85,Female,25.9,42.6,Overweight,55,65,0,283,Moderate Drinker,Non Smoker,Meets Aerobic Only,Less than 7 Hours,0,0,0,0.0,1,35.64
184203,81,Female,31.7,46.0,Obese,103,23,1,252,Non-Drinker,Smoker,Insufficiently Active,Chronic Sleep Deprivation (≤5 hours),0,1,1,0.0,1,13.86
209133,100,Female,20.7,38.8,Normal weight,68,59,0,79,Non-Drinker,Former Smoker,Inactive,Chronic Sleep Deprivation (≤5 hours),0,0,0,0.0,0,33.0
152235,99,Female,29.9,52.9,Overweight,187,53,1,91,Moderate Drinker,Non Smoker,Meets Aerobic Only,Adequate Sleep (7+ hours),0,0,1,0.0,0,41.8
199290,66,Female,28.5,39.3,Overweight,88,66,1,92,Moderate Drinker,Non Smoker,Inactive,Adequate Sleep (7+ hours),0,0,0,0.0,0,40.392


In [17]:
# Calculate the overall average exam score
average_exam_score_total = df['Exam Score'].mean()

# Calculate the average exam score for males
average_exam_score_males = df[df['Gender'] == 'Male']['Exam Score'].mean()

# Calculate the average exam score for females
average_exam_score_females = df[df['Gender'] == 'Female']['Exam Score'].mean()

# Output the results
print(f"Average Exam Score (Total): {average_exam_score_total:.2f}")
print(f"Average Exam Score (Males): {average_exam_score_males:.2f}")
print(f"Average Exam Score (Females): {average_exam_score_females:.2f}")


Average Exam Score (Total): 59.91
Average Exam Score (Males): 61.40
Average Exam Score (Females): 58.41


In [18]:
# Save the modified dataset with introduced bias to a CSV file
df.to_csv('biased_dataset.csv', index=False)
print("Dataset saved as 'biased_dataset.csv'")


Dataset saved as 'biased_dataset.csv'
