# Understanding Confounders in AI
#### Confounders Defined: 
A confounder is a hidden variable that influences both the independent variable (your features) and the dependent variable (your target or what you're trying to predict). This creates a spurious, non-causal association that can mislead your ML model.

#### AI Example: 
Imagine training an AI model to predict the likelihood of someone developing heart disease.  Factors like age, diet, and exercise habits  are relevant predictors.  However, a confounder might be a genetic predisposition that increases the risk of heart disease and independently influences dietary choices.

#### The Problem: 
If the confounder (genetic predisposition) is not accounted for, your model might attribute too much weight to dietary choices, underestimating the true impact of genetics.  This leads to inaccurate predictions and a flawed understanding of the underlying relationships.

### Why Addressing Confounders Matters
#### Model Interpretation: 
Confounders hinder your ability to accurately interpret how features influence the target variable. You might mistakenly overestimate or underestimate the importance of certain factors.

#### Model Reliability:  
When your model encounters data with different distributions of the confounding variable (e.g., a different demographic group with varying genetic predispositions), its predictions might become unreliable.

### Technique: Adjusting for Confounders
One common technique in AI for addressing confounders is Matching.  Here's the general idea with a simplified example:

* Identify the Confounder: 
In our heart disease example, let's say you identify "family history of heart disease" as a potential confounder.
* Create Matched Pairs: 
You would form pairs of individuals who are similar across all relevant features except for the confounder (e.g., one person with a family history of heart disease and one without, but otherwise similar age, diet, etc.).
* Train Within Pairs: 
Analyze differences in outcomes (heart disease development) within these matched pairs.  This helps isolate the impact of the confounder, minimizing its influence on the perceived relationship between your features and the target variable.

### Other Adjustment Techniques
* Propensity Score Matching: A more sophisticated version of matching that accounts for multiple confounders simultaneously.
* Regression Techniques: Include the confounder as an additional feature in regression models to control for its effect.
* Causal Inference Methods: In specific situations, techniques from causal inference aim to isolate true causal relationships even in the presence of confounders.

#### Important Considerations
* Identifying Confounders: 
This requires domain knowledge and a critical understanding of the problem and the data.
* Data Availability: 
Adjusting for confounders might necessitate collecting data on potential confounding variables.
* Complexity: 
Techniques for confounder adjustment can add complexity to your modeling process.

# Example
#### Target
heart_disease:
* 0 -> No heart disease
* 1 -> Heart disease

#### Features
* age
* diet (numerical score representing healthy eating habits)
* exercise (hours of exercise per week)

#### Confounder
family_history:
* 0 -> No heart disease
* 1 -> Heart disease

In [49]:
# imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

In [50]:
# data generation
num_samples = 1000 # number of individuals in our dataset
confounder_strength = 0.6 # how much the confounder affects the outcome

np.random.seed(73)

# features
age = np.random.randint(25, 80, num_samples)
diet = np.random.randint(0,6, num_samples) # Diet score from 0-5 (0 is good, 5 is bad)
exercise = np.random.rand(num_samples) * 8 # Hours of exercise per week

# confounder
family_history = np.random.randint(0, 2, num_samples)

# outcome -> Formula is entirly made up, it just serves the purpose of this example (should return about 50% True)
def develop_heart_disease(age, diet, exercise, family_history):
    risk = -3 + age * 0.02 + diet * 0.3 + exercise * 0.2 + family_history * confounder_strength
    probability = 1 / (1 + np.exp(-risk))
    return np.random.rand() < probability

heart_disease = np.array([develop_heart_disease(age[i], diet[i], exercise[i], family_history[i]) for i in range(num_samples)])

data = pd.DataFrame({
    'age': age,
    'diet': diet,
    'exercise': exercise,
    'family_history': family_history,
    'heart_disease': heart_disease
})

print(data['heart_disease'].value_counts())
data.head()

heart_disease
False    551
True     449
Name: count, dtype: int64


Unnamed: 0,age,diet,exercise,family_history,heart_disease
0,47,3,2.027615,0,False
1,43,4,3.652187,0,False
2,71,3,0.679628,1,False
3,35,5,7.400895,0,True
4,41,4,3.767877,0,False


In [51]:
# prepare data for training
X = data[['age', 'diet', 'exercise', 'family_history']]
y = data['heart_disease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=73)

# model without matching
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy_no_matching = accuracy_score(y_test, y_pred)
roc_auc_no_matching = roc_auc_score(y_test, y_pred)

print(f'Accuracy without matching: {accuracy_no_matching}')
print(f'ROC AUC without matching: {roc_auc_no_matching}')

coefficients_no_matching = pd.DataFrame({'feature': X.columns, 'coefficient': model.coef_[0]})
print(coefficients_no_matching)

Accuracy without matching: 0.65
ROC AUC without matching: 0.6493649364936493
          feature  coefficient
0             age     0.016519
1            diet     0.278045
2        exercise     0.228303
3  family_history     0.419373


Accuracy and Roc Auc -> moderate performance 
Coefficients -> family_history has the largest impact. It is hard to say whether this is due to true causation or confounder influcence on features. 

In [55]:
# model with matching, splitting into two groups
groups = data.groupby('family_history')

for group_value, group_data in groups:
    X_group = group_data[['age', 'diet', 'exercise']]
    y_group = group_data['heart_disease']

    X_train_group, X_test_group, y_train_group, y_test_group = train_test_split(X_group, y_group, test_size=0.2, random_state=73)

    model_group = LogisticRegression()
    model_group.fit(X_train_group, y_train_group)

    y_pred_group = model_group.predict(X_test_group)

    print(f'This is group {group_value}')
    print(f'Accuracy for group {group_value}: {accuracy_score(y_test_group, y_pred_group)}')
    print(f'ROC AUC for group {group_value}: {roc_auc_score(y_test_group, y_pred_group)}')

    coefficients_group = pd.DataFrame({'feature': X_group.columns, 'coefficient': model_group.coef_[0]})
    print(coefficients_group)
    print('\n')

This is group 0
Accuracy for group 0: 0.6372549019607843
ROC AUC for group 0: 0.5951612903225806
    feature  coefficient
0       age     0.015758
1      diet     0.292451
2  exercise     0.246440


This is group 1
Accuracy for group 1: 0.6666666666666666
ROC AUC for group 1: 0.6642798690671032
    feature  coefficient
0       age     0.012553
1      diet     0.286784
2  exercise     0.198278




# Interpretation