<a href="https://colab.research.google.com/github/Jack-W-Fan/AI-Workshop-Dataset-Notebook/blob/main/AI_Workshop%20with%20Machine%20Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. AI in Medicine Machine Learning Workshop
## Predicting High Influenza Activity Using Surveillance Data

Today we will:
- Load real public health data
- Transform it into machine learning format
- Train a simple predictive model
- Evaluate performance
- Discuss clinical implications

# 2. What is Machine Learning?

Machine learning allows computers to learn patterns from data
and make predictions.

In medicine, ML can help:
- Predict disease risk
- Forecast outbreaks
- Assist diagnosis
- Support clinical decisions

In [None]:
# 3. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

In [None]:
# 4. Load Dataset
df = pd.read_csv("data-table.csv")
df # shows entire table (be careful if table is large)
# use df.head() to show 5 rows or df.head(17) to show 17 rows

# 5. Form a Prediction
Predict whether influenza will be high using COVID levels and time

In [None]:
# 6. Pivot the Data
df_pivot = df.pivot(index='week_end', columns='pathogen', values='Children')
df_pivot = df_pivot.fillna(0)
df_pivot.head()

In [None]:
# 7. Reset Index + Create Week Number
df_pivot = df_pivot.reset_index()
df_pivot['week_end'] = pd.to_datetime(df_pivot['week_end'])
df_pivot['week_num'] = ((df_pivot['week_end'] - df_pivot['week_end'].min()).dt.days / 7).astype(int)

df_pivot.head()

In [None]:
# 8. Define Target Variable
# Define "high influenza" as above the median week (splits data into 0/1)
threshold = df_pivot['Influenza'].median()
df_pivot['target'] = (df_pivot['Influenza'] > threshold).astype(int)

print("Threshold (median influenza):", threshold)
print(df_pivot['target'].value_counts())

# 9. Why Does Class Balance Matter?

If most weeks are low influenza,
a model could just predict "low" every time
and still appear accurate.

This is why accuracy alone is not enough in medicine.

In [None]:
# 10. Define Features and Target
X = df_pivot[['COVID-19', 'week_num']]
y = df_pivot['target']
X

In [None]:
y

In [None]:
# 11. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
# 12. Check Class Counts
print("y_train class counts:")
print(y_train.value_counts())

In [None]:
# 13. Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# 14. Make Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

In [None]:
# 15. Evaluate Model
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)

print("Accuracy:", accuracy)
print("AUC:", auc)
print("Confusion Matrix:")
print(cm)
# [True Negative, False Positive]
# [False Negative, True Positive]

# 16. Understanding Errors in Medicine

False Positives:
Predict high influenza when it isn't high.
→ Could cause unnecessary preparation or concern.

False Negatives:
Miss a high influenza week.
→ Could delay public health response.

In medicine, false negatives are often more dangerous.

In [None]:
# 17. Receiver Operating Characteristic (ROC) Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)

plt.plot(fpr, tpr)
plt.plot([0,1],[0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()

In [None]:
# 18. Feature Importance
coefficients = pd.Series(model.coef_[0], index=X.columns)
coefficients.sort_values().plot(kind='barh')
plt.title("Feature Importance")
plt.show()

# 19. Limitations

- Small dataset
- Retrospective data
- No external validation
- Correlation does not imply causation
- This does not replace physician judgment