<a href="https://colab.research.google.com/github/DrDavidL/learning-dhds/blob/main/Synthetic_Analysis_GI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome Data Science and AI Learners!

We'll use this notebook to:

1. Generate synthetic data with help of AI!
2. Generate a predictive model.
3. Review measures of performance of the model!


# Step 1: Generate Synthetic Data

Let's pretend we're working on a project to predict gastrointestinal conditions. We'll ask the enterprise data warehouse team for data elements relevant to gut health.

In the meantime, though, we really want to get started! Our research proposal is being reviewed and we want to be ready for when we get the real data. So, let's create a processing pipeline so we're ready when we get the real data! The first thing we'll need is synthetic data that includes potential predictors for Irritable Bowel Syndrome (IBS), which will be our target variable. Here is how we can do this!

In [None]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

n = 1000  # number of rows

# ---- Base data (your original) ----
data = {
    "Age": np.random.normal(loc=45, scale=10, size=n).astype(int),
    "Sex_at_birth": np.random.choice(["Male", "Female"], size=n),
    "BMI": np.random.normal(loc=25, scale=4, size=n).round(1),
    "Fiber_Intake": np.random.normal(loc=20, scale=8, size=n).round(1), # grams/day
    "Alcohol_Consumption": np.random.choice(["None", "Moderate", "Heavy"], size=n, p=[0.5, 0.4, 0.1]), # Categories
    "Smoking_Status": np.random.choice(["Never", "Former", "Current"], size=n, p=[0.6, 0.2, 0.2]), # Categories
    "Stress_Level": np.random.randint(1, 10, size=n), # Scale 1-10
    "Exercise_Frequency": np.random.choice(["None", "Occasional", "Regular"], size=n, p=[0.3, 0.4, 0.3]), # Categories
    "Presence_of_IBS": np.random.choice([0, 1], size=n, p=[0.7, 0.3]), # Target variable (0: No IBS, 1: IBS)
    "Stool_Frequency": np.random.poisson(lam=1.5, size=n) + 1 # Stool frequency (counts per day, add 1 to avoid 0)
}

df = pd.DataFrame(data)

# ---- Nudge related features by IBS status ----
ibs_mask = df["Presence_of_IBS"] == 1
non_ibs_mask = ~ibs_mask

# Stress Level: higher for IBS
df.loc[ibs_mask, "Stress_Level"] = np.random.randint(5, 10, size=ibs_mask.sum())

# Stool Frequency: Can be higher or lower for IBS, make it more variable
df.loc[ibs_mask, "Stool_Frequency"] = np.random.poisson(lam=2.5, size=ibs_mask.sum()) + 1

# Fiber Intake: slightly lower for IBS
df.loc[ibs_mask, "Fiber_Intake"] = np.random.normal(loc=15, scale=6, size=ibs_mask.sum()).round(1)

# ---- Clinical bounds (clips) ----
df["Age"] = df["Age"].clip(18, 80)
df["BMI"] = df["BMI"].clip(15, 45)
df["Fiber_Intake"] = df["Fiber_Intake"].clip(5, 40)
df["Stress_Level"] = df["Stress_Level"].clip(1, 10)
df["Stool_Frequency"] = df["Stool_Frequency"].clip(1, 7)


# Peek
display(df.head(10))

Before we create a model - let's explore what we have briefly:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.histplot(df["Stool_Frequency"], bins=range(0, df["Stool_Frequency"].max()+2), kde=False)
plt.title("Distribution of Stool Frequency")
plt.xlabel("Stool Frequency (counts per day)")
plt.ylabel("Count of Patients")
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df["Age"], bins=20, kde=True)
plt.title("Age Distribution")
plt.xlabel("Age (years)")
plt.ylabel("Count of Patients")
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df["BMI"], bins=20, kde=True)
plt.title("BMI Distribution")
plt.xlabel("BMI")
plt.ylabel("Count of Patients")
plt.show()

In [None]:
plt.figure(figsize=(15,5))

# Age boxplot
plt.subplot(1,3,1)
sns.boxplot(y=df["Age"])
plt.title("Age Distribution")

# BMI boxplot
plt.subplot(1,3,2)
sns.boxplot(y=df["BMI"])
plt.title("BMI Distribution")

# Fiber Intake boxplot
plt.subplot(1,3,3)
sns.boxplot(y=df["Fiber_Intake"])
plt.title("Fiber Intake Distribution")

plt.tight_layout()
plt.show()

In [None]:
mean_stress = df.groupby("Presence_of_IBS")["Stress_Level"].mean().astype(int)
print(mean_stress)

# Barplot for clarity
plt.figure(figsize=(6,5))
sns.barplot(x="Presence_of_IBS", y="Stress_Level", data=df, ci=None)
plt.xticks([0,1], ["No IBS", "IBS"])
plt.title("Mean Stress Level by IBS Status")
plt.ylabel("Mean Stress Level (1-10)")
plt.show()

# üîπ Step 1: Guidance on Algorithm Selection

When deciding on a machine learning algorithm for your dataset, think about three key things:

### 1. **Type of Outcome (Target Variable)**

* **Binary Classification** (Yes/No, 0/1): e.g., *Presence_of_DM*, *Presence_of_HTN*.
* **Multiclass Classification**: e.g., cancer staging, disease type.
* **Regression** (continuous numbers): e.g., predicting *Glucose*, *BMI*, or *ER_visits*.
* **Unsupervised Learning** (no target variable): e.g., finding patterns or groups in data.

### 2. **Size and Shape of Data**

* Small dataset (hundreds of rows) ‚Üí simpler algorithms (logistic regression, decision tree) often perform well.
* Larger dataset (thousands‚Äìmillions) ‚Üí more complex algorithms (random forest, XGBoost, neural networks).

### 3. **Interpretability vs. Accuracy**

* If you need **interpretability** (clear reasoning, coefficients, feature importance):
  
  * Logistic Regression
  * Decision Tree
* If you want **higher predictive performance** (even if harder to interpret):
  
  * Random Forest
  * Gradient Boosting (XGBoost, LightGBM)

* * *

### Algorithm Options by Task Type

| Task Type                | Common Algorithms                                  | Notes                                                                 |
| :----------------------- | :------------------------------------------------- | :-------------------------------------------------------------------- |
| **Binary Classification**| Logistic Regression, SVM, Decision Tree, Random Forest, Gradient Boosting, Naive Bayes, K-Nearest Neighbors | Good for predicting one of two outcomes.                             |
| **Multiclass Classification** | Logistic Regression (multinomial), SVM, Decision Tree, Random Forest, Gradient Boosting, Naive Bayes, K-Nearest Neighbors | Used when there are more than two discrete outcomes.                 |
| **Regression**           | Linear Regression, Ridge, Lasso, Elastic Net, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor, Support Vector Regression | For predicting continuous numerical values.                          |
| **Unsupervised Learning**| K-Means Clustering, Hierarchical Clustering, DBSCAN, PCA, t-SNE, Association Rules (Apriori) | Used for finding hidden patterns, groups, or reducing dimensionality without a specific target variable. |

* * *

# üîπ Step 2: Recommendation for Your Dataset

Your dataset has **1000 rows** with **mixed features** (numeric, categorical).

* **Target variable:** `"Presence_of_IBS"` (binary classification).
* **Goal:** Predict IBS status from the provided features.

üëâ Best first choice: **Logistic Regression** (simple, interpretable baseline).
üëâ Next step: Compare with **Random Forest** (handles nonlinear interactions, often more accurate).

# üîπ Step 1: Guidance on Algorithm Selection

When deciding on a machine learning algorithm for your dataset, think about three key things:

### 1. **Type of Outcome (Target Variable)**

* **Binary Classification** (Yes/No, 0/1): e.g., *Presence_of_DM*, *Presence_of_HTN*.
* **Multiclass Classification**: e.g., cancer staging, disease type.
* **Regression** (continuous numbers): e.g., predicting *Glucose*, *BMI*, or *ER_visits*.

### 2. **Size and Shape of Data**

* Small dataset (hundreds of rows) ‚Üí simpler algorithms (logistic regression, decision tree) often perform well.
* Larger dataset (thousands‚Äìmillions) ‚Üí more complex algorithms (random forest, XGBoost, neural networks).

### 3. **Interpretability vs. Accuracy**

* If you need **interpretability** (clear reasoning, coefficients, feature importance):

  * Logistic Regression
  * Decision Tree
* If you want **higher predictive performance** (even if harder to interpret):

  * Random Forest
  * Gradient Boosting (XGBoost, LightGBM)

---

# üîπ Step 2: Recommendation for Your Dataset

Your dataset has **1000 rows** with **mixed features** (numeric labs, binary comorbidities, categorical sex).

* **Target variable candidate:** `"Presence_of_DM"` (binary classification).
* **Goal:** Predict diabetes status from labs, demographics, and ER visits.

üëâ Best first choice: **Logistic Regression** (simple, interpretable baseline).
üëâ Next step: Compare with **Random Forest** (handles nonlinear interactions, often more accurate).



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Features and target
X = df.drop(columns=["Presence_of_IBS"])  # predictors
y = df["Presence_of_IBS"]                 # target

# Identify categorical and numerical features
categorical_features = ['Sex_at_birth', 'Alcohol_Consumption', 'Smoking_Status', 'Exercise_Frequency']
numerical_features = ['Age', 'BMI', 'Fiber_Intake', 'Stress_Level', 'Stool_Frequency']

# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a column transformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing to the training and testing data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Get the feature names after one-hot encoding
feature_names = numerical_features + list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features))

# Convert processed data back to DataFrames with feature names (optional, but helpful for interpretation)
X_train_scaled = pd.DataFrame(X_train_processed, columns=feature_names)
X_test_scaled = pd.DataFrame(X_test_processed, columns=feature_names)

Now our data is ready to test! Let's try Logistic Regression first.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

log_reg = LogisticRegression(max_iter=1000, class_weight="balanced")

log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)

print("=== Logistic Regression Results ===")
print(classification_report(y_test, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))


cm = confusion_matrix(y_test, y_pred_lr)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["No IBS", "IBS"])
disp.plot(cmap="Blues")
plt.title("Confusion Matrix - Logistic Regression")
plt.show()

### üìä Classification Report Terms

* **Precision (Positive Predictive Value)**
  *Of all the patients the model predicted as having diabetes, how many actually had diabetes?*
  Formula: **TP / (TP + FP)**
  ‚Üí High precision = few false positives.

* **Recall (Sensitivity, True Positive Rate)**
  *Of all the patients who actually had diabetes, how many did the model correctly identify?*
  Formula: **TP / (TP + FN)**
  ‚Üí High recall = few false negatives.
  
* **F1-score**
  *The harmonic mean of precision and recall.*
  Formula: **2 √ó (Precision √ó Recall) / (Precision + Recall)**
  ‚Üí Useful when you want a balance between precision and recall.

* **Support**
  *The number of true cases for each class in the test dataset.*
  ‚Üí Example: If 29 people really had diabetes, support for class ‚Äú1‚Äù is 29.

* **Accuracy**
  *The fraction of all predictions that were correct.*
  Formula: **(TP + TN) / Total**
  ‚Üí Can be misleading with imbalanced classes.

* **Macro avg**
  *Average of precision, recall, and F1 across all classes, treating each class equally (unweighted).*
  ‚Üí Good for seeing performance on minority classes.

* **Weighted avg**
  *Average of precision, recall, and F1 across classes, weighted by the number of samples in each class (support).*
  ‚Üí Dominated by majority class performance if classes are imbalanced.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get predicted probabilities
y_prob = log_reg.predict_proba(X_test_scaled)[:,1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {auc:.2f})")
plt.plot([0,1],[0,1],'k--')  # diagonal line
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.title("ROC Curve - Logistic Regression")
plt.legend()
plt.show()

Let's repeat so you have code for the Random Forest! (Other notebooks have additional algorithms!)

In [None]:
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train_scaled, y_train)

y_pred_rf = rf.predict(X_test_scaled)

print("=== Random Forest Results ===")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

# Feature importance plot
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred_rf)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["No IBS", "IBS"])
disp.plot(cmap="Blues")
plt.title("Confusion Matrix - Random Forest")
plt.show()

feat_importances = pd.Series(rf.feature_importances_, index=X_train_scaled.columns)
plt.figure(figsize=(10,5))
sns.barplot(x=feat_importances.sort_values(ascending=False), y=feat_importances.sort_values(ascending=False).index)
plt.title("Feature Importance (Random Forest)")
plt.show()

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Predict probabilities for the positive class (IBS = 1)
y_prob_rf = rf.predict_proba(X_test_scaled)[:, 1]

# Compute ROC curve values
fpr, tpr, thresholds = roc_curve(y_test, y_prob_rf)
auc = roc_auc_score(y_test, y_prob_rf)

# Plot ROC curve
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color="darkorange", lw=2, label=f"Random Forest (AUC = {auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")  # diagonal = no skill
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity)")
plt.title("ROC Curve - Random Forest")
plt.legend(loc="lower right")
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

# Task
Enable users to download the preprocessing file with data field names and specifications and the model. Provide code for a streamlit app that will let users upload their preprocessing file and model file to populate the streamlit app so that new data can be entered and predictions seen. This should support either the logistic regression or random forest examples.

## Save preprocessing and model files

### Subtask:
Add code to save the fitted `preprocessor` object and the trained `log_reg` and `rf` models to files.


**Reasoning**:
The subtask requires saving the fitted preprocessor and trained models. The `joblib` library is suitable for this purpose. I will save the `preprocessor`, `log_reg`, and `rf` objects to `.joblib` files.



In [None]:
import joblib

# Save the preprocessor
joblib.dump(preprocessor, 'preprocessor.joblib')

# Save the Logistic Regression model
joblib.dump(log_reg, 'logistic_regression_model.joblib')

# Save the Random Forest model
joblib.dump(rf, 'random_forest_model.joblib')

# Congratulations! We have a model! Two models even, a random forest and the original logistic regression.

But, what can we do with them?

Well, what if we created an online app so users could enter values and see the predictions? This is these models really come to life!

Click [here](https://fsm-ibs-predict.streamlit.app/) for the Website (free hosted - so watch screen first).