# **Workshop: Tracing Disease Spread Using Genetic and Epidemiological Data**

## **Overview**
In this workshop, students will play the role of molecular epidemiologists investigating an outbreak in a high school. They will use epidemiological data and genetic similarities (pairwise genetic distances) to infer how the disease spread.

## **Learning Objectives**
- Understand how epidemiologists use case information and genetic data to track disease transmission.
- Learn the concept of genetic distances and how they help determine relatedness between infections.
- Develop critical thinking skills by reconstructing a transmission chain.

## **Activity Structure**
1. **Introduction (15 minutes)**
   - Briefly explain how diseases spread and the role of molecular epidemiology.
   - Introduce the concept of genetic distance (i.e., how mutations accumulate as a pathogen spreads).
   - Explain how to interpret a genetic distance matrix.

2. **Solving the Outbreak (40 minutes)**
   - Present the **Case Information Table** and **Genetic Distance Matrix**.
   - Students work in small groups to analyze the data and infer transmission chains.
   - Encourage discussion on uncertainties and alternate explanations.

3. **Discussion and Wrap-Up (20 minutes)**
   - Groups present their findings.
   - Discuss the real-world challenges of outbreak investigations.
   - Highlight key takeaways about combining genetic and epidemiological data.

---

# **Scenario: Outbreak in a High School**

An infectious disease has spread in a high school. Molecular epidemiologists have collected samples from infected students and analyzed their genetic similarities. Your task is to determine who likely infected whom using the available data.


## **Case Information Table**

| Case ID | Symptom Onset Date | Known Contacts | Recent Travel | Sample ID |
|---------|--------------------|---------------|--------------|-----------|
| Alice   | March 2           | Bob, Carol    | No           | S1        |
| Bob     | March 4           | Alice, David  | No           | S2        |
| Carol   | March 6           | Alice, Eve    | No           | S3        |
| David   | March 7           | Bob, Eve      | Yes (London) | S4        |
| Eve     | March 9           | Carol, David  | No           | S5        |

## **Genetic Distance Matrix**

|       | S1 | S2 | S3 | S4 | S5 |
|-------|----|----|----|----|----|
| **S1** | 0  | 1  | 2  | 3  | 3  |
| **S2** | 1  | 0  | 1  | 2  | 3  |
| **S3** | 2  | 1  | 0  | 1  | 2  |
| **S4** | 3  | 2  | 1  | 0  | 1  |
| **S5** | 3  | 3  | 2  | 1  | 0  |

---

## **Student Task**
Using the case information and genetic distance matrix:
1. Identify the most likely **index case** (first infected person).
2. Determine who infected whom.
3. Explain your reasoning using both epidemiological clues (e.g., symptom onset dates, known contacts) and genetic distances.

**Bonus Discussion Questions:**
- What are some challenges of using only genetic distances to infer transmission?
- How might missing cases affect our conclusions?
- What additional data would help confirm the outbreak source?

---

## **Materials Needed**
- Printed handouts with case information and genetic distances.
- Whiteboard or flipchart for group discussions.
- (Optional) Colored markers or sticky notes for mapping transmission chains.

## **Wrap-Up**
Conclude by discussing how real-world outbreak investigations work and how molecular epidemiology informs public health responses.

---

This Markdown version is ready for copying and pasting into any document or platform that supports Markdown formatting.

In [59]:
import random
import sciris as sc
import numpy as np
import pandas as pd
import networkx as nx
from rich.jupyter import display
from scipy.spatial.distance import pdist, squareform
from datetime import timedelta
from Bio import SeqIO
from skbio import DistanceMatrix
from skbio.tree import nj
import subprocess
import os
from ete3 import Tree
import tempfile
from scipy.stats import gamma, pairs
from collections import deque

In [6]:
simulations = sc.load("outbreaks.obj")

In [7]:
base_sim = simulations[0]
some_noise = simulations[1]

In [10]:
base_sim.result.keys()

['linearSeqSim',
 'poissonLinSeqSim',
 'ref_seq',
 'generation_times',
 'pairwise_data',
 'linearPhastSeqSim']

In [33]:
base_sim.tree.nodes[0]

{'sampled': 1,
 'exp_date': 19,
 'date_infectious': 22.20424,
 'date_symptom_onset': 24.505036,
 'sample_date': 24.505036,
 'seed': True}

In [66]:
dict(base_sim.tree.nodes)

{0: {'sampled': 1,
  'exp_date': 19,
  'date_infectious': 22.20424,
  'date_symptom_onset': 24.505036,
  'sample_date': 24.505036,
  'seed': True},
 1: {'sampled': 1,
  'exp_date': 2,
  'date_infectious': 5.2042399999999995,
  'date_symptom_onset': 7.5050360000000005,
  'sample_date': 7.5050360000000005,
  'seed': True},
 2: {'sampled': 1,
  'exp_date': 5,
  'date_infectious': 8.204239999999999,
  'date_symptom_onset': 10.505035999999999,
  'sample_date': 10.505035999999999,
  'seed': True},
 3: {'sampled': 1,
  'exp_date': 28,
  'date_infectious': 31.20424,
  'date_symptom_onset': 33.505036,
  'sample_date': 33.505036,
  'seed': True},
 4: {'sampled': 1,
  'exp_date': 7,
  'date_infectious': 10.204239999999999,
  'date_symptom_onset': 12.505035999999999,
  'sample_date': 12.505035999999999,
  'seed': True},
 5: {'sampled': 1,
  'exp_date': 22.20424,
  'date_infectious': 25.408479999999997,
  'date_symptom_onset': 27.709276,
  'sample_date': 27.709276},
 6: {'sampled': 1,
  'exp_date':

In [131]:
cases = pd.DataFrame(dict(base_sim.tree.nodes)).T
cases = cases[["exp_date", "date_symptom_onset"]].astype(int)

cases.reset_index(inplace=True, names=["Case ID"])
cases.rename(columns={"exp_date": "Exposure Date", "date_symptom_onset": "Symptom Onset Date"},
             inplace=True)

cases.sort_values("Exposure Date", inplace=True)

cases["Sample ID"] = cases["Case ID"]
cases["Case ID"] = cases["Case ID"].apply(lambda x: f"C{x}")
cases["Sample ID"] = cases["Sample ID"].apply(lambda x: f"S{x}")
cases["Exposure Date"] = cases["Exposure Date"].apply(lambda x: f"Day {x}")
cases["Symptom Onset Date"] = cases["Symptom Onset Date"].apply(lambda x: f"Day {x}")

In [132]:
travel_locations = ["London", "Paris", "New York", "Madrid", "Rome", "Dubai", "Berlin", "Tokyo"]
cases["Recent Travel"] = "No"
travel_cases = random.sample(range(0, 15), random.randint(5, 8))  # Select 10-15% of early cases
cases.loc[travel_cases, "Recent Travel"] = [f"Yes ({random.choice(travel_locations)})" for _ in travel_cases]

cases.head(10)

Unnamed: 0,Case ID,Exposure Date,Symptom Onset Date,Sample ID,Recent Travel
1,C1,Day 2,Day 7,S1,No
2,C2,Day 5,Day 10,S2,No
4,C4,Day 7,Day 12,S4,Yes (Tokyo)
0,C0,Day 19,Day 24,S0,No
13,C13,Day 22,Day 27,S13,Yes (Berlin)
12,C12,Day 22,Day 27,S12,Yes (Berlin)
11,C11,Day 22,Day 27,S11,Yes (Dubai)
10,C10,Day 22,Day 27,S10,No
9,C9,Day 22,Day 27,S9,Yes (Madrid)
8,C8,Day 22,Day 27,S8,Yes (Rome)


In [138]:
def get_distance_matrix(sequences):
    keys = list(sequences.keys())
    matrix = np.array(list(sequences.values()))

    # Convert nucleotide sequences to numerical values for Hamming distance calculation
    mapping = {'A': 0, 'C': 1, 'G': 2, 'T': 3}

    seq_matrix_num = np.vectorize(mapping.get)(matrix)
    distances = pdist(seq_matrix_num, metric='hamming')  # Proportion of differing sites
    return squareform(distances), keys, matrix

hamming_distances, ids, seq_matrix = get_distance_matrix(base_sim.result.linearSeqSim)
hamming_matrix = hamming_distances * seq_matrix.shape[1]  # convert to number of differing positions
hamming_matrix = hamming_matrix.astype(int)
ids = [f"S{x}" for x in ids]
pairwise_distances = pd.DataFrame(hamming_matrix, index=ids, columns=ids)

pairwise_distances.columns.name = "Sample ID"
pairwise_distances.index.name = "Sample ID"

In [144]:
mask = np.triu(np.ones(pairwise_distances.shape), k=1)

# Apply the mask to set the upper triangle to NaN
pairwise_distances_masked = pairwise_distances.mask(mask == 1)