# Python Sets: Comparing Groups of Data

Sets are a powerful Python data structure for working with unique items and comparing groups. They're especially useful in biology for finding overlaps, differences, and unique elements.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand the difference between sets and lists
2. Create and manipulate sets
3. Use set operations (union, intersection, difference)
4. Apply sets to real biology problems
5. Visualize set relationships with Venn diagrams

---

In [None]:
# Import libraries
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn3

print('Libraries loaded!')

---

## Part 1: Lists vs Sets

### What's the Difference?

### Example 1: Lists Allow Duplicates

In [None]:
# List of genes (can have duplicates)
gene_list = ['TP53', 'BRCA1', 'TP53', 'MYC', 'BRCA1', 'ATR']

print('Gene list:')
print(gene_list)
print(f'Length: {len(gene_list)}')
print(f'\n→ Lists keep duplicates: TP53 and BRCA1 appear twice')

### Example 2: Sets Remove Duplicates Automatically

In [None]:
# Convert list to set
gene_set = set(gene_list)

print('Gene set:')
print(gene_set)
print(f'Length: {len(gene_set)}')
print(f'\n→ Sets keep only unique items: each gene appears once')

### Example 3: Key Differences

In [None]:
print('=== Lists vs Sets ===')
print('\nLists:')
print('  - Created with: [1, 2, 3]')
print('  - Allow duplicates: YES')
print('  - Ordered: YES (keeps insertion order)')
print('  - Indexed: YES (can use list[0])')
print('  - Use when: Order matters, need duplicates')

print('\nSets:')
print('  - Created with: {1, 2, 3} or set([1, 2, 3])')
print('  - Allow duplicates: NO (automatic removal)')
print('  - Ordered: NO (order not guaranteed)')
print('  - Indexed: NO (cannot use set[0])')
print('  - Use when: Need unique items, comparing groups')

### 📝 Practice Question 1

**Task:** You have a list of species observed in a habitat survey (some seen multiple times):

```python
species_list = ['Fox', 'Rabbit', 'Deer', 'Fox', 'Rabbit', 'Owl', 'Fox', 'Deer']
```

Convert to a set to find how many **unique** species were observed.

In [None]:
# YOUR CODE HERE
# Convert to set and count unique species


---

## Part 2: Creating Sets

Multiple ways to create sets in Python.

### Example 4: Different Ways to Create Sets

In [None]:
# Method 1: Curly braces
genes1 = {'TP53', 'BRCA1', 'MYC'}
print('Method 1 (curly braces):', genes1)

# Method 2: set() function with list
genes2 = set(['TP53', 'BRCA1', 'MYC'])
print('Method 2 (set function):', genes2)

# Method 3: Empty set (must use set(), not {})
empty_set = set()
print('Method 3 (empty set):', empty_set)

# WARNING: {} creates an empty dictionary, not a set!
not_a_set = {}
print('\nWarning: {} is a:', type(not_a_set))

### Example 5: Adding Items to Sets

In [None]:
# Start with empty set
genes = set()
print('Initial:', genes)

# Add single item
genes.add('TP53')
print('After add("TP53"):', genes)

# Add duplicate (no effect)
genes.add('TP53')
print('After add("TP53") again:', genes)

# Add multiple items
genes.update(['BRCA1', 'MYC', 'ATR'])
print('After update():', genes)

print('\n→ Duplicates are automatically ignored')

### 📝 Practice Question 2

**Task:** Create a set of metabolites and add items to it:

1. Start with an empty set
2. Add 'glucose', 'ATP', 'pyruvate'
3. Try adding 'glucose' again
4. Print the final set and its length

In [None]:
# YOUR CODE HERE
# Create and populate metabolite set


---

## Part 3: Set Operations

The real power of sets: comparing groups!

### Example 6: Union (All Items from Both Sets)

In [None]:
# Genes upregulated in two experiments
experiment_1 = {'TP53', 'BRCA1', 'MYC', 'ATR'}
experiment_2 = {'BRCA1', 'PTEN', 'MDM2', 'ATR'}

print('Experiment 1:', experiment_1)
print('Experiment 2:', experiment_2)

# Union: all genes from either experiment
all_genes = experiment_1.union(experiment_2)
# Or: all_genes = experiment_1 | experiment_2

print('\nUnion (all genes):', all_genes)
print(f'Total unique genes: {len(all_genes)}')
print('\n→ Union includes everything from both sets')

### Example 7: Intersection (Common Items)

In [None]:
# What genes are upregulated in BOTH experiments?
common_genes = experiment_1.intersection(experiment_2)
# Or: common_genes = experiment_1 & experiment_2

print('Experiment 1:', experiment_1)
print('Experiment 2:', experiment_2)
print('\nIntersection (common):', common_genes)
print(f'Number of shared genes: {len(common_genes)}')
print('\n→ Intersection shows overlap between sets')

### Example 8: Difference (Items in First but Not Second)

In [None]:
# What genes are unique to experiment 1?
unique_to_exp1 = experiment_1.difference(experiment_2)
# Or: unique_to_exp1 = experiment_1 - experiment_2

print('Experiment 1:', experiment_1)
print('Experiment 2:', experiment_2)
print('\nUnique to Experiment 1:', unique_to_exp1)

# What genes are unique to experiment 2?
unique_to_exp2 = experiment_2.difference(experiment_1)
print('Unique to Experiment 2:', unique_to_exp2)

print('\n→ Difference shows what is in one set but not the other')

### Example 9: Symmetric Difference (Unique to Each, Not Shared)

In [None]:
# Genes that are in one experiment OR the other, but NOT both
unique_genes = experiment_1.symmetric_difference(experiment_2)
# Or: unique_genes = experiment_1 ^ experiment_2

print('Experiment 1:', experiment_1)
print('Experiment 2:', experiment_2)
print('\nSymmetric difference:', unique_genes)
print('\n→ Everything except the overlap')

### Summary of Set Operations

In [None]:
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}

print('Set A:', A)
print('Set B:', B)
print('\nOperations:')
print(f'Union (A | B):             {A | B}  ← Everything')
print(f'Intersection (A & B):      {A & B}  ← Shared')
print(f'Difference (A - B):        {A - B}  ← Only in A')
print(f'Symmetric Diff (A ^ B):    {A ^ B}  ← Not shared')

### 📝 Practice Question 3

**Task:** Two labs sequenced gut bacteria from different patients:

```python
lab_A = {'E.coli', 'Bacteroides', 'Lactobacillus', 'Bifidobacterium'}
lab_B = {'E.coli', 'Clostridium', 'Lactobacillus', 'Akkermansia'}
```

Find:
1. Bacteria found by both labs
2. Bacteria unique to Lab A
3. Total number of unique bacteria found

In [None]:
# YOUR CODE HERE
# Compare bacterial species between labs


---

## Part 4: Real Biology Examples

### Example 10: Drug Target Analysis

In [None]:
# Three drugs affect different sets of genes
drug_A_targets = {'TP53', 'BRCA1', 'ATR', 'CHEK1', 'RPA1'}
drug_B_targets = {'TP53', 'MDM2', 'PTEN', 'ATR'}
drug_C_targets = {'BRCA1', 'CHEK1', 'MDM2', 'MYC'}

print('Drug A targets:', drug_A_targets)
print('Drug B targets:', drug_B_targets)
print('Drug C targets:', drug_C_targets)

# Which genes are targeted by ALL three drugs?
all_three = drug_A_targets & drug_B_targets & drug_C_targets
print(f'\nTargeted by all three drugs: {all_three}')

# Which genes are targeted by at least one drug?
any_drug = drug_A_targets | drug_B_targets | drug_C_targets
print(f'Targeted by any drug: {len(any_drug)} genes')
print(any_drug)

# Genes unique to Drug A
unique_A = drug_A_targets - drug_B_targets - drug_C_targets
print(f'\nUnique to Drug A: {unique_A}')

### Example 11: Gene Pathway Analysis

In [None]:
# Genes from your experiment
upregulated_genes = {'TP53', 'ATR', 'BRCA1', 'CHEK1', 'RPA1', 'MDM2', 'PTEN', 'MYC'}

# Known pathway genes
dna_repair = {'TP53', 'ATR', 'BRCA1', 'CHEK1', 'RPA1', 'RAD51'}
cell_cycle = {'TP53', 'MDM2', 'CDK4', 'CCND1', 'RB1'}
apoptosis = {'TP53', 'BAX', 'BCL2', 'PTEN', 'CASP3'}

print('Your upregulated genes:', upregulated_genes)
print(f'\n=== Pathway Enrichment ===')

# Check overlap with each pathway
dna_overlap = upregulated_genes & dna_repair
print(f'\nDNA Repair pathway: {len(dna_overlap)}/{len(dna_repair)} genes')
print(f'  Hits: {dna_overlap}')

cycle_overlap = upregulated_genes & cell_cycle
print(f'\nCell Cycle pathway: {len(cycle_overlap)}/{len(cell_cycle)} genes')
print(f'  Hits: {cycle_overlap}')

apop_overlap = upregulated_genes & apoptosis
print(f'\nApoptosis pathway: {len(apop_overlap)}/{len(apoptosis)} genes')
print(f'  Hits: {apop_overlap}')

print('\n→ DNA Repair pathway shows strongest enrichment!')

### Example 12: Patient Sample Comparison

In [None]:
# Mutations found in three cancer patients
patient_1 = {'TP53', 'KRAS', 'EGFR', 'BRAF'}
patient_2 = {'TP53', 'PIK3CA', 'PTEN', 'EGFR'}
patient_3 = {'TP53', 'KRAS', 'APC', 'SMAD4'}

print('Patient 1 mutations:', patient_1)
print('Patient 2 mutations:', patient_2)
print('Patient 3 mutations:', patient_3)

# Common mutations across all patients
common = patient_1 & patient_2 & patient_3
print(f'\nCommon to all patients: {common}')

# Mutations in at least 2 patients
p1_p2 = patient_1 & patient_2
p1_p3 = patient_1 & patient_3
p2_p3 = patient_2 & patient_3

at_least_two = p1_p2 | p1_p3 | p2_p3
print(f'\nIn at least 2 patients: {at_least_two}')

# Total unique mutations
all_mutations = patient_1 | patient_2 | patient_3
print(f'\nTotal unique mutations: {len(all_mutations)}')

### 📝 Practice Question 4

**Task:** Three research groups identified genes associated with Alzheimer's disease:

```python
group_1 = {'APOE', 'APP', 'PSEN1', 'MAPT', 'BIN1'}
group_2 = {'APOE', 'APP', 'CLU', 'PICALM', 'BIN1'}
group_3 = {'APOE', 'PSEN1', 'TREM2', 'CD33', 'BIN1'}
```

Find:
1. Genes identified by all three groups (strongest candidates)
2. Genes identified by only one group (need more validation)
3. Total number of candidate genes

In [None]:
# YOUR CODE HERE
# Analyze Alzheimer's gene candidates


---

## Part 5: Visualizing Sets with Venn Diagrams

### Example 13: Two-Set Venn Diagram

In [None]:
# Drug targets from earlier example
drug_A = {'TP53', 'BRCA1', 'ATR', 'CHEK1', 'RPA1'}
drug_B = {'TP53', 'MDM2', 'PTEN', 'ATR'}

fig, ax = plt.subplots(figsize=(8, 6))

venn2([drug_A, drug_B], 
      set_labels=('Drug A', 'Drug B'),
      ax=ax)

ax.set_title('Drug Target Overlap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f'Shared targets: {drug_A & drug_B}')
print(f'Drug A only: {drug_A - drug_B}')
print(f'Drug B only: {drug_B - drug_A}')

### Example 14: Three-Set Venn Diagram

In [None]:
# Patient mutations from earlier
patient_1 = {'TP53', 'KRAS', 'EGFR', 'BRAF'}
patient_2 = {'TP53', 'PIK3CA', 'PTEN', 'EGFR'}
patient_3 = {'TP53', 'KRAS', 'APC', 'SMAD4'}

fig, ax = plt.subplots(figsize=(10, 8))

venn3([patient_1, patient_2, patient_3],
      set_labels=('Patient 1', 'Patient 2', 'Patient 3'),
      ax=ax)

ax.set_title('Mutation Overlap Across Patients', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f'All three patients: {patient_1 & patient_2 & patient_3}')

### 📝 Practice Question 5

**Task:** Create a Venn diagram comparing two microbiome samples:

```python
healthy_gut = {'Lactobacillus', 'Bifidobacterium', 'E.coli', 'Bacteroides', 'Akkermansia'}
diseased_gut = {'E.coli', 'Clostridium', 'Bacteroides', 'Enterococcus'}
```

Create a Venn diagram and print:
1. Bacteria in both samples
2. Bacteria unique to healthy gut
3. Bacteria unique to diseased gut

In [None]:
# YOUR CODE HERE
# Create Venn diagram for microbiome comparison


---

## Part 6: Checking Set Relationships

### Example 15: Subset and Superset

In [None]:
# All DNA repair genes
dna_repair = {'TP53', 'ATR', 'BRCA1', 'CHEK1', 'RPA1', 'RAD51'}

# Your significant genes
my_genes = {'ATR', 'BRCA1', 'CHEK1'}

# Check if my_genes is a subset of dna_repair
is_subset = my_genes.issubset(dna_repair)
print(f'My genes: {my_genes}')
print(f'DNA repair genes: {dna_repair}')
print(f'\nAre my genes a subset of DNA repair? {is_subset}')

# Check if dna_repair is a superset of my_genes
is_superset = dna_repair.issuperset(my_genes)
print(f'Is DNA repair a superset of my genes? {is_superset}')

print('\n→ All my genes are in the DNA repair pathway!')

### Example 16: Checking for Overlap

In [None]:
# Do two pathways share any genes?
pathway_A = {'TP53', 'MDM2', 'ATR'}
pathway_B = {'BRCA1', 'PTEN', 'MYC'}
pathway_C = {'TP53', 'BAX', 'BCL2'}

# Check if sets are disjoint (no overlap)
print('Pathway A:', pathway_A)
print('Pathway B:', pathway_B)
print('Pathway C:', pathway_C)

no_overlap_AB = pathway_A.isdisjoint(pathway_B)
print(f'\nA and B are disjoint (no overlap): {no_overlap_AB}')

no_overlap_AC = pathway_A.isdisjoint(pathway_C)
print(f'A and C are disjoint (no overlap): {no_overlap_AC}')

print(f'\n→ A and C share: {pathway_A & pathway_C}')

### 📝 Practice Question 6 (Challenge)

**Task:** You're analyzing antibiotic resistance genes:

```python
hospital_strain = {'blaZ', 'mecA', 'vanA', 'tetM', 'ermB'}
community_strain = {'blaZ', 'tetM'}
lab_strain = {'blaZ'}
```

Answer:
1. Is the lab strain resistance a subset of community strain?
2. Is the hospital strain a superset of community strain?
3. How many resistance genes are unique to the hospital strain?
4. Create a Venn diagram comparing hospital and community strains

In [None]:
# YOUR CODE HERE
# Analyze antibiotic resistance patterns


---

## Summary

### When to Use Sets vs Lists:

**Use Lists when:**
- Order matters (e.g., time series data)
- You need duplicates (e.g., repeated measurements)
- You need indexing (`list[0]`)

**Use Sets when:**
- You need unique items only
- Comparing groups (overlaps, differences)
- Fast membership testing (`item in set`)
- Order doesn't matter

---

### Creating Sets:
```python
my_set = {1, 2, 3}              # Curly braces
my_set = set([1, 2, 3])         # From list
my_set = set()                  # Empty set
```

### Set Operations:
```python
A | B    # Union (everything)
A & B    # Intersection (shared)
A - B    # Difference (in A, not B)
A ^ B    # Symmetric difference (not shared)
```

### Common Methods:
```python
set.add(item)              # Add one item
set.update([items])        # Add multiple items
set.remove(item)           # Remove item (error if not found)
set.discard(item)          # Remove item (no error)
len(set)                   # Number of items
item in set                # Check membership (fast!)
```

### Checking Relationships:
```python
A.issubset(B)              # Is A ⊆ B?
A.issuperset(B)            # Is A ⊇ B?
A.isdisjoint(B)            # No overlap?
```

---

### Biology Applications:

1. **Gene Lists:**
   - Find genes common to multiple experiments
   - Identify unique hits in each condition

2. **Pathway Analysis:**
   - Check overlap with known pathways
   - Calculate enrichment

3. **Patient Comparisons:**
   - Shared vs unique mutations
   - Common phenotypes

4. **Microbiome Studies:**
   - Species presence/absence
   - Community composition overlap

5. **Drug Discovery:**
   - Target overlap between compounds
   - Off-target prediction

---

### Key Takeaways:

1. **Sets automatically remove duplicates** - perfect for unique items
2. **Set operations are intuitive** - union, intersection, difference
3. **Fast membership testing** - checking if item in set is very quick
4. **Great for comparisons** - overlap analysis is the killer feature
5. **Venn diagrams visualize sets** - use matplotlib-venn

Remember: When you need to compare groups and find overlaps, **think sets!**