<a href="https://colab.research.google.com/github/Ash100/Python_for_Lifescience/blob/main/Chapter_3%3AData_Structures_for_Biology.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Learn Python for Biological Data Analysis**
# **Chapter 3:** Data Structures for Biology

This course is designed and taught by **Dr. Ashfaq Ahmad**. During teaching I will use all the examples from the Biological Sciences or Life Sciences.

## 📅 Course Outline

---

## 🏗️ Foundation (Weeks 1–2)

### 📘 Chapter 1: Getting Started with Python and Colab
- Introduction to Google Colab interface
- Basic Python syntax and data types
- Variables, strings, and basic operations
- Print statements and comments

### 📘 Chapter 2: Control Structures
- Conditional statements (`if`/`else`)
- Loops (`for` and `while`)
- Basic functions and scope

---

## 🧬 Data Handling (Weeks 3–4)

### 📘 Chapter 3: Data Structures for Biology
- Lists and tuples (storing sequences, experimental data)
- Dictionaries (gene annotations, species data)
- Sets (unique identifiers, sample collections)

### 📘 Chapter 4: Working with Files
- Reading and writing text files
- Handling CSV files (experimental data)
- Basic file operations for biological datasets

---

## 📊 Scientific Computing (Weeks 5–7)

### 📘 Chapter 5: NumPy for Numerical Data
- Arrays for storing experimental measurements
- Mathematical operations on datasets
- Statistical calculations (mean, median, standard deviation)

### 📘 Chapter 6: Pandas for Data Analysis
- DataFrames for structured biological data
- Data cleaning and manipulation
- Filtering and grouping experimental results
- Handling missing data

### 📘 Chapter 7: Data Visualization
- Matplotlib basics for scientific plots
- Creating publication-quality figures
- Specialized plots for biological data (histograms, scatter plots, box plots)

---

## 🔬 Biological Applications (Weeks 8–10)

### 📘 Chapter 8: Sequence Analysis
- String manipulation for DNA/RNA sequences
- Basic sequence operations (reverse complement, transcription)
- Reading FASTA files
- Simple sequence statistics

### 📘 Chapter 9: Statistical Analysis for Biology
- Hypothesis testing basics
- t-tests and chi-square tests
- Correlation analysis
- Introduction to `scipy.stats`

### 📘 Chapter 10: Practical Projects
- Analyzing gene expression data
- Population genetics calculations
- Ecological data analysis
- Creating reproducible research workflows

---

## 🚀 Advanced Topics *(Optional – Weeks 11–12)*

### 📘 Chapter 11: Bioinformatics Libraries
- Introduction to Biopython
- Working with biological databases
- Phylogenetic analysis basics

### 📘 Chapter 12: Best Practices
- Code organization and documentation
- Error handling
- Reproducible research practices
- Sharing code and results

---

## 🧠 Key Teaching Strategies

1. Start each chapter with biological context – explain why the programming concept matters for their field.
2. Use biological datasets throughout – gene sequences, experimental measurements, species data.
3. Include hands-on exercises after each concept.
4. Emphasize reproducibility – show how code documents their analysis process.
5. Build complexity gradually – start with simple examples, then real research scenarios.

---

✅ This progression moves from basic programming concepts to practical biological applications, ensuring students can immediately apply what they learn to their research and coursework.


## Learning Objectives
By the end of this chapter, students will be able to:

1. Use lists and tuples to store biological sequences and experimental data
2. Apply dictionaries for gene annotations and species information
3. Utilize sets for managing unique identifiers and sample collections
4. Choose appropriate data structures for different biological problems

##**Section 1:** Lists and Tuples - Storing Sequences and Experimental Data
### **Introduction to Lists in Python**

Lists are one of the most versatile and commonly used data structures in Python. They are ordered, mutable (changeable) collections of items. This means you can add, remove, or modify items after the list has been created, and the order of items is preserved. Lists can contain items of different data types within the same list.

### Why use Lists?

* **Ordered:** Items in a list have a defined order, and that order will not change unless you explicitly modify the list.
* **Changeable (Mutable):** You can modify a list after it's created (add, remove, or change elements).
* **Allows Duplicates:** Lists can contain multiple items with the same value.
* **Heterogeneous:** A single list can hold items of different data types (e.g., integers, strings, floats, even other lists).

### Creating a List

Lists are created by placing all the items (elements) inside square brackets `[]`, separated by commas.<br>



In [None]:
# An empty list
my_empty_list = []
print(my_empty_list) # Output: []

# A list of integers
numbers = [1, 2, 3, 4, 5]
print(numbers) # Output: [1, 2, 3, 4, 5]

# A list of strings
fruits = ["apple", "banana", "cherry"]
print(fruits) # Output: ['apple', 'banana', 'cherry']

# A mixed-type list
mixed_list = [1, "hello", 3.14, True]
print(mixed_list) # Output: [1, 'hello', 3.14, True]

# A nested list (a list containing another list)
nested_list = [1, 2, [3, 4], 5]
print(nested_list) # Output: [1, 2, [3, 4], 5]

In [None]:
# Basic list creation and operations
dna_sequence = ['A', 'T', 'G', 'C', 'A', 'T', 'G']
print(f"DNA sequence: {dna_sequence}")
print(f"Length: {len(dna_sequence)}")
print(f"First nucleotide: {dna_sequence[0]}")
print(f"Last nucleotide: {dna_sequence[-1]}")

**1.2 Biological Applications of Lists**

Example 1: DNA Sequence Analysis

In [None]:
# Storing a DNA sequence as a list
gene_sequence = ['A', 'T', 'G', 'C', 'C', 'G', 'T', 'A', 'A', 'T']

# Count nucleotides
nucleotide_counts = {
    'A': gene_sequence.count('A'),
    'T': gene_sequence.count('T'),
    'G': gene_sequence.count('G'),
    'C': gene_sequence.count('C')
}

print("Nucleotide composition:")
for nucleotide, count in nucleotide_counts.items():
    percentage = (count / len(gene_sequence)) * 100
    print(f"{nucleotide}: {count} ({percentage:.1f}%)")

**Example 2: Experimental Data Storage**

In [None]:
# pH measurements over time
ph_measurements = [7.2, 7.1, 6.9, 6.8, 6.7, 6.9, 7.0, 7.1]
time_points = [0, 1, 2, 3, 4, 5, 6, 7]  # hours

print("pH over time:")
for i, (time, ph) in enumerate(zip(time_points, ph_measurements)):
    print(f"Hour {time}: pH = {ph}")

# Calculate average pH
average_ph = sum(ph_measurements) / len(ph_measurements)
print(f"\nAverage pH: {average_ph:.2f}")

**Example 3: Protein Molecular Weights**

In [None]:
# List of protein molecular weights (in kDa)
protein_weights = [45.2, 67.8, 23.1, 89.5, 34.7, 56.3, 78.9]

# Sort proteins by molecular weight
sorted_weights = sorted(protein_weights)
print(f"Proteins sorted by molecular weight: {sorted_weights}")

# Find proteins within a specific range
target_range = (40, 70)  # kDa
proteins_in_range = [w for w in protein_weights if target_range[0] <= w <= target_range[1]]
print(f"Proteins between {target_range[0]}-{target_range[1]} kDa: {proteins_in_range}")

**1.3 Introduction to Tuples**

Tuples are another fundamental data structure in Python, similar to lists in some ways, but with a crucial difference: **they are immutable**. This means once a tuple is created, you cannot change its elements (add, remove, or modify them).

Tuples are ordered collections of items, and they can contain items of different data types.

## Why use Tuples?

* **Ordered:** Items in a tuple have a defined order, which will not change.
* **Immutable (Unchangeable):** This is the key characteristic. Once created, a tuple's contents cannot be altered. This makes them suitable for data that should not be modified, like coordinates, configuration settings, or database records that are not meant to be changed.
* **Allows Duplicates:** Tuples can contain multiple items with the same value.
* **Heterogeneous:** A single tuple can hold items of different data types.
* **Faster:** Due to their immutable nature, tuples can sometimes be slightly faster to process than lists for certain operations.
* **Used as Dictionary Keys:** Because they are immutable, tuples can be used as keys in dictionaries, whereas lists cannot.

## Creating a Tuple

Tuples are created by placing all the items (elements) inside parentheses `()`, separated by commas.<br>

In [None]:
# An empty tuple
my_empty_tuple = ()
print(my_empty_tuple) # Output: ()

# A tuple of integers
numbers_tuple = (1, 2, 3, 4, 5)
print(numbers_tuple) # Output: (1, 2, 3, 4, 5)

# A tuple of strings
colors = ("red", "green", "blue")
print(colors) # Output: ('red', 'green', 'blue')

# A mixed-type tuple
mixed_tuple = (1, "hello", 3.14, False)
print(mixed_tuple) # Output: (1, 'hello', 3.14, False)

# A tuple with a single item (requires a trailing comma!)
single_item_tuple = ("single",)
print(single_item_tuple) # Output: ('single',)
print(type(single_item_tuple)) # Output: <class 'tuple'>

# Without the comma, it's just a string in parentheses
not_a_tuple = ("single")
print(type(not_a_tuple)) # Output: <class 'str'>

In [None]:
# Basic tuple creation
codon = ('A', 'T', 'G')
print(f"Codon: {codon}")
print(f"Cannot modify tuple - it's immutable!")

# Tuples are great for coordinates
chromosome_position = (12, 1504829)  # (chromosome, position)
print(f"Gene located at chromosome {chromosome_position[0]}, position {chromosome_position[1]}")

**1.4 Biological Applications of Tuples**

**Example 1: Genetic Coordinates**

In [None]:
# Store gene locations as tuples (chromosome, start, end, strand)
gene_locations = [
    ('chr1', 1000, 2000, '+'),
    ('chr2', 5000, 6500, '-'),
    ('chr3', 3000, 4200, '+'),
    ('chr1', 7000, 8000, '-')
]

print("Gene locations:")
for i, (chrom, start, end, strand) in enumerate(gene_locations, 1):
    length = end - start
    print(f"Gene {i}: {chrom}:{start}-{end} ({strand}) - Length: {length} bp")

**Example 2: Amino Acid Properties**

In [None]:
# Store amino acid properties as tuples (name, abbreviation, molecular_weight, hydrophobicity)
amino_acids = [
    ('Alanine', 'Ala', 'A', 89.1, 1.8),
    ('Glycine', 'Gly', 'G', 75.1, -0.4),
    ('Valine', 'Val', 'V', 117.1, 4.2),
    ('Leucine', 'Leu', 'L', 131.2, 3.8)
]

print("Amino Acid Properties:")
print("Name\t\tAbbr\tCode\tMW\tHydrophobicity")
print("-" * 50)
for name, abbr, code, mw, hydro in amino_acids:
    print(f"{name:<10}\t{abbr}\t{code}\t{mw}\t{hydro}")

**1.5 Practice Exercise: Sequence Analysis**

In [None]:
# Exercise: Analyze a DNA sequence
def analyze_dna_sequence(sequence):
    """
    Analyze a DNA sequence and return statistics
    """
    # Convert to list for easier manipulation
    seq_list = list(sequence.upper())

    # Count nucleotides
    counts = {'A': 0, 'T': 0, 'G': 0, 'C': 0}
    for nucleotide in seq_list:
        if nucleotide in counts:
            counts[nucleotide] += 1

    # Calculate GC content
    gc_content = (counts['G'] + counts['C']) / len(seq_list) * 100

    # Find complement
    complement_map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    complement = [complement_map[n] for n in seq_list]

    return {
        'length': len(seq_list),
        'composition': counts,
        'gc_content': gc_content,
        'complement': ''.join(complement)
    }

# Test the function
test_sequence = "ATGCGATCGATCG"
results = analyze_dna_sequence(test_sequence)

print(f"Sequence: {test_sequence}")
print(f"Length: {results['length']}")
print(f"Composition: {results['composition']}")
print(f"GC Content: {results['gc_content']:.1f}%")
print(f"Complement: {results['complement']}")

##**Section 2:** Dictionaries - Gene Annotations and Species Data

**2.1 Introduction to Dictionaries**

# Introduction to Dictionaries in Python

Dictionaries are powerful and highly flexible data structures in Python, used to store data values in **key-value pairs**. Each key-value pair maps the key to its associated value. Think of a dictionary as a real-world dictionary where a "word" (the key) is associated with its "definition" (the value).

## Key Characteristics of Dictionaries

* **Key-Value Pairs:** Dictionaries store data as `key: value` pairs.
* **Changeable (Mutable):** You can add new items, modify existing items, or remove items after a dictionary has been created.
* **Keys Must Be Unique:** Each key within a dictionary must be unique. If you try to add a new item with an existing key, the old value associated with that key will be overwritten.
* **Keys Must Be Immutable:** Dictionary keys must be immutable types (e.g., strings, numbers, tuples). You cannot use mutable types like lists or other dictionaries as keys.
* **Values Can Be Anything:** Values can be of any data type (strings, numbers, lists, other dictionaries, etc.) and can be duplicates.

## Why use Dictionaries?

Dictionaries are ideal when you have data that can be identified by a specific name or unique identifier (the key). They provide a highly efficient way to retrieve data associated with a particular key.

* Storing user profiles (username: user_data)
* Representing records (product_id: product_details)
* Configuration settings (setting_name: value)
* Counting frequencies of items (item: count)

## Creating a Dictionary

Dictionaries are created by placing a comma-separated list of `key: value` pairs inside curly braces `{}`.<br>

In [None]:
# An empty dictionary
my_empty_dict = {}
print(my_empty_dict) # Output: {}

# A dictionary with string keys and integer values
student_scores = {"Alice": 95, "Bob": 88, "Charlie": 76}
print(student_scores) # Output: {'Alice': 95, 'Bob': 88, 'Charlie': 76}

# A dictionary with mixed key/value types
person_info = {
    "name": "Jane Doe",
    "age": 30,
    "is_student": False,
    "courses": ["Math", "Physics"]
}
print(person_info)
# Output: {'name': 'Jane Doe', 'age': 30, 'is_student': False, 'courses': ['Math', 'Physics']}

# Using the dict() constructor
another_dict = dict(brand="Ford", model="Mustang", year=1964)
print(another_dict) # Output: {'brand': 'Ford', 'model': 'Mustang', 'year': 1964}

In [None]:
# Basic dictionary creation
genetic_code = {
    'UUU': 'Phe', 'UUC': 'Phe', 'UUA': 'Leu', 'UUG': 'Leu',
    'UCU': 'Ser', 'UCC': 'Ser', 'UCA': 'Ser', 'UCG': 'Ser',
    'UAU': 'Tyr', 'UAC': 'Tyr', 'UAA': 'Stop', 'UAG': 'Stop'
}

print("Genetic Code Examples:")
for codon, amino_acid in genetic_code.items():
    print(f"{codon} -> {amino_acid}")

_`items()`_ retrieves all key-value pairs from the dictionary named genetic_code

**2.2 Gene Annotation Dictionaries**

**Example 1: Gene Information Database**

In [None]:
# Store comprehensive gene information
gene_database = {
    'BRCA1': {
        'full_name': 'Breast Cancer 1',
        'chromosome': 17,
        'start': 41196312,
        'end': 41277500,
        'strand': '-',
        'function': 'DNA repair',
        'associated_diseases': ['Breast cancer', 'Ovarian cancer'],
        'protein_length': 1863
    },
    'TP53': {
        'full_name': 'Tumor Protein 53',
        'chromosome': 17,
        'start': 7565097,
        'end': 7590856,
        'strand': '-',
        'function': 'Tumor suppressor',
        'associated_diseases': ['Various cancers'],
        'protein_length': 393
    },
    'CFTR': {
        'full_name': 'Cystic Fibrosis Transmembrane Conductance Regulator',
        'chromosome': 7,
        'start': 117120016,
        'end': 117308718,
        'strand': '+',
        'function': 'Ion channel',
        'associated_diseases': ['Cystic fibrosis'],
        'protein_length': 1480
    }
}

# Access gene information
def display_gene_info(gene_name):
    if gene_name in gene_database:
        info = gene_database[gene_name]
        print(f"Gene: {gene_name}")
        print(f"Full Name: {info['full_name']}")
        print(f"Location: chr{info['chromosome']}:{info['start']}-{info['end']} ({info['strand']})")
        print(f"Function: {info['function']}")
        print(f"Associated Diseases: {', '.join(info['associated_diseases'])}")
        print(f"Protein Length: {info['protein_length']} amino acids")
    else:
        print(f"Gene {gene_name} not found in database")

# Display information for specific genes
for gene in ['BRCA1', 'TP53']:
    display_gene_info(gene)
    print("-" * 50)

**Example 2: Expression Data**

In [None]:
# Store gene expression data across different conditions
expression_data = {
    'control': {
        'GAPDH': 1000,
        'ACTB': 800,
        'TP53': 50,
        'BRCA1': 25
    },
    'treatment_A': {
        'GAPDH': 1050,
        'ACTB': 820,
        'TP53': 150,
        'BRCA1': 75
    },
    'treatment_B': {
        'GAPDH': 980,
        'ACTB': 790,
        'TP53': 200,
        'BRCA1': 100
    }
}

# Calculate fold changes
def calculate_fold_change(treatment, control_condition='control'):
    fold_changes = {}
    control_data = expression_data[control_condition]
    treatment_data = expression_data[treatment]

    for gene in control_data:
        if gene in treatment_data:
            fold_change = treatment_data[gene] / control_data[gene]
            fold_changes[gene] = fold_change

    return fold_changes

# Calculate and display fold changes
for treatment in ['treatment_A', 'treatment_B']:
    fold_changes = calculate_fold_change(treatment)
    print(f"Fold changes for {treatment}:")
    for gene, fc in fold_changes.items():
        direction = "up" if fc > 1 else "down"
        print(f"  {gene}: {fc:.2f}x ({direction}-regulated)")
    print()

**2.3 Species Data Management**

**Example 1: Taxonomic Information**

In [None]:
# Store species taxonomic information
species_database = {
    'homo_sapiens': {
        'common_name': 'Human',
        'kingdom': 'Animalia',
        'phylum': 'Chordata',
        'class': 'Mammalia',
        'order': 'Primates',
        'family': 'Hominidae',
        'genus': 'Homo',
        'species': 'sapiens',
        'genome_size': 3.2e9  # base pairs
    },
    'mus_musculus': {
        'common_name': 'House Mouse',
        'kingdom': 'Animalia',
        'phylum': 'Chordata',
        'class': 'Mammalia',
        'order': 'Rodentia',
        'family': 'Muridae',
        'genus': 'Mus',
        'species': 'musculus',
        'genome_size': 2.7e9
    },
    'escherichia_coli': {
        'common_name': 'E. coli',
        'kingdom': 'Bacteria',
        'phylum': 'Proteobacteria',
        'class': 'Gammaproteobacteria',
        'order': 'Enterobacteriales',
        'family': 'Enterobacteriaceae',
        'genus': 'Escherichia',
        'species': 'coli',
        'genome_size': 4.6e6
    }
}

# Function to display taxonomic hierarchy
def display_taxonomy(species_key):
    if species_key in species_database:
        species = species_database[species_key]
        print(f"Taxonomic Classification of {species['common_name']}:")
        hierarchy = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
        for level in hierarchy:
            print(f"  {level.capitalize()}: {species[level]}")
        print(f"  Genome Size: {species['genome_size']:,.0f} bp")
    else:
        print(f"Species {species_key} not found")

# Display taxonomy for different species
for species in species_database.keys():
    display_taxonomy(species)
    print("-" * 40)

**Example 2: Phenotype Data**

In [None]:
# Store phenotype data for different organisms
phenotype_data = {
    'wild_type': {
        'growth_rate': 1.0,
        'viability': 100,
        'fertility': 'normal',
        'pigmentation': 'normal'
    },
    'mutant_A': {
        'growth_rate': 0.7,
        'viability': 85,
        'fertility': 'reduced',
        'pigmentation': 'albino'
    },
    'mutant_B': {
        'growth_rate': 1.2,
        'viability': 95,
        'fertility': 'normal',
        'pigmentation': 'dark'
    }
}

# Compare phenotypes
def compare_phenotypes(reference='wild_type'):
    ref_data = phenotype_data[reference]
    print(f"Phenotype comparison (reference: {reference}):")
    print(f"{'Strain':<12} {'Growth Rate':<12} {'Viability':<10} {'Fertility':<10} {'Pigmentation'}")
    print("-" * 60)

    for strain, data in phenotype_data.items():
        print(f"{strain:<12} {data['growth_rate']:<12} {data['viability']:<10} {data['fertility']:<10} {data['pigmentation']}")

compare_phenotypes()

**2.4 Practice Exercise: Sequence Translation**

In [None]:
# Exercise: Create a complete translation system
def create_genetic_code():
    """Create the complete genetic code dictionary"""
    genetic_code = {
        'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',
        'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',
        'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*',
        'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W',
        'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
        'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
        'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
        'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
        'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',
        'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
        'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
        'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
        'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
        'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
        'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
        'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G'
    }
    return genetic_code

def translate_sequence(dna_sequence):
    """Translate DNA sequence to protein"""
    # Convert DNA to RNA
    rna_sequence = dna_sequence.replace('T', 'U')

    # Get genetic code
    genetic_code = create_genetic_code()

    # Translate RNA to protein
    protein = []
    for i in range(0, len(rna_sequence), 3):
        codon = rna_sequence[i:i+3]
        if len(codon) == 3:
            amino_acid = genetic_code.get(codon, 'X')  # X for unknown
            protein.append(amino_acid)
            if amino_acid == '*':  # Stop codon
                break

    return ''.join(protein)

# Test translation
test_dna = "ATGAAACGCATTAGCGGTGCTAAATTAG"
protein = translate_sequence(test_dna)
print(f"DNA: {test_dna}")
print(f"Protein: {protein}")

##**Section 3:** Sets - Unique Identifiers and Sample Collections

**3.1 Introduction to Sets**


Sets are an unordered collection of **unique** items. They are mutable, meaning you can add or remove items after creation, but unlike lists and tuples, sets are unindexed, and they do not allow duplicate members. Sets are primarily used to perform mathematical set operations like union, intersection, difference, and for efficiently checking for membership or removing duplicates from a sequence.

## Key Characteristics of Sets

* **Unordered:** Items in a set do not have a defined order. You cannot refer to items by index or slice them.
* **Unique Elements:** A set cannot contain duplicate items. If you try to add an element that already exists, it will not be added.
* **Mutable (Changeable):** You can add new items or remove existing items after a set has been created.
* **Unindexed:** You cannot access items by referring to an index.
* **Heterogeneous:** A single set can hold items of different data types (e.g., integers, strings, floats, tuples). However, elements within a set must be **immutable** (like numbers, strings, or tuples). You cannot put mutable types (like lists or dictionaries) inside a set.

## Why use Sets?

Sets are ideal when:

* You need to store a collection of unique items.
* You want to perform quick membership testing (checking if an item exists in the collection).
* You need to perform mathematical set operations (union, intersection, etc.).
* You want to efficiently remove duplicate elements from a list or other collection.

## Creating a Set

Sets are created by placing all the items (elements) inside curly braces `{}`, separated by commas. However, if you want an empty set, you must use the `set()` constructor, as `{}` creates an empty dictionary.<br>



In [None]:
# An empty set
my_empty_set = set()
print(my_empty_set) # Output: set()
print(type(my_empty_set)) # Output: <class 'set'>

# A set of integers (duplicates are automatically removed)
numbers_set = {1, 2, 3, 2, 4, 1}
print(numbers_set) # Output: {1, 2, 3, 4} (order might vary)

# A set of strings
fruits_set = {"apple", "banana", "cherry"}
print(fruits_set) # Output: {'cherry', 'apple', 'banana'} (order might vary)

# A mixed-type set (elements must be immutable)
mixed_set = {1, "hello", 3.14, (1, 2)} # Tuple is immutable, so it's allowed
print(mixed_set) # Output: {1, 3.14, (1, 2), 'hello'} (order might vary)

# Creating a set from a list (useful for removing duplicates)
my_list_with_duplicates = [1, 2, 2, 3, 4, 4, 5]
unique_numbers = set(my_list_with_duplicates)
print(unique_numbers) # Output: {1, 2, 3, 4, 5}

In [None]:
# Basic set creation and operations
genes_study1 = {'BRCA1', 'TP53', 'EGFR', 'MYC'}
genes_study2 = {'TP53', 'EGFR', 'KRAS', 'PIK3CA'}

print(f"Study 1 genes: {genes_study1}")
print(f"Study 2 genes: {genes_study2}")

# Set operations
intersection = genes_study1 & genes_study2
union = genes_study1 | genes_study2
difference = genes_study1 - genes_study2

print(f"Common genes: {intersection}")
print(f"All genes: {union}")
print(f"Unique to study 1: {difference}")

**3.2 Managing Unique Identifiers**

**Example 1: Sample ID Management**

In [None]:
# Manage unique sample identifiers
def create_sample_database():
    """Create a sample database with unique identifiers"""
    samples = {
        'patient_samples': {
            'P001_T',  # Patient 1, Tumor
            'P001_N',  # Patient 1, Normal
            'P002_T',
            'P002_N',
            'P003_T',
            'P003_N'
        },
        'control_samples': {
            'C001_H',  # Control 1, Healthy
            'C002_H',
            'C003_H',
            'C004_H'
        },
        'processed_samples': {
            'P001_T',
            'P002_T',
            'C001_H',
            'C002_H'
        }
    }
    return samples

# Analyze sample collections
sample_db = create_sample_database()

# Find samples that need processing
all_samples = sample_db['patient_samples'] | sample_db['control_samples']
processed = sample_db['processed_samples']
pending = all_samples - processed

print(f"Total samples: {len(all_samples)}")
print(f"Processed samples: {len(processed)}")
print(f"Pending samples: {len(pending)}")
print(f"Samples to process: {pending}")

# Quality control - check for duplicates
all_samples_list = list(sample_db['patient_samples']) + list(sample_db['control_samples'])
duplicates = len(all_samples_list) - len(set(all_samples_list))
print(f"Duplicate samples detected: {duplicates}")

**Example 2: Gene Set Analysis**

In [None]:
# Pathway analysis using sets
pathways = {
    'DNA_repair': {
        'BRCA1', 'BRCA2', 'TP53', 'ATM', 'CHEK2', 'RAD51'
    },
    'cell_cycle': {
        'TP53', 'RB1', 'CDK4', 'CCND1', 'CDKN2A', 'E2F1'
    },
    'apoptosis': {
        'TP53', 'BAX', 'BCL2', 'CASP3', 'CASP9', 'APAF1'
    },
    'oncogenes': {
        'MYC', 'KRAS', 'EGFR', 'HER2', 'PIK3CA', 'AKT1'
    }
}

# Find gene overlaps between pathways
def analyze_pathway_overlap(pathway1, pathway2):
    """Analyze overlap between two pathways"""
    genes1 = pathways[pathway1]
    genes2 = pathways[pathway2]

    overlap = genes1 & genes2
    unique1 = genes1 - genes2
    unique2 = genes2 - genes1

    return {
        'overlap': overlap,
        'unique_to_pathway1': unique1,
        'unique_to_pathway2': unique2,
        'overlap_percentage': len(overlap) / len(genes1 | genes2) * 100
    }

# Analyze overlaps
result = analyze_pathway_overlap('DNA_repair', 'cell_cycle')
print("DNA Repair vs Cell Cycle Pathways:")
print(f"Overlapping genes: {result['overlap']}")
print(f"Unique to DNA repair: {result['unique_to_pathway1']}")
print(f"Unique to cell cycle: {result['unique_to_pathway2']}")
print(f"Overlap percentage: {result['overlap_percentage']:.1f}%")

**3.3 Sample Collection Management**

**Example 1: Clinical Sample Tracking**

In [None]:
# Track samples across different studies
class SampleTracker:
    def __init__(self):
        self.studies = {}

    def add_study(self, study_name, samples):
        """Add a new study with its samples"""
        self.studies[study_name] = set(samples)

    def get_unique_samples(self):
        """Get all unique samples across studies"""
        all_samples = set()
        for samples in self.studies.values():
            all_samples.update(samples)
        return all_samples

    def find_shared_samples(self, study1, study2):
        """Find samples shared between two studies"""
        if study1 in self.studies and study2 in self.studies:
            return self.studies[study1] & self.studies[study2]
        return set()

    def get_study_exclusive_samples(self, study_name):
        """Get samples exclusive to a specific study"""
        if study_name not in self.studies:
            return set()

        other_samples = set()
        for name, samples in self.studies.items():
            if name != study_name:
                other_samples.update(samples)

        return self.studies[study_name] - other_samples

# Example usage
tracker = SampleTracker()

# Add studies
tracker.add_study('breast_cancer', ['BC001', 'BC002', 'BC003', 'BC004', 'BC005'])
tracker.add_study('lung_cancer', ['LC001', 'LC002', 'LC003', 'BC002', 'BC003'])  # Some overlap
tracker.add_study('healthy_controls', ['HC001', 'HC002', 'HC003', 'HC004'])

# Analysis
print("Sample Analysis:")
print(f"Total unique samples: {len(tracker.get_unique_samples())}")
print(f"Shared between breast and lung cancer: {tracker.find_shared_samples('breast_cancer', 'lung_cancer')}")
print(f"Exclusive to breast cancer: {tracker.get_study_exclusive_samples('breast_cancer')}")

**Example 2: Contamination Detection**

In [None]:
# Detect potential contamination in sample collections
def detect_contamination(sample_batches):
    """Detect potential contamination by finding unexpected sample overlaps"""
    contamination_report = {}

    batch_names = list(sample_batches.keys())

    for i, batch1 in enumerate(batch_names):
        for batch2 in batch_names[i+1:]:
            overlap = sample_batches[batch1] & sample_batches[batch2]
            if overlap:
                contamination_report[f"{batch1}_vs_{batch2}"] = overlap

    return contamination_report

# Example sample batches (should be independent)
sample_batches = {
    'batch_A': {'A001', 'A002', 'A003', 'A004'},
    'batch_B': {'B001', 'B002', 'B003', 'B004'},
    'batch_C': {'C001', 'C002', 'A002', 'C004'},  # Contaminated with A002
    'batch_D': {'D001', 'D002', 'D003', 'B003'}   # Contaminated with B003
}

contamination = detect_contamination(sample_batches)
print("Contamination Detection Report:")
if contamination:
    for comparison, overlapping_samples in contamination.items():
        print(f"  {comparison}: {overlapping_samples}")
else:
    print("  No contamination detected")

**3.4 Practice Exercise: Comprehensive Analysis**

In [None]:
# Exercise: Comprehensive biological data analysis using all data structures
def comprehensive_analysis():
    """Comprehensive analysis combining lists, dictionaries, and sets"""

    # Sample data
    experimental_data = {
        'genes': ['BRCA1', 'TP53', 'EGFR', 'MYC', 'KRAS'],
        'expression_levels': [2.5, 3.2, 1.8, 4.1, 2.9],
        'sample_ids': ['S001', 'S002', 'S003', 'S004', 'S005'],
        'significant_genes': {'BRCA1', 'TP53', 'MYC'},
        'pathways': {
            'cancer': {'BRCA1', 'TP53', 'EGFR', 'MYC'},
            'growth': {'EGFR', 'MYC', 'KRAS'}
        }
    }

    # Analysis 1: Gene expression summary (using lists and dictionaries)
    gene_expression = dict(zip(experimental_data['genes'], experimental_data['expression_levels']))
    print("Gene Expression Analysis:")
    sorted_genes = sorted(gene_expression.items(), key=lambda x: x[1], reverse=True)
    for gene, expression in sorted_genes:
        status = "significant" if gene in experimental_data['significant_genes'] else "not significant"
        print(f"  {gene}: {expression:.1f} ({status})")

    # Analysis 2: Pathway enrichment (using sets)
    print("\nPathway Enrichment Analysis:")
    significant_genes = experimental_data['significant_genes']
    for pathway, pathway_genes in experimental_data['pathways'].items():
        enriched_genes = significant_genes & pathway_genes
        enrichment_score = len(enriched_genes) / len(pathway_genes) * 100
        print(f"  {pathway}: {enriched_genes} ({enrichment_score:.1f}% enriched)")

    # Analysis 3: Sample validation (using sets)
    print(f"\nSample Validation:")
    unique_samples = set(experimental_data['sample_ids'])
    print(f"  Total samples: {len(experimental_data['sample_ids'])}")
    print(f"  Unique samples: {len(unique_samples)}")
    print(f"  Duplicates detected: {len(experimental_data['sample_ids']) - len(unique_samples)}")

    return {
        'gene_expression': gene_expression,
        'significant_genes': significant_genes,
        'pathways': experimental_data['pathways']
    }

# Run comprehensive analysis
results = comprehensive_analysis()

**Summary and Key Takeaways**<br>
When to Use Each Data Structure:

**Lists:**

DNA/RNA sequences  
Time-series experimental data<br>
Ordered collections that may change

**Tuples:**

Genomic coordinates<br>
Immutable biological properties<br>
Fixed experimental conditions<br>

**Dictionaries:**

Gene annotations<br>
Species information<br>
Key-value mappings<br>

**Sets:**

Unique identifiers<br>
Sample collections<br>
Pathway analysis<br>
Removing duplicates<br>