# Problem 5: Computing GC Content [GC]
**Source:** [Rosalind - Computing GC](https://rosalind.info/problems/gc/)
## 1. Biology Context 
**What is GC Content?**
In DNA, the bases are Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). The GC content is the percentage of nitrogenous bases in a DNA molecule (or fragment) that are either Guanine or Cytosine.

**Why does it matter?**
- **Stability:** G-C pairs are bound by 3 hydrogen bonds, while A-T pairs have only 2. Higher GC content means higher thermal stability (higher melting temperature, Tm).
- **Genomic Signature:** GC content varies significantly between species and can be used to classify bacteria (Taxonomy).
- **PCR Bias:** In lab experiments, high GC regions are harder to amplify/sequence.

## 2. Algorithmic Logic 
The challenge here is **Parsing FASTA format**.
A FASTA file looks like this:
```text
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC

In [6]:
def read_fasta(file_path):     #准备一个空字典（大箱子），用来装最后整理好的数据
    fasta_dict={}                       #格式将是：{ "Rosalind_6404": "CCTGC...", "Rosalind_5959": "CCATC..." }
    current_label=''                    #准备一个变量（临时标签），用来记住“当前正在处理谁”.因为计算机会“失忆”。当它读到第 3 行（DNA碎片）时，它早就忘了第 1 行（名字）是什么。所以我们需要用 current_label 把它永久存在内存里，直到遇到下一个名字才更新它。
    with open(file_path,'r')as f:
        for line in f:
            line=line.strip()
            if not line:
                continue
            if line.startswith('>'):
                line[1:]                #从第2个字符取到最后,因为要去掉开头的 ">"，只要 "Rosalind_6404"
                current_label=line[1:]  
                fasta_dict[current_label]=''     #在大箱子里先占个位，内容先设为空字符串 "",等着下一轮循环往里填东西
            else:
                fasta_dict[current_label]+=line  #“拼接”。"A" + "B" 变成了 "AB"。seq += line 就是把新的一行粘到旧的后面。
    return fasta_dict

def calculate_gc(dna_seq):
    if len(dna_seq) == 0: return 0
    count=dna_seq.count('C')+dna_seq.count('G')
    return(count/len(dna_seq))*100

data=read_fasta('rosalind_gc.txt')
best_id=''
best_gc=-1     #设为 -1 是因为 GC 含量最低是 0，设个负数保证第一个人肯定能赢它
for seq_id,sequence in data.items():    #同时赋值  #如果直接写 for x in sample_data:，Python 默认很“懒”，它只会给左边的名字，不给内容。加上 .items()，就是在告诉 Python：“把字典里的每一对数据，都打包成一个（名字，内容）的套装拿出来。”
    gc=calculate_gc(sequence)                    #“遍历字典里的每一对数据，把‘名字’给 seq_id，把‘DNA序列’给 sequence，然后让我对它们进行操作。” #在 for 循环里，变量的定义方式非常“隐蔽”。
    #变量其实就是在上一行，那个 for 语句里定义的  for 循环其实是一个“自动赋值机”。当程序运行到 for seq_id, sequence in ... 这一行时，Python 在后台偷偷做了这样一件事（假设字典里有两条数据）：
    #第一轮循环开始时：Python 悄悄执行了：seq_id = "Rosalind_6404"  sequence = "CCTGCGGAAG..." (看！变量在这里被赋值了) 然后进入下一行代码：gc = calculate_gc(sequence) (这时候 sequence 手里已经拿着 DNA 了)
    if gc>best_gc:
        best_gc=gc
        best_id=seq_id
print(f'{best_id}')  
print(f'{best_gc:.6f}')   #保留6位小数

Rosalind_4506
53.125000


In [11]:
!pip install biopython
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction       # SeqIO 是 Biopython 里专门处理序列输入输出(Input/Output)的模块  gc_fraction 是专门算 GC 含量的函数
def solve_with_biopython(file_path):
    best_id=''
    best_gc=-1                            #以后做这种“找最大值”的题目，永远把初始值设为比理论最小值还要小一点的数（比如 -1 或者 Python 里的 float('-inf') 负无穷大），这样最安全！
    for record in SeqIO.parse(file_path,'fasta'):   #SeqIO.parse #这行代码直接替代了前面写的那一大段 for循环、if/else、字典拼接  #"fasta" 告诉它这是 FASTA 格式，请用对应的规则去读
        current_gc=gc_fraction(record.seq)*100          
        #record 不是简单的字符串，它是一个“对象”（Object/包裹）。# record.id  会自动拿到 ">" 后面的名字# record.seq 会自动拿到拼接好的完整序列
        #gc_fraction(record.seq) 会自动算 G+C 的比例（结果是 0.5 这种小数）
        if current_gc>best_gc:
           best_gc=current_gc
           best_id=record.id
    print(f'{best_id}')
    print(f'{best_gc:.6f}')
solve_with_biopython('rosalind_gc.txt')

Rosalind_4506
53.125000


## Reflection on Problem 5: Computing GC Content

### 1. The Challenge: Parsing FASTA Format
This problem was a turning point. Unlike previous problems where inputs were simple strings, here I had to handle the **FASTA format**—the industry standard for biological sequences.
The main challenge was not calculating the percentage, but **parsing multi-line data**:
- **Logic:** I learned to treat the file reading as a "State Machine". Using a dictionary to store `current_label` as the key and concatenating lines into the value was a great exercise in algorithmic thinking.
- **Edge Cases:** I realized the importance of initializing the `best_gc` variable to `-1` instead of `0`. In a research context, a sequence could theoretically have 0% GC content (e.g., Poly-A tail), so `-1` is a safer boundary condition.

### 2. From "Manual" to "Industrial" (Biopython)
I implemented two solutions:
1.  **Native Python:** Manually parsing strings using `dict` and loops. This taught me how `for` loops and variable scope work (and how to debug `IndentationError`!).
2.  **Biopython:** Using `SeqIO.parse`. This was an eye-opener. It reduced 20 lines of manual logic into 3 lines of robust code.
    *   *Key Takeaway:* While libraries like Biopython are powerful, writing the manual parser first gave me the confidence to understand **what the library is actually doing under the hood**.

### 3. Debugging Journey
As a beginner, I encountered several errors (e.g., `TypeError: object of type 'int' has no len()`). Fixing these taught me to pay attention to data types—distinguishing between the **count** (an integer) and the **sequence** (a string).

**Next Step:** I am now ready to handle real-world sequence files, not just sample strings!