# Week 1 — Sequence Analysis Toolkit (Revision)

**Author:** Dheeraj Babu

This notebook summarizes the Week 1 mini-project: a reproducible sequence analysis tool
(FASTA parsing, ORF finding, translation, CSV export, and plots). It includes code snippets,
explanations, and links to example plots generated during the week.


## Setup

Make sure you have the `bioinfo` conda environment (see `environment.yml`). To recreate locally:

```bash
conda env create -f environment.yml
conda activate bioinfo
pip install -r requirements.txt  # optional if you add one
```

Primary scripts in the repo:
- `enhanced_seq_analyzer_cli.py` — main CLI tool
- `plot_fasta_stats.py` — plotting helper



## Quick example — run the analyzer on a FASTA

Run the CLI on a sample FASTA (example shown with `orf_test.fasta`):

```bash
python3 enhanced_seq_analyzer_cli.py orf_test.fasta --orf --both-strands --allow-partial --write-orf-fasta -o orf_test_out.csv --plot-stats -v
```

This produces:
- `orf_test_out.csv`
- `orf_test_out_orfs.nuc.fa` and `orf_test_out_orfs.aa.fa`
- `orf_test_out_plots/gc_distribution.png`
- `orf_test_out_plots/protlen_distribution.png`


In [None]:
import pandas as pd

# read example CSV (adjust filename if needed)
csv_path = 'orf_test_out.csv'
try:
    df = pd.read_csv(csv_path)
    print('CSV loaded:', csv_path)
    display(df.head())
except FileNotFoundError:
    print('Example CSV not found. Run the CLI first to generate it.')


In [None]:
import matplotlib.pyplot as plt
import pandas as pd

csv_path = 'orf_test_out.csv'
try:
    df = pd.read_csv(csv_path)
    # GC% distribution
    plt.figure(figsize=(6,3))
    df['gc_percent'].dropna().astype(float).plot.hist(bins=20)
    plt.xlabel('GC%')
    plt.ylabel('Count')
    plt.title('GC% distribution')
    plt.tight_layout()
    plt.show()

    # Protein length distribution
    plt.figure(figsize=(6,3))
    prot = pd.to_numeric(df.get('prot_len', pd.Series()), errors='coerce').dropna()
    plt.hist(prot, bins=20)
    plt.xlabel('Protein length (aa)')
    plt.ylabel('Count')
    plt.title('Protein length distribution')
    plt.tight_layout()
    plt.show()
except FileNotFoundError:
    print('CSV not found; run the CLI to create orf_test_out.csv')


## Interpreting outputs

- `gc_percent`: GC fraction of each sequence; helps detect genome composition or contamination.
- `prot_len`, `prot_seq_first50`, `prot_mw`, `prot_pI`, `aa_counts`: protein properties for the longest ORF found per sequence.
- `orf_frame`, `orf_strand`, `orf_start`, `orf_end`: coordinates of the reported ORF (1-based inclusive coordinates).

**Notes:** Coordinates are relative to the original sequence. If `orf_strand` is `-`, the nucleotide subsequence should be reverse-complemented to obtain the coding sequence in 5'→3'.


In [None]:
from Bio import SeqIO
from Bio.Seq import Seq

fasta = 'orf_test.fasta'
csv_path = 'orf_test_out.csv'

try:
    seqs = {rec.id: rec.seq for rec in SeqIO.parse(fasta,'fasta')}
    df = pd.read_csv(csv_path)
    # get first ORF row if exists
    row = df.iloc[0]
    if row['orf_start'] and row['orf_end']:
        start = int(row['orf_start'])
        end = int(row['orf_end'])
        seq = seqs[row['id']][start-1:end]
        if row['orf_strand'] == '-':
            seq = seq.reverse_complement()
        print('Extracted ORF nucleotide (first 200 bp):')
        print(str(seq)[:200])
        print('\nTranslated (first 50 aa):')
        print(str(seq.translate())[:50])
    else:
        print('No ORF recorded in CSV row')
except FileNotFoundError:
    print('FASTA or CSV not found in the notebook folder. Run the CLI to generate them.')


## Example Plots

If you committed the `*_plots` folders to the repository, GitHub will render these images. Locally, run the plotting cell above or open the images:

**Example files produced during Week 1:**

- `orf_test_out_plots/gc_distribution.png`
- `orf_test_out_plots/protlen_distribution.png`

Below are embedded images (they will display in Jupyter if the files are present in the same folder):

![GC distribution](orf_test_out_plots/gc_distribution.png)

![Protein length distribution](orf_test_out_plots/protlen_distribution.png)



## Week 1 Checklist — Completed

- [x] Parse FASTA with Biopython (SeqIO)
- [x] Compute GC%, length, base composition
- [x] Find ORFs (3 frames, both strands option)
- [x] Translate and compute protein stats (MW, pI)
- [x] Export CSV and ORF FASTAs
- [x] Generate GC% and protein length plots
- [x] Unit tests (pytest) and CI (GitHub Actions)

---

### Next steps (suggested)
- Add a `demo.ipynb` showing CLI outputs inline (this notebook is that demo)
- Run the tool on additional bacterial genomes and compare GC distributions
- Prepare for Week 2: RNA-seq (DESeq2), count matrices, and differential expression analysis
