# 🧬 Project 1 Assessment: Identifying Mutations in NGS Data

This task simulates part of a **next-generation sequencing (NGS) workflow** in which we must identify **mismatch mutations** of nucleotides (nt) within sequencing reads compared to a known reference sequence.

This is a **real-world step in many bioinformatics pipelines**: researchers often compare experimental sequencing data (reads) to a reference genome to find mutations and analyse their patterns.


## 🎓 Learning outcomes
- Practice **loops**, **conditionals**, and **string comparisons**
- Use **pandas** to handle tabular data
- **Visualize biological data** with Python plotting libraries (**Matplotlib** and **Seaborn**)
- Write **modular, robust code** for real bioinformatics workflows

## 📂 Provided materials
- A **200‑nt reference FASTA file**: `reference.txt`
- A **CSV file of 1,000 reads**: `reads.csv` (each 200 nt, *already aligned* to the reference start)
- Starter code to load the **reference sequence** and **reads** as a pandas DataFrame with 2 columns:
  - **read_id**: a unique numbered ID for each read (e.g. read_0001)
  - **sequence**: the base sequence of the read (e.g. ATGGCTAA…)

## 🎯 Your task
Some reads perfectly match the reference; others contain single-base mismatches.

**Write a Python program to identify these mutations.**

<a name='essential'></a>
### ✅ Essential functionality

Your program should:

1.	**Compare each sequencing read to the reference** base-by-base.

2.	**Record the total numbers of:**
  -	Wild-type (WT) reads (without any mutations)
  -	Mutated reads (with 1 or more mutations)

3.	**For mutated reads, record:**
  -	The position of the first mismatch (1–200)
  -	The reference base and the mutated base (i.e. A, C, T, or G) for the first mismatch

4.	**Store the results from step 3 as new columns in the pandas DataFrame**  with the following names:
  -	**wildtype_base**: Reference base.
  -	**mutated_base**: Mutated base.
  -	**mutated_position**: Position of 1st mutation in the read (1–200).

5.	**Present the findings from the mutation analysis (steps 2–4)** by:
-	**Printing a text summary** to the screen of the:
  - percentage of WT vs. mutated reads
  - percentages of specific base mutations (e.g. A → T, A → C, etc.), ordered from most common to least common.
-	**Creating publication-ready plots** (i.e. using appropriate plot types; well labelled; good use of style/colour) to represent these results, e.g. visualising:
  - percentage of WT vs. mutated reads
  - percentages of specific base mutations
  - mutation positions and/or hotspots

6.	**Save the updated DataFrame to a new CSV file** called `analysis.csv` for further analysis. The CSV should include rows for every read (i.e. 1,000 plus the header row) with **empty entries in the new columns** (from step 4) if no mutations were found in a given read.

**You can access analysis.csv to check your program's output via the Files area on Colab** (click the 📁 folder icon on the left toolbar). To view the contents of the file, double click it. To download it, move your mouse over the file, click the three dots and choose Download (or right/context click).

Your program should also:

-	**Handle unexpected input** (e.g. truncated reads, unknown characters, upper- or lower-case bases) from the CSV file **without crashing** and by **reporting errors** in a user-friendly way — note you can assume that errors are only ever in the `reads.csv` file, not in `reference.txt`.
-	**Use functions well** to make the code easily readable, concise, and avoid code repetition
-	**Be well commented**, including **docstrings** for functions
-	Follow good practices in naming variables, functions, etc.

<a name='advanced'></a>
### ⭐ Advanced functionality

You can improve your marks further by implementing code to perform the following:


- **Generate more advanced plot types** that are used in contemporary NGS literature.

-	**Calculate and report/visualize complex and/or statistical results** (e.g. of frequency of mutation at hotspots, types of mutations).

-	**Perform advanced error handling** (for example: outputting informative error messages; logging and reporting skipped reads)

# 🚀 Getting started
* Run the code in **Steps 0–2** to bring you up to the point where you can start developing your code
* Enter the code for your program in the code block in **Step 3**
* Use **Steps 4 and 5** to test your program and confirm its robustness
* You can re‑run code blocks as many times as you like.

## 🛠 Step 0: Setup
Install and import the libraries we’ll use.

In [None]:
# If running on Colab, uncomment to install dependencies (usually preinstalled):
#!pip install pandas matplotlib seaborn

import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)

## 🔽 Step 1: Download the data
These two files are available for you to use in your task:
- `reference.txt` (FASTA-like; 200‑nt reference)
- `reads.csv` (1,000 reads, each 200 nt)

Run the below code to download them into the Colab working directory (these should show up if you click the 📁 Files icon on the left).

In [None]:
# Download files automatically from this GitHub repo
!wget -q "https://github.com/HocheggerLab/y3-bio-python/blob/main/data/Reference%20Document.txt"
!wget -q "https://github.com/HocheggerLab/y3-bio-python/blob/main/data/ReadsData.csv"

# Optional: confirm they downloaded
!ls -lh reference.txt reads.csv

## 🧬 Step 2: Load reference and reads (using starter code)
Below, we provide some helper code to load the reference from `reference.txt` and the reads from `reads.csv`.

❗ **Do not change the function’s behavior** (you may read it, of course!).

The code creates two variables from the loaded files:
- **reference**: a string containing the 200-nt reference sequence
- **reads_df**: a pandas DataFrame containing the 1,000 reads

In [None]:
def load_reference(path: str) -> str:
  """
  Load reference nucelotide sequence file from specified path.

  Args:
      path: Path and filename of the 200-nt sequence file in FASTA format.

  Returns:
      ref: String containing the sequence.

  Raises:
      ValueError: If the reference file does not contain 200 nt.
  """
  with open(path) as f:
      lines = [ln.strip() for ln in f if ln.strip()]
  if lines[0].startswith('>'):
      lines = lines[1:]
  ref = "".join(lines).upper()
  if len(ref) != 200:
      raise ValueError(f"Reference must be 200 nt, got {len(ref)}")
  return ref

# Load reference and reads
REF_PATH = "reference.txt"
READS_PATH = "reads.csv"

reference = load_reference(REF_PATH)
reads_df = pd.read_csv(READS_PATH)

# Confirm that reference and reads were loaded correctly
print("Reference length:", len(reference))
reads_df.head()

## 🐍 Step 3: Enter your code here!

Remember, at minimum your code should:
*   **Print a summary of your mutation analysis** to the screen
*   **Create a basic plot** to represent number/percentage of WT and mutated reads.
*   **Store information in the `reads_df` DataFrame** on the first mutation of each read in new columns called **wildtype_base**, **mutated_base**, and **mutated_position**.
* **Save the `reads_df` DataFrame** to a CSV file called `analysis.csv`.

# ❗ **IMPORTANT: Make sure that the code block below contains all the code for your program. This is the only code block that will be used to mark your program.**




In [None]:
# Write your program code here!


## 🧪 Step 4: Test your code
It is important to check that your code gives the expected outputs. One way to do this is using "test data" with known numbers of mutations.

The code below generates a simple **test reference sequence** and a **test reads dataframe** with two WT reads and two mutant reads (50% each). This test data overwrites `reference` and `reads_df`, so re-run Step 2 if you want to re-load the original data.

If it's working correctly, your code (Step 3) should identify `read_2` as having a mutation at position 200, and at position 2 for `read_4`.

In [None]:
# Generate test data
reference = "A"*199 + "C"

test_read_wt = "A"*199 + "C"
test_read_mut = "A"*199 + "G"
test_read_mut_2 = "A" + "G" + "A"*197 + "C"

reads_df = pd.DataFrame({"read_id": ["read_1", "read_2", "read_3", "read_4"],
                         "read": [test_read_wt, test_read_mut, test_read_wt, test_read_mut_2]})

# Show the first rows of the reads_df test data
reads_df.head()

You can continue to adapt the test data generation code above to test your program. For example, does your program output the mutated base changes correctly to `analysis.csv`?

## 💣 Step 5: Develop robustness to errors

An essential part of good coding practice is making sure that your program **doesn't crash**, no matter what is thrown at it!

In a similar way to the test data generation code above, we can generate further test data to simulate 3 specific types of errors in the input `reference` and `reads_df` data:

1. Truncated reads (i.e. reads shorter than 200 nt)
2. Lower-case bases (e.g. a lower-case base character such as `a` instead of `A`)
3. Unknown characters (i.e. any letter or character except A, C, T or G, such as E, j, %, @, etc.)

### ✂ Truncated reads

In [None]:
# Generate test data for truncated reads
reference = "A"*199 + "C"

test_read_wt = "A"*199 + "C"
test_read_wt_2 = "A"*100 + "C"
test_read_mut = "A"*199 + "G"
test_read_mut_2 = "A" + "G" + "A"*100 + "C"

reads_df = pd.DataFrame({"read_id": ["read_1", "read_2", "read_3", "read_4"],
                         "read": [test_read_wt, test_read_mut, test_read_wt_2, test_read_mut_2]})

# Show the first rows of the reads_df test data
reads_df.head()

### 🔡 Lower-case bases

In [None]:
# Generate test data for lower-case bases
reference = "A"*199 + "c"

test_read_wt = "a"*199 + "C"
test_read_mut = "A"*199 + "g"
test_read_mut_2 = "A" + "g" + "a"*197 + "C"

reads_df = pd.DataFrame({"read_id": ["read_1", "read_2", "read_3", "read_4"],
                         "read": [test_read_wt, test_read_mut, test_read_wt, test_read_mut_2]})

# Show the first rows of the reads_df test data
reads_df.head()

### ⁉ Unknown characters

Based on the previous examples, write code here to generate test data containing unknown characters (i.e. letters or symbols except A, C, T or G):

In [None]:
# Generate test data for unknown characters


# 👣 Next steps

✅ **Check:**
- Does your program (Step 3) implement everything listed in the [essential functionality](#essential) section?
- Does it follow **good coding practices** in the use of functions, commenting, and variable naming?
- Does it **pass all the tests** so far (Steps 4 and 5)?

If so, congratulations! 🥳

Now, you can develop it further by implementing the features in the [advanced functionality](#advanced) section.