<a href="https://colab.research.google.com/github/Ash100/Statistical_Analysis/blob/main/Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Week 1 Practical – Python & Bioinformatics Setup in Colab

**Course:** Statistical Analysis in Bioinformatics  
**Week 1 Focus:** Python, Google Colab setup, Pandas, and Biopython basics.  

---

## 🎯 Learning Objectives
By the end of this practical, you should be able to:
- Set up a Python environment in Google Colab.  
- Recall Python basics (variables, lists, functions).  
- Use `pandas` for handling biological datasets.  
- Perform sequence manipulations with `Biopython`.  
- Fetch gene sequences from **NCBI** for analysis.  

---

## 🔧 1. Setting up Environment
We first check the Python version and install required libraries: **pandas** (for data handling) and **biopython** (for sequence analysis).



In [None]:
# Check Python version
import sys
print("Python version:", sys.version)

# Install required libraries
!pip install biopython pandas


## 🐍 2. Python Basics Refresher
Python is the backbone of many bioinformatics analyses.  
Here we practice:
- Declaring variables  
- Iterating through lists  
- Writing simple functions  


## 📊 3. Using Pandas for Mutation Data
`pandas` is essential for managing **tabular biological data** such as gene mutations, expression profiles, or variant annotations.


In [2]:
# Variables and printing
gene = "BRCA1"
mutations = 15
print(f"Gene: {gene}, Mutations: {mutations}")

# Lists and loops
nucleotides = ["A", "T", "G", "C"]
for n in nucleotides:
    print("Nucleotide:", n)

# Functions
def gc_content(seq):
    g = seq.count("G")
    c = seq.count("C")
    return (g+c)/len(seq)*100

print("GC content of ATGCGC:", gc_content("ATGCGC"))


Gene: BRCA1, Mutations: 15
Nucleotide: A
Nucleotide: T
Nucleotide: G
Nucleotide: C
GC content of ATGCGC: 66.66666666666666


In [5]:
import pandas as pd

# Example: Mutation dataset (mock data)
data = {
    "Gene": ["BRCA1","BRCA2","TP53","EGFR"],
    "Mutation": ["A>G","T>C","G>A","C>T"],
    "Pathogenicity": ["High","Medium","High","Low"]
}
df = pd.DataFrame(data)
df


Unnamed: 0,Gene,Mutation,Pathogenicity
0,BRCA1,A>G,High
1,BRCA2,T>C,Medium
2,TP53,G>A,High
3,EGFR,C>T,Low


In [4]:
# Basic operations
print("Columns:", df.columns)
print("\nUnique Genes:", df["Gene"].unique())
print("\nPathogenic mutations:\n", df[df["Pathogenicity"]=="High"])


Columns: Index(['Gene', 'Mutation', 'Pathogenicity'], dtype='object')

Unique Genes: ['BRCA1' 'BRCA2' 'TP53' 'EGFR']

Pathogenic mutations:
     Gene Mutation Pathogenicity
0  BRCA1      A>G          High
2   TP53      G>A          High


## 🧬 4. Biopython: Sequence Manipulation
`Biopython` provides tools for biological sequence analysis.  
Here we will:
- Create a DNA sequence  
- Calculate its complement and reverse complement  
- Compute GC fraction  


In [6]:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction

# Create a DNA sequence
seq = Seq("ATGGCGTACGTAGCTAGC")

print("Sequence:", seq)
print("Length:", len(seq))
print("Complement:", seq.complement())
print("Reverse complement:", seq.reverse_complement())
print("GC fraction:", gc_fraction(seq))


Sequence: ATGGCGTACGTAGCTAGC
Length: 18
Complement: TACCGCATGCATCGATCG
Reverse complement: GCTAGCTACGTACGCCAT
GC fraction: 0.5555555555555556


## 🔗 5. Fetch Sequence from NCBI
We can query **NCBI databases** directly using Biopython’s `Entrez`.  
⚠️ Remember to provide your **email** (required by NCBI servers).  


In [None]:
from Bio import Entrez, SeqIO

Entrez.email = "your_email@example.com"  # change to your email

# Example: fetch TP53 gene (short region, nucleotide database)
handle = Entrez.efetch(db="nucleotide", id="NM_000546", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
handle.close()

print("Record ID:", record.id)
print("Description:", record.description)
print("Length:", len(record.seq))
print("First 100 bases:", record.seq[:100])
