# This lab contains exercises for the Bioinformatics with Python Lab held on November 19th, 2024 during the CSHL Advanced Sequencing Technologies & Bioinformatics Analysis course

## Using `requests` to extract data from an API

#### First, let's import the `requests` library

In [3]:
import requests

#### 1. We are interested in using `requests` to query the Ensembl API to get back a record for a the TCERG1 gene. How would you determine this? Type your answer in the cell below. The Ensembl documentation for this can be viewed here: https://rest.ensembl.org/documentation/info/symbol_lookup

In [None]:
gene_symbol = "TCERG1"
ensembl_gene_url = f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene_symbol}?content-type=application/json"
ensembl_gene_record = # Uncomment and type code here

#### 2. How would you view the JSON output for this request?

In [None]:
# Uncomment and type code here

## Introduction to Web Scraping using `beautifulsoup4` 

#### We're now going to be using `beautifulsoup4` to practice web scraping from the course website: https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC

In [1]:
from bs4 import BeautifulSoup

#### 1. How would you to get the list of invited speakers for the course? Type your code in the cell below:

In [6]:
url = 'https://meetings.cshl.edu/courses.aspx?course=C-SEQTEC'
response = requests.get(url)
html_content = response.text
cshl_webpage = BeautifulSoup(html_content, "html.parser")
cshl_webpage.find('div', class_='cspeakers16').find("div", class_="cspeakers16")

<div class="cspeakers16">
<p class="MsoNormal"><b>Katie Campbell, </b>University of California, Los Angles, Los Angles, CA<br/>
<b>Bimal Chaudhary, </b>Nationwide Children's, Powell, OH<br/>
<b>Justin Kinney, </b>Cold Spring Harbor Laboratory, Cold Spring Harbor, NY<br/>
<b>Yang Li, </b>Washington University in St. Louis, Saint Louis, MO<br/>
<b>Zachary Lippman, </b>CSHL/HHMI, Cold Spring Harbor, NY<br/>
<b>Jessica Mozersky, </b>Washington University in St Louis, St Louis, MO<br/>
<b>Adam Phillippy, </b>National Human Genome Research Institute, Bethesda, MA<br/>
<b>Alex Wagner,</b> Nationwide Children's Hospital, Dublin, OH<br/>
<span style="font-weight: bold;">Jason Williams</span>, <span style="font-size: 1rem;">Cold Spring Harbor Laboratory, Cold Spring Harbor, NY</span></p></div>

#### The following cell can be run to convert this to a human readable form:

In [7]:
instructors = cshl_webpage.find('div', class_='cspeakers16').find("div", class_="cspeakers16")
insructors = instructors.get_text().replace("\xa0", " ").strip()
print(insructors)

Katie Campbell, University of California, Los Angles, Los Angles, CA
Bimal Chaudhary, Nationwide Children's, Powell, OH
Justin Kinney, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
Yang Li, Washington University in St. Louis, Saint Louis, MO
Zachary Lippman, CSHL/HHMI, Cold Spring Harbor, NY
Jessica Mozersky, Washington University in St Louis, St Louis, MO
Adam Phillippy, National Human Genome Research Institute, Bethesda, MA
Alex Wagner, Nationwide Children's Hospital, Dublin, OH
Jason Williams, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY


#### 3. Suppose we want to extract the dates for the course, and we know that the dates are under the `cdate16` flag. Write a query to output the dates that uses the `get_text()` function

In [None]:
# Uncomment and type code here

## VCFs and Pandas

#### We will conclude by doing some exercises that combine analysis of variant call files (VCFs) and the pandas library. First, let's import pandas and pysam:

In [None]:
import pandas as pd # Used to access dataframes
import pysam # Used to access VCFs

#### 1. Write code to read in the VCF using pysam and print the header:

In [None]:
vcf_file = pysam.VariantFile("../../data/Exome_Norm_HC_calls.filtered.PASS.vcf", index_filename=None)
# print header here

#### 2. Scroll through and look at the rows with the ##FORMAT flags. What do these rows mean and why are they important?

*Type Answer Here*:

#### 3. Run the code below to view the first 30 variants in the vcf. What kind of variants do you see in this list? How can you tell?

In [None]:
print("#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HCC1395BL_DNA")
for index, record in enumerate(vcf_file):
   if index == 30:
       break
   else:
       print(record)

*Type Answer Here*:

#### Run the code below to convert the VCF to a pandas dataframe:

In [None]:
columns = ["chrom", "pos", "id", "ref", "alt", "qual", "filter",
          "dp", "gt", "ad", "gq"]
vcf_data = []
for record in vcf_file:
   sample_data = record.samples["HCC1395BL_DNA"]
   vcf_data.append({"chrom": record.chrom,
                  "pos": record.pos,
                  "id": record.id,
                  "ref": record.ref,
                  "alt": ','.join(record.alts),
                  "qual": record.qual,
                  "filter": ';'.join(record.filter.keys()),
                  "dp": sample_data.get("DP"),
                  "gt": sample_data.get("GT"),
                  "ad": sample_data.get("AD"),
                  "gq":sample_data.get("GQ")})
vcf_data = pd.DataFrame(vcf_data)
vcf_data["chrom"] = vcf_data["chrom"].str.replace('chr', '', regex=False).astype(int)

#### 5. How many rows are in the dataframe? What pandas command can you use to determine this? 

#### 6. We would like to take a random sample of 20 entries from the dataframe. How would we do this in pandas? Name the new dataset `vcf_subset`. Set the random state to equal 7 for reproducibility.

#### 7. Suppose we want to sort the sample so that the variant positions are in **descending order**. How would you do this using pandas?

#### 8. We would like to only include variants that have a read depth >= 50.0 and a genotype quality >= 70.0. How would you do this using pandas?

## Final Exercise

#### A clinical colleague has asked you to examine the VCF to see if there are any frameshift variants in the coding regions of the *P2RX5* gene. How many frameshift variants are there? Write your code below to answer this question. The exon coordinates for the *P2RX5* gene are provided in the cell below.

In [None]:
exon_list = [[3695868,3696155], [3691643,3691794], [3690955,3691027],
             [3690604,3690680], [3690426,3690523], [3690069,3690150],
             [3689491,3689630], [3688625,3688759], [3688011,3688105],
             [3681895,3681978], [3679589,3679784], [3673226,3673226]]