# Webscraping with Python Tutorial
For the 2024 CSHL Advanced Sequencing Technologies & Bioinformatics Analysis course

## Using the Requests library

In [1]:
import requests

### Retrieve content with .get
Let's retrieve all of the [CSHL meetings](https://meetings.cshl.edu/meetings) programatically using a web scraper.

In [2]:
offerings_page_url = "https://meetings.cshl.edu/meetings"
cshl_meetings_response = requests.get(offerings_page_url)

### Check request status with .status_code
A good first check is to see if your response was successful (e.g. returned a 200 (OK) [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)).

Other popular codes include:
- 302 (Found)
- 404 (Not Found)
- 401 (Unauthorized)
- 418 (I'm a Teapot)

In [3]:
cshl_meetings_response.status_code

200

_Try it yourself:_ Modify the below code to grab content from another site. Was it successful? Why or why not?

In [None]:
# Get the content
test_page_url = ?
test_page_response = requests.? ( ? )

# Check the status code
test_page_response. ?

### View response content
Text content of responses can be viewed using `.text`, and json-formatted responses can be viewed using `.json`

In [4]:
cshl_meetings_response.text

'\r\n\r\n<!DOCTYPE html>\r\n\r\n<html lang="en">\r\n\r\n<head>\r\n\t<!-- Global site tag (gtag.js) - Google Analytics -->\r\n    <script async src="https://www.googletagmanager.com/gtag/js?id=UA-30723914-1"></script>\r\n    <script>\r\n        window.dataLayer = window.dataLayer || [];\r\n        function gtag(){dataLayer.push(arguments);}\r\n        gtag(\'js\', new Date());\r\n\r\n        gtag(\'config\', \'UA-30723914-1\');\r\n        gtag(\'config\', \'G-85036B76HX\');\r\n    </script>\r\n    \r\n    <title>\r\n\tMeetings on Cancer, Cells, Genomics, Neuroscience, Genetics and Immunology\r\n</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><meta name="description" content="Meetings and Conferences on Cancer, Cells, Genomics, Neuroscience, Genetics and Immunology" /><link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" /><link type="text/css" rel="stylesheet" href="https://cdn.jsdelivr.net/jquery.jsso

Wow, that's a lot of HTML under the hood. We will need a way to parse through this.

### Requesting JSON content from web APIs
Another common use of the requests library is to retrieve JSON content from web APIs. Let's take a look at the [LitVar search API]():

In [5]:
variant_id = 'rs113488022'
url = f'https://www.ncbi.nlm.nih.gov/research/litvar2-api/variant/get/litvar@{variant_id}%23%23'
litvar_response = requests.get(url)

In [6]:
litvar_response.status_code

200

In [7]:
litvar_response.json()

{'_id': 'litvar@rs113488022##',
 'concept': 'variant',
 'rsid': 'rs113488022',
 'clingen_ids': ['CA281998', 'CA123643', 'CA16602736'],
 'gene': ['BRAF'],
 'name': 'p.V600E',
 'hgvs': 'p.V600E',
 'flag_gene_variant': False,
 'flag_clingen_variant': False,
 'flag_rsid_variant': True,
 'data_species': ['Homo sapiens'],
 'data_snp_id': ['113488022'],
 'data_tax_id': ['9606'],
 'data_allele': ['N'],
 'data_snp_class': ['snv'],
 'data_chromosome_base_position': ['7:140753336'],
 'data_clinical_significance': ['other',
  'drug-response',
  'pathogenic',
  'likely-pathogenic']}

## Introduction to Web Scraping using `beautifulsoup4`

Let's go back to our scraping of the CSHL meetings website. We had a lot of content there that looked difficult to parse. Let's see how the beautifulsoup library can help us improve on this.

In [8]:
from bs4 import BeautifulSoup

### Making the Soup
BeautifulSoup can parse XML as well as HTML, so we need to specify what parser we want to use when we pass it content.

In [9]:
cshl_meetings_soup = BeautifulSoup(cshl_meetings_response.text, "html.parser")

In [10]:
cshl_meetings_soup


<!DOCTYPE html>

<html lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-30723914-1"></script>
<script>
        window.dataLayer = window.dataLayer || [];
        function gtag(){dataLayer.push(arguments);}
        gtag('js', new Date());

        gtag('config', 'UA-30723914-1');
        gtag('config', 'G-85036B76HX');
    </script>
<title>
	Meetings on Cancer, Cells, Genomics, Neuroscience, Genetics and Immunology
</title><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="Meetings and Conferences on Cancer, Cells, Genomics, Neuroscience, Genetics and Immunology" name="description"/><link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/><link href="https://cdn.jsdelivr.net/jquery.jssocials/1.4.0/jssocials.css" rel="stylesheet" type="text/css"/><link href="https://cdn.jsdelivr.net/jquery.jssocials/1.4.0/jssocials-

Well, that's better formatted, but still a lot of content to go through. How can we break this up?

### Soup components

We can use Beautiful Soup to review different parts of the response separately as _tags_, e.g.:
- title
- head
- body

In [14]:
cshl_meetings_soup.title

<title>
	Meetings on Cancer, Cells, Genomics, Neuroscience, Genetics and Immunology
</title>

Hrm. We know from the course site that "Network Biology" is one of the courses we expect. Let's search for that.

In [51]:
table = cshl_meetings_soup.find(id='tblCustomers')
table_rows = table.find_all("tr")

In [52]:
len(table_rows)

28

In [59]:
table_rows[1].find_all('td')[0].a.text

'Neurodegenerative Diseases: Biology & Therapeutics'

In [62]:
table_rows[1].find_all('td')[1].text.strip()

'Wed Dec 4 - Sat Dec 7 2024'

In [63]:
table_rows[1].find_all('td')[2].text.strip()

'Fri Sep 13 2024'

In [66]:
course_info = dict()
for row in table_rows[1:]:
    row_data = row.find_all('td')
    course_info[row_data[0].a.text] = [
        row.find_all('td')[1].text.strip(),
        row.find_all('td')[2].text.strip()
    ]
course_info

{'Neurodegenerative Diseases: Biology & Therapeutics': ['Wed Dec 4 - Sat Dec 7 2024',
  'Fri Sep 13 2024'],
 'Development & 3D Modeling of the Human Brain': ['Mon Dec 9 - Thu Dec 12 2024',
  'Mon Oct 7 2024'],
 'Probabilistic Modeling in Genomics': ['Wed Mar 5 - Sat Mar 8 2025',
  'Fri Jan 10 2025'],
 'Network Biology': ['Tue Mar 11 - Sat Mar 15 2025', 'Fri Jan 10 2025'],
 'Nucleic Acid Therapies': ['Wed Mar 19 - Sat Mar 22 2025', 'Fri Jan 17 2025'],
 'Cancer Genetics: History & Consequences': ['Wed Mar 26 - Sat Mar 29 2025',
  'Fri Jan 31 2025'],
 'Ubiquitins, Autophagy & Disease': ['Tue Apr 1 - Sat Apr 5 2025',
  'Fri Jan 17 2025'],
 'Brain Barriers': ['Tue Apr 8 - Sat Apr 12 2025', 'Fri Jan 24 2025'],
 'Systems Immunology': ['Tue Apr 22 - Sat Apr 26 2025', 'Fri Jan 31 2025'],
 'Telomeres & Telomerase': ['Tue Apr 29 - Sat May 3 2025', 'Fri Feb 7 2025'],
 'Biology of Genomes': ['Tue May 6 - Sat May 10 2025', 'Fri Feb 14 2025'],
 'Mechanisms of Metabolic Signaling': ['Tue May 13 - Sat 