# Scrapping Scientific Update's Website with BeautifulSoup4
The idea behind this notebook is to practice web scraping skills. Scientific Update is a company that dictates training courses for professionals in chemistry, having a special focus on process chemistry and, particularly on organic sythesis. 
Our goal is to be able to develop a simple system that extract the info about all the courses that the company provides and to analyze the data in order to get some insights

In [95]:
# Web Scraping
import requests
from bs4 import BeautifulSoup

In [2]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

In [96]:
#print(soup.prettify())

### Accordion

In [97]:
accordion = soup.find(id="accordion")
# print(accordion.prettify())

We can see that inside the accordion we have a "Testimonials" section with comments, wich has comments from previous course atendees. Whe want to extract the most relevant information about the course and we are not interested in this comments since they are only there to promote the course (therefore, they are all going to be very cherrypicked positive messages).<br>
Let's get rid of that section turning the corde into a string, splitting it and regenerating the html at the end

In [98]:
splitted = str(accordion).split("<h3>Testimonials</h3>")[0]
accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
print(accordion.prettify())

<div id="accordion">
 <h3>
  Course Outline
 </h3>
 <div>
  <p>
   The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.
  </p>
 </div>
 <h3>
  Benefits of Attending
 </h3>
 <div>
  <ul>
   <li>
    Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale
   </li>
   <li>
    They will learn about

Much better! right? No we can see that each accordion section has an \<h3\> title and then a div that might contain a paragraph, a list, or both

In [99]:
divs = accordion.find_all("div")
titles = accordion.find_all("h3")

In [100]:
for i in range(len(titles)):
    print(titles[i].text+"\n"+ divs[i].text +"\n")

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scaleThey will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-hou

It can be seen that some entries inside the accordion not only have paragraphs but also lists. We are going to convert those lists into paragraphs

In [101]:
new_divs= []
changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]

for div in divs:
    string = str(div)
    for change in changes:
        string = string.replace(change["old"], change["new"])
    code = BeautifulSoup(string, "html.parser")
    new_divs.append(code)
    
for i in range(len(titles)):
    print(titles[i].text+"\n"+ new_divs[i].text +"\n")
          
        

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-h

This looks good, so let's group all of this into a couple of functions.

### getAccordionInfo(accordion, split_tag="str")
This function gets the beautifulsoup object containing the accordion code. The accordion is a component present in the course overview wich is used to display basic information about the course, splited in cards containing a title and the info. At the end of the accordion there is info that is not relevant and therefore that part of the code has to be removed. The function receives a substring that indicates where to cut the code

In [102]:
def getAccordionInfo(accordion, split_tag="<h3>Testimonials</h3>"):
    #config
    changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]
    
    # Accordion is splitted and only the beggining is kept
    splitted = str(accordion).split(split_tag)[0]
    accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
    
    # Get titles and divs
    titles = accordion.find_all("h3")
    divs = accordion.find_all("div")
    
    # Clean content
    new_divs= []

    for div in divs:
        string = str(div)
        for change in changes:
            string = string.replace(change["old"], change["new"])
        code = BeautifulSoup(string, "html.parser")
        new_divs.append(code)
        
    # Prepare data and return it
    data = []

    for i in range(len(titles)):
        data.append({"title":titles[i].text, "content":new_divs[i].text})
    
    return data

We test here that we get the result we want:

In [103]:
accordion = soup.find(id="accordion")

In [104]:
data = getAccordionInfo(accordion)
for entry in data:
    print(entry["title"])
    print(entry["content"])
    print("\n")

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.


Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-

## Description

In the center left of the page we have the overall description of the course. This includes dates and location. Let's extract that information

In [105]:
description = soup.find(class_="description")
# print(description.prettify())

In [106]:
for p in description.find_all("p"):
    print(p)
    print("\n")

<p><img alt="" class="alignnone size-medium wp-image-10900" height="59" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png" srcset="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png 300w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-1024x201.png 1024w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-768x151.png 768w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo.png 1188w" width="300"/></p>


<p>We are delighted to be able to offer this course now <span style="color: #ff00ff;"><strong>ONLINE</strong></span>, this two-day course will be divided up into four sessions using an online platform, dates and times are as follows and set in BST (British Summer Time):</p>


<p><strong>Tuesday May 3rd | </strong>2.00pm – 5.00pm BST<br/> <strong>Wednesday May 4th| </strong>2.00pm – 5.00pm BST<strong

In [107]:
mode = description.find_all("p")[1].find("strong")
dates = description.find_all("p")[2]

In [108]:
print(mode)

<strong>ONLINE</strong>


In [109]:
dates_str = str(dates)
dates_str = dates_str.replace("<br/>","").replace("|","").replace("<p><strong>","<p>").replace("<strong>","</p><p>").replace("</strong>",";")
print(dates_str)
print("_________________")
date_ps = BeautifulSoup(dates_str, "html.parser")

dates = []

for p in date_ps:
    if p.text.strip() != "":
        dates.append(date.strip().replace(u'\xa0', "").replace("th","th;"))
print(dates)

<p>Tuesday May 3rd  ;2.00pm – 5.00pm BST </p><p>Wednesday May 4th ;2.00pm – 5.00pm BST</p><p> Thursday May 5th ; 2.00pm – 5.00pm BST</p><p> ;</p><p>Friday May 6th ; 2.00pm – 5.00pm BST</p>
_________________
['Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST']


### getDescriptionInfo(description)

In [112]:
def getDescriptionInfo(description):
    #config
    tag_changes = [{"old":"<br/>", "new":"" },
         {"old": "|","new": ""},
         {"old": "<p><strong>", "new": "<p>"},
         {"old": "<strong>", "new": "</p><p>"},
         {"old": "</strong>", "new": ""},]
     
    mode = description.find_all("p")[1].find("strong").text.lower()
    dates = []
    raw_dates = description.find_all("p")[2]
    dates_str = str(raw_dates)
    for change in tag_changes:
        dates_str = dates_str.replace(change["old"],change["new"])
    date_ps = BeautifulSoup(dates_str, "html.parser")

    for p in date_ps:
        if p.text.strip() != "":
            dates.append(date.strip().replace(u'\xa0', "").replace("th","th;"))
    return({"dates":dates, "mode":mode})

description = soup.find(class_="description")
description_data = getDescriptionInfo(description)
print(description_data)

{'dates': ['Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST'], 'mode': 'online'}
