# Scrapping Scientific Update's Website with BeautifulSoup4
The idea behind this notebook is to practice web scraping skills. Scientific Update is a company that dictates training courses for professionals in chemistry, having a special focus on process chemistry and, particularly on organic sythesis. 
Our goal is to be able to develop a simple system that extract the info about all the courses that the company provides and to analyze the data in order to get some insights

In [1]:
import re

# Web Scraping
import requests
from bs4 import BeautifulSoup

In [2]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

In [3]:
#print(soup.prettify())

### Title

In [4]:
main_item = soup.find("div", class_="main-item")
title = main_item.h1.text
print(title)

Safety & Selectivity in the Scale-Up of Chemical Reactions


In [5]:
def getTitle(soup):
    title = None
    main_item = soup.find("div", class_="main-item")
    title = main_item.h1.text
    return({"title":title})

print(getTitle(soup))
    

{'title': 'Safety & Selectivity in the Scale-Up of Chemical Reactions'}


### Accordion

In [6]:
accordion = soup.find(id="accordion")
# print(accordion.prettify())

We can see that inside the accordion we have a "Testimonials" section with comments, wich has comments from previous course atendees. Whe want to extract the most relevant information about the course and we are not interested in this comments since they are only there to promote the course (therefore, they are all going to be very cherrypicked positive messages).<br>
Let's get rid of that section turning the corde into a string, splitting it and regenerating the html at the end

In [7]:
splitted = str(accordion).split("<h3>Testimonials</h3>")[0]
accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
print(accordion.prettify())

<div id="accordion">
 <h3>
  Course Outline
 </h3>
 <div>
  <p>
   The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.
  </p>
 </div>
 <h3>
  Benefits of Attending
 </h3>
 <div>
  <ul>
   <li>
    Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale
   </li>
   <li>
    They will learn about

Much better! right? No we can see that each accordion section has an \<h3\> title and then a div that might contain a paragraph, a list, or both

In [8]:
divs = accordion.find_all("div")
titles = accordion.find_all("h3")

In [9]:
for i in range(len(titles)):
    print(titles[i].text+"\n"+ divs[i].text +"\n")

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scaleThey will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-hou

It can be seen that some entries inside the accordion not only have paragraphs but also lists. We are going to convert those lists into paragraphs

In [10]:
new_divs= []
changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]

for div in divs:
    string = str(div)
    for change in changes:
        string = string.replace(change["old"], change["new"])
    code = BeautifulSoup(string, "html.parser")
    new_divs.append(code)
    
for i in range(len(titles)):
    print(titles[i].text+"\n"+ new_divs[i].text +"\n")
          
        

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-h

This looks good, so let's group all of this into a couple of functions.

### getAccordionInfo(accordion, split_tag="str")
This function gets the beautifulsoup object containing the accordion code. The accordion is a component present in the course overview wich is used to display basic information about the course, splited in cards containing a title and the info. At the end of the accordion there is info that is not relevant and therefore that part of the code has to be removed. The function receives a substring that indicates where to cut the code

In [11]:
def getAccordionInfo(soup, split_tag="<h3>Testimonials</h3>"):
    accordion = soup.find(id="accordion")
    #config
    changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]
    
    # Accordion is splitted and only the beggining is kept
    splitted = str(accordion).split(split_tag)[0]
    accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
    
    # Get titles and divs
    titles = accordion.find_all("h3")
    divs = accordion.find_all("div")
    
    # Clean content
    new_divs= []

    for div in divs:
        string = str(div)
        for change in changes:
            string = string.replace(change["old"], change["new"])
        code = BeautifulSoup(string, "html.parser")
        new_divs.append(code)
        
    # Prepare data and return it
    data = {}
    for i in range(len(titles)):
        data[titles[i].text.lower().replace(" ", "-").replace("'","")] = new_divs[i].text      
    
    return data

We test here that we get the result we want:

In [13]:
data = getAccordionInfo(soup)
for key in data.keys():
    print(data[key])


The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-house safety professionals or contract 

## Description

In the center left of the page we have the overall description of the course. This includes dates and location. Let's extract that information

In [14]:
description = soup.find(class_="description")
# print(description.prettify())

In [15]:
for p in description.find_all("p"):
    print(p)
    print("\n")

<p><img alt="" class="alignnone size-medium wp-image-10900" height="59" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png" srcset="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png 300w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-1024x201.png 1024w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-768x151.png 768w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo.png 1188w" width="300"/></p>


<p>We are delighted to be able to offer this course now <span style="color: #ff00ff;"><strong>ONLINE</strong></span>, this two-day course will be divided up into four sessions using an online platform, dates and times are as follows and set in BST (British Summer Time):</p>


<p><strong>Tuesday May 3rd | </strong>2.00pm – 5.00pm BST<br/> <strong>Wednesday May 4th| </strong>2.00pm – 5.00pm BST<strong

In [16]:
mode = description.find_all("p")[1].find("strong")
dates = description.find_all("p")[2]
pitch = description.find_all("p")[3:]

In [17]:
print(mode)

<strong>ONLINE</strong>


### getDescriptionInfo(description)

In [28]:
def getDescriptionInfo(soup):
    description = soup.find(class_="description")
    #config
    tag_changes = [{"old":"<br/>", "new":"" },
         {"old": "|","new": ""},
         {"old": "<p><strong>", "new": "<p>"},
         {"old": "<strong>", "new": "</p><p>"},
         {"old": "</strong>", "new": ""},]
     
    # Get the raw data from the description html
    mode = description.find_all("p")[1].find("strong").text.lower()
    raw_pitch = description.find_all("p")[3:]
    raw_dates = description.find_all("p")[2]
    
    # Unify the pitch in one single text
    pitch = ""
    for p in raw_pitch:
        pitch = pitch + " " + p.text
    pitch = pitch.strip()
    
    # Format the dates
    dates=[]
    dates_str = str(raw_dates)
    for change in tag_changes:
        dates_str = dates_str.replace(change["old"],change["new"])
    date_ps = BeautifulSoup(dates_str, "html.parser")

    for p in date_ps:
        if p.text.strip() != "":
            dates.append(p.text.strip().replace(u'\xa0', "").replace("rd","rd;").replace("th","th;"))
    return({"dates":dates, "mode":mode, "pitch": pitch})


description_data = getDescriptionInfo(soup)
print(description_data)

{'dates': ['Tuesday May 3rd;  2.00pm – 5.00pm BST', 'Wednesday May 4th; 2.00pm – 5.00pm BST', 'Thursday May 5th; 2.00pm – 5.00pm BST', 'Friday May 6th; 2.00pm – 5.00pm BST'], 'mode': 'online', 'pitch': 'The safety of chemical processes is critical for the whole chemical industry. It is vital that process development chemists and engineers are able to identify aspects of the chemistry that may be hazardous or pose a risk to the safety of the process or equipment. In order to do this, they need to know when to proactively engage colleagues or contractors to carry out process safety testing and hazard analysis, which in turn requires a knowledge of the equipment and test methods available. As chemical reactions are scaled up and operations become more economic the ability to remove heat from exothermic events becomes reduced and at the same time the outcome of any incident becomes much more severe. This is an almost unique course on the safety of chemical reactions and processes that are 

### Fee Info

This one is a little bit tricky beacuse the fee info is inside a div just a rather generic "box" class. Therefore, we are going to have to target the text itself, and find the div of the box class that contains an h3 tag with the text "fee info".

In [19]:
boxes = soup.find_all("div", class_="box")
print(boxes)

[<div class="box"><h3>Fee Info</h3><p><span class="price">£1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK</span></p><h4>Discounts</h4><p>Up to 15% discount available on multiple bookings</p> <a class="booknow registrationAccess" role="button">Book Now</a></div>, <div class="box"><h3>Helpful Info</h3><p class="short-address" style="font-weight: bold"> Online Platform</p><p class="address">Join us from home</p></div>]


In [20]:
box = None
for div in boxes:
    if(div.h3):
        if div.h3.text.upper() == "FEE INFO":
            box = div
    
print("we found the box\n", box.prettify())


we found the box
 <div class="box">
 <h3>
  Fee Info
 </h3>
 <p>
  <span class="price">
   £1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK
  </span>
 </p>
 <h4>
  Discounts
 </h4>
 <p>
  Up to 15% discount available on multiple bookings
 </p>
 <a class="booknow registrationAccess" role="button">
  Book Now
 </a>
</div>



Now we can see that the price is inside a span with a class called "price"

In [21]:
price = box.find("span", class_="price")
print(price)

<span class="price">£1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK</span>


We have the text that mentions the price, now we are going to have to use a regular expression to match the price number.

In [22]:
def getFeeInfo(soup):
    no_result = {"price": None, "currency": None}
    boxes = soup.find_all("div", class_="box")
    box = None
    for div in boxes:
        if(div.h3):
            if div.h3.text.upper() == "FEE INFO":
                box = div
    if not box:
        return no_result
    price = box.find("span", class_="price")
    if not price:
         return no_result
    value = None
    match = re.match("^\£?\$?\d+\.\d+", price.text)
    if match:
        value = match.group(0)
    return {"price": value[1:], "currency": value[0] }
        
        



In [23]:
fee_info = getFeeInfo(soup)
print("Fee Info\n\n", fee_info)

Fee Info

 {'price': '1499.00', 'currency': '£'}


## Scraping

In [24]:

def scrapCourse(soup):
    #Scrap all data
    title = getTitle(soup)
    information = getAccordionInfo(soup)
    description = getDescriptionInfo(soup)
    fee = getFeeInfo(soup)
    # Merge all in a dictionary
    data = title | information | description | fee
    return data


In [25]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

data = scrapCourse(soup)


In [26]:
print(data.keys())

dict_keys(['title', 'course-outline', 'benefits-of-attending', 'who-should-attend', 'whats-included', 'dates', 'mode', 'pitch', 'price', 'currency'])


In [27]:
for key in data.keys():
    print(key)
    print("-------")
    print(data[key])
    print("\n")

title
-------
Safety & Selectivity in the Scale-Up of Chemical Reactions


course-outline
-------
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.


benefits-of-attending
-------
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are availa