# Scrapping Scientific Update's Website with BeautifulSoup4
The idea behind this notebook is to practice web scraping skills. Scientific Update is a company that dictates training courses for professionals in chemistry, having a special focus on process chemistry and, particularly on organic sythesis. 
Our goal is to be able to develop a simple system that extract the info about all the courses that the company provides and to analyze the data in order to get some insights

In [2]:
import re

# Web Scraping
import requests
from bs4 import BeautifulSoup

In [2]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

In [3]:
#print(soup.prettify())

### Header

This is the main section and it contains importan info, like the title, dates and location

In [4]:
def getTitle(soup):
    title = None
    main_item = soup.find("div", class_="main-item")
    title = main_item.h1.text
    return({"title":title})

# print(getTitle(soup))


def getHeader(soup):
    try:
        main_item = soup.find("div", class_="main-item")
        for match in main_item.find_all("span", class_="title"):
            match.replaceWith("")
        title = main_item.h1.text.capitalize().strip()
        date = main_item.find("p", "date").text.strip()
        location = main_item.find("p", "location").text.strip()
        return({"title":title, "date": date, "location": location})
    except Exception as e:
            print("Header information could not be scrapped:\n", e)
            return {"title": None, "date": None, "location": None}
    
soup = BeautifulSoup(content, "lxml")
print(getHeader(soup))

{'title': 'Safety & selectivity in the scale-up of chemical reactions', 'date': '03 May - 06 May 2022', 'location': 'Online Platform'}


### Accordion

In [5]:
accordion = soup.find(id="accordion")
# print(accordion.prettify())

We can see that inside the accordion we have a "Testimonials" section with comments, wich has comments from previous course atendees. Whe want to extract the most relevant information about the course and we are not interested in this comments since they are only there to promote the course (therefore, they are all going to be very cherrypicked positive messages).<br>
Let's get rid of that section turning the corde into a string, splitting it and regenerating the html at the end

In [6]:
splitted = str(accordion).split("<h3>Testimonials</h3>")[0]
accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
print(accordion.prettify())

<div id="accordion">
 <h3>
  Course Outline
 </h3>
 <div>
  <p>
   The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.
  </p>
 </div>
 <h3>
  Benefits of Attending
 </h3>
 <div>
  <ul>
   <li>
    Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale
   </li>
   <li>
    They will learn about

Much better! right? No we can see that each accordion section has an \<h3\> title and then a div that might contain a paragraph, a list, or both

In [7]:
divs = accordion.find_all("div")
titles = accordion.find_all("h3")

In [8]:
for i in range(len(titles)):
    print(titles[i].text+"\n"+ divs[i].text +"\n")

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scaleThey will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-hou

It can be seen that some entries inside the accordion not only have paragraphs but also lists. We are going to convert those lists into paragraphs

In [9]:
new_divs= []
changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]

for div in divs:
    string = str(div)
    for change in changes:
        string = string.replace(change["old"], change["new"])
    code = BeautifulSoup(string, "html.parser")
    new_divs.append(code)
    
for i in range(len(titles)):
    print(titles[i].text+"\n"+ new_divs[i].text +"\n")
          
        

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-h

This looks good, so let's group all of this into a couple of functions.

### getAccordionInfo(accordion, split_tag="str")
This function gets the beautifulsoup object containing the accordion code. The accordion is a component present in the course overview wich is used to display basic information about the course, splited in cards containing a title and the info. At the end of the accordion there is info that is not relevant and therefore that part of the code has to be removed. The function receives a substring that indicates where to cut the code

In [10]:
def getAccordionInfo(soup, split_tag="<h3>Testimonials</h3>"):
    try:
        accordion = soup.find(id="accordion")
        #config
        changes = [{"old":"<li>", "new":"" },
             {"old": "</li>","new": ". "},
             {"old": "<ul>", "new": "<p>"},
             {"old": "</ul>", "new": "</p>"},
             {"old": "<strong>", "new": ""},
             {"old": "</strong>", "new": " "}]

        # Accordion is splitted and only the beggining is kept
        splitted = str(accordion).split(split_tag)[0]
        accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")

        children = []
        texts = []
        children = accordion.findChildren(recursive=False)
        for child in children:
            if(child.name in ["h1","h2","h3"]):
                texts.append(child.text.strip().upper()+":")
            else:
                texts.append(child.text.strip())
        text = " ".join(texts)
        # Remove any double spaces
        text = re.sub(r"([a-z])([A-Z])",r"\1. \2",text)
        return {"information": text}
    except Exception as e:
        print("Accordion information could not be scrapped:\n", e)
        return {"information": None}

We test here that we get the result we want:

In [11]:
data = getAccordionInfo(soup)
for key in data.keys():
    print(data[key])


COURSE OUTLINE: The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents. BENEFITS OF ATTENDING: Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-

## Description

In the center left of the page we have the overall description of the course. This includes dates and location. Let's extract that information

In [12]:
description = soup.find(class_="description")
# print(description.prettify())

In [13]:
for p in description.find_all("p"):
    print(p)
    print("\n")

<p><img alt="" class="alignnone size-medium wp-image-10900" height="59" loading="lazy" sizes="(max-width: 300px) 100vw, 300px" src="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png" srcset="https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-300x59.png 300w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-1024x201.png 1024w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo-768x151.png 768w, https://www.scientificupdate.com/wp-content/uploads/2021/03/Online-Logo.png 1188w" width="300"/></p>


<p>We are delighted to be able to offer this course now <span style="color: #ff00ff;"><strong>ONLINE</strong></span>, this two-day course will be divided up into four sessions using an online platform, dates and times are as follows and set in BST (British Summer Time):</p>


<p><strong>Tuesday May 3rd | </strong>2.00pm – 5.00pm BST<br/> <strong>Wednesday May 4th| </strong>2.00pm – 5.00pm BST<strong

In [14]:
mode = description.find_all("p")[1].find("strong")
dates = description.find_all("p")[2]
pitch = description.find_all("p")[3:]

In [15]:
print(mode)

<strong>ONLINE</strong>


### getDescriptionInfo(description)

In [16]:

def getDescriptionInfo(soup):
    description = soup.find(class_="description")
    #config
    tag_changes = [{"old":"<br/>", "new":"" },
         {"old": "|","new": ""},
         {"old": "<p><strong>", "new": "<p>"},
         {"old": "<strong>", "new": "</p><p>"},
         {"old": "</strong>", "new": ""},]
     
    # Get the raw data from the description html

    raw_pitch = description.find_all("p")

    
    # Unify the pitch in one single text
    pitch = ""
    for p in raw_pitch:
        pitch = pitch + " " + p.text
    pitch = pitch.strip()
    
    return({"pitch": pitch})


### Fee Info

This one is a little bit tricky beacuse the fee info is inside a div just a rather generic "box" class. Therefore, we are going to have to target the text itself, and find the div of the box class that contains an h3 tag with the text "fee info".

In [17]:
boxes = soup.find_all("div", class_="box")
print(boxes)

[<div class="box"><h3>Fee Info</h3><p><span class="price">£1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK</span></p><h4>Discounts</h4><p>Up to 15% discount available on multiple bookings</p> <a class="booknow registrationAccess" role="button">Book Now</a></div>, <div class="box"><h3>Helpful Info</h3><p class="short-address" style="font-weight: bold"> Online Platform</p><p class="address">Join us from home</p></div>]


In [18]:
box = None
for div in boxes:
    if(div.h3):
        if div.h3.text.upper() == "FEE INFO":
            box = div
    
print("we found the box\n", box.prettify())


we found the box
 <div class="box">
 <h3>
  Fee Info
 </h3>
 <p>
  <span class="price">
   £1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK
  </span>
 </p>
 <h4>
  Discounts
 </h4>
 <p>
  Up to 15% discount available on multiple bookings
 </p>
 <a class="booknow registrationAccess" role="button">
  Book Now
 </a>
</div>



Now we can see that the price is inside a span with a class called "price"

In [19]:
price = box.find("span", class_="price")
print(price)

<span class="price">£1499.00 + VAT*, you will also have the choice to select payment in Dollars or Euros. *VAT will only be added to those companies based in the UK</span>


We have the text that mentions the price, now we are going to have to use a regular expression to match the price number.

In [20]:
def getFeeInfo(soup):
    no_result = {"price": None, "currency": None}
    boxes = soup.find_all("div", class_="box")
    box = None
    for div in boxes:
        if(div.h3):
            if div.h3.text.upper() == "FEE INFO":
                box = div
    if not box:
        return no_result
    price = box.find("span", class_="price")
    if not price:
         return no_result
    value = None
    match = re.match("^\£?\$?\d+\.\d+", price.text)
    if match:
        value = match.group(0)
    return {"price": value[1:], "currency": value[0] }
        
        



In [21]:
fee_info = getFeeInfo(soup)
print("Fee Info\n\n", fee_info)

Fee Info

 {'price': '1499.00', 'currency': '£'}


## Scraping

In [22]:

def scrapCourse(soup):
    #Scrap all data
    header = getHeader(soup)
    information = getAccordionInfo(soup)
    description = getDescriptionInfo(soup)
    fee = getFeeInfo(soup)
    # Merge all in a dictionary
    data = header | information | description | fee
    return data


## TESTING

In [23]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

data = scrapCourse(soup)


In [24]:
print(data.keys())

dict_keys(['title', 'date', 'location', 'information', 'pitch', 'price', 'currency'])


In [25]:
for key in data.keys():
    print(key)
    print("-------")
    print(data[key])
    print("\n")

title
-------
Safety & selectivity in the scale-up of chemical reactions


date
-------
03 May - 06 May 2022


location
-------
Online Platform


information
-------
COURSE OUTLINE: The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents. BENEFITS OF ATTENDING: Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more 

In [26]:
url_list=["https://www.scientificupdate.com/training_courses/heat-transfer-equipment-and-drying-module-3-3/20220513/","https://www.scientificupdate.com/training_courses/design-development-and-scale-up-of-safe-chemical-processes-and-operations-2/20220627/","https://www.scientificupdate.com/training_courses/practical-management-of-impurities-and-development-of-effective-and-comprehensive-control-strategies-9/20220322/","https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"]

In [27]:
data = []
for url in url_list:
    print("scrapping")
    result = requests.get(url)
    content = result.text
    soup = BeautifulSoup(content, "lxml")
    info = scrapCourse(soup)
    data.append(info)

print(len(data))

scrapping
scrapping
scrapping
scrapping
4


In [28]:
for entry in data:
    print("NEW PAGE\n---------------\n")
    for key in entry.keys():
        print(key)
        print("------------------\n")
        print(entry[key])
        print("\n")
    

NEW PAGE
---------------

title
------------------

Heat transfer equipment and drying – module 3


date
------------------

13 May - 13 May 2022


location
------------------

Online Platform


information
------------------

MODULE OUTLINE: Module Outline. Based on the well-received face-to-face Chemical Engineering course, these modules have been structured to provide flexibility in a virtual training environment.  Interactivity is retained by enabling participants to split out into virtual break-out rooms to work on examples that help cement understanding of the material presented. Each module duration will be 3 hours.Heat transfer utilities and equipment. Heat transfer calculations. Types of heat transfer equipment. Heat transfer in batch vessels. Evaporation. Joule-Thomson effect. Drying. Humidity and the drying process. Application to solvents. Dryer design & equipment MODULE OBJECTIVES: Chemists, biologists and engineers regularly interact and collaborate in process R&D project

# Crawler

Now we want to make a crawler that navigates the pages. We need to get the url of all courses in order to scrap them with the scrapCourse function from above.

In [4]:
# Starting page
start_url = "https://www.scientificupdate.com/training/courses/"


#Urls to crawl
courses_urls = []

page = requests.get(start_url)
content = page.text
soup = BeautifulSoup(content, "lxml")
print(soup.prettify())



<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="Conferences, training and consultancy for organic chemists and chemical engineers in pharmaceutical and fine chemical development industry." name="description"/>
  <meta content="scientific update, organic chemistry, chemical engineers, chemical training, scientific courses, worldwide science training, process chemistry courses, chemistry workshops, chemistry webinars" name="keywords"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://www.scientificupdate.com/xmlrpc.php" rel="pingback"/>
  <link href="https://www.scientificupdate.com" hreflang="en-gb" rel="alternate"/>
  <script src="https://use.typekit.net/eta1mke.js">
  </script>
  <script>
   try{Typekit.load({ async: true });}catch(e){}
  </script>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="r

In [19]:
def getLinksByClass(soup, className=""):
    urls = []
    if className == "":
        links = soup.find_all("a")
    else:
        links = soup.find_all("a", class_=className)
    for link in links:
        urls.append(link["href"])
    return urls

for link in getLinksByClass(soup, "view-details"):
    print(link)

https://www.scientificupdate.com/training_courses/applied-mixing-technology-for-process-development-scaling-down-to-scale-up-5/20220314/
https://www.scientificupdate.com/training_courses/practical-management-of-impurities-and-development-of-effective-and-comprehensive-control-strategies-9/20220322/
https://www.scientificupdate.com/training_courses/biocatalysis-as-a-tool-for-the-synthetic-chemist-7/20220328/
https://www.scientificupdate.com/training_courses/an-introduction-to-chemical-engineering-science-3/20220425/
https://www.scientificupdate.com/training_courses/scale-up-what-goes-wrong-short-course-2/20220426/
https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/
https://www.scientificupdate.com/training_courses/solvents-and-solvent-selection-in-chemical-manufacturing/20220511/
https://www.scientificupdate.com/training_courses/fluid-transfer-and-liquid-mixing-module-1-3/20220511/
https://www.scientificupdate.com/train

In [20]:
for link in getLinksByClass(soup.find("div", class_="pagenumbers")):
    print(link)

https://www.scientificupdate.com/training/courses/page/2/
https://www.scientificupdate.com/training/courses/page/3/
https://www.scientificupdate.com/training/courses/page/4/
https://www.scientificupdate.com/training/courses/page/5/


In [27]:
import time

def getCoursesLinks(start_url, pause=1):
    # In this list we will store the courses links
    links = []
    try:
        # Load the first page
        page = requests.get(start_url)
        content = page.text
        soup = BeautifulSoup(content, "lxml")
        # get urls from page numbers links
        next_pages = getLinksByClass(soup.find("div", class_="pagenumbers"))
        print("Getting links from first page...")
        links = getLinksByClass(soup, "view-details")
        # Set a timeout between calls to avoid being banned
        time.sleep(pause)
        # Load the other pages and get the urls from courses linlks
        for url in next_pages:
            print("Getting links from next page...")
            page = requests.get(url)
            content = page.text
            soup = BeautifulSoup(content, "lxml")
            new_links = getLinksByClass(soup, "view-details")
            links = [*links, *new_links]
            # Set a timeout between calls to avoid being banned
            time.sleep(pause)
        print("Done!")
        return links
    except Exception as e:
        print("An error has occured when trying to retrieve the data", e)
        return []
        
links = getCoursesLinks(start_url, pause=10)

Getting links from first page...
Getting links from next page...
Getting links from next page...
Getting links from next page...
Getting links from next page...


In [28]:
for link in links: print(link)

https://www.scientificupdate.com/training_courses/applied-mixing-technology-for-process-development-scaling-down-to-scale-up-5/20220314/
https://www.scientificupdate.com/training_courses/practical-management-of-impurities-and-development-of-effective-and-comprehensive-control-strategies-9/20220322/
https://www.scientificupdate.com/training_courses/biocatalysis-as-a-tool-for-the-synthetic-chemist-7/20220328/
https://www.scientificupdate.com/training_courses/an-introduction-to-chemical-engineering-science-3/20220425/
https://www.scientificupdate.com/training_courses/scale-up-what-goes-wrong-short-course-2/20220426/
https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/
https://www.scientificupdate.com/training_courses/solvents-and-solvent-selection-in-chemical-manufacturing/20220511/
https://www.scientificupdate.com/training_courses/fluid-transfer-and-liquid-mixing-module-1-3/20220511/
https://www.scientificupdate.com/train