# Scrapping Scientific Update's Website
The idea behind this notebook is to practice web scrapping skills. Scientific Update is a company that dictates training courses for professionals in chemistry, having a special focus on process chemistry and, particularly on organic sythesis. 
Our goal is to be able to develop a simple system that extract the info about all the courses that the company provides and to analyze the data in order to get some insights

In [1]:
# Web Scrapping
import requests
from bs4 import BeautifulSoup

## First attempt, using BeautifulSoup

In [2]:
website = "https://www.scientificupdate.com/training_courses/safety-selectivity-in-the-scale-up-of-chemical-reactions-11/20220503/"
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, "lxml")

In [3]:
#print(soup.prettify())

### Accordion

In [4]:
accordion = soup.find(id="accordion")
# print(accordion.prettify())

We can see that inside the accordion we have a "Testimonials" section with comments, wich has comments from previous course atendees. Whe want to extract the most relevant information about the course and we are not interested in this comments since they are only there to promote the course (therefore, they are all going to be very cherrypicked positive messages).<br>
Let's get rid of that section turning the corde into a string, splitting it and regenerating the html at the end

In [5]:
splitted = str(accordion).split("<h3>Testimonials</h3>")[0]
accordion = BeautifulSoup(splitted, "lxml").find(id="accordion")
print(accordion.prettify())

<div id="accordion">
 <h3>
  Course Outline
 </h3>
 <div>
  <p>
   The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.
  </p>
 </div>
 <h3>
  Benefits of Attending
 </h3>
 <div>
  <ul>
   <li>
    Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale
   </li>
   <li>
    They will learn about

Much better! right? No we can see that each accordion section has an \<h3\> title and then a div that might contain a paragraph, a list, or both

In [6]:
divs = accordion.find_all("div")
titles = accordion.find_all("h3")

In [15]:
for i in range(len(titles)):
    print(titles[i].text+"\n"+ divs[i].text +"\n")

Course Outline
The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.

Benefits of Attending
Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scaleThey will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-hou

It can be seen that some entries inside the accordion not only have paragraphs but also lists. We are going to convert those lists into paragraphs

In [28]:
new_divs= []
changes = [{"old":"<li>", "new":"" },
         {"old": "</li>","new": ". "},
         {"old": "<ul>", "new": "<p>"},
         {"old": "</ul>", "new": "</p>"},
         {"old": "<strong>", "new": ""},
         {"old": "</strong>", "new": " "}]

for div in divs:
    string = str(div)
    for change in changes:
        string = string.replace(change["old"], change["new"])
    code = BeautifulSoup(string, "html.parser")
    new_divs.append(code)
    
for div in new_divs:
    print("\n" + div.text +"\n")
          
        


The aim of the course is to give lab chemists an understanding of the issues that need to be considered during the early stages of scale up to large laboratory scale equipment (10-20 litre vessels) / kilo lab. The course will concentrate on chemical safety and selectivity issues and include information on what safety testing equipment is available and the uses and limitations of this equipment. Attendees will learn how to identify potential problems whether they be thermal hazards or selectivity issues. Methods used by other companies for handling hazardous reagents and reactions will be described as well as alternative chemistry to circumvent these reactions and/or reagents.


Attendees will learn how to identify potentially unsafe chemical processes, particularly those that pose more a danger on scale. They will learn about what testing procedures are available to help them identify unsafe operating conditions to enable to talk knowledgably to in-house safety professionals or contra