<a href="https://colab.research.google.com/github/MJMortensonWarwick/large_scale_data_for_research/blob/main/web_scraping_with_beautiful_soup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with requests and BeautifulSoup
In this notebook we will practice web scraping and information extraction using the packages requests and BeautifulSoup and the world renowned website of the [University of Warwick](www.warwick.ac.uk).

To begin, we need to check the install of the relevant packages.

In [1]:
!pip install beautifulsoup4



We will also setup a header variable - this basically tells the website this request is from a normal browser-type agent:

In [2]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd

headers = ({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

From here we can specify our website and make a GET request (as with an API). We can print the start of the response to check we have been able to access the site content.

In [4]:
warwick_web = "https://warwick.ac.uk"
response = get(warwick_web, headers=headers)

print(response.text[:500])

<!doctype html>
<html lang="en-GB" class="no-js">
    <head>
        <base href="https://warwick.ac.uk/">

        <meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">

<title>Welcome to the University of Warwick</title>

<meta name="description" content="University of Warwick website">
<meta name="keywords" content="University of Warwick, university, warwick, uk universities, warwick university, uk, uni


Now we have scraped some website content lets see if we can find anything useful in it (by parsing the content with BeautifulSoup). Let's start by extracting all the h1 tags (first-level headers):

In [5]:
warwick_soup = BeautifulSoup(response.text, 'html.parser')

warwick_h1 = warwick_soup.find_all('h1')
print(warwick_h1)

[<h1>
                            
                                The University of Warwick
                            
                            
                        </h1>]


We can do something similar for h3 tags and extract only the text rather than the HTML tags:

In [7]:
warwick_tag = warwick_soup.find_all('h3')

for each in warwick_tag:
    print(str(each.get_text()))

Postgraduate Study 2024
Research Stories
National Centre for Research Culture
Knowledge Exchange
Warwick has been awarded Gold in all categories of the government's latest Teaching Excellence Framework (TEF) rankings
Labour leader sees University of Warwick’s industrial impact first hand
Warwick opens new Venice home as part of record £100m investment in the arts
University of Warwick recognised as international centre of research excellence by leading experts
Connect with us
Talk to us
Find us


Here we will loop between two web pages (the suffixes list below) and extract the h1 headings, the meta title and the meta description. As the website stores content inside containers, we will need to search inside these, again via a loop. Finally you will note we use the time module in order to make the script wait - via sleep( ). This is good practice when scraping websites as a script can make a lot of requests very quickly and overload the website server. We sleep for a random time just for fun. We also time the whole process using the Notebook function %%time.

In [8]:
%%time

from random import randint
from time import sleep

# setting up the lists/dictionary that will form our dataframe with all the results
titles = []
metatitle = []
title_dict = {} # empty dictionary

uri = 'https://warwick.ac.uk/research/'
suffixes = ['ref', 'partnerships'] # add the rest in here

for suffix in suffixes: # loop through the suffixes list

    h1heading = [] # new list or empty the list to start again
    metatitle = []
    metadesc = []
    warwick_webscrape = uri + suffix # concatenate the URL and the suffix in the current loop
    r = get(warwick_webscrape, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    warwick_webpage = page_html.find_all('html')

    if warwick_webpage != []:
        for container in warwick_webpage:

            # page title
            meta_title = page_html.find("meta", property="og:title")
            metatitle.append(meta_title)

            # meta description
            meta_desc = page_html.find("meta", property="og:description")
            metadesc.append(meta_desc)

            # H1
            h1name = container.find_all('h1')[0].text
            h1heading.append(h1name)


    else:
        continue

    title_dict[suffix] = h1heading, metatitle, metadesc # add to the dictionary the suffix and the list of h1's


    sleep(randint(1,3))

CPU times: user 258 ms, sys: 8.26 ms, total: 267 ms
Wall time: 5.38 s


We can now inspect our data:

In [9]:
print(title_dict)

{'ref': (['\n\nResearch\n\n'], [<meta content="92% of our research is world-leading or internationally excellent." data-user-meta="true" property="og:title"/>], [<meta content="Our results in REF 2021 demonstrate the world class quality of our research, our approach, and most importantly, our people." data-user-meta="true" property="og:description"/>]), 'partnerships': (['\n\nResearch\n\n'], [None], [None])}


To finish off our work we will add our data to a pandas DataFrame and export as a CSV.

In [12]:
cols = ['Title'] # columns in our output file

warwickcsv = pd.DataFrame({'Title': title_dict})[cols]
warwickcsv.to_csv('warwick_scrape.csv') # the name of our file

# download a file to browser downloads
from google.colab import files
files.download("/content/warwick_scrape.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>