# SCRAPING OVERVIEW

## Legal & ethical tools

> Have I checked that here might be a public-facing API I can use instead?

> Am I overwhelming or pressurising the server by repeated requests? 

* __import time / time.sleep(5)__

> Am I running my scraper in peak hours? or in off-peak hours?

> Have I identified myself in the headers? Do I want to?

* __headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0; Aidan ODonnell/odonnella4@cardiff.ac.uk"}__

> Am I using User-Agent rotation - if so, is it editorially justified?

> Have I read the terms of service? Do they expressly prohibit scraping? If not, what do they allow / not allow?

* "All text, images and other content on this website is copyright of Cardiff University unless explicitly stated otherwise. It may only be downloaded or copied without first obtaining permission for the purposes of teaching, administration and research within the University, or for personal, non-commercial use."

> What does the robots.txt file say?

* https://www.cardiff.ac.uk/robots.txt

> Have I set a timeout on my request?

* __requests.get(page, headers = headers, timeout = 5)__

> If there is people's personal data involved, am I working on the basis that the people have *not* consented to third-party (that's me in this case) use of this data? 

> Am I downloading copyrighted material? If so, does my use of it fall under copyright allowances?


# Scraping Cardiff University- School of Journalism

<br><br>
## Using Requests, Beautiful Soup

In [1]:
import requests

# work from a cache for speed and fewer requests: https://pypi.org/project/requests-cache/
# import requests_cache
# requests_cache.install_cache('demo_cache')

from bs4 import BeautifulSoup as bs


## Cardiff University

In [2]:
page = "https://www.cardiff.ac.uk/journalism-media-and-culture/people/academic-staff"

# headers, including my ID
my_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0; Aidan ODonnell/odonnella4@cardiff.ac.uk"}
# standard headers, no ID
# my_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/71.0;"}

# adding a timeout so it stops if it doesn't work quickly
req = requests.get(page, headers = my_headers, timeout = 5)


In [3]:
req.content

b'\n\n\n\n\n\n\n\n \n\n<!--  End global propertises -->\n      \n\n\n\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\t\n\n\n\n<!--endnoindex--> \n<!-- End Nav section -->\n\n\n\n\t\t\t\t    \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t  \t\t\t\t\n<!DOCTYPE html><!--[if lt IE 7]><html class="no-js ie ie6 lt-ie9 lt-ie8 lt-ie7" lang="en"><![endif]-->    <!--[if IE 7]><html class="no-js ie ie7 lt-ie9 lt-ie8" lang="en"><![endif]-->    <!--[if IE 8]><html class="no-js ie ie8 lt-ie9" lang="en"><![endif]-->    <!--[if IE 9]><html class="no-js ie ie9" lang="en"><![endif]-->    <!--[if (gte IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->    <head>        <title>Academic staff - School of Journalism, Media an

In [4]:
req.text

'\n\n\n\n\n\n\n\n \n\n<!--  End global propertises -->\n      \n\n\n\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\t\n\n\n\n<!--endnoindex--> \n<!-- End Nav section -->\n\n\n\n\t\t\t\t    \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t      \n    \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t  \t\t\t\t\n<!DOCTYPE html><!--[if lt IE 7]><html class="no-js ie ie6 lt-ie9 lt-ie8 lt-ie7" lang="en"><![endif]-->    <!--[if IE 7]><html class="no-js ie ie7 lt-ie9 lt-ie8" lang="en"><![endif]-->    <!--[if IE 8]><html class="no-js ie ie8 lt-ie9" lang="en"><![endif]-->    <!--[if IE 9]><html class="no-js ie ie9" lang="en"><![endif]-->    <!--[if (gte IE 9)|!(IE)]><!--><html lang="en" class="no-js"><!--<![endif]-->    <head>        <title>Academic staff - School of Journalism, Media and

In [5]:
# we parse what's come back using BS

soup = bs(req.content)
soup

<!--  End global propertises --><!--endnoindex--><!-- End Nav section --><!DOCTYPE html>
<!--[if lt IE 7]><html class="no-js ie ie6 lt-ie9 lt-ie8 lt-ie7" lang="en"><![endif]--><!--[if IE 7]><html class="no-js ie ie7 lt-ie9 lt-ie8" lang="en"><![endif]--><!--[if IE 8]><html class="no-js ie ie8 lt-ie9" lang="en"><![endif]--><!--[if IE 9]><html class="no-js ie ie9" lang="en"><![endif]--><!--[if (gte IE 9)|!(IE)]><!--><html class="no-js" lang="en"><!--<![endif]--> <head> <title>Academic staff - School of Journalism, Media and Culture - Cardiff University</title> <meta charset="utf-8"/> <link href="https://www.cardiff.ac.uk/cy/journalism-media-and-culture/people/academic-staff" hreflang="cy" rel="alternate"/> <link href="https://www.cardiff.ac.uk/journalism-media-and-culture/people/academic-staff" hreflang="en" rel="alternate"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <meta content="Our academic staff come from a wide range of backgrounds with years of experienc

In [6]:
soup.find_all('a')

[<a class="skip-content" href="#page-title">Skip to main content</a>,
 <a aria-label="Supporting teaching innovation" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Teaching excellence" href="https://www.cardiff.ac.uk/teaching-excellence">Teaching excellence</a>,
 <a aria-label="Alumni of the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Alumni" href="https://www.cardiff.ac.uk/alumni">Alumni</a>,
 <a aria-label="Donate to the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Donate" href="https://www.cardiff.ac.uk/donate">Donate</a>,
 <a aria-label="All news" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="News" href="https://www.cardiff.ac.uk/news">News</a>,
 <a aria-label="All events" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Events" href="https://www.cardiff.ac.uk/events

In [7]:
# Use find_all - we get ALL the links

everything = soup.find_all('a')

print(type(everything))
print(len(everything))
everything


<class 'bs4.element.ResultSet'>
404


[<a class="skip-content" href="#page-title">Skip to main content</a>,
 <a aria-label="Supporting teaching innovation" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Teaching excellence" href="https://www.cardiff.ac.uk/teaching-excellence">Teaching excellence</a>,
 <a aria-label="Alumni of the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Alumni" href="https://www.cardiff.ac.uk/alumni">Alumni</a>,
 <a aria-label="Donate to the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Donate" href="https://www.cardiff.ac.uk/donate">Donate</a>,
 <a aria-label="All news" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="News" href="https://www.cardiff.ac.uk/news">News</a>,
 <a aria-label="All events" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Events" href="https://www.cardiff.ac.uk/events

In [8]:
# get a list of all the divs

print(len(soup.find_all('div')))
soup.find_all('div')

155


[<div aria-label="Skip to content" role="navigation"><a class="skip-content" href="#page-title">Skip to main content</a></div>,
 <div class="navbar">
 <ul class="nav">
 <li class="nav-item">
 <a aria-label="Supporting teaching innovation" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Teaching excellence" href="https://www.cardiff.ac.uk/teaching-excellence">Teaching excellence</a>
 </li>
 <li class="nav-item">
 <a aria-label="Alumni of the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Alumni" href="https://www.cardiff.ac.uk/alumni">Alumni</a>
 </li>
 <li class="nav-item">
 <a aria-label="Donate to the University" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="Donate" href="https://www.cardiff.ac.uk/donate">Donate</a>
 </li>
 <li class="nav-item">
 <a aria-label="All news" class="nav-link ga-event" data-action="click" data-category="nav-global" data-label="News"

In [9]:
# just the H2s, using a .text method to see no tags

for x in soup.find_all("h2"):
    print(x.text)

Professor Stuart Allan
Dr Tom Allbeson
Gavin Allen
Iolo ap Dafydd
Dr Lucy Bennett
Jane Bentley
Dr Mike Berry
Professor Paul Bowman
Dr Damian Carney
Dr Cynthia Carter
Ross Clarke
Sali Collins
Professor Simon Cottle
Professor Stephen Cushion
Professor Lina Dencik
Cathy Duncan
Naomi Dunstan
Dr Inaki Garcia-Blanco
Dr Ross Garner
Gwenfair Griffith
Dr Hugh Griffiths
Mark Griffiths
Dr David Dunkley Gyimah
Dr Hannah Hamad
Professor Ian Hargreaves
Dr Janet Harris
Michael Hill
Dr Arne Hintz
Timothy Holmes
Nicola Hooper
Dr Ceri Hughes
Dr Savyasaachi Jain
Dr John Jewell
Dr César Jiménez-Martínez
Rachael Jolley
Dr Jenny Kidd
Dr Susan Kinnear
Professor Jenny Kitzinger
Dr Maria Kyriakidou
Professor Justin Lewis
Dr Michael Kho Lim
Sian Morgan Lloyd
Sandra Loy
Sharon Magill
Emma Meese
Dr Galina Miazhevich
Dr Linda Mitchell
Dr Kerry Moore
Nick Mosdell
Dr Caitriona Noonan
Dr Aidan O'Donnell
Tony O'Shaughnessy
Dr Elliot Pill
Professor Richard Sambrook
Nick Skinner
Dr Francesca Sobande
Matt Swaine
Professo

<br><br>
## we need all the info for each person

In [38]:
# if we collect all the pars, we get some stuff before then all the job titles...

paras = soup.find_all("p")

for p in paras:
    print (p.text)

A leadinguniversity
in the heart of a thriving capital city
96%
of our graduates were in employment and/or further study, due to start a new job or course, or doing other activities, such as travelling.
(HESA 2021)
Top 5
UK University for research quality
(REF 2014)
Welcoming andambitious 
we are truly a global university
£600m
invested in our biggest campus upgrade for a generation
Working to make
a better future
for Wales and the world

Professor of Journalism and Communication
Lecturer in Cultural History
Digital Journalism Lecturer
Broadcast Lecturer
Lecturer in Media Audiences (Teaching and Research)
Course Director MA Magazine Journalism
Senior Lecturer
Deputy Head of School and Professor of Cultural Studies
Senior Lecturer Media Law (Teaching and Research)
Reader
Lecturer in Magazine Journalism
Lecturer
Professor of Media and Communication / Director of Communications, Human Security and Atrocity in Global Context Research Group
Director of Postgraduate Research
Professor
Lectur

In [None]:
for x in soup.find_all('div', class_ = 'card-inner'):
    for y in x.find_all('div', class_ = 'content'):
        print(y.text)
for z in x.find_all('div', class_ = 'info'):
print(z.text)


In [56]:
# this works, mostly

for a in soup.find_all("div", class_ = "profile profile-student with-image vcard"):
    for b in a.find_all("h2"):
        print(b.text)
    for c in a.find_all("p"):
        print(c.text)
    for d in a.find_all("dd"):
        print(d.text)
    print('-----')



Professor Stuart Allan
Professor of Journalism and Communication
allans@cardiff.ac.uk
+44(0)29 208 74509
-----
Dr Tom Allbeson
Lecturer in Cultural History
allbesont@cardiff.ac.uk
-----
Gavin Allen
Digital Journalism Lecturer
alleng3@cardiff.ac.uk
+44 29225 11351
-----
Iolo ap Dafydd
Broadcast Lecturer
Welsh speaking
apdafyddi@cardiff.ac.uk
029 2087 4756
-----
Dr Lucy Bennett
Lecturer in Media Audiences (Teaching and Research)
bennettl@cardiff.ac.uk
+44 (0)29 2251 0789
-----
Jane Bentley
Course Director MA Magazine Journalism
bentleyje@cardiff.ac.uk
+44(0)29 208 74681
-----
Dr Mike Berry
Senior Lecturer
berrym1@cardiff.ac.uk
+44 (0)29 208 70630
-----
Professor Paul Bowman
Deputy Head of School and Professor of Cultural Studies
bowmanp@cardiff.ac.uk
+44 (0)29 208 76797
-----
Dr Damian Carney
Senior Lecturer Media Law (Teaching and Research)
carneyd@cardiff.ac.uk
+44(0)29 208 74186
-----
Dr Cynthia Carter
Reader
cartercl@cardiff.ac.uk
+44 (0)29 2087 6172
-----
Ross Clarke
Lecturer in Mag

In [59]:
# this works

for a in soup.find_all("div", class_ = "profile profile-student with-image vcard"):
    for b in a.find_all("h2"):
        print(b.text)
    for c in a.find_all("p"):
        print(c.text)
    for d in a.find_all("dd", class_ = 'profile-contact-telephone'):
        print(d.text)
    for e in a.find_all('dd'):
        for f in e.find_all('a', href=True):
            print(f.text)
    print('-----')


Professor Stuart Allan
Professor of Journalism and Communication
+44(0)29 208 74509
allans@cardiff.ac.uk
-----
Dr Tom Allbeson
Lecturer in Cultural History
allbesont@cardiff.ac.uk
-----
Gavin Allen
Digital Journalism Lecturer
+44 29225 11351
alleng3@cardiff.ac.uk
-----
Iolo ap Dafydd
Broadcast Lecturer
029 2087 4756
apdafyddi@cardiff.ac.uk
-----
Dr Lucy Bennett
Lecturer in Media Audiences (Teaching and Research)
+44 (0)29 2251 0789
bennettl@cardiff.ac.uk
-----
Jane Bentley
Course Director MA Magazine Journalism
+44(0)29 208 74681
bentleyje@cardiff.ac.uk
-----
Dr Mike Berry
Senior Lecturer
+44 (0)29 208 70630
berrym1@cardiff.ac.uk
-----
Professor Paul Bowman
Deputy Head of School and Professor of Cultural Studies
+44 (0)29 208 76797
bowmanp@cardiff.ac.uk
-----
Dr Damian Carney
Senior Lecturer Media Law (Teaching and Research)
+44(0)29 208 74186
carneyd@cardiff.ac.uk
-----
Dr Cynthia Carter
Reader
+44 (0)29 2087 6172
cartercl@cardiff.ac.uk
-----
Ross Clarke
Lecturer in Magazine Journalis

## Assemble (don't just print)

In [62]:
# assemble via a list of lists 

total_list = []

for a in soup.find_all("div", class_ = "profile profile-student with-image vcard"):
    person_list = []
    for b in a.find_all("h2"):
        name = b.text
#         print(b.text)
    for c in a.find_all("p"):
        job = c.text
#         print(c.text)
    for d in a.find_all("dd", class_ = 'profile-contact-telephone'):
#         print(d.text)
        phone = d.text
    for e in a.find_all('dd'):
        for f in e.find_all('a', href=True):
#             print(f.text)
            mail = f.text
    person_list.extend((name, job, phone, mail))
    
    total_list.append(person_list)

len(total_list)

68

In [63]:
# what does a single entry on our list of lists look like?

total_list[4]

['Dr Lucy Bennett',
 'Lecturer in Media Audiences (Teaching and Research)',
 '+44 (0)29 2251 0789',
 'bennettl@cardiff.ac.uk']

In [72]:
# assemble the df from the list of lists

the_headers = ['name', 'job', 'phone', 'email']
df_l = pd.DataFrame(total_list, columns = the_headers)
df_l

Unnamed: 0,name,job,phone,email
0,Professor Stuart Allan,Professor of Journalism and Communication,+44(0)29 208 74509,allans@cardiff.ac.uk
1,Dr Tom Allbeson,Lecturer in Cultural History,+44(0)29 208 74509,allbesont@cardiff.ac.uk
2,Gavin Allen,Digital Journalism Lecturer,+44 29225 11351,alleng3@cardiff.ac.uk
3,Iolo ap Dafydd,Broadcast Lecturer,029 2087 4756,apdafyddi@cardiff.ac.uk
4,Dr Lucy Bennett,Lecturer in Media Audiences (Teaching and Rese...,+44 (0)29 2251 0789,bennettl@cardiff.ac.uk
...,...,...,...,...
63,Matt Walsh,Head of School,+44 29208 79916,walshm8@cardiff.ac.uk
64,Andrew Weeks,Welsh Medium Lecturer,+44 29208 79916,weeksa3@cardiff.ac.uk
65,Dr Carrie Westwater,Lecturer (Teaching and Research),07447 454 603,westwaterca1@cardiff.ac.uk
66,Dr Andy Williams,Senior Lecturer,+44(0)29 208 70088,williamsa28@cardiff.ac.uk


In [73]:
# assemble via a list of dictionaries 

list_of_dicts = []

for a in soup.find_all("div", class_ = "profile profile-student with-image vcard"):
    person_list = []
    for b in a.find_all("h2"):
        nom = b.text
#         print(b.text)
    for c in a.find_all("p"):
        boulot = c.text
#         print(c.text)
    for d in a.find_all("dd", class_ = 'profile-contact-telephone'):
#         print(d.text)
        telephone = d.text
    for e in a.find_all('dd'):
        for f in e.find_all('a', href=True):
#             print(f.text)
            mel = f.text
    
    list_of_dicts.append({"name": nom, "job": boulot, "phone":telephone, "email": mel})

len(entry)

68

In [74]:
# what does a single entry look like here?

list_of_dicts[5]

{'name': 'Jane Bentley',
 'job': 'Course Director MA Magazine Journalism',
 'phone': '+44(0)29 208 74681',
 'email': 'bentleyje@cardiff.ac.uk'}

In [70]:
# assemble the df using the list of dictionaries

df_cols = ["name", "job", "phone", "email"]

df_d = pd.DataFrame(list_of_dicts)

df_d

Unnamed: 0,name,job,phone,email
0,Professor Stuart Allan,Professor of Journalism and Communication,+44(0)29 208 74509,allans@cardiff.ac.uk
1,Dr Tom Allbeson,Lecturer in Cultural History,+44(0)29 208 74509,allbesont@cardiff.ac.uk
2,Gavin Allen,Digital Journalism Lecturer,+44 29225 11351,alleng3@cardiff.ac.uk
3,Iolo ap Dafydd,Broadcast Lecturer,029 2087 4756,apdafyddi@cardiff.ac.uk
4,Dr Lucy Bennett,Lecturer in Media Audiences (Teaching and Rese...,+44 (0)29 2251 0789,bennettl@cardiff.ac.uk
...,...,...,...,...
63,Matt Walsh,Head of School,+44 29208 79916,walshm8@cardiff.ac.uk
64,Andrew Weeks,Welsh Medium Lecturer,+44 29208 79916,weeksa3@cardiff.ac.uk
65,Dr Carrie Westwater,Lecturer (Teaching and Research),07447 454 603,westwaterca1@cardiff.ac.uk
66,Dr Andy Williams,Senior Lecturer,+44(0)29 208 70088,williamsa28@cardiff.ac.uk


In [None]:
# save locally, DROPPING THE INDEX!

df_l.to_csv("my_lovely_df_of_staff.csv", index = False)