# INTRODUCTION

## Extract album title from a single item on a single page
SOURCE: https://github.com/qut-dmrc/web-scraping-intro-workshop/blob/master/web-scraping-intro-step1.ipynb

This notebook gets a page from the Metacritic website and then extracts one of the fields we are interested in from it.


In [2]:
# Import python modules
import bs4       # BeautifulSoup4 is a Python package for parsing HTML and XML documents
import requests  # It allows you to send HTTP requests in Python

The next steps build up the URL that has the information we want. The sections of the url that we will want to change to get more pages of information are kept seperate so we can change them more easily.

In [4]:
# This is the base_url
base_url = "http://www.metacritic.com/browse/albums/artist"

In [6]:
# Select which page to scrape based on the first letter of the artist names
letter = "/a"

In [10]:
# Build the url (only scrape the first page - page 0)
page = base_url+lett+"?page=0"

Now lets check what the variable thepage is set to. You can show the value of any variable in a notebook by putting it in the last line of a notebook cell and running the cell. Jupyter will try to display it in a clear way, often clearer than the default 'print' layout.

In [11]:
page

'http://www.metacritic.com/browse/albums/artist/a?page=0'

These steps get the page using Requests and then process it using BeautifulSoup.



In [12]:
# the bot pretends to be a Chrome browser
hdrs = {"User-Agent": "Chrome/78.0"}

In [25]:
# call the url
response_url = requests.get(page, headers=hdrs)   # We can see the status code that the server returned. 
                                                  # If the server returns 200 status code, the program will work!
                                                  # If the server returns 404, the program will fail.
if response_url.status_code == 200:
    print("Success!")
    
elif response_url.status_code == 404:
    print("Not found!")                           # It works!!

Success!


In [27]:
response_url

<Response [200]>

In [30]:
# Transform to soup using html.parser 
soup = bs4.BeautifulSoup(response_url.text, "html.parser")
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "https://www.w3.org/TR/html4/strict.dtd">

<html xml:lang="en">
<head>
<title>Music and Albums from A-Z  by Artist, letter A - Metacritic</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[function(e,n,t){function r(){}function o(e,n,t){return function(){return i(e,[c.now()].concat(u(arguments)),n?null:this,t),n?void 0:this}}var i=e("handle"),a=e(3),u=e(4),f=e("ee").get("tracer"),c=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],d="api-",l=d+"i

In [35]:
# Find all div-tags of class "product_wrap" (We found it using the SelectorGadget extension)
title_tag = soup.find_all("div", class_=["product_wrap"]) 

# Have a look at the first item in the list
title_tag[0]

<div class="product_wrap">
<div class="basic_stat product_title">
<a href="/music/colonia/a-camp">
                            Colonia
                                                    </a>
</div>
<div class="basic_stat product_score brief_metascore">
<div class="metascore_w small release positive">64</div>
</div>
<div class="basic_stat condensed_stats">
<ul class="more_stats">
<li class="stat product_artist">
<span class="label">Artist:</span>
<span class="data">A Camp</span>
</li>
<li class="stat product_avguserscore">
<span class="label">User:</span>
<span class="data textscore textscore_favorable">8.0</span>
</li>
<li class="stat release_date full_release_date">
<span class="label">Release Date:</span>
<span class="data">Apr 28, 2009</span>
</li>
</ul>
</div>
</div>

In [36]:
# Extract the first div-tag from the first item
thetitle = title_tag[0].find("div", class_="product_title")
thetitle

<div class="basic_stat product_title">
<a href="/music/colonia/a-camp">
                            Colonia
                                                    </a>
</div>

In [46]:
# The album name is the text part of this tag
temp= thetitle.get_text()
temp

'\n\n                            Colonia\n                                                    \n'

In [47]:
# It's poorly formatted so we need to clean it up a bit by first splitting the string into a list of words
temptemp= title_text.split()
temptemp

['Colonia']

In [48]:
# And then we need to join the words back together with single spaces between them
clean_title = " ".join(temptemp)
clean_title

'Colonia'

## Extract all album titles on a single page
SOURCE: https://github.com/qut-dmrc/web-scraping-intro-workshop/blob/master/web-scraping-intro-step2.ipynb

This notebook extends the previous step to get all of the titles from a single page.

- We already have bs4 and requests python modules, so we don't need to import more modules. 
- We also have the base_url, the lett, the page and the browser
- We have checked that the status code is 200

In [43]:
thepage

'http://www.metacritic.com/browse/albums/artist/a?page=0'

In [45]:
# Transform to soup using html.parser parser
soup = bs4.BeautifulSoup(response_url.text, "html.parser")

# Find all div-tags of class "product_wrap" (We found it using the SelectorGadget extension)
title_tag = soup.find_all("div", class_=["product_wrap"]) 

Now process **title_tag** in a new way to get all the items instead of just one.

In [50]:
# Let's do the same thing as in the previous step but for all items in the page

list = []

for item in title_tag:
    
    # extract the first div-tag from the item
    thetitle = item.find("div", class_="product_title")
    
    # extract and clean up the album name
    temptemp = thetitle.get_text()
    temptemp= temptemp.split()
    album_name = " ".join(temptemp)
    
    # add the albun name to the list
    list += [album_name]

In [51]:
list

['Colonia',
 'Call It Blazing',
 'Common Courtesy',
 'Bad Vibrations',
 'Pines',
 'Pile',
 'Toy',
 'Feathers Wet, Under the Moon',
 'Wooden Mask',
 'Passover',
 "You're Always on My Mind",
 'A Gun Called Tension',
 'Essence',
 'Darkness At Noon',
 'The Way The Wind Blows',
 'Cervantine',
 'You Have Already Gone to the Other World',
 'And Hell Will Follow Me',
 'Thirteenth Step',
 'eMOTIVe',
 'Eat the Elephant',
 'A Place To Bury Strangers',
 'Exploding Head',
 'Onwards to the Wall [EP]',
 'Worship',
 'Transfixiation',
 'Pinned',
 'Elasticity',
 'Ashes Grammar',
 'Nitetime Rainbows [EP]',
 'Autumn, Again',
 'Sea When Absent',
 'We Got It From Here...Thank You 4 Your Service',
 'Partycrasher',
 'A Winged Victory for the Sullen',
 'Atomos',
 'Iris [Original Motion Picture Soundtrack]',
 'The Undivided Five',
 'Trap Lord',
 'Always Strive and Prosper',
 'Still Striving [Mixtape]',
 'Cozy Tapes, Vol. 1: Friends',
 'Cozy Tapes, Vol. 2: Too Cozy',
 'Live Love A$AP',
 'Long.Live.A$AP',
 'At.Lo

## Extract all review data from a single page
SOURCE: https://github.com/qut-dmrc/web-scraping-intro-workshop/blob/eed5d2a9dcc328dc2988b31ac0b8adadc2f0561c/web-scraping-intro-step3.ipynb

Extend from getting just the title field to getting all the fields we are interested in from each item.