# Web Scraping, a.k.a. `requests`

### 🎯 Goal: learn how to get pages from a websites

Web scraping = getting data from websites

In [1]:
import requests

Requests allows us to send HTTP requests to website/servers. It sends back a response code, and the full html (if successful).

In [2]:
url = 'https://www.spiced-academy.com/en'

In [3]:
response = requests.get(url)

In [4]:
response

<Response [200]>

200 is a status/response code. 404 is another one.

* 200-range: success!
* 300-range: redirect
* 400-range: client-side error (_it's not them, it's you!_)
* 500-range: server-side error (_it's not you, it's them!_)

In [5]:
response.status_code

200

In [6]:
type(response)

requests.models.Response

In [13]:
print(response.text[:1000])

<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" dir="ltr">

<head>
    <title>Your new career starts here | Spiced Academy</title>
    <meta name="description" content="Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.">
    
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&display=swap" rel="stylesheet">
    <link rel='stylesheet' href='/css/main.css'>
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=3">
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=3">
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=3">
    <link rel="mask-icon" href="/safari-pinned-tab


What do you notice?

* It is written in HTML (you don't need to know HTML to scrape data though!)
* It is structured using _tags_ (stuff in angled brackets), _classs_ and _ids_
* It is nested/hierarchical

### Writing/reading files in python

(You will need this to save your artists page file to your computer.)

In [14]:
f = open('spiced_html.txt', 'w') # w: write
f.write(response.text)
f.close()

51189

In [16]:
with open('spiced_html2.txt', 'w') as f:
    f.write(response.text)

In [17]:
with open('spiced_html2.txt', 'r') as f: # r: read
    spiced_html = f.read()

In [18]:
print(spiced_html[:1000])

<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" dir="ltr">

<head>
    <title>Your new career starts here | Spiced Academy</title>
    <meta name="description" content="Kickstart your new career with our intensive, on-site tech programs in Web Development and Data Science.">
    
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <link href="https://fonts.googleapis.com/css?family=Poppins:300,400,600&display=swap" rel="stylesheet">
    <link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono:400,500&display=swap" rel="stylesheet">
    <link rel='stylesheet' href='/css/main.css'>
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png?v=3">
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png?v=3">
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png?v=3">
    <link rel="mask-icon" href="/safari-pinned-tab.svg" color

In [19]:
with open('spiced_html2.txt', 'a') as f: # a: append
    f.write('I am adding a line.')

### Inspecting a webpage

(You will need this to find links to individual songs in your artist's page.)

In [20]:
url = 'https://en.wikipedia.org/wiki/Michael_Jackson'

In [21]:
response = requests.get(url)

In [22]:
response

<Response [200]>

In [30]:
print(response.text[:1000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Jam (song) - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"f1be453e-6300-4df6-89ea-5e9f57aa1309","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Jam_(song)","wgTitle":"Jam (song)","wgCurRevisionId":1072753187,"wgRevisionId":1072753187,"wgArticleId":3943758,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Finnish-language sources (fi)","Articles with short description","Short description is different from Wikidata","Use American English from August 2019","All W

In [24]:
import re

In [25]:
links = re.findall('href.{100}', response.text)

In [28]:
# You will automate this part for your project
song_links = ['/wiki/Blood_on_the_Dance_Floor_(song)',
              '/wiki/Thriller_(song)',
              '/wiki/Jam_(song)']

In [29]:
for counter, link in enumerate(song_links):
    url = 'https://en.wikipedia.org' + link
    response = requests.get(url)
    with open(f'{counter}.txt', 'w', encoding='utf-8') as f:
        f.write(response.text)
    print(counter, link, response.status_code)

0 /wiki/Blood_on_the_Dance_Floor_(song) 200
1 /wiki/Thriller_(song) 200
2 /wiki/Jam_(song) 200


In [36]:
for counter, link in enumerate(song_links):
    url = 'https://en.wikipedia.org' + link
    response = requests.get(url)
    song_title = link.split('/')[2]
    with open(f'{song_title}.txt', 'w', encoding='utf-8') as f:
        f.write(response.text)
    print(counter, song_title, response.status_code)

0 Blood_on_the_Dance_Floor_(song) 200
1 Thriller_(song) 200
2 Jam_(song) 200


You have two options:
* You could save the full html of your song page into a file
* You can do some more web scraping magic and extraxt just the lyrics and save that into a file