# 🕸️ Web Scraping 🕸️
Data scraping is a technique that a data scientist and data analyists can use to collect data and content from the internet.

Common data types:
+ images
+ text
+ product information
+ customer review
+ customer data 

How it works?
1. we make an HTTP request to a server
2. the server sends us a response
3. we extract the valuable information from the server response

How to avoid getting blacklisted?
+ read [here](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/)

How to Disguise as a browser

```python
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
```

## Warm-up exercise
**Match the concepts with the correct descriptions:**

<table border="1">
    <tr>
        <th>Concept</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>Requests</td>
        <td>Powerful language to find patterns in text</td>
    </tr>
    <tr>
        <td>Regular Expressions</td>
        <td>Request for a web page that allows to submit large forms or upload files</td>
    </tr>
    <tr>
        <td>Scrapy</td>
        <td>Response indicating that a web page was successfully delivered</td>
    </tr>
    <tr>
        <td>BeautifulSoup4</td>
        <td>Generic name for a (web) programming interface</td>
    </tr>
    <tr>
        <td>Unified Resource Locator (URL)</td>
        <td>Request for a web page</td>
    </tr>
    <tr>
        <td>HyperText Markup Language (HTML)</td>
        <td>Python library for sending HTTP GET and HTTP POST requests</td>
    </tr>
    <tr>
        <td>HyperText Transfer Protocol (HTTP)</td>
        <td>Python library for downloading entire web sites</td>
    </tr>
    <tr>
        <td>HTTP GET</td>
        <td>Response indicating that a web page could not be found</td>
    </tr>
    <tr>
        <td>HTTP POST</td>
        <td>Python library for parsing HTML pages</td>
    </tr>
    <tr>
        <td>API</td>
        <td>Method to send text messages from one computer to another, built on top of TCP/IP</td>
    </tr>
    <tr>
        <td>200</td>
        <td>Text format in which most web pages are written</td>
    </tr>
    <tr>
        <td>404</td>
        <td>Address of a website, file or similar</td>
    </tr>
</table>


## Answers
- **404**: Response indicating that a web page could not be found
- **200**: Response indicating that a web page was successfully delivered
- **API**: Method to send text messages from one computer to another, built on top of TCP/IP
- **Requests**: Python library for sending HTTP GET and HTTP POST requests
- **scrapy**: Python library for downloading entire web sites
- **regular expression**: Powerful language to find patterns in text
- **BeautifulSoup4**: Python library for parsing HTML pages
- **URL** Address of a website, file or similar
- **HTML**: Text format in which most web pages are written
- **HTTP**: Generic name for a (web) programming interface
- **HTTP GET**: Request for a web page
- **HTTP POST**: Request for a web page that allows to submit large forms or upload files

## What to install?

use the requirements text file

## Some important BeautifulSoup functions and methods:

```python
tag = soup.find('a')  # Finds the first <a> tag

tags = soup.find_all('a')  # Finds all <a> tags

text = soup.get_text(separator=' ') # Extracts all text from a tag, optionally with a separator.

href = tag.get('href')  # Retrieves the value of the href attribute from a tag

pretty_html = soup.prettify() # Returns a formatted string of the HTML document.

```

## Sending HTTP request with *requests*
We want to scrape a song lyrics from [lyrics.com](https://www.lyrics.com)

In [1]:
#import the libraries
import requests
import os

In [2]:
# Define URL
URL = 'https://www.lyrics.com/artist/Eminem/347307'

In [3]:
# Send a Http get request to the URL
response = requests.get(url=URL)
response

<Response [403]>

- we get response 403,it does not happen always but it is useful to define th usr-agent so the website knows that you are not a robot
- 403=when a web server recognizes a user's request but is unable to allow additional access

In [5]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

In [6]:
#Let's add headers
response = requests.get(url=URL, headers=headers)
response

<Response [200]>

In [7]:
# Get the content
content = response.text
content



## 🍜 Parsing HTML with BeautifulSoup 🍜

In [8]:
from bs4 import BeautifulSoup

In [9]:
# Let's create a soup from content
soup_content = BeautifulSoup(content,"html.parser")
soup_content

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta content="#8D6282" name="theme-color"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>Eminem Lyrics, Songs and Albums | Lyrics.com</title>
<meta content="Eminem Lyrics - All the great songs and their lyrics from Eminem on Lyrics.com" name="description"/>
<meta content="Eminem lyrics, Eminem song lyrics, Eminem lyric" name="keywords"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=5" name="viewport"/>
<base href="https://www.lyrics.com/"/>
<script>
s4Prefix = 'https://static.stands4.com';
version = '1.4.66';
common_version = '2.0.13';
</script>
<link as="style" href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css" rel="preload"/>
<link href="https://static.stands4.com/app_common/css/bootstrap-3.4.1.min.css" rel="stylesheet"/>
<link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect">
<link href="https://fonts.googleapis.com" rel="dns-pr

In [10]:
type(soup_content)

bs4.BeautifulSoup

### Let's use the developer tool and try to find the place for the lyrics

We see sth looks like href. What does that mean?

It is an attribute which is used in HTML to specify the URL of the page or resource that a hyperlink points to. It is commonly used within anchor (<a>) tags


https://www.lyrics.com/artist/Eminem/347307

In [11]:
soup_content.find_all(name='td', attrs={'class':'tal qx'})

[<td class="tal qx"><strong><a href="/lyric/36219604/Eminem/Homicide">Homicide</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35949116/Eminem/Rainy+Days+%5Bfeat.+Eminem%5D">Rainy Days [feat. Eminem]</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/36259009/Eminem/Remember+the+Name">Remember the Name</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/36584804/Eminem/Hip+Hop">Hip Hop</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35019312/Eminem/Caterpillar">Caterpillar</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35368569/Eminem/The+Ringer">The Ringer</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35368568/Eminem/Greatest">Greatest</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35368567/Eminem/Lucky+You">Lucky You</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyric/35368566/Eminem/Paul+%28Skit%29">Paul (Skit)</a></strong></td>,
 <td class="tal qx"><strong><a href="/lyri

In [12]:
# Let's get the first element

td_tag = soup_content.find_all(name='td', attrs={'class':'tal qx'})[0]
td_tag

<td class="tal qx"><strong><a href="/lyric/36219604/Eminem/Homicide">Homicide</a></strong></td>

In [13]:
# it is better to use find_all, but here we have only one

a_tag = td_tag.find(name='a')
a_tag

<a href="/lyric/36219604/Eminem/Homicide">Homicide</a>

In [14]:
href_value = a_tag.get('href')
href_value  # to get the attributes like  href, class, src etc

'/lyric/36219604/Eminem/Homicide'

In [15]:
titles = []
hyperlinks = []

for td in soup_content.find_all(name='td', attrs={'class':'tal qx'}):
    a_tag = td.find(name='a')
    href_value = a_tag.get('href')
    title = a_tag.get_text()
    cleaned_title = title.split(' [')[0]
    even_more_cleaned_title = cleaned_title.split(' (')[0]
    if even_more_cleaned_title not in titles: #check if we already have it, if not append
        titles.append(even_more_cleaned_title)
        print(even_more_cleaned_title)
        hyperlinks.append(href_value)
    
    #print(href_value)


Homicide
Rainy Days
Remember the Name
Hip Hop
Caterpillar
The Ringer
Greatest
Lucky You
Paul
Normal
Em Calls Paul
Stepping Stone
Not Alike
Kamikaze
Fall
Nice Guy
Good Guy
Venom
Killshot
Nowhere Fast
Majesty
Revenge
Business
My Name Is
The Real Slim Shady
No Favors
Believe
Untouchable
River
Remind Me
Revival
Bad Husband
Tragic Endings
Framed
Heat
Offended
Need Me
In Your Head
Castle
Walk on Water
Dead Wrong
Love the Way You Lie, Pt. 2
Numb
Campaign Speech
Kill for You
Without Me
Medicine Man
Kings Never Die
Best Friend
Phenomenal
Speedom
Here Comes the Weekend
The Hills
Guilty Conscience
Brain Damage
If I Had
97 Bonnie & Clyde
Role Model
My Fault
Ken Kaniff
Cum On Everybody
Rock Bottom
Soap
As The World Turns
Im Shady
Bad Meets Evil
Public Service Announcement 2000
Kill You
Stan
Who Knew
Steve Berman
The Way I Am
Real Slim Shady
Remember Me?
Marshall Mathers
Drug Ballad
Amityville
Bitch Please II
Kim
Under The Influence
Criminal
Curtains Up
White America
Cleanin Out My Closet
Square Dan

## we can also create the full links

In [16]:
FULL_URL = URL + hyperlinks[0]
FULL_URL

'https://www.lyrics.com/artist/Eminem/347307/lyric/36219604/Eminem/Homicide'

## Let's get the lyrics of one of the songs

### Step 1: Define URL and send a request to the website and display the response

We can create the link as we did above or we can go to that page and copy the link directly

In [17]:
URL2="https://www.lyrics.com/lyric/36259009/Eminem/Remember+the+Name"
response2 = requests.get(url=URL2,headers=headers)
response2

<Response [200]>

### Step 2: Get the content from the response

In [18]:
content2 = response2.text
content2

'<!doctype html>\n<html lang="en-US">\n<head>\n<meta name="theme-color" content="#8D6282"/>\n\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<title>Eminem - Remember the Name Lyrics | Lyrics.com</title>\n<meta name="description" content="Remember the Name Lyrics by Eminem from the No. 6 Collaborations Project album- including song video, artist biography, translations and more: Yeah, I was born a misfit, grew up ten miles from the town of Ipswich\r\nWanted to make it big, I wished it to existence\r&hellip;">\n<meta name="keywords" content="Remember the Name lyrics, lyrics for Remember the Name, Remember the Name song, Remember the Name words, lyrics from No. 6 Collaborations Project">\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=5">\n<base href="https://www.lyrics.com/">\n<script>\ns4Prefix = \'https://static.stands4.com\';\nversion = \'1.4.66\';\ncommon_version = \'2.0.13\';\n</script>\n\n<link rel="preloa

### Step 3: Create soup

In [19]:
soup_content2 = BeautifulSoup(content2,"html.parser")
soup_content2

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta content="#8D6282" name="theme-color"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>Eminem - Remember the Name Lyrics | Lyrics.com</title>
<meta content="Remember the Name Lyrics by Eminem from the No. 6 Collaborations Project album- including song video, artist biography, translations and more: Yeah, I was born a misfit, grew up ten miles from the town of Ipswich
Wanted to make it big, I wished it to existence
…" name="description"/>
<meta content="Remember the Name lyrics, lyrics for Remember the Name, Remember the Name song, Remember the Name words, lyrics from No. 6 Collaborations Project" name="keywords"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=5" name="viewport"/>
<base href="https://www.lyrics.com/"/>
<script>
s4Prefix = 'https://static.stands4.com';
version = '1.4.66';
common_version = '2.0.13';
</script>
<link as="style" href="https://static.stands4.co

### Step 4: Check with the developer tool to find where the lyrics start

In [20]:
soup_content2.find_all(name='pre', attrs={'id':'lyric-body-text'})

[<pre class="lyric-body" data-lang="en" dir="ltr" id="lyric-body-text">Yeah, I was born a misfit, grew up ten <a href="https://www.definitions.net/definition/miles" style="color:#222; ">miles</a> from the town of Ipswich
 Wanted to make it big, I <a href="https://www.definitions.net/definition/wished" style="color:#222; ">wished</a> it to existence
 I <a href="https://www.definitions.net/definition/never" style="color:#222; ">never</a> was a sick kid, <a href="https://www.definitions.net/definition/always" style="color:#222; ">always</a> dismissed quick
 "Stick to singing, stop rappin'", like it's Christmas
 And if you're talkin' money, then my <a href="https://www.definitions.net/definition/conversation" style="color:#222; ">conversation</a> shiftin'
 My <a href="https://www.definitions.net/definition/dreams" style="color:#222; ">dreams</a> are <a href="https://www.definitions.net/definition/bigger" style="color:#222; ">bigger</a> than just bein' on the rich list
 Might be insanity, b

In [21]:
len(soup_content2.find_all(name='pre', attrs={'id':'lyric-body-text'}))
#It returns to 1 which means instead of using all, we can just use find!

1

In [22]:
soup_content2.find(name='pre', attrs={'id':'lyric-body-text'}).get_text()

'Yeah, I was born a misfit, grew up ten miles from the town of Ipswich\r\nWanted to make it big, I wished it to existence\r\nI never was a sick kid, always dismissed quick\r\n"Stick to singing, stop rappin\'", like it\'s Christmas\r\nAnd if you\'re talkin\' money, then my conversation shiftin\'\r\nMy dreams are bigger than just bein\' on the rich list\r\nMight be insanity, but people call it gifted\r\nMy face is goin\' numb from the shit this stuff is mixed with\r\nWatch how the lyrics in the songs might get twisted\r\nMy wife wears red, but looks better without the lipstick\r\nI\'m a private guy and you know nothin\' \'bout my business\r\nAnd if I had my fifteen minutes, I must have missed \'em\r\n\r\nTwenty years old is when I came in the game\r\nAnd now it\'s eight years on and you remember the name\r\nAnd if you thought I was good, well, then I\'m better today\r\nBut it\'s ironic how you people thought I\'d never be great\r\nI like my shows open-air, Tokyo to Delaware\r\nPut your p

In [23]:
print(soup_content2.find(name='pre', attrs={'id':'lyric-body-text'}).get_text())

Yeah, I was born a misfit, grew up ten miles from the town of Ipswich
Wanted to make it big, I wished it to existence
I never was a sick kid, always dismissed quick
"Stick to singing, stop rappin'", like it's Christmas
And if you're talkin' money, then my conversation shiftin'
My dreams are bigger than just bein' on the rich list
Might be insanity, but people call it gifted
My face is goin' numb from the shit this stuff is mixed with
Watch how the lyrics in the songs might get twisted
My wife wears red, but looks better without the lipstick
I'm a private guy and you know nothin' 'bout my business
And if I had my fifteen minutes, I must have missed 'em

Twenty years old is when I came in the game
And now it's eight years on and you remember the name
And if you thought I was good, well, then I'm better today
But it's ironic how you people thought I'd never be great
I like my shows open-air, Tokyo to Delaware
Put your phones in the air if you wanna be rocked
You know I want way more than 

## Web API
+ Very Roughly Speaking:
    + just a special URL, where we get back data ,e.g, in JSON format
+ Advantages over WEB SCRAPING:
    + EASIER to parse JSON than HTML
    + Companies/Organization(data owner) can:
        +  More control, cleanear data
        + Set up rate-limits (e.g. 100 requests per minute)
        + They collect your data as well
        + People can get dependent on APIs so they can charge you



![Beehive](api_example2.png)

Let's send a get request to the api related to this web site https://open-meteo.com

### Further Readings 📚
+ [Disguising as a Browser](https://stackoverflow.com/questions/27652543/how-can-i-use-pythons-requests-to-fake-a-browser-visit-a-k-a-and-generate-user)
+ [More on Web API](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Client-side_web_APIs/Introduction)
+ [Learn Regular Expression I](https://regexone.com)
+ [Learn Regular Expression II](https://alf.nu/RegexGolf?world=regex&level=r00)
+ [Regular Expression in Python](https://www.w3schools.com/python/python_regex.asp)
+ [Selenium: A program to mimic a Web Browser](https://www.selenium.dev/documentation/webdriver/getting_started/first_script/)

## Other options

In [None]:
# necessary lib
#pip install lxml

In [24]:
import pandas as pd

In [25]:
#defining the url and reading html
url_df = 'https://fbref.com/en/comps/106/2019/2019-Womens-World-Cup-Stats'

df = pd.read_html(url_df)
df

[   Rk         Squad  MP  W  D  L  GF  GA  GD  Pts          Notes
 0   1     br Brazil   3  2  0  1   6   3   3    6  → Round of 16
 1   2   cn China PR   3  1  1  1   1   1   0    4  → Round of 16
 2   3   cm Cameroon   3  1  0  2   3   5  -2    3  → Round of 16
 3   4    ng Nigeria   3  1  0  2   2   4  -2    3  → Round of 16
 4   5      cl Chile   3  1  0  2   2   5  -3    3            NaN
 5   6  ar Argentina   3  0  2  1   3   4  -1    2            NaN,
    Rk         Squad  MP  W  D  L  GF  GA  GD  Pts   xG  xGA  xGD  xGD/90  \
 0   1     fr France   3  3  0  0   7   1   6    9  6.4  0.6  5.8    1.93   
 1   2     no Norway   3  2  0  1   6   3   3    6  3.0  5.2 -2.2   -0.73   
 2   3    ng Nigeria   3  1  0  2   2   4  -2    3  1.9  4.1 -2.2   -0.74   
 3   4  kr Korea Rep   3  0  0  3   1   8  -7    0  3.5  4.9 -1.4   -0.46   
 
            Notes  
 0  → Round of 16  
 1  → Round of 16  
 2            NaN  
 3            NaN  ,
    Rk            Squad  MP  W  D  L  GF  GA  GD 

In [26]:
# get the first table with selecting the first row
df = pd.read_html(url_df)[0]
df.head()

Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,Notes
0,1,br Brazil,3,2,0,1,6,3,3,6,→ Round of 16
1,2,cn China PR,3,1,1,1,1,1,0,4,→ Round of 16
2,3,cm Cameroon,3,1,0,2,3,5,-2,3,→ Round of 16
3,4,ng Nigeria,3,1,0,2,2,4,-2,3,→ Round of 16
4,5,cl Chile,3,1,0,2,2,5,-3,3,


In [28]:
# get the first table with selecting the first row
df = pd.read_html(url_df)[7]
df

Unnamed: 0,Rk,Squad,MP,W,D,L,GF,GA,GD,Pts,xG,xGA,xGD,xGD/90,Top Team Scorer,Goalkeeper,Notes
0,1,us USA,7.0,7.0,0.0,0.0,26.0,3.0,23.0,21.0,20.2,4.2,16.1,2.29,"Alex Morgan, Megan Rapinoe - 6",Alyssa Naeher,
1,,,,,,,,,,,,,,,,,
2,2,nl Netherlands,7.0,6.0,0.0,1.0,11.0,5.0,6.0,18.0,10.7,9.2,1.6,0.22,Vivianne Miedema - 3,Sari van Veenendaal,
3,,,,,,,,,,,,,,,,,
4,3,se Sweden,7.0,5.0,0.0,2.0,12.0,6.0,6.0,15.0,13.2,8.0,5.1,0.74,Kosovare Asllani - 3,Hedvig Lindahl,
5,,,,,,,,,,,,,,,,,
6,4,eng England,7.0,5.0,0.0,2.0,13.0,5.0,8.0,15.0,15.3,5.6,9.8,1.4,Ellen White - 6,Karen Bardsley,
7,,,,,,,,,,,,,,,,,
8,QF,de Germany,5.0,4.0,0.0,1.0,10.0,2.0,8.0,12.0,11.2,4.9,6.3,1.26,Sara Däbritz - 3,Almuth Schult,
9,QF,fr France,5.0,4.0,0.0,1.0,10.0,4.0,6.0,12.0,11.1,2.9,8.2,1.63,Wendie Renard - 4,Sarah Bouhaddi,


## other try with coffee

In [29]:
url2= "https://cafely.com/blogs/research/which-country-consumes-the-most-coffee?srsltid=AfmBOoqMT3PxxYvzog5NpIYYtgJ-8vz-GMPye8Hg1TzuHmEPqIenALbM#full-data"
df2 = pd.read_html(url2)
df2

[    Unnamed: 0      Country  Daily cup of coffee per capita  \
 0            1   Luxembourg                            5.31   
 1            2      Finland                            3.77   
 2            3       Sweden                            2.59   
 3            4       Norway                            2.57   
 4            5      Austria                            2.03   
 5            6      Denmark                            2.04   
 6            7  Switzerland                            1.87   
 7            8  Netherlands                            1.79   
 8            9       Greece                            1.71   
 9           10      Germany                            1.61   
 10          11       Canada                            1.57   
 11          12      Belgium                            1.57   
 12          13       France                            1.48   
 13          14     Slovenia                            1.49   
 14          15        Italy            

In [30]:

df2[0]

Unnamed: 0.1,Unnamed: 0,Country,Daily cup of coffee per capita,Lifetime coffee consumption by cups
0,1,Luxembourg,5.31,118227
1,2,Finland,3.77,83939
2,3,Sweden,2.59,58612
3,4,Norway,2.57,58159
4,5,Austria,2.03,45198
5,6,Denmark,2.04,44676
6,7,Switzerland,1.87,42318
7,8,Netherlands,1.79,39854
8,9,Greece,1.71,37449
9,10,Germany,1.61,35259
