# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# your code here
html = requests.get(url).content
html
soup = BeautifulSoup(html, "html.parser")
soup


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+sol

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
# your code here
#Remove the information inside the tag 'p'
tags = ['p']
text = [element.text for element in soup.find_all(tags)]

l1 = [elem.strip().split("\n") for elem in text]
#l1

#Transform the list of list in to a list of strings
s=""
for i in l1:
    for j in i:
        s=s+" "+str(j)

s=s.split(" ")
s[10:]

['bvaughn',
 'emilk',
 'nvh95',
 'lucasfernog',
 '@tauri-apps',
 'Rich-Harris',
 'wcandillon',
 'yairm210',
 'developit',
 'hadley',
 'thrau',
 'TooTallNate',
 'MichalLytek',
 'compnerd',
 'JelleZijlstra',
 'mgechev',
 'sethvargo',
 'mosra',
 'MichaelChirico',
 'adamchainz',
 'carols10cents',
 'Pittsburgh,',
 'PA,',
 'USA',
 'quisquous',
 'lewis6991']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [6]:
# your code here
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
soup

names = soup('h1', class_ = 'h3 lh-condensed') #The diference if put "SOUP" i can just take what is inside of class and not all!
l = [x.text.strip() for x in names]
#print(l)
a = []
for item in l:
    c = item.replace("\n\n","")
    a.append(c)
#print(a)
final_list = []
for items in a:
    splitlist = items.split()
    final_list.append(splitlist[2])
final_list

['public-apis',
 'core',
 'challenge',
 'manim',
 'nvdiffrec',
 'Zuri',
 'MockingBird',
 'DALLE2-pytorch',
 'openpilot',
 'Informer2020',
 'GitHub520',
 'Riskfolio-Lib',
 'Real-Time-Voice-Cloning',
 'system-design-primer',
 'FastDiff',
 'lutris',
 'sherlock',
 'awesome-python',
 'localstack',
 'discord.py',
 'Real-ESRGAN',
 'python-mini-projects',
 '30-Days-Of-Python',
 'gget',
 'Shadowrocket-ADBlock-Rules-Forever']

In [7]:
#Another version
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

#Remove the information inside the tag 'p' 
tags = ["h1", "class:h3 lh-condensed"]

text = [element.text.strip() for element in soup.find_all(tags)]

l1 = [elem.replace("\n\n","") for elem in text]
l1

#Transform the list of list in to a list of strings
final_list = []
for items in l1:
    splitlist = items.split()
    final_list.append(splitlist[-1])
final_list

['Trending',
 'public-apis',
 'core',
 'challenge',
 'manim',
 'nvdiffrec',
 'Zuri',
 'MockingBird',
 'DALLE2-pytorch',
 'openpilot',
 'Informer2020',
 'GitHub520',
 'Riskfolio-Lib',
 'Real-Time-Voice-Cloning',
 'system-design-primer',
 'FastDiff',
 'lutris',
 'sherlock',
 'awesome-python',
 'localstack',
 'discord.py',
 'Real-ESRGAN',
 'python-mini-projects',
 '30-Days-Of-Python',
 'gget',
 'Shadowrocket-ADBlock-Rules-Forever']

#### Display all the image links from Walt Disney wikipedia page.

In [8]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [9]:
# your code here
repositories = requests.get(url).content

soup = BeautifulSoup(repositories, 'html.parser')

soup

l = []
for links in soup.find_all('img'):
    l.append(links.get('src'))

l

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screens

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [10]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [11]:
# your code here

repositories = requests.get(url).content
soup = BeautifulSoup(repositories, 'html.parser')
soup

l = []
for links in soup.find_all('a'):
    l.append(links.get('href'))
l = l[3:]
a = l[2:10]
for x in a:
    l.remove(x)
l

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/wiki/Python_(mythology)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(programming_language)',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/w/index.php?title=Python&action=edit&section=5',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/w/index.php?title=Python&action=edit&section=6',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Colt_Python',
 '/w/index.php?title=Python&act

#### Find the number of titles that have changed in the United States Code since its last release point.

In [12]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [13]:
# your code here
repositories = requests.get(url).content
soup = BeautifulSoup(repositories, 'html.parser')
soup

names = soup('div', class_ = 'usctitlechanged')
l = [x.text.strip() for x in names]
l

['Title 2 - The Congress',
 'Title 15 - Commerce and Trade',
 'Title 39 - Postal Service ٭']

#### Find a Python list with the top ten FBI's Most Wanted names.

In [14]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [15]:
# your code here
repositories = requests.get(url).content
soup = BeautifulSoup(repositories, 'html.parser')
soup

names = soup('h3', class_ = 'title')
l = [x.text.strip() for x in names]
l

['BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'RAFAEL CARO-QUINTERO',
 'YULAN ADONAY ARCHAGA CARIAS',
 'EUGENE PALMER',
 'OCTAVIANO JUAREZ-CORRO']

In [16]:
#Another code
tags = ['h3']
text = [element.text for element in soup.find_all(tags)]
b=[]
for i in range(0,len(text)-1):
    c=str(text[i])
    d=c.strip().replace("\n","")
    
    b.append(d)
print(b)

['BHADRESHKUMAR CHETANBHAI PATEL', 'ALEJANDRO ROSALES CASTILLO', 'ARNOLDO JIMENEZ', 'JASON DEREK BROWN', 'ALEXIS FLORES', 'JOSE RODOLFO VILLARREAL-HERNANDEZ', 'RAFAEL CARO-QUINTERO', 'YULAN ADONAY ARCHAGA CARIAS', 'EUGENE PALMER', 'OCTAVIANO JUAREZ-CORRO', 'federal bureau of investigation']


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [19]:
# your code here
repositories = requests.get(url).content
soup = BeautifulSoup(repositories, 'html.parser')
soup


names = soup('tr', class_ = 'ligne1 normal')
l = [x.text for x in names]
l
l = l[:10]
a = []
for items in l:
    b = items.replace('\xa0','')
    a.append(b)

names = soup('tr', class_ = 'ligne2 normal')
l = [x.text for x in names]
l = l[:10]
for items in l:
    b = items.replace('\xa0','')
    a.append(b)
a


['earthquake2022-05-2215:07:19.022min ago35.46N3.65W3ML2.3STRAIT OF GIBRALTAR2022-05-22 15:24',
 'earthquake2022-05-2214:40:58.949min ago35.45N3.61W24ML2.1STRAIT OF GIBRALTAR2022-05-22 15:04',
 'earthquake2022-05-2214:36:25.053min ago1.41S134.17E10 M3.2NEAR N COAST OF PAPUA, INDONESIA2022-05-22 14:45',
 'earthquake2022-05-2214:18:43.01hr 11min ago2.99S128.84E10 M2.9CERAM SEA, INDONESIA2022-05-22 14:45',
 'earthquake2022-05-2213:53:57.71hr 36min ago35.44N3.65W7ML2.2STRAIT OF GIBRALTAR2022-05-22 14:18',
 'earthquake2022-05-2213:47:02.11hr 43min ago26.12S178.43E589Mw5.3SOUTH OF FIJI ISLANDS2022-05-22 14:39',
 '1IIIearthquake2022-05-2213:30:30.01hr 59min ago45.44N16.05E26ML1.6CROATIA2022-05-22 13:33',
 'earthquake2022-05-2213:19:05.22hr 11min ago44.02N10.91E5ML2.4NORTHERN ITALY2022-05-22 13:40',
 'earthquake2022-05-2212:42:10.02hr 47min ago24.42S67.68W232 M3.7SALTA, ARGENTINA2022-05-22 12:57',
 'earthquake2022-05-2212:27:27.03hr 02min ago36.68N9.65W18ML2.7WEST OF GIBRALTAR2022-05-22 12:32'

In [18]:
#Another code
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
soup

tags = ['tr']
text = [element.text for element in soup.find_all(tags)]


b=[]

for i in range(14,len(text)-2):
    c=str(text[i])
    
    d=c.strip().replace("(\xa0)+"," ")
    print(d)

earthquake2022-05-22   15:17:38.512min ago33.35 N  141.27 E  65Mw6.1 OFF EAST COAST OF HONSHU, JAPAN2022-05-22 15:29
earthquake2022-05-22   15:12:30.117min ago63.09 N  150.82 W  122ml3.0 CENTRAL ALASKA2022-05-22 15:16
earthquake2022-05-22   15:07:19.022min ago35.46 N  3.65 W  3ML2.3 STRAIT OF GIBRALTAR2022-05-22 15:24
earthquake2022-05-22   14:42:16.047min ago11.76 N  86.86 W  5 M3.7 NEAR COAST OF NICARAGUA2022-05-22 14:45
earthquake2022-05-22   14:40:58.949min ago35.45 N  3.61 W  24ML2.1 STRAIT OF GIBRALTAR2022-05-22 15:04
earthquake2022-05-22   14:37:42.852min ago37.38 N  121.74 W  4Md2.3 SAN FRANCISCO BAY AREA, CALIF.2022-05-22 14:40
earthquake2022-05-22   14:36:25.053min ago1.41 S  134.17 E  10 M3.2 NEAR N COAST OF PAPUA, INDONESIA2022-05-22 14:45
earthquake2022-05-22   14:32:38.057min ago21.17 S  68.53 W  132 M3.7 ANTOFAGASTA, CHILE2022-05-22 14:45
earthquake2022-05-22   14:18:43.01hr 11min ago2.99 S  128.84 E  10 M2.9 CERAM SEA, INDONESIA2022-05-22 14:45
earthquake2022-05-22   14

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [20]:
pip install tweepy


Note: you may need to restart the kernel to use updated packages.


In [21]:
import tweepy

In [22]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [23]:
# your code here
from bs4 import BeautifulSoup
import requests

handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("{} tweets {} number of tweets.".format(handle,tweets.get('data-count')))

except:
    print('Account name not found...')

Input your account name on Twitter: @handle
Account name not found...


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [24]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [25]:
# your code here
from bs4 import BeautifulSoup
import requests
handle = input('Input your account name on Twitter: ') 
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')
try:
    follow_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
    followers = follow_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("Number of followers: {} ".format(followers.get('data-count')))
except:
    print('Account name not found...')
  

Input your account name on Twitter: @handle
Account name not found...


#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [26]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.wikipedia.org/')
bs = BeautifulSoup(html, "html.parser")
nameList = bs.findAll('a', {'class' : 'link-box'})
for name in nameList:
    print(name.get_text())
  


English
6 458 000+ articles


日本語
1 314 000+ 記事


Русский
1 798 000+ статей


Español
1 755 000+ artículos


Deutsch
2 667 000+ Artikel


Français
2 400 000+ articles


Italiano
1 742 000+ voci


中文
1 256 000+ 条目 / 條目


Português
1 085 000+ artigos


العربية
1 159 000+ مقالة



#### A list with the different kind of datasets available in data.gov.uk.

In [28]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [29]:
# your code here
response = requests.get(url).content
soup = BeautifulSoup(response, "lxml")

tags = ["h3"]
text = [element.text for element in soup.find_all(tags)]
text

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [30]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [31]:
# your code here
response = requests.get(url).content
soup = BeautifulSoup(response, "lxml")

tags = ["tr","tr""a"]
text = [typen.text for typen in soup.find_all(tags)]
for i in range(1,11):
    c=str(text[i])
    
    d=c.strip().replace("\n","") and c.replace("[0-9]","")
    print(d)


1

Mandarin Chinese

929.0

11.922%

Sino-Tibetan

Sinitic


2

Spanish

474.7

5.994%

Indo-European

Romance


3

English

372.9

4.922%

Indo-European

Germanic


4

Hindi (sanskritised Hindustani)[11]

343.9

4.429%

Indo-European

Indo-Aryan


5

Bengali

233.7

4.000%

Indo-European

Indo-Aryan


6

Portuguese

232.4

2.870%

Indo-European

Romance


7

Russian

154.0

2.000%

Indo-European

Balto-Slavic


8

Japanese

125.3

1.662%

Japonic

Japanese


9

Western Punjabi[12]

92.7

1.204%

Indo-European

Indo-Aryan


10

Yue Chinese

85.2

0.949%

Sino-Tibetan

Sinitic



## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here