# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url_dvp = 'https://github.com/trending/developers'
# your code here
page = requests.get(url_dvp)
page

<Response [200]>

In [3]:
soup = BeautifulSoup(page.content, "lxml")
print(type(soup))
print(len(soup))

<class 'bs4.BeautifulSoup'>
3


#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [4]:
soup_names = soup.find_all('h1', {'class' : 'h3 lh-condensed'})
print(type(soup_names))
print(len(soup_names))
print(soup_names[0])

<class 'bs4.element.ResultSet'>
25
<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":11247099,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="ef1ca5fa12eb07591ada00a49d20f71122be37479f6c8f5ebc682e338e46fdde" href="/antfu">
            Anthony Fu
</a> </h1>


In [5]:
listofnames = [e.text for e in soup_names]
print(listofnames)

['\n\n            Anthony Fu\n ', '\n\n            Franck Nijhof\n ', '\n\n            Stefano Gottardo\n ', '\n\n            Tom Payne\n ', '\n\n            Remi Rousselet\n ', '\n\n            Ha Thach\n ', '\n\n            Florimond Manca\n ', '\n\n            Zihua Li\n ', '\n\n            Adam Wathan\n ', '\n\n            Dalton Hubble\n ', '\n\n            Christian Muehlhaeuser\n ', '\n\n            bdring\n ', '\n\n            Arvid Norberg\n ', '\n\n            Florian\n ', '\n\n            Pedro S. Lopez\n ', '\n\n            Daniel Lemire\n ', '\n\n            Michael (Parker) Parker\n ', '\n\n            Fons van der Plas\n ', '\n\n            Kamil Kisiela\n ', '\n\n            Antonio Nuno Monteiro\n ', '\n\n            Hajime Hoshi\n ', '\n\n            PySimpleGUI\n ', '\n\n            Adam R\n ', '\n\n            An Tao\n ', '\n\n            Jonny Burger\n ']


In [6]:
listofnames = [e.text.strip('\n ') for e in soup_names]
print(listofnames)

['Anthony Fu', 'Franck Nijhof', 'Stefano Gottardo', 'Tom Payne', 'Remi Rousselet', 'Ha Thach', 'Florimond Manca', 'Zihua Li', 'Adam Wathan', 'Dalton Hubble', 'Christian Muehlhaeuser', 'bdring', 'Arvid Norberg', 'Florian', 'Pedro S. Lopez', 'Daniel Lemire', 'Michael (Parker) Parker', 'Fons van der Plas', 'Kamil Kisiela', 'Antonio Nuno Monteiro', 'Hajime Hoshi', 'PySimpleGUI', 'Adam R', 'An Tao', 'Jonny Burger']


#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [7]:
# This is the url you will scrape in this exercise
url_repos = 'https://github.com/trending/python?since=daily'
# your code here
page = requests.get(url_repos)

In [8]:
soup = BeautifulSoup(page.content, "lxml")

In [9]:
soup_repos = soup.find_all('h1', {'class' : 'h3 lh-condensed'})
soup_repos[0]

<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":350557349,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="2c10082ca8f220b8cbe8a40844a81b0c5a80359b20beedb6828bbfe836fd4da7" href="/davepl/Primes">
<svg aria-hidden="true" class="octicon octicon-repo mr-1 color-text-secondary" height="16" version="1.1" viewbox="0 0 16 16" width="16"><path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path></svg>
<span

In [10]:
listofrepos = [r.text for r in soup_repos]
listofrepos

['\n\n\n\n        davepl /\n\n      Primes\n ',
 '\n\n\n\n        home-assistant /\n\n      core\n ',
 '\n\n\n\n        Chia-Network /\n\n      chia-blockchain\n ',
 '\n\n\n\n        willmcgugan /\n\n      rich\n ',
 '\n\n\n\n        Rapptz /\n\n      discord.py\n ',
 '\n\n\n\n        archlinux /\n\n      archinstall\n ',
 '\n\n\n\n        ytdl-org /\n\n      youtube-dl\n ',
 '\n\n\n\n        donnemartin /\n\n      system-design-primer\n ',
 '\n\n\n\n        bguerbas /\n\n      SpeedTest\n ',
 '\n\n\n\n        python /\n\n      cpython\n ',
 '\n\n\n\n        networkx /\n\n      networkx\n ',
 '\n\n\n\n        owid /\n\n      covid-19-data\n ',
 '\n\n\n\n        wang0618 /\n\n      PyWebIO\n ',
 '\n\n\n\n        ndb796 /\n\n      python-for-coding-test\n ',
 '\n\n\n\n        programthink /\n\n      zhao\n ',
 '\n\n\n\n        scikit-learn /\n\n      scikit-learn\n ',
 '\n\n\n\n        TheAlgorithms /\n\n      Python\n ',
 '\n\n\n\n        python-telegram-bot /\n\n      python-telegram-b

In [11]:
listofrepos = [''.join(r.text.strip('/ \n').split('\n')) for r in soup_repos]
listofrepos

['davepl /      Primes',
 'home-assistant /      core',
 'Chia-Network /      chia-blockchain',
 'willmcgugan /      rich',
 'Rapptz /      discord.py',
 'archlinux /      archinstall',
 'ytdl-org /      youtube-dl',
 'donnemartin /      system-design-primer',
 'bguerbas /      SpeedTest',
 'python /      cpython',
 'networkx /      networkx',
 'owid /      covid-19-data',
 'wang0618 /      PyWebIO',
 'ndb796 /      python-for-coding-test',
 'programthink /      zhao',
 'scikit-learn /      scikit-learn',
 'TheAlgorithms /      Python',
 'python-telegram-bot /      python-telegram-bot',
 'swisskyrepo /      PayloadsAllTheThings',
 'sammchardy /      python-binance',
 'vinta /      awesome-python',
 'spyder-ide /      spyder',
 'maurosoria /      dirsearch',
 'ericaltendorf /      plotman',
 'ManimCommunity /      manim']

#### Display all the image filenames from Walt Disney wikipedia page.

In [12]:
# This is the url you will scrape in this exercise
url_wd = 'https://en.wikipedia.org/wiki/Walt_Disney'
# your code here
page = requests.get(url_wd)
soup = BeautifulSoup(page.content, "lxml")
soup_wd = soup.find_all('a', class_ ='image')
soup_wd[23]

<a class="image" href="/wiki/File:Flag_of_the_United_States.svg"><img alt="Flag of the United States.svg" data-file-height="650" data-file-width="1235" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/30px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/45px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/60px-Flag_of_the_United_States.svg.png 2x" width="30"/></a>

In [13]:
filelst = [e['href'] for e in soup_wd]
filelst

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:Disneyland_Resort_logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_Califor

In [14]:
filelst = [e['href'].split('wiki/File:')[1] for e in soup_wd]
filelst

['Walt_Disney_1946.JPG',
 'Walt_Disney_1942_signature.svg',
 'Walt_Disney_envelope_ca._1921.jpg',
 'Trolley_Troubles_poster.jpg',
 'Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 'Steamboat-willie.jpg',
 'Walt_Disney_1935.jpg',
 'Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 'Disney_drawing_goofy.jpg',
 'DisneySchiphol1951.jpg',
 'WaltDisneyplansDisneylandDec1954.jpg',
 'Walt_disney_portrait_right.jpg',
 'Walt_Disney_Grave.JPG',
 'Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 'Disney_Display_Case.JPG',
 'Disney1968.jpg',
 'Disneyland_Resort_logo.svg',
 'Animation_disc.svg',
 'P_vip.svg',
 'Magic_Kingdom_castle.jpg',
 'Video-x-generic.svg',
 'Flag_of_Los_Angeles_County,_California.svg',
 'Blank_television_set.svg',
 'Flag_of_the_United_States.svg']

#### Retrieve the Wikipedia page of "Python" below and create a list of links on that page.

In [15]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python'
# your code here
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.content, "html")
soup_links = soup.find_all('a')
print(len(soup_links))
soup_links[150]

<Response [200]>
152


<a href="https://wikimediafoundation.org/"><img alt="Wikimedia Foundation" height="31" loading="lazy" src="/static/images/footer/wikimedia-button.png" srcset="/static/images/footer/wikimedia-button-1.5x.png 1.5x, /static/images/footer/wikimedia-button-2x.png 2x" width="88"/></a>

In [16]:
list_links = [a['href'] for a in soup_links if 'href' in a.attrs if a['href'].startswith('ht')]
print(len(list_links))
list_links

59


['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 'https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Python&namespace=0',
 'https://en.wikipedia.org/w/index.php?title=Python&oldid=997582414',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://www.wikidata.org/wiki/Special:EntityPage/Q747452',
 'https://commons.wikimedia.org/wiki/Category:Python',
 'https://af.wikipedia.org/wiki/Python',
 'https://als.wikipedia.org/wiki/Python',
 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D8%AA%D9%88%D8%B6%D9%8A%D8%AD)',
 'https://az.wikipedia.org/wiki/Python',
 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%A6%E0%A7%8D%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%B0%E0%A7%8D%E0%A6%A5%E0%A6%A4%E0%A6%BE_%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A6%B8%E0%A6%A8)',
 'https://be.wikipedia.org/wiki/Python',
 'htt

#### Number of Titles that have changed in the United States Code since its last release point 

In [17]:
# This is the url you will scrape in this exercise
url_usc = 'http://uscode.house.gov/download/download.shtml'
#your code
page = requests.get(url_usc)
soup_usc = BeautifulSoup(page.content, 'html')

updated = soup_usc.find_all('div', {'class':'usctitlechanged'})
print(len(updated))

4


#### A Python list with the top ten FBI's Most Wanted names 

Your output should look like below:
```
['ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'YASER ABDEL SAID',
 'SANTIAGO VILLALBA MEDEROS']
```

In [18]:
# your code here
page = requests.get('https://www.fbi.gov/wanted/topten')
soup = BeautifulSoup(page.content, "html")
soup = soup.find_all('h3', {'class': 'title'})
[e.text.strip('\n') for e in soup]

['ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'YASER ABDEL SAID']

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [19]:
# This is the url you will scrape in this exercise
url_earth = 'https://www.emsc-csem.org/Earthquake/'
# your code here
resp = requests.get(url_earth)
soup = BeautifulSoup(resp.content, "lxml")
table_earth = soup.find(id='tbody')

In [20]:
rows = table_earth.find_all('tr')[:20]
print(len(rows))
print(rows[0])

20
<tr class="ligne1 normal" id="967014" onclick="go_details(event,967014);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=967014">2021-04-05   14:16:56.0</a></b><i class="ago" id="ago0">17min ago</i></td><td class="tabev1">18.41 </td><td class="tabev2">S  </td><td class="tabev1">71.12 </td><td class="tabev2">W  </td><td class="tabev3">29</td><td class="tabev5" id="magtyp0">ML</td><td class="tabev2">2.5</td><td class="tb_region" id="reg0"> OFF COAST OF TARAPACA, CHILE</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2021-04-05 14:28</td></tr>


In [21]:
datetime = [x.find('td',{'class':'tabev6'}) for x in rows]
date=[lst[0] for lst in [y.find('a').text.split('\xa0') for y in datetime]]
time=[lst[3] for lst in [y.find('a').text.split('\xa0') for y in datetime]]
print(date)
print(time)

sel = [y.find_all('td',{'class':'tabev1'}) for y in rows]
lat = [lst[0].text.strip('\xa0') for lst in sel]
print(lat)
long = [lst[1].text.strip('\xa0') for lst in sel]
print(long)
sel2 = [y.find_all('td',{'class':'tabev2'}) for y in rows]
lat2 = [lst[0].text.strip('\xa0') for lst in sel2]
print(lat2)
long2 = [lst[1].text.strip('\xa0') for lst in sel2]
print(long2)

region = [lst[1] for lst in [x.find('td',{'class':'tb_region'}).text.split('\xa0') for x in rows]]
print(region)

['2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05']
['14:16:56.0', '14:16:17.0', '14:00:17.9', '13:47:30.4', '13:44:15.6', '13:43:08.1', '13:43:05.6', '13:40:33.4', '13:33:34.0', '13:27:58.0', '13:22:37.4', '13:19:22.5', '12:53:07.4', '12:40:35.0', '12:29:48.8', '12:27:47.9', '12:22:07.0', '12:00:55.0', '11:51:21.0', '11:48:40.4']
['18.41', '7.20', '37.38', '58.46', '46.86', '37.56', '54.92', '39.79', '38.19', '8.06', '42.27', '33.95', '33.94', '9.67', '33.95', '33.94', '30.54', '2.00', '36.11', '33.95']
['71.12', '126.94', '179.89', '137.11', '10.00', '36.41', '163.25', '22.01', '118.08', '116.79', '8.39', '118.35', '118.34', '80.53', '118.34', '118.34', '71.43', '96.89', '73.43', '118.34']
['S', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'N',

In [22]:
dct= {'Date':date,'Time':time, 'Latitude':lat, 'Lat degrees':lat2, 'Longitude':long, 'Long degrees':long2, 'Region':region}
print(dct)

{'Date': ['2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05', '2021-04-05'], 'Time': ['14:16:56.0', '14:16:17.0', '14:00:17.9', '13:47:30.4', '13:44:15.6', '13:43:08.1', '13:43:05.6', '13:40:33.4', '13:33:34.0', '13:27:58.0', '13:22:37.4', '13:19:22.5', '12:53:07.4', '12:40:35.0', '12:29:48.8', '12:27:47.9', '12:22:07.0', '12:00:55.0', '11:51:21.0', '11:48:40.4'], 'Latitude': ['18.41', '7.20', '37.38', '58.46', '46.86', '37.56', '54.92', '39.79', '38.19', '8.06', '42.27', '33.95', '33.94', '9.67', '33.95', '33.94', '30.54', '2.00', '36.11', '33.95'], 'Lat degrees': ['S', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'S', 'N'], 'Longitude': ['71.12', '126.94', '179.89', '137.11', '10.00', '36.41', '163.25', '22.01', '118.08', '116.79', '8

In [23]:
import pandas as pd
df = pd.DataFrame(dct)
df

Unnamed: 0,Date,Time,Latitude,Lat degrees,Longitude,Long degrees,Region
0,2021-04-05,14:16:56.0,18.41,S,71.12,W,"OFF COAST OF TARAPACA, CHILE"
1,2021-04-05,14:16:17.0,7.2,N,126.94,E,"MINDANAO, PHILIPPINES"
2,2021-04-05,14:00:17.9,37.38,S,179.89,E,"OFF E. COAST OF N. ISLAND, N.Z."
3,2021-04-05,13:47:30.4,58.46,N,137.11,W,SOUTHEASTERN ALASKA
4,2021-04-05,13:44:15.6,46.86,N,10.0,E,SWITZERLAND
5,2021-04-05,13:43:08.1,37.56,N,36.41,E,CENTRAL TURKEY
6,2021-04-05,13:43:05.6,54.92,N,163.25,E,OFF EAST COAST OF KAMCHATKA
7,2021-04-05,13:40:33.4,39.79,N,22.01,E,GREECE
8,2021-04-05,13:33:34.0,38.19,N,118.08,W,NEVADA
9,2021-04-05,13:27:58.0,8.06,S,116.79,E,"LOMBOK REGION, INDONESIA"


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [16]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url_tw = 'https://twitter.com/paulpogba'
# your code here
page = requests.get(url_tw)
print(page)
soup = BeautifulSoup(page.content, 'lxml')

A FAIRE

<Response [200]>
None


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
A FAIRE

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [25]:
# This is the url you will scrape in this exercise
url_wiki = 'https://www.wikipedia.org/'
# your code here
resp = requests.get(url_wiki)
soup = BeautifulSoup(resp.content, "html")
languages = soup.find_all('div', {'class': 'central-featured-lang'})

strong = [e.find_all('strong')[0].text for e in languages]
print(strong)
print()
bdi = [e.find_all('bdi')[0].text for e in languages]
articles = [''.join(e.split()) for e in bdi]
print(articles)
[(x,y) for x,y in zip(strong, articles)]

['English', 'Español', '日本語', 'Deutsch', 'Русский', 'Français', 'Italiano', '中文', 'Português', 'Polski']

['6274000+', '1668000+', '1259000+', '2553000+', '1708000+', '2311000+', '1681000+', '1185000+', '1061000+', '1463000+']


[('English', '6274000+'),
 ('Español', '1668000+'),
 ('日本語', '1259000+'),
 ('Deutsch', '2553000+'),
 ('Русский', '1708000+'),
 ('Français', '2311000+'),
 ('Italiano', '1681000+'),
 ('中文', '1185000+'),
 ('Português', '1061000+'),
 ('Polski', '1463000+')]

#### A list with the different kind of datasets available in data.gov.uk.

In [26]:
# This is the url you will scrape in this exercise
url_uk = 'https://data.gov.uk/'
page = requests.get(url_uk)
soup = BeautifulSoup(page.content, "html")
subjects = soup.find_all('a', {'class': 'govuk-link'})
print(subjects[4:])
print(len(subjects[4:]))
print(type(subjects[4:]))
[x for e in subjects[4:] for x in e]

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>, <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>, <a class="govuk-link" href="/search?filters%5Bto

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [27]:
# This is the url you will scrape in this exercise
url_speak = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
# your code here
page = requests.get(url_speak)
soup = BeautifulSoup(page.content, "html")
table = soup.find_all('table', {'class': 'wikitable sortable'})[0]

rows = table.find_all('tr')
rows2 = [row.text.split('\n') for row in rows][:11]
mylist = [(rg[3].strip('[]910'),rg[5]) for rg in rows2]
print(mylist)
import pandas as pd
colnames = mylist[0]
df = pd.DataFrame(mylist[1:], columns=colnames)
df

[('Language', 'Speakers(millions)'), ('Mandarin Chinese', '918'), ('Spanish', '480'), ('English', '379'), ('Hindi (sanskritised Hindustani)', '341'), ('Bengali', '228'), ('Portuguese', '221'), ('Russian', '154'), ('Japanese', '128'), ('Western Punjabi', '92.7'), ('Marathi', '83.1')]


Unnamed: 0,Language,Speakers(millions)
0,Mandarin Chinese,918.0
1,Spanish,480.0
2,English,379.0
3,Hindi (sanskritised Hindustani),341.0
4,Bengali,228.0
5,Portuguese,221.0
6,Russian,154.0
7,Japanese,128.0
8,Western Punjabi,92.7
9,Marathi,83.1


## Bonus

#### Find the book name, price and stock availability as a pandas dataframe.

In [28]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url_b = 'http://books.toscrape.com/'
# your code here
page = requests.get(url_b)
soup = BeautifulSoup(page.content, 'html')
book_soup = soup.find_all('article',{'class':'product_pod'})
alist=[e.find_all('a') for e in book_soup]
print(len(alist))
print(alist[0])

20
[<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>, <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]


In [29]:
booktitles=[e[1]['title'] for e in alist]
print(booktitles)

['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]


In [30]:
plist = [e.find_all('p',{'class':'price_color'}) for e in book_soup]
pricelist = [x.text for p in plist for x in p]
print(len(pricelist))
print(pricelist)

20
['£51.77', '£53.74', '£50.10', '£47.82', '£54.23', '£22.65', '£33.34', '£17.93', '£22.60', '£52.15', '£13.99', '£20.66', '£17.46', '£52.29', '£35.02', '£57.25', '£23.88', '£37.59', '£51.33', '£45.17']


In [31]:
availlist = [e.find_all('p',{'class':'instock availability'}) for e in book_soup]
book_avail_lst = [z.strip('\n ') for z in [x.text for p in availlist for x in p]]
print(book_avail_lst)
print(len(book_avail_lst))

['In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock', 'In stock']
20


In [32]:
mydict = {'Book Title':booktitles, 'Book Price':pricelist, 'Book Availability':book_avail_lst}
pd.DataFrame(mydict)

Unnamed: 0,Book Title,Book Price,Book Availability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock
9,The Black Maria,£52.15,In stock


#### Scrape a certain number of tweets of a given Twitter account.

In [33]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'
# your code here


#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [34]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [35]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'
# your code here


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
# your code here
