<a href="https://colab.research.google.com/github/angelsmreyes/lab_web_scraping/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [6]:
#your code
parser = requests.get(url)
parser

html = parser.content
#print(html)

soup = BeautifulSoup(html, 'lxml')
#soup

lst = soup.select('article h1[class="h3 lh-condensed"]')
#lst = soup.select('div h1[class="col-md-6"]')
lst

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":661450,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="e1d8cec536f9b82d0028bb84c708121d62a6a56819ee2b0771d0e95c0b7e1f4e" data-view-component="true" href="/arvidn">
             Arvid Norberg
 </a> </h1>, <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":7253922,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="4cc96ad02476315557290be891875c12c9d7e7e912c5358e1a9959e6abc695e2" data-view-component="true" href="/freearhey">
             Aleksandr Statciuk
 </a> </h1>, <h1 

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [7]:
#your code
Trending = [element.text.strip() for element in lst]
Trending

['Arvid Norberg',
 'Aleksandr Statciuk',
 'PySimpleGUI',
 'Ha Thach',
 'Henrik Rydgård',
 'Hajime Hoshi',
 'Julien Le Coupanec',
 'Anthony Sottile',
 'MichaIng',
 'Copple',
 'GO Sueyoshi',
 'Luke Edwards',
 'Tom Payne',
 'Artem Zakharchenko',
 'Felix Angelov',
 'tiepvupsu',
 'Chris Long',
 'Ritchie Vink',
 'Michael Vines',
 'Eliza Weisman',
 'Andrew Gallant',
 'Florian Roth',
 'Huan (李卓桓)',
 'Matthias Fey',
 'David Tolnay']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [8]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [9]:
#your code
parser = requests.get(url)
parser

html = parser.content
#html

soup = BeautifulSoup(html, 'lxml')
#soup

lst_2 = soup.select('article h1[class="h3 lh-condensed"]')
lst_2

Trending_repo = [element.text.strip().replace('\n', '').replace(' ', '') for element in lst_2]
Trending_repo

['microsoft/qlib',
 'Python-World/python-mini-projects',
 'CyberPunkMetalHead/binance-trading-bot-new-coins',
 'ChoiceCoin/Voting',
 'zhongxinghong/PKUAutoElective',
 'Chia-Network/chia-blockchain',
 'Rapptz/discord.py',
 'public-apis/public-apis',
 'freqtrade/freqtrade',
 'sherlock-project/sherlock',
 'iterativv/NostalgiaForInfinity',
 'oppia/oppia',
 'waydroid/waydroid',
 'pyg-team/pytorch_geometric',
 'ytdl-org/youtube-dl',
 'instaloader/instaloader',
 'blakeblackshear/frigate',
 'freqtrade/freqtrade-strategies',
 'home-assistant/core',
 'kingoflolz/mesh-transformer-jax',
 'Ganapati/RsaCtfTool',
 'Cog-Creators/Red-DiscordBot',
 'nerdyrodent/VQGAN-CLIP',
 'zulip/zulip',
 'TeamUltroid/Ultroid']

#### Display all the image links from Walt Disney wikipedia page

In [10]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [11]:
#your code
parser = requests.get(url)

html = parser.content

soup = BeautifulSoup(html, 'lxml')

lst_3 = soup.select('img[src]')
lst_3[0]['src']

images = [lst_3[i]['src'] for i in range(len(lst_3))]
images

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/128px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/5/57/

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [12]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [13]:
#your code
parser = requests.get(url)

html = parser.content

soup = BeautifulSoup(html, 'lxml')

lst_4 = soup.select('ul li a[href]')
lst_4

python = [lst_4[i]['href'] for i in range(len(lst_4))]
python

['/wiki/Pythons',
 '/wiki/Python_(genus)',
 '#Computing',
 '#People',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#Other_uses',
 '#See_also',
 '/wiki/Python_(programming_language)',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_Anghelo',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Colt_Python',
 '/wiki/PYTHON',
 '/wiki/Python_(film)',
 '/wiki/Python_(mythology)',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/wiki/Cython',
 '/wiki/Pyton',
 '/wiki/Pithon',
 '/wiki/Category:Disambiguation_pages',
 '/wiki/Category:Human_name_disambiguation_pages',
 '/wiki/Category:Disambiguation_pages_with_given-name-holder_lists',
 '/wiki/Category

In [19]:
links = []
for i in range(len(python)):
  repla = re.sub(r'[#]\w+','', python[i])
  links.append(repla)

#for i in links:
#  if i == '':
#    links.remove(i)

del links[2:6]


#### Number of Titles that have changed in the United States Code since its last release point 

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

parser = requests.get(url)

html = parser.content

soup = BeautifulSoup(html, 'lxml')
soup

lst_5 = soup.select('div.usctitlechanged')
lst_5

[<div class="usctitlechanged" id="us/usc/t42">
 
           Title 42 - The Public Health and Welfare
 
         </div>]

In [None]:
#your code
Titles = [element.text.strip() for element in lst_5]
len(Titles)

1

#### A Python list with the top ten FBI's Most Wanted names 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [None]:
#your code 
parser = requests.get(url)

html = parser.content

soup = BeautifulSoup(html, 'lxml')


lst_6 = soup.select('h3.title')
lst_6

[<h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/octaviano-juarez-corro">OCTAVIANO JUAREZ-CORRO</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/eugene-palmer">EUGENE PALMER</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/rafael-caro-quintero">RAFAEL CARO-QUINTERO</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/robert-william-fisher">ROBERT WILLIAM FISHER</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/jason-derek-brown">JASON DEREK BROWN</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted

In [None]:
most_wanted = [element.text.strip() for element in lst_6]
most_wanted

['OCTAVIANO JUAREZ-CORRO',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ROBERT WILLIAM FISHER',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
#your code

parser = requests.get(url)

html = parser.content

soup = BeautifulSoup(html, 'lxml')

lst_7 = soup.select('td.tabev6 a')
#lst_7[0].text.replace('\xa0', ' ').split()

date_time = [element.text.replace('\xa0', ' ').split() for element in lst_7]
date_time[0][0]

date = [date_time[i][0] for i in range(len(date_time))]

time = [date_time[i][1] for i in range(len(date_time))]
len(time)

50

In [None]:
lst_latitude = soup.select('tbody#tbody tr')

comp_row = [element.text.strip().replace('\xa0', ' ').split() for element in lst_latitude]
comp_row

lsta = []
for i in range(len(comp_row)):
  ago = list(filter(lambda v : re.match(r'ago\d+', v), comp_row[i]))
  lsta.append(ago)
#clean = [i.replace('ago', '') for i in lsta]
#clean
latitude_flat = [j for lat in lsta for j in lat]
latitude_flat
clean_lat = [i.replace('ago', '') for i in latitude_flat]
len(clean_lat)
#lsta

50

In [None]:
lst_longitude = soup.select('tbody#tbody tr')

comp_row2 = [element.text.strip().replace('\xa0', ' ').split() for element in lst_longitude]
comp_row2

longitude = []
for i in range(len(comp_row)):
  ago = list(filter(lambda v : re.match(r'\d{1,3}\.\d\d', v), comp_row[i]))
  longitude.append(ago)


longitude_flat = [j for log in longitude for j in log]
len(longitude_flat)


50

In [None]:
selected = soup.select('td[class="tb_region"]')

region_name = [element.text.strip() for element in selected]
region_name


['STRAIT OF GIBRALTAR',
 'CANARY ISLANDS, SPAIN REGION',
 'NIAS REGION, INDONESIA',
 'OKLAHOMA',
 'CANARY ISLANDS, SPAIN REGION',
 'CANARY ISLANDS, SPAIN REGION',
 'CANARY ISLANDS, SPAIN REGION',
 'ISLAND OF HAWAII, HAWAII',
 'MOLUCCA SEA',
 'ADMIRALTY ISLANDS REGION, P.N.G.',
 'ISLAND OF HAWAII, HAWAII',
 'VANUATU',
 'CANARY ISLANDS, SPAIN REGION',
 'ISLAND OF HAWAII, HAWAII',
 'CANARY ISLANDS, SPAIN REGION',
 'CANARY ISLANDS, SPAIN REGION',
 'SOUTHERN PERU',
 'SPAIN',
 'CANARY ISLANDS, SPAIN REGION',
 'BISMARCK SEA',
 'HINDU KUSH REGION, AFGHANISTAN',
 'CANARY ISLANDS, SPAIN REGION',
 'CANARY ISLANDS, SPAIN REGION',
 'ISLAND OF HAWAII, HAWAII',
 'MINDANAO, PHILIPPINES',
 'MINDANAO, PHILIPPINES',
 'OFF E. COAST OF N. ISLAND, N.Z.',
 'GREATER LOS ANGELES AREA, CALIF.',
 'WESTERN TEXAS',
 'CAUCASUS REGION, RUSSIA',
 'ISLAND OF HAWAII, HAWAII',
 'NEAR COAST OF NICARAGUA',
 'SOUTH OF FIJI ISLANDS',
 'SAN JUAN, ARGENTINA',
 'NORTHERN ITALY',
 'SOUTH SANDWICH ISLANDS REGION',
 'JAVA, INDONE

In [None]:
df = pd.DataFrame(zip(date, time, clean_lat, longitude_flat, region_name), columns=['date', 'time','latitude','longitude', 'region_name'])

In [None]:
df.head(10)

Unnamed: 0,date,time,latitude,longitude,region_name
0,2021-09-13,18:30:20.5,35.54,3.72,STRAIT OF GIBRALTAR
1,2021-09-13,18:28:45.7,28.57,17.85,"CANARY ISLANDS, SPAIN REGION"
2,2021-09-13,18:26:38.0,1.74,96.73,"NIAS REGION, INDONESIA"
3,2021-09-13,18:26:21.2,34.31,96.44,OKLAHOMA
4,2021-09-13,18:16:24.5,28.57,17.85,"CANARY ISLANDS, SPAIN REGION"
5,2021-09-13,18:12:28.5,28.56,17.85,"CANARY ISLANDS, SPAIN REGION"
6,2021-09-13,18:07:46.0,28.56,17.85,"CANARY ISLANDS, SPAIN REGION"
7,2021-09-13,17:46:55.6,19.21,155.42,"ISLAND OF HAWAII, HAWAII"
8,2021-09-13,17:19:23.0,1.96,125.96,MOLUCCA SEA
9,2021-09-13,17:18:27.0,2.88,147.88,"ADMIRALTY ISLANDS REGION, P.N.G."


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [None]:
#your code

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

### **Te falta este** 

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code
account_credent = '/AngelsmReyes'

In [None]:
comp_link = url+account_credent
comp_link

'https://twitter.com//AngelsmReyes'

In [None]:
parser = requests.get(comp_link)



html = parser.content
html


soup = BeautifulSoup(html, 'lxml')
soup

#selection = soup.select('div[class="css-901oao css-bfa6kz r-14j79pv r-37j5jr r-n6v787 r-16dba41 r-1cwl3u0 r-bcqeeo r-qvutc0"]')
#selection

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/><link href="//abs.twimg.com" rel="preconnect"/><link href="//abs.twimg.com" rel="dns-prefetch"/><link href="//api.twitter.com" rel="preconnect"/><link href="//api.twitter.com" rel="dns-prefetch"/><link href="//pbs.twimg.com" rel="preconnect"/><link href="//pbs.twimg.com" rel="dns-prefetch"/><link href="//t.co" rel="preconnect"/><link href="//t.co" rel="dns-prefetch"/><link href="//video.twimg.com" rel="preconnect"/><link href="//video.twimg.com" rel="dns-prefetch"/><link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.60e872c5.js" nonce="YTMyZmY5MTEtZjJlMy00NWFkLWE2MGQtNTNhNTBiYTk0Yjgy" rel="preload"/><link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/vendors~main.e35bab05.js" nonce="YTMy

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

###**Te falta este**

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
#your code
parser = requests.get(url)

html = parser.content
html

soup = BeautifulSoup(html, 'lxml')
soup

selected = soup.select('div a.link-box')
selected

languages_lst = [elem.text.strip().replace('\xa0', ' ') for elem in selected]
languages_lst


['English\n6 326 000+ articles',
 '日本語\n1 275 000+ 記事',
 'Español\n1 696 000+ artículos',
 'Deutsch\n2 590 000+ Artikel',
 'Русский\n1 734 000+ статей',
 'Français\n2 340 000+ articles',
 '中文\n1 206 000+ 條目',
 'Italiano\n1 701 000+ voci',
 'Português\n1 066 000+ artigos',
 'Polski\n1 480 000+ haseł']

In [None]:
languages = []
for i in range(len(languages_lst)):
  lang_clean = re.sub(r'[\n]', ' ', languages_lst[i])
  languages.append(lang_clean)


In [None]:
languages

['English 6 326 000+ articles',
 '日本語 1 275 000+ 記事',
 'Español 1 696 000+ artículos',
 'Deutsch 2 590 000+ Artikel',
 'Русский 1 734 000+ статей',
 'Français 2 340 000+ articles',
 '中文 1 206 000+ 條目',
 'Italiano 1 701 000+ voci',
 'Português 1 066 000+ artigos',
 'Polski 1 480 000+ haseł']

#### A list with the different kind of datasets available in data.gov.uk 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [None]:
#your code 

parser = requests.get(url)

html = parser.content
html

soup = BeautifulSoup(html, 'lxml')
soup

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title>Find open data - data.gov.uk</title>
<meta content="#0b0c0c" name="theme-color"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/find-assets/application-15d90411608bb4b805b8fdbd1944d73a4b203af82657c5a4215b3a481fd06295.css" media="screen" rel="stylesheet"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="TYjLmE6loAQmHzMZM_5fNf6RwyRxwTta7zLJKXid0COO0yjysOvktajazuwCCJ8rT8EdarU9WGmoKsleyddfow" name="csrf-token"/>
</head><body class="govuk-template__body">
<script>document.body.className = ((document.body.className) ? document.body.className + ' js-enabled' : 'js-enabled');</script>
<div aria-label="cookie banner" class="gem-c-cookie-banner govuk-clearfix" data-module="cookie-banner" data-nosnippet="" id="global-cookie-message" role="region">
<div aria-label="Cookies

In [None]:
selected = soup.select('h3 a.govuk-link')
selected

datasets = [elem.text for elem in selected]
datasets

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
#your code

parser = requests.get(url)

html = parser.content

In [None]:
num_by_num_speakers = table = pd.read_html(html)

In [None]:
table[1].head(10)

Unnamed: 0,Rank,Language,Speakers(millions),Percentageof world pop.(March 2019)[8],Language family,Branch
0,1,Mandarin Chinese,918.0,11.922%,Sino-Tibetan,Sinitic
1,2,Spanish,480.0,5.994%,Indo-European,Romance
2,3,English,379.0,4.922%,Indo-European,Germanic
3,4,Hindi (sanskritised Hindustani)[9],341.0,4.429%,Indo-European,Indo-Aryan
4,5,Bengali,300.0,4.000%,Indo-European,Indo-Aryan
5,6,Portuguese,221.0,2.870%,Indo-European,Romance
6,7,Russian,154.0,2.000%,Indo-European,Balto-Slavic
7,8,Japanese,128.0,1.662%,Japonic,Japanese
8,9,Western Punjabi[10],92.7,1.204%,Indo-European,Indo-Aryan
9,10,Marathi,83.1,1.079%,Indo-European,Indo-Aryan


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code
parser = requests.get(url)

html = parser.content
html

soup = BeautifulSoup(html, 'lxml')
soup

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</s

In [None]:
selected = soup.select('td.titleColumn')
selected

lst = [elem.text.strip().replace('\n', ' ') for elem in selected]
lst

m = []
for i in range(len(lst)):
    m_c= re.sub(r'\d+\.', '', lst[i])
    m.append(m_c)
m

m1 = [i.lstrip() for i in m]

m3 = []
for i in range(len(m1)):
  m2 = re.sub(r'\(\d\d\d\d\)', '', m1[i])
  m3.append(m2)
m3

movie_name = [elem.strip() for elem in m3]
movie_name

['The Shawshank Redemption',
 'The Godfather',
 'The Godfather: Part II',
 'The Dark Knight',
 '12 Angry Men',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'Pulp Fiction',
 'The Good, the Bad and the Ugly',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Fight Club',
 'Forrest Gump',
 'Inception',
 'The Lord of the Rings: The Two Towers',
 'Star Wars: Episode V - The Empire Strikes Back',
 'The Matrix',
 'Goodfellas',
 "One Flew Over the Cuckoo's Nest",
 'Seven Samurai',
 'Se7en',
 'The Silence of the Lambs',
 'City of God',
 'Life Is Beautiful',
 "It's a Wonderful Life",
 'Star Wars: Episode IV - A New Hope',
 'Saving Private Ryan',
 'Interstellar',
 'Spirited Away',
 'The Green Mile',
 'Parasite',
 'Léon: The Professional',
 'Hara-Kiri',
 'The Pianist',
 'The Usual Suspects',
 'Terminator 2: Judgment Day',
 'Back to the Future',
 'Psycho',
 'Modern Times',
 'The Lion King',
 'American History X',
 'City Lights',
 'Grave of the Fireflies',
 'Whiplash

In [None]:
selected = soup.select('td.titleColumn span')
selected

initial_relase = [elem.text.replace('(', '').replace(')', '') for elem in selected]
initial_relase


['1994',
 '1972',
 '1974',
 '2008',
 '1957',
 '1993',
 '2003',
 '1994',
 '1966',
 '2001',
 '1999',
 '1994',
 '2010',
 '2002',
 '1980',
 '1999',
 '1990',
 '1975',
 '1954',
 '1995',
 '1991',
 '2002',
 '1997',
 '1946',
 '1977',
 '1998',
 '2014',
 '2001',
 '1999',
 '2019',
 '1994',
 '1962',
 '2002',
 '1995',
 '1991',
 '1985',
 '1960',
 '1936',
 '1994',
 '1998',
 '1931',
 '1988',
 '2014',
 '2000',
 '2006',
 '2011',
 '2006',
 '1942',
 '1968',
 '1954',
 '1988',
 '1979',
 '1979',
 '2000',
 '1981',
 '1940',
 '2006',
 '2012',
 '1957',
 '1950',
 '2008',
 '1980',
 '2018',
 '1957',
 '1964',
 '2018',
 '2019',
 '1997',
 '2003',
 '2016',
 '1984',
 '2012',
 '1986',
 '2017',
 '2020',
 '2018',
 '2019',
 '1981',
 '1963',
 '1999',
 '1995',
 '2009',
 '1984',
 '1995',
 '2009',
 '1997',
 '1983',
 '1968',
 '1992',
 '1931',
 '2007',
 '1958',
 '1941',
 '1985',
 '2012',
 '2000',
 '1955',
 '1952',
 '1959',
 '2004',
 '1952',
 '1948',
 '1962',
 '1921',
 '2016',
 '1987',
 '2020',
 '1971',
 '1927',
 '1960',
 '1976',
 

In [None]:
selected = soup.select('td.titleColumn a["title"]')
selected


d = [elem.text for elem in selected]
d

['The Shawshank Redemption',
 'The Godfather',
 'The Godfather: Part II',
 'The Dark Knight',
 '12 Angry Men',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'Pulp Fiction',
 'The Good, the Bad and the Ugly',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Fight Club',
 'Forrest Gump',
 'Inception',
 'The Lord of the Rings: The Two Towers',
 'Star Wars: Episode V - The Empire Strikes Back',
 'The Matrix',
 'Goodfellas',
 "One Flew Over the Cuckoo's Nest",
 'Seven Samurai',
 'Se7en',
 'The Silence of the Lambs',
 'City of God',
 'Life Is Beautiful',
 "It's a Wonderful Life",
 'Star Wars: Episode IV - A New Hope',
 'Saving Private Ryan',
 'Interstellar',
 'Spirited Away',
 'The Green Mile',
 'Parasite',
 'Léon: The Professional',
 'Hara-Kiri',
 'The Pianist',
 'The Usual Suspects',
 'Terminator 2: Judgment Day',
 'Back to the Future',
 'Psycho',
 'Modern Times',
 'The Lion King',
 'American History X',
 'City Lights',
 'Grave of the Fireflies',
 'Whiplash

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code
parser = requests.get(url)



#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city:houston


In [None]:
# your code
parser = requests.get(url)


res = parser.json()
print(res)
res.keys()





{'coord': {'lon': -95.3633, 'lat': 29.7633}, 'weather': [{'id': 701, 'main': 'Mist', 'description': 'mist', 'icon': '50d'}, {'id': 501, 'main': 'Rain', 'description': 'moderate rain', 'icon': '10d'}], 'base': 'stations', 'main': {'temp': 23.82, 'feels_like': 24.42, 'temp_min': 22.38, 'temp_max': 25.08, 'pressure': 1016, 'humidity': 83}, 'visibility': 9656, 'wind': {'speed': 1.34, 'deg': 23, 'gust': 2.68}, 'rain': {'1h': 2.37}, 'clouds': {'all': 1}, 'dt': 1631558582, 'sys': {'type': 2, 'id': 2006306, 'country': 'US', 'sunrise': 1631534717, 'sunset': 1631579371}, 'timezone': -18000, 'id': 4699066, 'name': 'Houston', 'cod': 200}


dict_keys(['coord', 'weather', 'base', 'main', 'visibility', 'wind', 'rain', 'clouds', 'dt', 'sys', 'timezone', 'id', 'name', 'cod'])

In [None]:

print('Temperature:', res['main']['temp'])

Temperature: 23.82


In [None]:
print('Wind Speed:', res['wind']['speed'])

Wind Speed: 1.34


In [None]:
print('description:', res['weather'][0]['description'])

description: mist


In [None]:
print('Weather:', res['weather'][0]['main'])

Weather: Mist


#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code
parser = requests.get(url)
html = parser.content
html

soup = BeautifulSoup(html, 'lxml')
soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [None]:
selected = soup.select('h3 a["href"]') 
selected

book_name = [selected[elem]['title'] for elem in range(len(selected))]
book_name

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]

In [None]:
selected = soup.select('div p.price_color')
selected

price = [elem.text for elem in selected]
price

['£51.77',
 '£53.74',
 '£50.10',
 '£47.82',
 '£54.23',
 '£22.65',
 '£33.34',
 '£17.93',
 '£22.60',
 '£52.15',
 '£13.99',
 '£20.66',
 '£17.46',
 '£52.29',
 '£35.02',
 '£57.25',
 '£23.88',
 '£37.59',
 '£51.33',
 '£45.17']

In [None]:
selected = soup.select('div p[class="instock availability"]')
selected

stock = [elem.text.strip() for elem in selected]
stock

['In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock',
 'In stock']

In [None]:
df_book = pd.DataFrame(zip(book_name, price, stock), columns=['Book', 'price', 'stock_aviability'])
df_book.head()

Unnamed: 0,Book,price,stock_aviability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
