# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [None]:
# Exercise 1: Find out the trending developers

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [2]:
# your code here
import requests
 
url = 'https://github.com/trending/developers'
html = requests.get(url).content  
html[0:600]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossori'

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
soup

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-848e5bda8a9313d9e37e362b7eecd7a8.css" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9I

In [4]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

['Trending',
 '\n      These are the developers building the hot tools today.\n    ',
 '\n\n            nick black\n ',
 '\n\n              dankamongmen\n ',
 '\n\n\n\n\n      notcurses\n ',
 '\n\n            Rick\n ',
 '\n\n              LinuxSuRen\n ',
 '\n\n\n\n\n      remote-jobs-in-china\n ',
 '\n\n            David Pedersen\n ',
 '\n\n              davidpdrsn\n ',
 '\n\n\n\n\n      todo-or-die\n ',
 '\n\n            Seth Vargo\n ',
 '\n\n              sethvargo\n ',
 '\n\n\n\n\n      go-password\n ',
 '\n\n            PySimpleGUI\n ',
 '\n\n              PySimpleGUI\n ',
 '\n\n\n\n\n      PySimpleGUI\n ',
 '\n\n            Anthony Fu\n ',
 '\n\n              antfu\n ',
 '\n\n\n\n\n      unconfig\n ',
 '\n\n            Adrienne Walker\n ',
 '\n\n              quisquous\n ',
 '\n\n\n\n\n      cactbot\n ',
 '\n\n            David Tolnay\n ',
 '\n\n              dtolnay\n ',
 '\n\n\n\n\n      efg\n ',
 '\n\n            Damodar Lohani\n ',
 '\n\n              lohanidamodar\n ',
 '\n\n

In [12]:
print('\n'.join(text))

Trending

      These are the developers building the hot tools today.
    


            nick black
 


              dankamongmen
 





      notcurses
 


            Rick
 


              LinuxSuRen
 





      remote-jobs-in-china
 


            David Pedersen
 


              davidpdrsn
 





      todo-or-die
 


            Seth Vargo
 


              sethvargo
 





      go-password
 


            PySimpleGUI
 


              PySimpleGUI
 





      PySimpleGUI
 


            Anthony Fu
 


              antfu
 





      unconfig
 


            Adrienne Walker
 


              quisquous
 





      cactbot
 


            David Tolnay
 


              dtolnay
 





      efg
 


            Damodar Lohani
 


              lohanidamodar
 





      flutter_ui_challenges
 


            Ariel Mashraki
 


              a8m
 





      golang-cheat-sheet
 


            Fons van der Plas
 


              fonsp
 





      Pluto.jl
 


            Philipp 

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [None]:
# Exercise 2: Find out the repository names instead of developer names

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [13]:
import requests
 
url = 'https://github.com/trending/python?since=daily'
html = requests.get(url).content # as one sentence or split it into requests.get(url) and requests.content
html[0:600]

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossori'

In [14]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
soup

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-848e5bda8a9313d9e37e362b7eecd7a8.css" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9I

In [63]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

['Real Time Seismicity', '\xa0']

In [16]:
print('\n'.join(text))

Trending

      See what the GitHub community is most excited about today.
    






        commaai /

      openpilot
 

      openpilot is an open source driver assistance system. openpilot performs the functions of Automated Lane Centering and Adaptive Cruise Control for over 150 supported car makes and models.
    






        public-apis /

      public-apis
 

      A collective list of free APIs
    






        donnemartin /

      system-design-primer
 

      Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.
    






        Pycord-Development /

      pycord
 

      Pycord, a maintained fork of discord.py, is a python wrapper for the Discord API
    






        edeng23 /

      binance-trade-bot
 

      Automated cryptocurrency trading bot
    






        open-mmlab /

      mmtracking
 

      OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Si

#### Display all the image links from Walt Disney wikipedia page.

In [None]:
# Exercise 3: Display all the image links from Walt Disney Wikipedia page

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [65]:
import requests
import re
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
html = requests.get(url).content # as one sentence or split it into requests.get(url) and requests.content
html[0:600]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Walt Disney - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"4cc48875-23c7-4fc0-9536-4ad47b731ee2","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"'

In [67]:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')


In [68]:
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images: 
    print(image['src']+'\n')

//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg

//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/13/DisneySchiphol1951.jpg/220px-DisneySchiphol1951.jpg

//upload.wikimedia.org/w

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [None]:
# Exercise 4: Retrieve an arbitary page and create a list of links

In [None]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [69]:
import requests
 
url = 'https://en.wikipedia.org/wiki/Python'
html = requests.get(url).content  
html[0:600]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"a008cbd7-45f0-45f4-8d2d-610afc06010a","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPag'

In [70]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://en.wikipedia.org/wiki/Python")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
  if 'href' in link.attrs:
    print(link.attrs['href'])

#mw-head
#searchInput
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
/wiki/Pythonidae
/wiki/Python_(genus)
#Computing
#People
#Roller_coasters
#Vehicles
#Weaponry
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)
/w/index.php?title=Python&action=edit&section=5
/wiki/Python_(missile)
/wiki/Python_(nuclear_primary)
/wiki/Colt_Python
/w/index.php?title=Python&action=edit&section=6
/wiki/PYTHON
/wiki/Python_(film)
/wiki/Python_(mythology)
/wiki/

#### Find the number of titles that have changed in the United States Code since its last release point.

In [None]:
# Exercise 5: Find the number of titles that have changed in the US code 

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [36]:
import requests
 
url = 'http://uscode.house.gov/download/download.shtml'
html = requests.get(url).content  
html[0:600]

b'<?xml version=\'1.0\' encoding=\'UTF-8\' ?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml"><head>\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n        <meta http-equiv="X-UA-Compatible" content="IE=8" />\n        <meta http-equiv="pragma" content="no-cache" /><!-- HTTP 1.0 -->\n        <meta http-equiv="cache-control" content="no-cache,must-revalidate" /><!-- HTTP 1.1 -->\n        <meta http-equiv="expires" content="0" />\n        <link rel="shortcut ic'

In [78]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
type(soup)

bs4.BeautifulSoup

In [82]:
changed_titles = html_soup.find_all('div class' == 'usctitlechanged')
print(type(changed_titles))
print(len(changed_titles))

<class 'bs4.element.ResultSet'>
0


#### Find a Python list with the top ten FBI's Most Wanted names.

In [None]:
# Exercise 6: Find a Python list of the top 10 FBI's Wanted names

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [40]:
import requests
 
url = 'https://www.fbi.gov/wanted/topten'
html = requests.get(url).content  
html[0:600]

b'<!DOCTYPE html>\n<html lang="en" data-gridsystem="bs3">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<link rel="canonical" href="https://www.fbi.gov/wanted/topten"><title>Ten Most Wanted Fugitives &#8212; FBI</title>\n<meta name="DC.subject" content="Wanted by the FBI, Top Ten Most Wanted, Ten Most Wanted Fugitives, Top Ten Fugitives, Top Ten, Historical Ten Most Wanted">\n<meta name="DC.description" content="The FBI is offering rewards for information leading to the apprehension of the T'

In [41]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
soup

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/><title>Ten Most Wanted Fugitives — FBI</title>
<meta content="Wanted by the FBI, Top Ten Most Wanted, Ten Most Wanted Fugitives, Top Ten Fugitives, Top Ten, Historical Ten Most Wanted" name="DC.subject"/>
<meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the images of suspects to display more information." name="DC.description"/>
<meta content="2010-07-16T15:30:00+00:00" name="DC.date.created"/>
<meta content="2021-12-02T12:30:32+00:00" name="DC.date.modified"/>
<meta content="text/plain" name="DC.format"/>
<meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the 

In [42]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

['\nOfficial websites use .gov\n',
 '\nA .gov website belongs to an official government organization in the United States.\n',
 '\nSecure .gov websites use HTTPS\n',
 "\nA lock () or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.\n",
 'Ten Most Wanted Fugitives',
 'Most Wanted',
 'Ten Most Wanted Fugitives',
 "Notice: The official FBI Ten Most Wanted Fugitives list is maintained on the FBI website. This information may be copied and distributed, however, any unauthorized alteration of any portion of the FBI's Ten Most Wanted Fugitives posters is a violation of federal law (18 U.S.C., Section 709). Persons who make or reproduce these alterations are subject to prosecution and, if convicted, shall be fined or imprisoned for not more than one year, or both.",
 'Listing',
 'Results: 10 Items',
 '\nJOSE RODOLFO VILLARREAL-HERNANDEZ\n',
 '\nOCTAVIANO JUAREZ-CORRO\n',
 '\nRAFAEL CARO-QUINTERO\n',
 '\nYULAN ADONAY ARCH

In [43]:
print('\n'.join(text))

# print(data['items'][0]['title'])


Official websites use .gov


A .gov website belongs to an official government organization in the United States.


Secure .gov websites use HTTPS


A lock () or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Ten Most Wanted Fugitives
Most Wanted
Ten Most Wanted Fugitives
Notice: The official FBI Ten Most Wanted Fugitives list is maintained on the FBI website. This information may be copied and distributed, however, any unauthorized alteration of any portion of the FBI's Ten Most Wanted Fugitives posters is a violation of federal law (18 U.S.C., Section 709). Persons who make or reproduce these alterations are subject to prosecution and, if convicted, shall be fined or imprisoned for not more than one year, or both.
Listing
Results: 10 Items

JOSE RODOLFO VILLARREAL-HERNANDEZ


OCTAVIANO JUAREZ-CORRO


RAFAEL CARO-QUINTERO


YULAN ADONAY ARCHAGA CARIAS


EUGENE PALMER


BHADRESHKUMAR CHETANBHAI PATEL


ALEJAND

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [4]:
# Exercise 6: Display the 20 latest earthquakes info (Pandas dataframe)
import pandas as pd

In [5]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [7]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta content="en" http-equiv="Content-Language"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftershock,tremor" name="keywo

In [10]:
import requests
html = requests.get(url).content
soup = BeautifulSoup(html)


In [60]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

['Real Time Seismicity', '\xa0']

In [11]:
table_head=soup.find("thead")
#classlist = ['tabev6', 'tabev1', 'tabev2', 'tb_region']

cols = [colnames.text for colnames in table_head.find_all("th")]
cols
#but we actually don't need these specific columns, because we're going to split date and time

cols2 = ["DATE", "TIME", "LATITUDE", "LONGITUDE", "REGION NAME"]

In [13]:
table_body = soup.find("tbody", id="tbody")
table_time = [times.text.strip().split() for times in soup.select("tbody td.tabev6 a")][:20]
dates = [dates[0] for dates in table_time]
times = [times[1] for times in table_time]
times

['02:58:46.3',
 '02:54:54.0',
 '02:51:59.0',
 '02:41:43.0',
 '02:38:36.4',
 '02:28:12.0',
 '02:27:53.0',
 '02:22:58.0',
 '02:21:49.2',
 '02:20:40.9',
 '02:12:51.2',
 '02:09:06.1',
 '02:03:10.4',
 '01:50:44.6',
 '01:50:30.2',
 '01:47:09.4',
 '01:46:13.0',
 '01:46:02.1',
 '01:44:32.0',
 '01:43:42.3']

In [15]:
table_region = [reg.text.strip() for reg in table_body.find_all("td", class_="tb_region")][:20]
table_region

['NORTHERN ITALY',
 'SUMBAWA REGION, INDONESIA',
 'ICELAND REGION',
 'PHILIPPINE ISLANDS REGION',
 'PUERTO RICO REGION',
 'WESTERN TURKEY',
 'SOUTHERN SUMATRA, INDONESIA',
 'CANARY ISLANDS, SPAIN REGION',
 'VANCOUVER ISLAND, CANADA REGION',
 'NEW IRELAND REGION, P.N.G.',
 'CHIAPAS, MEXICO',
 'JUJUY, ARGENTINA',
 'SWITZERLAND',
 'SOUTH OF JAVA, INDONESIA',
 'SOUTH OF ALASKA',
 'CANARY ISLANDS, SPAIN REGION',
 'OAXACA, MEXICO',
 'CANARY ISLANDS, SPAIN REGION',
 'KARNATAKA, INDIA',
 'ISLAND OF HAWAII, HAWAII']

In [16]:
table_coords = [coords.text.strip() for coords in table_body.find_all("td", class_="tabev1")][:40]
table_latituds = [table_coords[i] for i in range(0,len(table_coords),2)] #getting the latituds 
table_long = [table_coords[j] for j in range(1,len(table_coords),2)]

In [17]:
table_card = [cards.text.strip() for cards in table_body.find_all("td", class_="tabev2")][:60]
table_card2 = [i for i in table_card if i in ["N","S","W","E"]]   
table_card_latituds =[table_card2[i] for i in range(0,len(table_card2),2)]
table_card_longituds = [table_card2[i] for i in range(1,len(table_card2),2)]

In [18]:
latitudes = zip(table_latituds,table_card_latituds)  
longitudes = zip(table_long, table_card_longituds)
latitudes = [" ".join(list(lats)) for lats in latitudes]
longitudes = [" ".join(list(longs)) for longs in longitudes]

In [20]:
df = pd.DataFrame({'DATE': dates, 'TIME': times, 'LATITUDE':latitudes, 'LONGITUDE':longitudes, 'REGION':table_region})
df

Unnamed: 0,DATE,TIME,LATITUDE,LONGITUDE,REGION
0,2021-12-22,02:58:46.3,44.43 N,7.61 E,NORTHERN ITALY
1,2021-12-22,02:54:54.0,8.37 S,117.51 E,"SUMBAWA REGION, INDONESIA"
2,2021-12-22,02:51:59.0,63.90 N,22.27 W,ICELAND REGION
3,2021-12-22,02:41:43.0,10.35 N,126.90 E,PHILIPPINE ISLANDS REGION
4,2021-12-22,02:38:36.4,17.95 N,67.10 W,PUERTO RICO REGION
5,2021-12-22,02:28:12.0,39.83 N,29.14 E,WESTERN TURKEY
6,2021-12-22,02:27:53.0,3.26 S,102.98 E,"SOUTHERN SUMATRA, INDONESIA"
7,2021-12-22,02:22:58.0,28.56 N,17.83 W,"CANARY ISLANDS, SPAIN REGION"
8,2021-12-22,02:21:49.2,48.91 N,123.35 W,"VANCOUVER ISLAND, CANADA REGION"
9,2021-12-22,02:20:40.9,5.59 S,153.94 E,"NEW IRELAND REGION, P.N.G."


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
Exercise 7: Count the number of tweets by a given Twitter account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [84]:
from bs4 import BeautifulSoup
import requests

handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("{} tweets {} number of tweets.".format(handle,tweets.get('data-count')))

except:
    print('Account name not found...')

Input your account name on Twitter: nadimsaad
Account name not found...


#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [109]:
from bs4 import BeautifulSoup
import requests
handle = input('Input your account name on Twitter: ') 
temp = requests.get('https://twitter.com/'+handle)
print(temp)
bs = BeautifulSoup(temp.text,'lxml')

follow_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
followers = follow_box.find('a').find('span',{'class':'ProfileNav-value'})
print("Number of followers: {} ".format(followers.get('data-count')))


Input your account name on Twitter: saadnadim
<Response [200]>


AttributeError: 'NoneType' object has no attribute 'find'

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [87]:

  from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.wikipedia.org/')
bs = BeautifulSoup(html, "html.parser")
nameList = bs.findAll('a', {'class' : 'link-box'})
for name in nameList:
      print(name.get_text())
  


English
6 383 000+ articles


日本語
1 292 000+ 記事


Русский
1 756 000+ статей


Deutsch
2 617 000+ Artikel


Español
1 717 000+ artículos


Français
2 362 000+ articles


中文
1 231 000+ 條目


Italiano
1 718 000+ voci


Português
1 074 000+ artigos


Polski
1 490 000+ haseł



#### A list with the different kind of datasets available in data.gov.uk.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [92]:
import re
import urllib

import rdflib

# set the user agent so that data-gov-uk will know who we are
rdflib.parser.headers = {"User-agent": "data-gov-uk-harvester <https://github.com/edsu/data-gov-uk-harvester>"}

graph = rdflib.ConjunctiveGraph('Sleepycat')
graph.open('store', create=True)

# the paged package listing that we will crawl to discover dataset urls
page_url_tmpl = "http://data.gov.uk/search/apachesolr_search/?filters=type:ckan_package&page=%s"
page = 0

# extract rdf from each dataset html/rdfa page
while True:
    page += 1
    page_url = page_url_tmpl % page
    print ("fetching list of datasets from %s" % page_url)
    html = urllib.urlopen(page_url).read()

    found = 0
    for dataset_url in re.findall(r'"(http://data.gov.uk/dataset/.*?)"', html):
        found += 1
        print ("fetching dataset %s" % dataset_url)
        try:
            graph.parse(location=dataset_url, format='rdfa', lax=True)
        #except Exception, e:
         #   print e

    if found == 0:
         break

# no sense in keeping tons of css stylesheet assertions is there?
for t in graph.triples((None, rdflib.URIRef('http://www.w3.org/1999/xhtml/vocab#stylesheet'), None)):
    graph.remove(t)

graph.serialize(open('data.rdf', 'w'))
graph.close()

IndentationError: unexpected unindent (<ipython-input-92-5b4e56c5a824>, line 36)

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
# your code here

## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [110]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm

#the url of the first page
url = "https://www.imdb.com/chart/top"

def all_page_link(start_url):
    all_urls = []
    url = start_url
    while(url != None):            #Loop around all the required webpages and terminates when last page arive!
        all_urls.append(url)
        soup = BeautifulSoup(requests.get(url).text,"html.parser")
        next_links = soup.find_all(class_='flat-button lister-page-next next-page')    #Extracts the next page link.
        if (len(next_links) == 0):         # If their is no next page, it returns 0.
            url = None
        else:
            next_page = "https://www.imdb.com" + next_links[0].get('href')
            url = next_page
    return all_urls


In [111]:
def director_and_actor(Director_and_star):
    Director_and_star =  Director_and_star.replace("\n","")
    Director_and_star = Director_and_star.replace("|","")
    Director_and_star = Director_and_star.split("Stars:")
    Director_and_star[0] = Director_and_star[0].replace("Director:","")
    Director_and_star[0] = Director_and_star[0].replace("Directors:","")
    for i in range(10):
        Director_and_star[0]=Director_and_star[0].replace("  "," ")
    director = Director_and_star[0][1:]
    stars = Director_and_star[1]
    stars = stars.replace(":","")
    return director,stars

In [101]:
def votes_and_gross_conveter(votes_and_gross):
    votes_and_gross_list = []
    for i in votes_and_gross:
        votes_and_gross_list.append(i.text)
    if(len(votes_and_gross)==2):
        votes=votes_and_gross_list[0]
        gross = votes_and_gross_list[1]
    else:
        votes=votes_and_gross_list[0]
        gross = None
    
    return votes,gross

In [112]:
main_array = []
for url in tqdm(all_page_link("https://www.imdb.com/list/ls068082370/")):     #Runs the function for all the pages.
    soup = BeautifulSoup(requests.get(url).text,"html.parser")         #Extracts out the main html code.
    for link in soup.find_all(class_='lister-item-content'):
        id = int(link.find('span',{"class":"lister-item-index unbold text-primary"}).text[:-1])
        name = link.find('a').text
        year = link.find('span',{"class":"lister-item-year text-muted unbold"}).text[1:5]
        run_time = link.find('span',{"class":"runtime"}).text
        genre = link.find('span',{"class":"genre"}).text[1:]
        rating = link.find('span',{"class":"ipl-rating-star__rating"}).text
        about = link.find_all('p')[1].text[5:]
        director,actors = director_and_actor(link.find_all('p',{"class":"text-muted text-small"})[1].text)
        votes, gross = votes_and_gross_conveter(link.find_all('span',{"name":"nv"}))
        votes = int(votes.replace(",",""))
        list_of_all = [id,name,year,run_time,genre,rating,about,director,actors,votes,gross]
        main_array.append(list_of_all)

100%|██████████| 3/3 [00:13<00:00,  4.42s/it]


In [115]:
#this index variable contains the name of the columns of the data frame.
index = ["id","name","year","run_time","genre","rating","about","director","actors","votes","gross"]

In [116]:
df = pd.DataFrame(main_array,columns=index)   #creating the DataFrame using "main_array"

In [104]:
df.head(5)

Unnamed: 0,id,name,year,run_time,genre,rating,about,director,actors,votes,gross
0,1,Les Évadés,1994,142 min,Drama,9.3,"imprisoned men bond over a number of years, fi...",Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",2501180,$28.34M
1,2,Le Parrain,1972,175 min,"Crime, Drama",9.2,"Godfather follows Vito Corleone, Don of the Co...",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Diane Ke...",1724412,$134.97M
2,3,The Dark Knight : Le Chevalier noir,2008,152 min,"Action, Crime, Drama",9.0,the menace known as the Joker wreaks havoc an...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",2451090,$534.86M
3,4,"Le Parrain, 2ᵉ partie",1974,202 min,"Crime, Drama",9.0,early life and career of Vito Corleone in 1920...,Francis Ford Coppola,"Al Pacino, Robert De Niro, Robert Duvall, Dian...",1196634,$57.30M
4,5,Pulp Fiction,1994,154 min,"Crime, Drama",8.9,"lives of two mob hitmen, a boxer, a gangster a...",Quentin Tarantino,"John Travolta, Uma Thurman, Samuel L. Jackson,...",1930384,$107.93M


In [105]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        250 non-null    int64 
 1   name      250 non-null    object
 2   year      250 non-null    object
 3   run_time  250 non-null    object
 4   genre     250 non-null    object
 5   rating    250 non-null    object
 6   about     250 non-null    object
 7   director  250 non-null    object
 8   actors    250 non-null    object
 9   votes     250 non-null    int64 
 10  gross     224 non-null    object
dtypes: int64(2), object(9)
memory usage: 21.6+ KB


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here