# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
requests.get('https://github.com/trending/developers')

<Response [200]>

In [4]:
html=requests.get(url).content
html

b'\n\n\n\n\n\n<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-UDS3MR1FfvqHmqZAs2MWSDCWPwLemVRLqCwld4/zfwH0vhv7I6RYmDnMnNAVQKP1YYvqnccOCH4iOhFaUUyrjw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-2e9090135c22aad5f56c2f72dcba7880.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-p4eUPemTc/4dlxCrmhH7lQDBSMyxvSF/8JUgk1+wawzib+okmfn1cNuyi

In [5]:
soup=BeautifulSoup(html)
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-2e9090135c22aad5f56c2f72dcba7880.css" integrity="sha512-UDS3MR1FfvqHmqZAs2MWSDCWPwLemVRLqCwld4/zfwH0vhv7I6RYmDnMnNAVQKP1YYvqnccOCH4iOhFaUUyrjw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-789f905d50a214e0c8606578148aa830.css" integrity="sha512-p4eUPemTc/4dlxCrmhH7lQDBSMyxvS

In [6]:
soup.find_all(class_="col-md-6")

[<div class="col-md-6">
 <h1 class="h3 lh-condensed"><a href="/sloria">Steven Loria</a></h1>
 <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/sloria">sloria</a>
 </p>
 </div>, <div class="col-md-6">
 <div class="mt-2 mb-3 my-md-0">
 <article>
 <div class="f6 text-gray text-uppercase mb-1"><svg aria-hidden="true" class="octicon octicon-flame text-orange-light mr-1" height="16" version="1.1" viewbox="0 0 12 16" width="12"><path d="M5.05.31c.81 2.17.41 3.38-.52 4.31C3.55 5.67 1.98 6.45.9 7.98c-1.45 2.05-1.7 6.53 3.53 7.7-2.2-1.16-2.67-4.52-.3-6.61-.61 2.03.53 3.33 1.94 2.86 1.39-.47 2.3.53 2.27 1.67-.02.78-.31 1.44-1.13 1.81 3.42-.59 4.78-3.42 4.78-5.56 0-2.84-2.53-3.22-1.25-5.61-1.52.13-2.03 1.13-1.89 2.75.09 1.08-1.02 1.8-1.86 1.33-.67-.41-.66-1.19-.06-1.78C8.18 5.31 8.68 2.45 5.05.32L5.03.3l.02.01z" fill-rule="evenodd"></path></svg>Popular repo</div>
 <h1 class="h4 lh-condensed">
 <a class="css-truncate css-truncate-target" data-ga-click="Explore, go to repository, locatio

In [31]:
tags1=soup.find_all("h1",{'class':"h3 lh-condensed"})
tags1

[<h1 class="h3 lh-condensed"><a href="/sloria">Steven Loria</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/miekg">Miek Gieben</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/frenck">Franck Nijhof</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/hrydgard">Henrik Rydgård</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/borkdude">Michiel Borkent</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/adamchainz">Adam Johnson</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/dagar">Daniel Agar</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/stevenjoezhang">ᴍɪᴍɪ</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/skade">Florian Gilcher</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/muukii">Hiroshi Kimura</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/radare">radare</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/ericniebler">Eric Niebler</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/tazjin">Vincent Ambo</a></h1>,
 <h1 class="h3 lh-condensed"><a href="/rexim">Alexey Kutepov</a></h1>,
 <h1 c

In [33]:
tags=soup.find_all("p",{'class':"f4 text-normal mb-1"})
tags

[<p class="f4 text-normal mb-1">
 <a class="link-gray" href="/sloria">sloria</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/miekg">miekg</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/frenck">frenck</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/hrydgard">hrydgard</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/borkdude">borkdude</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/adamchainz">adamchainz</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/dagar">dagar</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/stevenjoezhang">stevenjoezhang</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/skade">skade</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/muukii">muukii</a>
 </p>, <p class="f4 text-normal mb-1">
 <a class="link-gray" href="/radare">radare</a>
 </p>, <p class="f4 tex

In [61]:
text=[element.text for element in tags1]
print(text)

['Steven Loria', 'Miek Gieben', 'Franck Nijhof', 'Henrik Rydgård', 'Michiel Borkent', 'Adam Johnson', 'Daniel Agar', 'ᴍɪᴍɪ', 'Florian Gilcher', 'Hiroshi Kimura', 'radare', 'Eric Niebler', 'Vincent Ambo', 'Alexey Kutepov', 'James Newton-King', 'Norman Maurer', 'Kevin Papst', 'Nikita Sobolev', 'Ken’ichiro Oyama', 'Rich Harris', 'Janko Marohnić', 'Michael Grupp', 'Arvid Norberg', 'David Tolnay', 'William Chargin']


In [40]:
text1=[element.text.strip('\n') for element in tags]
print(text1)

['sloria', 'miekg', 'frenck', 'hrydgard', 'borkdude', 'adamchainz', 'dagar', 'stevenjoezhang', 'skade', 'muukii', 'radare', 'ericniebler', 'tazjin', 'rexim', 'JamesNK', 'normanmaurer', 'kevinpapst', 'sobolevn', 'k1LoW', 'Rich-Harris', 'janko', 'MichaelGrupp', 'arvidn', 'dtolnay', 'wchargin']


In [79]:
arr = []
for i in range(len(text)):
    arr.append([text[i]+" ("+text1[i]+")"])
    

In [80]:
arr

[['Steven Loria (sloria)'],
 ['Miek Gieben (miekg)'],
 ['Franck Nijhof (frenck)'],
 ['Henrik Rydgård (hrydgard)'],
 ['Michiel Borkent (borkdude)'],
 ['Adam Johnson (adamchainz)'],
 ['Daniel Agar (dagar)'],
 ['ᴍɪᴍɪ (stevenjoezhang)'],
 ['Florian Gilcher (skade)'],
 ['Hiroshi Kimura (muukii)'],
 ['radare (radare)'],
 ['Eric Niebler (ericniebler)'],
 ['Vincent Ambo (tazjin)'],
 ['Alexey Kutepov (rexim)'],
 ['James Newton-King (JamesNK)'],
 ['Norman Maurer (normanmaurer)'],
 ['Kevin Papst (kevinpapst)'],
 ['Nikita Sobolev (sobolevn)'],
 ['Ken’ichiro Oyama (k1LoW)'],
 ['Rich Harris (Rich-Harris)'],
 ['Janko Marohnić (janko)'],
 ['Michael Grupp (MichaelGrupp)'],
 ['Arvid Norberg (arvidn)'],
 ['David Tolnay (dtolnay)'],
 ['William Chargin (wchargin)']]

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [None]:
#your code

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [92]:
# This is the url you will scrape in this exercise
url1 = 'https://github.com/trending/python?since=daily'

In [98]:
requests.get(url1)

<Response [200]>

In [99]:
html1=requests.get(url1).content
soup1=BeautifulSoup(html1)

In [100]:
soup1

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-2e9090135c22aad5f56c2f72dcba7880.css" integrity="sha512-UDS3MR1FfvqHmqZAs2MWSDCWPwLemVRLqCwld4/zfwH0vhv7I6RYmDnMnNAVQKP1YYvqnccOCH4iOhFaUUyrjw==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-789f905d50a214e0c8606578148aa830.css" integrity="sha512-p4eUPemTc/4dlxCrmhH7lQDBSMyxvS

In [109]:
repo1=soup1.find_all(class_="h3 lh-condensed")
repo1

[<h1 class="h3 lh-condensed">
 <a href="/wifiphisher/wifiphisher">
 <svg aria-hidden="true" class="octicon octicon-repo text-gray mr-1" height="16" version="1.1" viewbox="0 0 12 16" width="12"><path d="M4 9H3V8h1v1zm0-3H3v1h1V6zm0-2H3v1h1V4zm0-2H3v1h1V2zm8-1v12c0 .55-.45 1-1 1H6v2l-1.5-1.5L3 16v-2H1c-.55 0-1-.45-1-1V1c0-.55.45-1 1-1h10c.55 0 1 .45 1 1zm-1 10H1v2h2v-1h3v1h5v-2zm0-10H2v9h9V1z" fill-rule="evenodd"></path></svg>
 <span class="text-normal">wifiphisher / </span>wifiphisher
 </a> </h1>, <h1 class="h3 lh-condensed">
 <a href="/shengqiangzhang/examples-of-web-crawlers">
 <svg aria-hidden="true" class="octicon octicon-repo text-gray mr-1" height="16" version="1.1" viewbox="0 0 12 16" width="12"><path d="M4 9H3V8h1v1zm0-3H3v1h1V6zm0-2H3v1h1V4zm0-2H3v1h1V2zm8-1v12c0 .55-.45 1-1 1H6v2l-1.5-1.5L3 16v-2H1c-.55 0-1-.45-1-1V1c0-.55.45-1 1-1h10c.55 0 1 .45 1 1zm-1 10H1v2h2v-1h3v1h5v-2zm0-10H2v9h9V1z" fill-rule="evenodd"></path></svg>
 <span class="text-normal">shengqiangzhang / </span>e

In [174]:
txt=[i.text.strip("\n ").split("/")[1] for i in repo1]
print("trending Python repositories in GitHub: ",txt)

trending Python repositories in GitHub:  [' wifiphisher', ' examples-of-web-crawlers', ' video-object-removal', ' snorkel', ' system-design-primer', ' 100-Days-Of-ML-Code', ' RAdam', ' Real-Time-Voice-Cloning', ' tuya-convert', ' SickZil-Machine', ' keras', ' keras-radam', ' ptf', ' instabot', ' maltrail', ' discord.py', ' Instagram-API-python', ' AiLearning', ' models', ' wifite2', ' home-assistant', ' bitcoinbook', ' imutils', ' yfinance']


#### Display all the image links from Walt Disney wikipedia page

In [118]:
# This is the url you will scrape in this exercise
url2 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [119]:
requests.get(url2)

<Response [200]>

In [121]:
html2=requests.get(url2).content
soup2=BeautifulSoup(html2)

In [122]:
soup2

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":911615016,"wgRevisionId":911615016,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Featured articles","Use mdy dates from April 2017","Use American English from May 2016","All Wikipedia articles written in American English","Biography with signature","Articles with hCards","Articles containing German-language text","Ar

In [125]:
img=soup2.select("img")
img

[<img alt="This is a featured article. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/40px-Semi-protection-shackle.svg.png 2x" width="20"/>,
 <img alt="Walt Disney 1946.JPG" data-file-height="675" dat

In [166]:
# [i.startswith("src") for i in img]
link=[]
for i in str(img).split():
    if i.startswith("src"):
        link.append(i)
link

['src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png"',
 'srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png',
 'src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png"',
 'srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Semi-protection-shackle.svg.png',
 'src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG"',
 'srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/330px-Walt_Disney_1946.JPG',
 'src="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png"',
 'srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/225px-Walt_Disney_1942_signature.svg.png',
 'src="//upload.wikimedia.org/wikipedia/commons/thumb/c/

In [163]:
[i.get('src') for i in img]

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [168]:
# This is the url you will scrape in this exercise
url3 ='https://en.wikipedia.org/wiki/Python' 

In [169]:
requests.get(url3)
html3=requests.get(url3).content
soup3=BeautifulSoup(html3)
soup3

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":911688889,"wgRevisionId":911688889,"wgArticleId":46332325,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short description","All article disambiguation pages","All disambiguation pages","Animal common name disambiguation pages","Disambiguation pages"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","Aug

In [177]:
links=soup3.select("a")
links

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/Python" title="wiktionary:Python">Python</a>,
 <a class="extiw" href="https://en.wiktionary.org/wiki/python" title="wiktionary:python">python</a>,
 <a href="#Snakes"><span class="tocnumber">1</span> <span class="toctext">Snakes</span></a>,
 <a href="#Ancient_Greece"><span class="tocnumber">2</span> <span class="toctext">Ancient Greece</span></a>,
 <a href="#Media_and_entertainment"><span class="tocnumber">3</span> <span class="toctext">Media and entertainment</span></a>,
 <a href="#Computing"><span class="tocnumber">4</span> <span class="toctext">Computing</span></a>,
 <a href="#Engineering"><span class="tocnumber">5</span> <span class="toctext">Engineering</span></a>,
 <a href="#Roller_coasters"><span class="tocnumber">5.1</span> <span class="toctext">Roller coasters</span></a>,
 <a href

In [179]:
[i.get("href") for i in links]

[None,
 '#mw-head',
 '#p-search',
 'https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 '#Snakes',
 '#Ancient_Greece',
 '#Media_and_entertainment',
 '#Computing',
 '#Engineering',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#People',
 '#Other_uses',
 '#See_also',
 '/w/index.php?title=Python&action=edit&section=1',
 '/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/w/index.php?title=Python&action=edit&section=2',
 '/wiki/Python_(mythology)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/w/index.php?title=Python&action=edit&section=3',
 '/wiki/Python_(film)',
 '/wiki/Pythons_2',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/w/index.php?title=Python&action=edit&section=4',
 '/wiki/Python_(programming_language)',
 '/wiki/CPython',
 '/wiki/CMU_Common_Lisp',
 '/wiki/PERQ#PERQ_3',
 '/w/index.php?title=Python&action=edit&section=5',
 '/w/index.php?title=Python&action=edit&sectio

#### Number of Titles that have changed in the United States Code since its last release point 

In [181]:
# This is the url you will scrape in this exercise
url4 = 'http://uscode.house.gov/download/download.shtml'

In [182]:
requests.get(url4)
html4=requests.get(url4).content
soup4=BeautifulSoup(html4)
soup4

<?xml version='1.0' encoding='UTF-8' ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=scripts" type="text/javascri

In [187]:
bold=soup4.select('.usctitlechanged')
len(bold)

4

#### A Python list with the top ten FBI's Most Wanted names 

In [188]:
# This is the url you will scrape in this exercise
url5 = 'https://www.fbi.gov/wanted/topten'

In [189]:
requests.get(url5)
html5=requests.get(url5).content
soup5=BeautifulSoup(html5)
soup5

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/><title>Ten Most Wanted — FBI</title>
<link href="https://www.fbi.gov/wanted/topten/RSS" rel="alternate" title="Ten Most Wanted - RSS 1.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/rss.xml" rel="alternate" title="Ten Most Wanted - RSS 2.0" type="application/rss+xml"/>
<link href="https://www.fbi.gov/wanted/topten/atom.xml" rel="alternate" title="Ten Most Wanted - Atom" type="application/rss+xml"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="Ten Most Wanted | Federal Bureau of Investigation" name="twitter:title"/>
<meta content="Federal Bureau of Investigation" property="og:site_name"/>
<meta content="Ten Most Wanted | Federal Bureau of Investigation" pro

In [191]:
names=soup5.find_all('h3',class_="title")
names

[<h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/eugene-palmer">EUGENE PALMER</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/santiago-mederos">SANTIAGO VILLALBA MEDEROS</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/robert-william-fisher">ROBERT WILLIAM FISHER</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yaser-abdel-said">YASER ABDEL SAID</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/jason-derek-brown">JASON DEREK BROWN</a>
 </h3>, <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/raf

In [192]:
names=soup5.find_all(class_="title")
names

[<p class="left title"><a href="https://www.fbi.gov/wanted">Most Wanted</a></p>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/eugene-palmer">EUGENE PALMER</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/santiago-mederos">SANTIAGO VILLALBA MEDEROS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/robert-william-fisher">ROBERT WILLIAM FISHER</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yaser-abdel-said">YASER ABDEL SAID</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/jason-derek-brown">JASON DE

In [194]:
names=soup5.select(".title")
names

[<p class="left title"><a href="https://www.fbi.gov/wanted">Most Wanted</a></p>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/eugene-palmer">EUGENE PALMER</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/santiago-mederos">SANTIAGO VILLALBA MEDEROS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/robert-william-fisher">ROBERT WILLIAM FISHER</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yaser-abdel-said">YASER ABDEL SAID</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/jason-derek-brown">JASON DE

In [197]:
[i.text.strip('\n') for i in names]

['Most Wanted',
 'EUGENE PALMER',
 'SANTIAGO VILLALBA MEDEROS',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ARNOLDO JIMENEZ',
 'ALEJANDRO ROSALES CASTILLO',
 'YASER ABDEL SAID',
 'JASON DEREK BROWN',
 'RAFAEL CARO-QUINTERO',
 'ALEXIS FLORES']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [198]:
# This is the url you will scrape in this exercise
url6 = 'https://www.emsc-csem.org/Earthquake/'

In [211]:
requests.get(url6)
html6=requests.get(url6).content
soup6=BeautifulSoup(html6,'lxml')
table=soup6.select('#tbody')
table

[<tbody id="tbody"><tr class="ligne1 normal" id="787446" onclick="go_details(event,787446);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=787446">2019-08-20   15:31:26.5</a></b><i class="ago" id="ago0">09min ago</i></td><td class="tabev1">37.81 </td><td class="tabev2">N  </td><td class="tabev1">29.67 </td><td class="tabev2">E  </td><td class="tabev3">17</td><td class="tabev5" id="magtyp0">ML</td><td class="tabev2">2.8</td><td class="tb_region" id="reg0"> WESTERN TURKEY</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2019-08-20 15:36</td></tr>
 <tr class="ligne2 normal" id="787447" onclick="go_details(event,787447);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=787447">2019-08-20   15:23:41.0</a></b><i class="ago

In [212]:
rows=table[0].find_all('tr')
rows

[<tr class="ligne1 normal" id="787446" onclick="go_details(event,787446);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=787446">2019-08-20   15:31:26.5</a></b><i class="ago" id="ago0">09min ago</i></td><td class="tabev1">37.81 </td><td class="tabev2">N  </td><td class="tabev1">29.67 </td><td class="tabev2">E  </td><td class="tabev3">17</td><td class="tabev5" id="magtyp0">ML</td><td class="tabev2">2.8</td><td class="tb_region" id="reg0"> WESTERN TURKEY</td><td class="comment updatetimeno" id="upd0" style="text-align:right;">2019-08-20 15:36</td></tr>,
 <tr class="ligne2 normal" id="787447" onclick="go_details(event,787447);"><td class="tabev0"></td><td class="tabev0"></td><td class="tabev0"></td><td class="tabev6"><b><i style="display:none;">earthquake</i><a href="/Earthquake/earthquake.php?id=787447">2019-08-20   15:23:41.0</a></b><i class="ago" id="ago1">17min

In [213]:
rows[0].select('#tbody td.tabev6')[0].a.text.split()

['2019-08-20', '15:31:26.5']

In [238]:
date_time=[i.select('#tbody td.tabev6')[0].a.text.split() for i in rows]
date_time

[['2019-08-20', '15:31:26.5'],
 ['2019-08-20', '15:23:41.0'],
 ['2019-08-20', '15:12:36.4'],
 ['2019-08-20', '15:11:37.0'],
 ['2019-08-20', '14:54:29.7'],
 ['2019-08-20', '14:49:23.2'],
 ['2019-08-20', '14:25:17.3'],
 ['2019-08-20', '14:24:35.8'],
 ['2019-08-20', '14:12:34.1'],
 ['2019-08-20', '14:07:33.9'],
 ['2019-08-20', '13:53:03.0'],
 ['2019-08-20', '13:48:10.5'],
 ['2019-08-20', '13:47:47.0'],
 ['2019-08-20', '13:41:46.4'],
 ['2019-08-20', '13:12:57.3'],
 ['2019-08-20', '13:06:38.9'],
 ['2019-08-20', '13:04:58.6'],
 ['2019-08-20', '13:03:53.3'],
 ['2019-08-20', '12:54:58.4'],
 ['2019-08-20', '12:52:23.2'],
 ['2019-08-20', '12:46:52.2'],
 ['2019-08-20', '12:45:27.0'],
 ['2019-08-20', '12:43:24.1'],
 ['2019-08-20', '12:26:41.7'],
 ['2019-08-20', '12:23:42.2'],
 ['2019-08-20', '11:48:15.7'],
 ['2019-08-20', '11:35:19.7'],
 ['2019-08-20', '11:16:50.7'],
 ['2019-08-20', '11:15:41.1'],
 ['2019-08-20', '11:15:30.0'],
 ['2019-08-20', '10:53:41.0'],
 ['2019-08-20', '10:48:36.2'],
 ['2019-

In [243]:
location1=soup6.select('#tbody tr>td:nth-child(5)')
location2=soup6.select('#tbody tr>td:nth-child(6)')
location3=soup6.select('#tbody tr>td:nth-child(7)')
location4=soup6.select('#tbody tr>td:nth-child(8)')
locations=[]
for i in range(20):
    locations.append(' '.join([location1[i].text.strip(),location2[i].text.strip(),location3[i].text.strip(),location4[i].text.strip()]))


In [244]:
locations

['37.81 N 29.67 E',
 '21.08 S 68.86 W',
 '36.19 N 27.70 E',
 '1.15 N 98.84 E',
 '33.66 N 119.08 W',
 '40.35 N 124.47 W',
 '40.11 N 15.47 E',
 '36.22 N 117.90 W',
 '34.93 N 25.88 E',
 '37.84 N 29.68 E',
 '4.06 S 133.81 E',
 '39.23 N 17.31 E',
 '69.14 N 144.56 W',
 '36.07 N 117.87 W',
 '36.13 N 27.72 E',
 '19.15 N 155.46 W',
 '56.22 N 149.92 W',
 '11.31 S 166.33 E',
 '38.30 N 115.79 W',
 '27.99 N 130.78 E']

In [246]:
region=soup6.select('#tbody td.tb_region')
region=[i.text.strip() for i in region]
region

['WESTERN TURKEY',
 'TARAPACA, CHILE',
 'DODECANESE IS.-TURKEY BORDER REG',
 'NORTHERN SUMATRA, INDONESIA',
 'CHANNEL ISLANDS REG., CALIFORNIA',
 'OFFSHORE NORTHERN CALIFORNIA',
 'SOUTHERN ITALY',
 'CENTRAL CALIFORNIA',
 'CRETE, GREECE',
 'WESTERN TURKEY',
 'NEAR S COAST OF PAPUA, INDONESIA',
 'SOUTHERN ITALY',
 'NORTHERN ALASKA',
 'CENTRAL CALIFORNIA',
 'DODECANESE IS.-TURKEY BORDER REG',
 'ISLAND OF HAWAII, HAWAII',
 'GULF OF ALASKA',
 'SANTA CRUZ ISLANDS',
 'NEVADA',
 'RYUKYU ISLANDS, JAPAN',
 'ISLAND OF HAWAII, HAWAII',
 'TARAPACA, CHILE',
 'ISLAND OF HAWAII, HAWAII',
 'FRANCE',
 'JUJUY, ARGENTINA',
 'WESTERN TURKEY',
 'KEPULAUAN BABAR, INDONESIA',
 'KERMADEC ISLANDS REGION',
 'EASTERN TURKEY',
 'MOLUCCA SEA',
 'MOLUCCA SEA',
 'ISLAND OF HAWAII, HAWAII',
 'SULAWESI, INDONESIA',
 'ISLAND OF HAWAII, HAWAII',
 'NAYARIT, MEXICO',
 'OKLAHOMA',
 'OKLAHOMA',
 'WESTERN TURKEY',
 'SOUTHERN CALIFORNIA',
 'MOLUCCA SEA',
 'OKLAHOMA',
 'OAXACA, MEXICO',
 'GUERRERO, MEXICO',
 'ANTOFAGASTA, CHILE

In [248]:
df=pd.DataFrame([date_time[:20],locations,region[:20]]).T
df.columns=['date_time','locations','region']
df

Unnamed: 0,date_time,locations,region
0,"[2019-08-20, 15:31:26.5]",37.81 N 29.67 E,WESTERN TURKEY
1,"[2019-08-20, 15:23:41.0]",21.08 S 68.86 W,"TARAPACA, CHILE"
2,"[2019-08-20, 15:12:36.4]",36.19 N 27.70 E,DODECANESE IS.-TURKEY BORDER REG
3,"[2019-08-20, 15:11:37.0]",1.15 N 98.84 E,"NORTHERN SUMATRA, INDONESIA"
4,"[2019-08-20, 14:54:29.7]",33.66 N 119.08 W,"CHANNEL ISLANDS REG., CALIFORNIA"
5,"[2019-08-20, 14:49:23.2]",40.35 N 124.47 W,OFFSHORE NORTHERN CALIFORNIA
6,"[2019-08-20, 14:25:17.3]",40.11 N 15.47 E,SOUTHERN ITALY
7,"[2019-08-20, 14:24:35.8]",36.22 N 117.90 W,CENTRAL CALIFORNIA
8,"[2019-08-20, 14:12:34.1]",34.93 N 25.88 E,"CRETE, GREECE"
9,"[2019-08-20, 14:07:33.9]",37.84 N 29.68 E,WESTERN TURKEY


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [2]:
url8='https://www.g2.com/categories/crm?order=g2_score#product-list'
requests.get(url8)
html8=requests.get(url8).content
html8

b'<!DOCTYPE html><!--[if lte IE 8] <html class="ie8" lang="en"--><!--[if IE 9] <html class="ie9" lang="en"--><!--[if gt IE 9] <!--> <html> <!----><head><meta charset="utf-8" /><link href="https://www.g2.com/assets/favicon-5be994bb49bea691c6a589da004c04466134fb86b55fb83c3088b5b5ce268a09.ico" rel="shortcut icon" type="image/x-icon" /><title>Best CRM Software 2019: Compare Reviews on 300+ CRMs | G2</title><meta content="78D210F3223F3CF585EB2436D17C6943" name="msvalidate.01" /><meta content="width=device-width, initial-scale=1" name="viewport" /><meta content="GNU Terry Pratchett" http-equiv="X-Clacks-Overhead" /><meta content="ie=edge" http-equiv="x-ua-compatible" /><meta content="website" property="og:type" /><meta content="G2" property="og:site_name" /><meta content="@G2dotcom" name="twitter:site" /><meta content="Best CRM Software 2019: Compare Reviews on 300+ CRMs | G2" property="og:title" /><meta content="https://www.g2.com/categories/crm" property="og:url" /><meta content="CRM Softw

In [4]:
soup8=BeautifulSoup(html8,'html.parser')
soup8

<!DOCTYPE html>
<!--[if lte IE 8] <html class="ie8" lang="en"--><!--[if IE 9] <html class="ie9" lang="en"--><!--[if gt IE 9] <!--> <html> <!-- --><head><meta charset="utf-8"/><link href="https://www.g2.com/assets/favicon-5be994bb49bea691c6a589da004c04466134fb86b55fb83c3088b5b5ce268a09.ico" rel="shortcut icon" type="image/x-icon"/><title>Best CRM Software 2019: Compare Reviews on 300+ CRMs | G2</title><meta content="78D210F3223F3CF585EB2436D17C6943" name="msvalidate.01"/><meta content="width=device-width, initial-scale=1" name="viewport"/><meta content="GNU Terry Pratchett" http-equiv="X-Clacks-Overhead"/><meta content="ie=edge" http-equiv="x-ua-compatible"/><meta content="website" property="og:type"/><meta content="G2" property="og:site_name"/><meta content="@G2dotcom" name="twitter:site"/><meta content="Best CRM Software 2019: Compare Reviews on 300+ CRMs | G2" property="og:title"/><meta content="https://www.g2.com/categories/crm" property="og:url"/><meta content="CRM Software is cust

In [2]:
# This is the url you will scrape in this exercise
url7 ='https://hackevents.co/search/anything/anywhere/anytime'

In [3]:
requests.get(url7)
html7=requests.get(url7).content
html7


b'<html><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=device-width, initial-scale=1"/><title>Showing anything in anywhere at anytime</title><link rel="shortcut icon" href="/img/favicons/favicon.png"/><!--THIS IS DYNAMICALLY FILLED--><meta property="og:type" content="website"/><meta property="og:url" content="https://hackevents.co/anything/anywhere/anytime"/><meta property="og:title" content="Showing anything in anywhere at anytime"/><meta property="og:description" content="I just found anything located in anywhere"/><meta property="og:image" content="https://hackevents.co/img/placeholder.jpg"/><meta property="og:image:secure_url" content="https://hackevents.co/img/placeholder.jpg"/><meta property="og:image:alt" content="Showing anything in anywhere at anytime Image"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:domain" value="hackevents.co"/><meta name="twitter:title" value="Showin

In [4]:
soup7=BeautifulSoup(html7,'html.parser')
soup7

<html><head><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1" name="viewport"/><title>Showing anything in anywhere at anytime</title><link href="/img/favicons/favicon.png" rel="shortcut icon"/><!--THIS IS DYNAMICALLY FILLED--><meta content="website" property="og:type"/><meta content="https://hackevents.co/anything/anywhere/anytime" property="og:url"/><meta content="Showing anything in anywhere at anytime" property="og:title"/><meta content="I just found anything located in anywhere" property="og:description"/><meta content="https://hackevents.co/img/placeholder.jpg" property="og:image"/><meta content="https://hackevents.co/img/placeholder.jpg" property="og:image:secure_url"/><meta content="Showing anything in anywhere at anytime Image" property="og:image:alt"/><meta content="summary_large_image" name="twitter:card"/><meta name="twitter:domain" value="hackevents.co"/><meta name="twitter:title" value="Showing 

In [5]:
table7=soup7.select('div.col-12.col-sm-7')
table7

[<div class="col-12 col-sm-7"><h3>The Code Factor</h3><hr/><p><i class="fas fa-map-marker-alt"></i>  Milano, Italy</p><p><i class="fas fa-calendar-check"></i> Starts 5/21/2019</p><p><i class="fas fa-calendar-times"></i> Ends   8/31/2019</p></div>,
 <div class="col-12 col-sm-7"><h3>TECHFEST MUNICH</h3><hr/><p><i class="fas fa-map-marker-alt"></i>  Munich, Germany</p><p><i class="fas fa-calendar-check"></i> Starts 9/6/2019</p><p><i class="fas fa-calendar-times"></i> Ends   9/8/2019</p></div>,
 <div class="col-12 col-sm-7"><h3>Galileo App Competition</h3><hr/><p><i class="fas fa-map-marker-alt"></i>  Prague, Czech Republic</p><p><i class="fas fa-calendar-check"></i> Starts 1/31/2019</p><p><i class="fas fa-calendar-times"></i> Ends   9/30/2019</p></div>]

In [6]:
name=soup7.select('div.col-12.col-sm-7 h3')
name=[i.text for i in name]
name

['The Code Factor', 'TECHFEST MUNICH', 'Galileo App Competition']

In [11]:
table8=soup7.select('div.col-12.col-sm-7 p')
table8

[<p><i class="fas fa-map-marker-alt"></i>  Milano, Italy</p>,
 <p><i class="fas fa-calendar-check"></i> Starts 5/21/2019</p>,
 <p><i class="fas fa-calendar-times"></i> Ends   8/31/2019</p>,
 <p><i class="fas fa-map-marker-alt"></i>  Munich, Germany</p>,
 <p><i class="fas fa-calendar-check"></i> Starts 9/6/2019</p>,
 <p><i class="fas fa-calendar-times"></i> Ends   9/8/2019</p>,
 <p><i class="fas fa-map-marker-alt"></i>  Prague, Czech Republic</p>,
 <p><i class="fas fa-calendar-check"></i> Starts 1/31/2019</p>,
 <p><i class="fas fa-calendar-times"></i> Ends   9/30/2019</p>]

In [32]:
table8[0].text.split()

['Milano,', 'Italy']

In [76]:
location=soup7.select('div.col-12.col-sm-7 p:has(i.fa-map-marker-alt)')
location1=[i.text.strip() for i in location]
location1


['Milano, Italy', 'Munich, Germany', 'Prague, Czech Republic']

In [89]:
start=soup7.select('div.col-12.col-sm-7 p:has(i.fas.fa-calendar-check)')
start1=[i.text.split()[1] for i in start]
start1

['5/21/2019', '9/6/2019', '1/31/2019']

In [90]:
end = soup7.select('div.col-12.col-sm-7 p:has(i.fas.fa-calendar-times)')
end1=[i.text.split()[1] for i in end]
end1

['8/31/2019', '9/8/2019', '9/30/2019']

In [95]:
colnames = ['Name','Location','Start','End']
DFrame=pd.DataFrame([name,location1,start1,end1]).T
DFrame.columns=colnames
DFrame

Unnamed: 0,Name,Location,Start,End
0,The Code Factor,"Milano, Italy",5/21/2019,8/31/2019
1,TECHFEST MUNICH,"Munich, Germany",9/6/2019,9/8/2019
2,Galileo App Competition,"Prague, Czech Republic",1/31/2019,9/30/2019


In [70]:
# maps=[]
# start=[]
# end=[]
# for i in table8:
#     if "marker" in str(table8):
#         maps.append(i.text.split())
#     elif "check" in str(table8):
#         start.append(i.text)
#     elif "times" in str(table8):
#         end.append(i.text)
# print(maps)
# print(start)
# print(end)

# full_list=[i.text.split() for i in table8]
# print(full_list)

# for i in full_list:
#     for j in i:
#         print(j)
        
# map=full_list[0]+full_list[3]+full_list[6]
# print(map)

# country=[]
# country.append(map[1],map[3],map[5],map[6])





[['Milano,', 'Italy'], ['Starts', '5/21/2019'], ['Ends', '8/31/2019'], ['Munich,', 'Germany'], ['Starts', '9/6/2019'], ['Ends', '9/8/2019'], ['Prague,', 'Czech', 'Republic'], ['Starts', '1/31/2019'], ['Ends', '9/30/2019']]
Milano,
Italy
Starts
5/21/2019
Ends
8/31/2019
Munich,
Germany
Starts
9/6/2019
Ends
9/8/2019
Prague,
Czech
Republic
Starts
1/31/2019
Ends
9/30/2019


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
#your code

#### A list with the different kind of datasets available in data.gov.uk 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [None]:
#your code 

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
#your code

### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
#your code

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
#your code