# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
## our libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
from lxml import html
import regex as re

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [24]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [25]:
# your code here
response = requests.get(url)              ## make request to url
response

<Response [200]>

In [27]:
soup = BeautifulSoup(response.content)        ## get all info and make a amazing soup with it
soup

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-930f30c9a7f3c8f413b0e34b301396e7.css" integrity="sha512-kw8wyafzyPQTsONLMBOW583TMBF78ZzLk01tMAvzQ45FuI3FGVXQw3jXrZFUzKqx3bG3wZR2fmx+5xKPM41Iwg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-75a44090e0cfa738eb7f192096058f17.css" integrity="sha512-daRAkODPpzjrfxkglgWPF8Y/U

#### 1. Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools or clicking in 'Inspect' on any browser. Here is an example:

![title](example_1.png)

2. Use BeautifulSoup `find_all()` to extract all the html elements that contain the developer names. Hint: pass in the `attrs` parameter to specify the class.

3. Loop through the elements found and get the text for each of them.

4. While you are at it, use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names. Hint: you may also use `.get_text()` instead of `.text` and pass in the desired parameters to do some string manipulation (check the documentation).

5. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [28]:
# your code here

## Option 1

names_list = []

for i in range(25):
    sub_dev = soup.find_all('article', class_='Box-row d-flex')[i]                ## each developer square
    sub_names = sub_dev.find_all('h1', class_='h3 lh-condensed')[0].get_text()    ## get name
    names = re.findall('(\S+\s?\S*\s?\S*)(\n)', sub_names)[0][0]                  ## get only the text information needed
    try:
        profiles = sub_dev.find_all('p', class_='f4 text-normal mb-1')[0].get_text()   ## get profile if applicable
        sub_prof = re.findall('(\S+\s?\S*\s?\S*)(\n)', profiles)[0][0]                 ## get only the text information needed
        prof = ' (' + str(sub_prof) + ')'                                              ## present the profile information as presented above
    except:
        prof = ''                                                                      ## if no profile associated no information will be presented
    
    developer = str(names) + prof         ## get everything together
    names_list.append(developer)          ## append to names list

names_list

['LoveSy (yujincheng08)',
 'Niklas von Hertzen (niklasvh)',
 'Henrik Rydgård (hrydgard)',
 'Jonah Lawrence (DenverCoder1)',
 'Adam Ralph (adamralph)',
 'Marten Seemann (marten-seemann)',
 'fatedier',
 'Yoni Goldberg (goldbergyoni)',
 'Steven (styfle)',
 'Remi Rousselet (rrousselGit)',
 'Elvis Pranskevichus (elprans)',
 'Bjørn Erik Pedersen (bep)',
 'Patrick Arminio (patrick91)',
 'Johnny Chen (johnnychen94)',
 'Daniel Lemire (lemire)',
 'Tim Holy (timholy)',
 'Hyo (hyochan)',
 'Ariya Hidayat (ariya)',
 'Takafumi Arakaki (tkf)',
 'Casey Rodarmor (casey)',
 'Chris Banes (chrisbanes)',
 'Jose Diaz-Gonzalez (josegonzalez)',
 'Matthew Phillips (matthewp)',
 'Eliza Weisman (hawkw)',
 'Jonny Borges (jonataslaw)']

In [29]:
## Option 2

profile = soup.find_all('h1', attrs= {'class' : 'h3 lh-condensed'})  ## get important information from url

nickname = [prof.a.get('href') for prof in profile]  ## get the nicknames
nick = [nicks.replace('/','') for nicks in nickname] ## clean nicknames

profname = [prof.text for prof in profile ]  ## get profile name
prof = [re.findall('(\S+\s?\S*\s?\S*)(\n)', profis)[0][0] for profis in profname] ## clean profile name

profil = [prof[i] + ' (' + nick[i] + ')' for i in range(len(nick))]  ## get all together

profil ## print

['LoveSy (yujincheng08)',
 'Niklas von Hertzen (niklasvh)',
 'Henrik Rydgård (hrydgard)',
 'Jonah Lawrence (DenverCoder1)',
 'Adam Ralph (adamralph)',
 'Marten Seemann (marten-seemann)',
 'fatedier (fatedier)',
 'Yoni Goldberg (goldbergyoni)',
 'Steven (styfle)',
 'Remi Rousselet (rrousselGit)',
 'Elvis Pranskevichus (elprans)',
 'Bjørn Erik Pedersen (bep)',
 'Patrick Arminio (patrick91)',
 'Johnny Chen (johnnychen94)',
 'Daniel Lemire (lemire)',
 'Tim Holy (timholy)',
 'Hyo (hyochan)',
 'Ariya Hidayat (ariya)',
 'Takafumi Arakaki (tkf)',
 'Casey Rodarmor (casey)',
 'Chris Banes (chrisbanes)',
 'Jose Diaz-Gonzalez (josegonzalez)',
 'Matthew Phillips (matthewp)',
 'Eliza Weisman (hawkw)',
 'Jonny Borges (jonataslaw)']

#### 1.1. Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [30]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending/python?since=daily'

In [31]:
# your code here
repos_response = requests.get(url2)  #get the url info
repos_response

repos_soup = BeautifulSoup(repos_response.content)  #make a nice soup
repos_soup

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-930f30c9a7f3c8f413b0e34b301396e7.css" integrity="sha512-kw8wyafzyPQTsONLMBOW583TMBF78ZzLk01tMAvzQ45FuI3FGVXQw3jXrZFUzKqx3bG3wZR2fmx+5xKPM41Iwg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-75a44090e0cfa738eb7f192096058f17.css" integrity="sha512-daRAkODPpzjrfxkglgWPF8Y/U

In [32]:
# your code here
## Option 1

repo_list = []

for i in range(25):
    repo_info = repos_soup.find_all('h1', class_='h3 lh-condensed')[i].get_text()  ## get top repo info
    dev = re.findall('(\S+\s?\S*\s?\S*)( /)', repo_info)[0][0]                     ## clean (just developer presented)
    rep = re.findall('(\S+\s?\S*\s?\S*)(\n)', repo_info)[1][0]                     ## clean (just repo presented)
    infor = 'Repository: '+str(rep)+ ', developed by:'+str(dev)                    ## all together now
    repo_list.append(infor)

repo_list

['Repository: docker-wyze-bridge, developed by:mrlt8',
 'Repository: mesh-transformer-jax, developed by:kingoflolz',
 'Repository: textual, developed by:willmcgugan',
 'Repository: shuup, developed by:shuup',
 'Repository: manim, developed by:3b1b',
 'Repository: DeepFaceLab, developed by:iperov',
 'Repository: bips, developed by:bitcoin',
 'Repository: slam-tg-mirror-bot, developed by:breakdowns',
 'Repository: rasa, developed by:RasaHQ',
 'Repository: RustPython, developed by:RustPython',
 'Repository: frigate, developed by:blakeblackshear',
 'Repository: CrackMapExec, developed by:byt3bl33d3r',
 'Repository: keras, developed by:keras-team',
 'Repository: bandit, developed by:PyCQA',
 'Repository: Python-100-Days, developed by:jackfrued',
 'Repository: HeytapTask, developed by:hwkxk',
 'Repository: python-binance, developed by:sammchardy',
 'Repository: SDEdit, developed by:ermongroup',
 'Repository: PayloadsAllTheThings, developed by:swisskyrepo',
 'Repository: manim, developed by:M

In [39]:
## Option 2

repos_info = repos_soup.find_all('h1', attrs= {'class' : 'h3 lh-condensed'})  ## get important information from url
repos_name_dev = [rep.text for rep in repos_info]  ## get repos and developer name

developer = [re.findall('(\S+\s?\S*\s?\S*)( /)', dev)[0][0] for dev in repos_name_dev] ## clean develope name
repository = [re.findall('(\S+\s?\S*\s?\S*)(\n)', repo)[1][0]  for repo in repos_name_dev] ## clean repo name

repo_dev = ['Repository: '+str(repository[i])+ ', developed by:'+str(developer[i]) for i in range(len(repository))]  ## get all together

repo_dev  ## print

['Repository: docker-wyze-bridge, developed by:mrlt8',
 'Repository: mesh-transformer-jax, developed by:kingoflolz',
 'Repository: textual, developed by:willmcgugan',
 'Repository: shuup, developed by:shuup',
 'Repository: manim, developed by:3b1b',
 'Repository: DeepFaceLab, developed by:iperov',
 'Repository: bips, developed by:bitcoin',
 'Repository: slam-tg-mirror-bot, developed by:breakdowns',
 'Repository: rasa, developed by:RasaHQ',
 'Repository: RustPython, developed by:RustPython',
 'Repository: frigate, developed by:blakeblackshear',
 'Repository: CrackMapExec, developed by:byt3bl33d3r',
 'Repository: keras, developed by:keras-team',
 'Repository: bandit, developed by:PyCQA',
 'Repository: Python-100-Days, developed by:jackfrued',
 'Repository: HeytapTask, developed by:hwkxk',
 'Repository: python-binance, developed by:sammchardy',
 'Repository: SDEdit, developed by:ermongroup',
 'Repository: PayloadsAllTheThings, developed by:swisskyrepo',
 'Repository: manim, developed by:M

#### 2. Display all the image links from Walt Disney wikipedia page.
Hint: use `.get()` to access information inside tags. Check out the documentation.

In [40]:
# This is the url you will scrape in this exercise
url_disney = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [41]:
# your code here
disney_response = requests.get(url_disney)  #get the url info
disney_response

disney_soup = BeautifulSoup(disney_response.content)  #Bibbidi-Bobbidi-Boo
disney_soup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8eea4c43-0e46-48e4-a974-310ff4313d55","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":1037017215,"wgRevisionId":1037017215,"wgArticleId":32917,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedia indefinitely move-protected pa

In [50]:
# your code here
## Option 1

image_info = disney_soup.find_all('a', attrs= {'class' : 'image'})  ## get important information from url
images = [re.findall('(href=")(\S+)(")', str(imag))[0][1] for imag in image_info]  ## get the image paths
images ##print

['/wiki/File:Walt_Disney_1946.JPG',
 '/wiki/File:Walt_Disney_1942_signature.svg',
 '/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 '/wiki/File:Trolley_Troubles_poster.jpg',
 '/wiki/File:Steamboat-willie.jpg',
 '/wiki/File:Walt_Disney_1935.jpg',
 '/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 '/wiki/File:Disney_drawing_goofy.jpg',
 '/wiki/File:DisneySchiphol1951.jpg',
 '/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 '/wiki/File:Walt_disney_portrait_right.jpg',
 '/wiki/File:Walt_Disney_Grave.JPG',
 '/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 '/wiki/File:Disney_Display_Case.JPG',
 '/wiki/File:Disney1968.jpg',
 '/wiki/File:Disneyland_Resort_logo.svg',
 '/wiki/File:Animation_disc.svg',
 '/wiki/File:P_vip.svg',
 '/wiki/File:Magic_Kingdom_castle.jpg',
 '/wiki/File:Video-x-generic.svg',
 '/wiki/File:Flag_of_Los_Angeles_County,_California.svg',
 '/wiki/File:Blank_television_set.svg',
 '/wiki/File:Flag_of_the_United_States.svg']

#### 2.1. List all language names and number of related articles in the order they appear in wikipedia.org.

In [51]:
# This is the url you will scrape in this exercise
url_articles = 'https://www.wikipedia.org/'

In [52]:
# your code here
articles_response = requests.get(url_articles)  #get the url info
articles_response

articles_soup = BeautifulSoup(articles_response.content)  #make a soup with all articles
articles_soup

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia</title>
<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
<link href="//creativecommons.org/licenses/by-sa/3.0/" rel="license"/>
<style>
.sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-e99844f6.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg-MediaWiki-logo_sister{background-

In [64]:
# your code here
langs_info = articles_soup.find_all('div', dir='ltr') ## get all language information
lang_article = [lang.text for lang in langs_info]  ## get language and articles info

language = [re.findall('(\n\n)(.+)(\n)', lang)[0][1] for lang in lang_article] ## clean language name
article_numb = [re.findall('(\n)(\d*)(\xa0)(\d*)(\xa0)(\d*)', lang)[0][1] + re.findall('(\n)(\d*)(\xa0)(\d*)(\xa0)(\d*)', lang)[0][3] + re.findall('(\n)(\d*)(\xa0)(\d*)(\xa0)(\d*)', lang)[0][5]for lang in lang_article] ## clean articles number

language_articles = [language[i] + ' (articles:' + article_numb[i] + '+)' for i in range(len(language))]  ## get all together

language_articles ## print

['English (articles:6326000+)',
 '日本語 (articles:1275000+)',
 'Español (articles:1696000+)',
 'Deutsch (articles:2590000+)',
 'Русский (articles:1734000+)',
 'Français (articles:2340000+)',
 '中文 (articles:1206000+)',
 'Italiano (articles:1701000+)',
 'Português (articles:1066000+)',
 'Polski (articles:1480000+)']

#### 2.2. Display the top 10 languages by number of native speakers stored in a pandas dataframe.
Hint: After finding the correct table you want to analyse, you can use a nested **for** loop to find the elements row by row (check out the 'td' and 'tr' tags). <br>An easier way to do it is using pd.read_html(), check out documentation [here](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html).

In [65]:
# This is the url you will scrape in this exercise
url_native = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [66]:
# your code here
repos_native = requests.get(url_native)  #get the url info
repos_native

repos_native_speakers = BeautifulSoup(repos_native.content)  #make a nice soup
repos_native_speakers

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"95b3c22e-be2d-48b9-8637-3a52039f3dfb","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_languages_by_number_of_native_speakers","wgTitle":"List of languages by number of native speakers","wgCurRevisionId":1035922567,"wgRevisionId":1035922567,"wgArticleId":405385,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia indefinitely semi-protected pages","Articles with short des

In [123]:
# your code here
speakers_info = repos_native_speakers.find_all('table', attrs= {'class':'wikitable sortable'}) ## get all speakers information
speakers_language = [info.text for info in speakers_info]  ## get language and speakers numbers info
language_speakers = [re.findall('(\n\n\n\d+\n\n)(.+)(\n\n)(\d+\.?\d*)(\n\n)(\d+\.\d+%)', lang) for lang in speakers_language] ## clean language name and numbers

lan_df = pd.DataFrame(language_speakers[0]).drop([0,2,4,5], axis=1) ## create dataframe and drop info we dont need
lan_df.columns = ['Language','Speakers(millions)'] ## rename columns
lan_df.sort_values(by=['Speakers(millions)']) ## sort by speakers numbers
lan_df.head(10)  ## see top 10

Unnamed: 0,Language,Speakers(millions)
0,Mandarin Chinese,918.0
1,Spanish,480.0
2,English,379.0
3,Hindi (sanskritised Hindustani)[9],341.0
4,Bengali,300.0
5,Portuguese,221.0
6,Russian,154.0
7,Japanese,128.0
8,Western Punjabi[10],92.7
9,Marathi,83.1


#### 3. Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.
Hint: If you hover over the title of the movie, you should see the director's name. Can you find where it's stored in the html?

In [2]:
# This is the url you will scrape in this exercise 
url_imbd = 'https://www.imdb.com/chart/top'

In [3]:
# your code here
url_movies = requests.get(url_imbd)  #get the url info
url_movies

info_movies = BeautifulSoup(url_movies.content)  #make a nice soup
info_movies

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<s

In [56]:
# your code here
## Get ratings
movies_rating_info = [movie_rating.text for movie_rating in info_movies.find_all('td', attrs= {'class':'ratingColumn imdbRating'})] ## get ratings info
movie_rating = [re.findall('\d+\.\d+', rate)[0] for rate in movies_rating_info] ## clean ratings
## get movie name
movie_name = [direc.a.text for direc in info_movies.find_all('td', attrs= {'class':'titleColumn'})]  ## get movie name
## get movie director
movies_director_info = [direc.a.get('title') for direc in info_movies.find_all('td', attrs= {'class':'titleColumn'})]  ## get diretor info
movie_diretor = [re.findall('(\A.+)(\s\(dir\.\),)', direct)[0][0] for direct in movies_director_info] ## clean diretor name
## get year
movies_year = [release.span.text for release in info_movies.find_all('td', attrs= {'class':'titleColumn'})]  ## get release years
movie_release = [re.findall('(\()(\d+)(\))', year)[0][1] for year in movies_year] ## clean years information

## Create dataframe
movies_df = pd.DataFrame(movie_name)
movies_df.columns = ['Movie name']
movies_df['Initial release'] = movie_release
movies_df['Diretor'] = movie_diretor
movies_df['Rating'] = movie_rating

movies_df ## print

Unnamed: 0,Movie name,Initial release,Diretor,Rating
0,Os Condenados de Shawshank,1994,Frank Darabont,9.2
1,O Padrinho,1972,Francis Ford Coppola,9.1
2,O Padrinho: Parte II,1974,Francis Ford Coppola,9.0
3,O Cavaleiro das Trevas,2008,Christopher Nolan,9.0
4,Doze Homens em Fúria,1957,Sidney Lumet,8.9
...,...,...,...,...
245,Aurora,1927,F.W. Murnau,8.0
246,"Paris, Texas",1984,Wim Wenders,8.0
247,As Noites de Cabíria,1957,Federico Fellini,8.0
248,Demon Slayer - Kimetsu no Yaiba - O Filme: Com...,2020,Haruo Sotozaki,8.0


#### 3.1. Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [2]:
#This is the url you will scrape in this exercise
url_top = 'https://www.imdb.com/list/ls009796553/'

In [3]:
# your code here
movie_response = requests.get(url_top)  #get the url info
movie_response

top_movies_soup = BeautifulSoup(movie_response.content)  #make a nice soup
top_movies_soup

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///list/ls009796553?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Random Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/l

In [10]:
# your code here
## get movie names
movies_names = [mov.a.text for mov in top_movies_soup.find_all('h3', attrs= {'class':'lister-item-header'})]  ## get movie name

# get movie year
movies_years = [year.text for year in top_movies_soup.find_all('span', attrs= {'class':'lister-item-year text-muted unbold'})]  ## get year info
movie_releases = [re.findall('\d+', year)[0] for year in movies_years] ## clean years information

## get brief summary
movies_sum = [summary.text for summary in top_movies_soup.find_all('p', attrs= {'class':''})]  ## get brief summary info
movie_brief = [re.findall('(\n)(.+)', sumarry)[0][1] for sumarry in movies_sum] ## clean brief summary

## Create dataframe
movies_top_df = pd.DataFrame(movies_names)
movies_top_df.columns = ['Movie name']
movies_top_df['Year'] = movie_releases
movies_top_df['Brief Summary'] = movie_brief

movies_top_df.head(10) ## print

Unnamed: 0,Movie name,Year,Brief Summary
0,Pesadelo em Elm Street,1984,The monstrous spirit of a slain child murderer...
1,Despertares,1990,The victims of an encephalitis epidemic many y...
2,Liga de Mulheres,1992,Two sisters join the first female professional...
3,Um Bairro em Nova Iorque,1993,A father becomes worried when a local gangster...
4,Anjos em Campo,1994,When a boy prays for a chance to have a family...
5,Tempo de Matar,1996,"In Canton, Mississippi, a fearless young lawye..."
6,Amistad,1997,"In 1839, the revolt of Mende captives aboard a..."
7,Anaconda,1997,"A ""National Geographic"" film crew is taken hos..."
8,"A Cool, Dry Place",1998,"Russell, single father balances his work as a ..."
9,América Proibida,1998,A former neo-nazi skinhead tries to prevent hi...


## Bonus

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [63]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url_city = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city: lisbon


In [68]:
# your code here
def getcityinfo():
    city = input('Enter the city: ')  ##get city name
    url_city = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
    city_response = requests.get(url_city)                                                ## get a response
    city_soup = BeautifulSoup(city_response.content)                                      ## make a traditional city food
    city_info = str(city_soup)                                                            ##get a string from that
    main = re.findall('(\"main\":\")(\w+\s*\w*)(\"\,)', city_info)[0][1]                  ## main weather info
    description = re.findall('(\"description\":\")(\w+\s*\w*)(\"\,)', city_info)[0][1]    ## description of weather
    temperature = re.findall('(\"temp\":)(\d+\.*\d*)(\,)', city_info)[0][1]               ## temperature in city
    temperature_min = re.findall('(\"temp_min\":)(\d+\.*\d*)(\,)', city_info)[0][1]       ## minimum temperature
    temperature_max = re.findall('(\"temp_max\":)(\d+\.*\d*)(\,)', city_info)[0][1]       ## maximum temperature
    wind_speed = re.findall('(\"speed\":)(\d+\.*\d*)(\,)', city_info)[0][1]               ## wind speed
    return ('Today the weather in '+str(city).capitalize()+' is '+str(main)+', more precisely '+str(description)+'. The temperature is around '+temperature+', being the temperature range in the city between '+temperature_min+' and '+temperature_max+'. Furthermore, the wind speed today is '+wind_speed+'.')

In [69]:
getcityinfo()    ## function returns all info regarding the city selected

Enter the city: barcelona


'Today the weather in Barcelona is Clear, more precisely clear sky. The temperature is around 23.15, being the temperature range in the city between 21.81 and 25.66. Furthermore, the wind speed today is 0.'

#### Find the book name, price and stock availability as a pandas dataframe.

In [70]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
book_url = 'http://books.toscrape.com/'

In [71]:
# your code here
book_response = requests.get(book_url)              ## make a request
book_soup = BeautifulSoup(book_response.content)    ## get a response 
book_soup

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="static

In [92]:
# your code here
## get book names
books_names = [book.h3.a.get('title') for book in book_soup.find_all('article', attrs= {'class':'product_pod'})] 
## get book prices
books_prices = [price.text for price in book_soup.find_all('p', attrs= {'class':'price_color'})]
## get book stock
books_st = [re.findall('(\s+)(.+)(\n\s)', stock.text)[0][1] for stock in book_soup.find_all('p', attrs= {'class':'instock availability'})]

## Create dataframe
book_df = pd.DataFrame(books_names)
book_df.columns = ['Book name']
book_df['Price'] = books_prices
book_df['Stock'] = books_st
book_df ## print

## Remark: the url used will only give information regarding the first page. 
## In order to get all the info we need to use the following url and make requests to each individual page
## url: 'http://books.toscrape.com/catalogue/page-'+page_number+'.html'

Unnamed: 0,Book name,Price,Stock
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock
9,The Black Maria,£52.15,In stock


####  Display the 100 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.
***Hint:*** Here the displayed number of earthquakes per page is 20, but you can easily move to the next page by looping through the desired number of pages and adding it to the end of the url.

In [2]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/?view='

# This is how you will loop through each page:
number_of_pages = int(100/20)                 ## Each page have 50 results not 20. Probably page changed number of results
each_page_urls = []

for n in range(1, number_of_pages+1):
    link = url+str(n)
    each_page_urls.append(link)
    
each_page_urls

['https://www.emsc-csem.org/Earthquake/?view=1',
 'https://www.emsc-csem.org/Earthquake/?view=2',
 'https://www.emsc-csem.org/Earthquake/?view=3',
 'https://www.emsc-csem.org/Earthquake/?view=4',
 'https://www.emsc-csem.org/Earthquake/?view=5']

In [28]:
# your code here
day_list = list()                            ## create empty list to day results
hour_list = list()                           ## create empty list to hour results
lat_list = list()                            ## create empty list to latitude results
lon_list = list()                            ## create empty list to longitude results
region_list = list()                         ## create empty list to region results

for i in range(len(each_page_urls)):                                ## check all page urls
    page_response = requests.get(each_page_urls[i])                 ## make a request
    page_soup = BeautifulSoup(page_response.content)                ## get a soup
    ## get date
    earth_day = [re.findall('\d\d\d\d\-\d\d-\d\d', day.b.a.text)[0] for day in page_soup.find_all('td', attrs= {'class':'tabev6'})]
    day_list.append(earth_day)                                      ## append to the main list
    ## get time
    earth_hour = [re.findall('\d\d\:\d\d\:\d\d\.\d', hour.b.a.text)[0] for hour in page_soup.find_all('td', attrs= {'class':'tabev6'})]
    hour_list.append(earth_hour)                                    ## append to the main list
    ## get latitude and longitude
    earth_lat_lon = [re.findall('(\d*\.\d*)', lat.text)[0] for lat in page_soup.find_all('td', attrs= {'class':'tabev1'})]
    earth_lat = [earth_lat_lon[i] for i in range(len(earth_lat_lon)) if i%2]      ##get latitude
    lat_list.append(earth_lat)                                                    ##append to the main list
    earth_lon = [earth_lat_lon[i] for i in range(len(earth_lat_lon)) if not i%2]  ##get longitude
    lon_list.append(earth_lon)                                                    ##append to the main list
    ## get region
    earth_region = [re.findall('(\xa0)(.+)', region.text)[0][1] for region in page_soup.find_all('td', attrs= {'class':'tb_region'})]
    region_list.append(earth_region)                                              ## append to the main list

## Create dataframe
df_complete = pd.DataFrame()

## Append all info
for i in range(len(region_list)):                 ## iterate each list inside the lists
    temp_df = pd.DataFrame(list(zip(day_list[i],hour_list[i],lat_list[i],lon_list[i],region_list[i]))) ##create a df 
    df_complete = df_complete.append(temp_df, ignore_index=True)  ## append to the main dataframe

df_complete.columns = ["day", "hour", "latitude", "longitude", "region"]   ##change column names

df_complete  ##print

Unnamed: 0,day,hour,latitude,longitude,region
0,2021-08-10,20:32:03.0,27.88,38.98,WESTERN TURKEY
1,2021-08-10,20:14:18.1,27.02,36.41,DODECANESE IS.-TURKEY BORDER REG
2,2021-08-10,20:02:39.0,68.46,31.22,"SAN JUAN, ARGENTINA"
3,2021-08-10,19:56:54.0,68.83,22.25,"ANTOFAGASTA, CHILE"
4,2021-08-10,19:53:52.2,3.52,36.63,STRAIT OF GIBRALTAR
...,...,...,...,...,...
249,2021-08-09,20:50:56.5,38.76,38.38,EASTERN TURKEY
250,2021-08-09,20:41:48.9,3.59,37.14,SPAIN
251,2021-08-09,20:41:18.3,27.07,36.30,DODECANESE IS.-TURKEY BORDER REG
252,2021-08-09,20:36:57.7,155.24,19.38,"ISLAND OF HAWAII, HAWAII"
