# Lab | Web Scraping

## Introduction

Web scraping can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner". In this lab, you will practice a series of exercises to practice your web scraping skills.  

Each exercise is independent from the previous one. If you get stuck in one exercise you can skip to the next one.

### Hints:
- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.

### Documentation:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

## Libraries
- Make sure you have all libraries installed before start the lab.  
- In this lab you will use `requests`, `BeautifulSoup` and `pandas`.

In [1]:
# Import the libraries here
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime 

## Scraping github trending developers
- In this first exercise we will scraping the github trending developers. Use the url below.
```python
url = 'https://github.com/trending/developers'
```

In [2]:
# Your code here
url = 'https://github.com/trending/developers'

- Start using `requests.get()` over the 'url', save your output in a new variable called `get_html`
- The output should be `<Response [200]>`

In [3]:
# Your code here
get_html = requests.get(url)
get_html

<Response [200]>

- Explore the request methods
- Try get_html.status_code and get_html.encoding

In [4]:
# Your code here
print(get_html.status_code)
print(get_html.encoding)

200
utf-8


- Call the `get_html.content` method to return the page content.
- Save in a variable called `html_content`

In [5]:
# Your code here
html_content = get_html.content
html_content

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" /><link crossorigin="anonymous" media="all" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4hJnnRdkaPuY1eu9bumt33FyHHFDX8hskTUNWNkIsMCz7

- Use the BeautifulSoup to parse your result. You can use the code below.
```python
soup = BeautifulSoup(html_content, "lxml")
```

In [6]:
# Your code here
soup = BeautifulSoup(html_content, "lxml")
soup

<!DOCTYPE html>
<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-52b02edb7f9eca7716bda405c2c2db81.css" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4h

### Display the names of the trending developers retrieved in the previous step.

- Find out the html tag and class names used for the developer names.
- Use BeautifulSoup to extract all the html elements that contain the developer names.
- Use string manipulation techniques to replace whitespaces and line breaks (i.e. \n) in the *text* of each html element. Use a list to store the clean names.

Your output should look like below:

```
['KentC.Dodds',
 'SethVargo',
 'VadimDemedes',
 'PaulBeusterien',
 'DanImhoff',
 'CalebPorzio',
 'TannerLinsley',
 'InesMontani',
 'Mr.doob',
 'JacobHoffman-Andrews',
 'TianonGravi',
 'TaylorOtwell',
 'MatthewJohnson',
 'MathiasBuus',
 'TimHolman',
 'AlonZakai',
 'HadleyWickham',
 'Bo-YiWu',
 'TobiasKoppers',
 'KentaroWada',
 'TeppeiFukuda',
 'MartinAtkins',
 'RyanMcKinley',
 'KlausPost',
 'JamesAgnew']
 ```

In [7]:
# Your code here
html_names = soup.find_all('h1', attrs={'class': 'h3 lh-condensed'})
names = [name.text.strip() for name in html_names]
names

['Arvid Norberg',
 'Vincent Prouillet',
 'Yair Morgenstern',
 'Etienne BAUDOUX',
 'Franck Nijhof',
 'Phap Dieu Duong',
 'Jason R. Coombs',
 'Alex Chi',
 'fiatjaf',
 'William Boman',
 'andig',
 'Anthony Sottile',
 'Artem Zakharchenko',
 'Sebastián Ramírez',
 'Fons van der Plas',
 'David Rodríguez',
 'Jonny Burger',
 'Scott W Harden',
 'J. Nick Koston',
 'kenji yoshida',
 'Adrian Kumpf',
 'Indrajeet Patil',
 'Ariel Mashraki',
 'Jesse Wilson',
 'Robin']

## Scraping function
- Now you have learned how to use Requests and BeautifulSoup. 
- Create the function below to make your scraping easier.
```python
def url_bs4(url):
    get_html = requests.get(url)
    print(get_html.status_code)
    print(get_html.encoding)
    html = get_html.content
    soup = BeautifulSoup(html)
    return soup
```

In [8]:
# Your code here
def url_bs4(url):
  get_html = requests.get(url)
  print(get_html.status_code)
  print(get_html.encoding)
  html = get_html.content
  soup = BeautifulSoup(html)
  return soup

## Scraping Walt Disney wikipedia page
- Use the url below to scraping the Walt Disney Wikipedia page.
- Use the url_bs4 function and check the status.
```python
url_disney = 'https://en.wikipedia.org/wiki/Walt_Disney'
```

In [9]:
# Your code here
url_disney = 'https://en.wikipedia.org/wiki/Walt_Disney'
wikidisney = url_bs4(url_disney)

200
UTF-8


- Create a list with  all the image links from Walt Disney Wikipedia page
- Try the `.find_all` method to find the images

In [10]:
# Your code here
imagenes = wikidisney.find_all('a', attrs={'class':'image'})
link_imagen = ['https://en.wikipedia.org'+imagen['href'] for imagen in imagenes]
link_imagen

['https://en.wikipedia.org/wiki/File:Walt_Disney_1946.JPG',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_1942_signature.svg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 'https://en.wikipedia.org/wiki/File:Trolley_Troubles_poster.jpg',
 'https://en.wikipedia.org/wiki/File:Steamboat-willie.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_1935.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 'https://en.wikipedia.org/wiki/File:Disney_drawing_goofy.jpg',
 'https://en.wikipedia.org/wiki/File:DisneySchiphol1951.jpg',
 'https://en.wikipedia.org/wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_disney_portrait_right.jpg',
 'https://en.wikipedia.org/wiki/File:Walt_Disney_Grave.JPG',
 'https://en.wikipedia.org/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 'https://en.wikipedia.org/wiki/File:Disney_Oscar_1953_(cropped).jpg',
 'https://en.wikipedi

## Scraping earthquakes
- Use the url below to scraping the 50 latest earthquakes.
```python
url_eq='https://www.emsc-csem.org/Earthquake/'
```
- Instead  of use requests and BeautifulSoup,  try the function `pd.read_html(url_eq)`
- You will notice that it returns a list of elements. One of the elements in this list is the earthquake table
- You will need to clean the columns names, the Date & Time values,  and drop the last 3 rows

In [11]:
# Your code here
url_eq='https://www.emsc-csem.org/Earthquake/'
earthquakes = pd.read_html(url_eq)[-1]
earthquakes.columns

MultiIndex([(    'CitizenResponse',   '12345678910»'),
            (    'CitizenResponse', '12345678910».1'),
            (    'CitizenResponse', '12345678910».2'),
            (    'Date & Time UTC',   '12345678910»'),
            (   'Latitude degrees',   '12345678910»'),
            (   'Latitude degrees', '12345678910».1'),
            (  'Longitude degrees',   '12345678910»'),
            (  'Longitude degrees', '12345678910».1'),
            (           'Depth km',   '12345678910»'),
            (            'Mag [+]',   '12345678910»'),
            (    'Region name [+]',   '12345678910»'),
            (    'Last update [-]',   '12345678910»'),
            ('Unnamed: 12_level_0',   '12345678910»')],
           )

In [12]:
earthquakes.columns= [columns[0] for columns in earthquakes.columns]
earthquakes.head()

Unnamed: 0,CitizenResponse,CitizenResponse.1,CitizenResponse.2,Date & Time UTC,Latitude degrees,Latitude degrees.1,Longitude degrees,Longitude degrees.1,Depth km,Mag [+],Region name [+],Last update [-],Unnamed: 12_level_0
0,,,,2022-01-24 20:57:56.506min ago,19.22,N,155.42,W,33,Ml,2.3,"ISLAND OF HAWAII, HAWAII",2022-01-24 21:03
1,,,,2022-01-24 20:54:12.809min ago,18.51,N,73.31,W,4,M,4.0,HAITI REGION,2022-01-24 21:03
2,,,,2022-01-24 20:52:59.411min ago,19.81,N,72.99,W,5,M,4.3,HAITI REGION,2022-01-24 21:00
3,,,,2022-01-24 20:39:15.024min ago,9.17,N,82.09,W,10,M,3.6,PANAMA-COSTA RICA BORDER REGION,2022-01-24 20:41
4,,,,2022-01-24 20:29:31.234min ago,42.83,N,1.33,W,10,ML,1.8,PYRENEES,2022-01-24 20:47


In [13]:
def datatime(data):
    try:
        if type(data) == str :
             return datetime.strptime(data[0:16],'%Y-%m-%d %H:%M')
        else : 
            return np.nan 
    except :
          return np.nan   

In [14]:
earthquakes['Date & Time UTC'] = earthquakes['Date & Time UTC'].apply(datatime)
earthquakes.head()

Unnamed: 0,CitizenResponse,CitizenResponse.1,CitizenResponse.2,Date & Time UTC,Latitude degrees,Latitude degrees.1,Longitude degrees,Longitude degrees.1,Depth km,Mag [+],Region name [+],Last update [-],Unnamed: 12_level_0
0,,,,2022-01-24 20:57:00,19.22,N,155.42,W,33,Ml,2.3,"ISLAND OF HAWAII, HAWAII",2022-01-24 21:03
1,,,,2022-01-24 20:54:00,18.51,N,73.31,W,4,M,4.0,HAITI REGION,2022-01-24 21:03
2,,,,2022-01-24 20:52:00,19.81,N,72.99,W,5,M,4.3,HAITI REGION,2022-01-24 21:00
3,,,,2022-01-24 20:39:00,9.17,N,82.09,W,10,M,3.6,PANAMA-COSTA RICA BORDER REGION,2022-01-24 20:41
4,,,,2022-01-24 20:29:00,42.83,N,1.33,W,10,ML,1.8,PYRENEES,2022-01-24 20:47


In [15]:
earthquak=earthquakes.drop(columns=['Region name [+]','Last update [-]', 'Unnamed: 12_level_0'])
earthquak.head()

Unnamed: 0,CitizenResponse,CitizenResponse.1,CitizenResponse.2,Date & Time UTC,Latitude degrees,Latitude degrees.1,Longitude degrees,Longitude degrees.1,Depth km,Mag [+]
0,,,,2022-01-24 20:57:00,19.22,N,155.42,W,33,Ml
1,,,,2022-01-24 20:54:00,18.51,N,73.31,W,4,M
2,,,,2022-01-24 20:52:00,19.81,N,72.99,W,5,M
3,,,,2022-01-24 20:39:00,9.17,N,82.09,W,10,M
4,,,,2022-01-24 20:29:00,42.83,N,1.33,W,10,ML


## Bonus
- Find the IMDB's Top 250 data.
- You should have movie name, year release, director name and actors.
- Create a dataframe with the data you collected.
- Use the url below to this exercise.
```python
url_imdb = 'https://www.imdb.com/chart/top'
```

In [55]:
# Your code here
url_imdb = "https://www.imdb.com/chart/top"
get_html = requests.get(url_imdb)
html = get_html.content
soup = BeautifulSoup(html,"lxml")
soup

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 250 Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/chart/top" rel="canonical"/>
<meta content="http://www.imdb.com/chart/top" property="og:url"/>
<script>
   

In [58]:
soup.find_all("td",attrs={'class':'titleColumn'})

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">O Poderoso Chefão</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">O Poderoso Chefão II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">Batman: O Cavaleiro das Trevas</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Homens e uma Sentença</a>
 <s

In [30]:
movies = soup.find_all('td', {'class':'titleColumn'})
titles = [movie.find('a').text for movie in movies]
years = [movie.find('span').text[1:-1] for movie in movies]
directors = [movie.find('a').get('title').split(',')[0][:-7] for movie in movies]
actors = [' & '.join(movie.find('a').get('title').split(',')[1:]) for movie in movies]

movies_dict = {'Title': titles, 'Release': years, 'Director': directors, 'Actors': actors}

movies_df = pd.DataFrame(movies_dict)
movies_df


Unnamed: 0,Title,Release,Director,Actors
