# Web Scraping

![legtsgo](https://media.giphy.com/media/dwmNhd5H7YAz6/giphy.gif)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Why-using-web-scraping?" data-toc-modified-id="Why-using-web-scraping?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Why using web-scraping?</a></span></li><li><span><a href="#Use-cases" data-toc-modified-id="Use-cases-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Use cases</a></span><ul class="toc-item"><li><span><a href="#Tools" data-toc-modified-id="Tools-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Tools</a></span></li></ul></li><li><span><a href="#HTML" data-toc-modified-id="HTML-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>HTML</a></span></li><li><span><a href="#What-Are-HTML-Tags?" data-toc-modified-id="What-Are-HTML-Tags?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>What Are HTML Tags?</a></span></li><li><span><a href="#What-are-HTML-Attributes?" data-toc-modified-id="What-are-HTML-Attributes?-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>What are HTML Attributes?</a></span><ul class="toc-item"><li><span><a href="#Golden-Rules-To-Remember" data-toc-modified-id="Golden-Rules-To-Remember-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Golden Rules To Remember</a></span></li><li><span><a href="#Tags-(some)" data-toc-modified-id="Tags-(some)-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Tags (some)</a></span></li><li><span><a href="#Attributes" data-toc-modified-id="Attributes-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Attributes</a></span></li></ul></li><li><span><a href="#Beautitul-soup" data-toc-modified-id="Beautitul-soup-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Beautitul soup</a></span></li><li><span><a href="#Difference-between-find_all()-and-select()" data-toc-modified-id="Difference-between-find_all()-and-select()-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Difference between find_all() and select()</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Find_all" data-toc-modified-id="Find_all-7.0.1"><span class="toc-item-num">7.0.1&nbsp;&nbsp;</span>Find_all</a></span></li><li><span><a href="#Select" data-toc-modified-id="Select-7.0.2"><span class="toc-item-num">7.0.2&nbsp;&nbsp;</span>Select</a></span></li></ul></li><li><span><a href="#Bento-diner" data-toc-modified-id="Bento-diner-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Bento diner</a></span></li></ul></li><li><span><a href="#Let's-scrape-🥷" data-toc-modified-id="Let's-scrape-🥷-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Let's scrape 🥷</a></span><ul class="toc-item"><li><span><a href="#Example-1----Rufus" data-toc-modified-id="Example-1----Rufus-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Example 1 -  Rufus</a></span></li><li><span><a href="#Example-2---Sillicon-valley" data-toc-modified-id="Example-2---Sillicon-valley-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Example 2 - Sillicon valley</a></span></li><li><span><a href="#Example-3---Sneakers" data-toc-modified-id="Example-3---Sneakers-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>Example 3 - Sneakers</a></span></li><li><span><a href="#Example-5---IMBD" data-toc-modified-id="Example-5---IMBD-8.4"><span class="toc-item-num">8.4&nbsp;&nbsp;</span>Example 5 - IMBD</a></span></li><li><span><a href="#Example-4---1992-olympics-medal-table" data-toc-modified-id="Example-4---1992-olympics-medal-table-8.5"><span class="toc-item-num">8.5&nbsp;&nbsp;</span>Example 4 - 1992 olympics medal table</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

## Why using web-scraping?

Web scraping is the process of collecting and parsing raw data from the internet. 
Why would you do that?


- Advantages: Using APIs, sometimes you need to create an account, get authenticated, read documentation, follow certain rules... And you might have a limit (quota) of requests you can make. Web scraping is more free-style: you don't have these limitations. Also for: Automation


- Disadvantages: You depend on the structure of the website you're trying to scrape. If the structure of the site is really messy, it might be a nightmare to create a script that collects the desired data. The structure of the website can change overnight and your code will simply break. Some websites are protected against web scraping, and the data they show on the screen is very difficult to access (for example, they can block you if you send too many requests, they can simply show data as images).





## Use cases

- Market & product research: (scrape ecommerce sites and get the brands, products, prices, reviews… of your competitors!).

- Lead generation: (scrape forums and find contact info of people that might be interested in your product... bit edgy but effective!).

- Keyword research: (what is people mentioning when they talk about something in the internet? Those keywords might be really useful in SEO/Growth-hacking).

- People Analytics: (the field of HR/Talent research is really benefiting from data science stuff. And they might need data like job descriptions, skills, offers… only available through web scraping).

- Stock Analysis: (using data in the web to forecast...)

- Media Gathering & Analysis: (how do different news outlets talk about the candidates of an important election? Web scrape them!) 

### Tools

- Beautiful soup
- Scrapy
- Selenium

## HTML
The HTML code is made up of `<labled>` content
HTML has a hierarchical structure: parent tags, children tags, sibling tags


```html
<html>
  <head>
    <title>This is my website</title>
    <link rel="stylesheet" href="styles.css" />
  </head>
  <body>
    <h1>This is header 1: h1</h1>
    <h2>This is h2</h2>
      <p class="something">This is a paragraph and the info is held here</p>
    <p>This is a second paragraph</p>
    <img src="thisimage.png" />
  </body>
</html>

```

```css
body {
  background-color: lightblue;
}

p {
  color: navy;
}

.something {
  color: red;
  font-weight: 500;
}
```

# first website

![Screenshot%202022-07-22%20at%2009.39.58.png](attachment:Screenshot%202022-07-22%20at%2009.39.58.png)


![sdsd](https://www.simplilearn.com/ice9/free_resources_article_thumb/header-tag.PNG)

## What Are HTML Tags?
Tags are used to mark up the start of an HTML element and they are usually enclosed in angle brackets. An example of a tag is: `<h1>`.   Most tags must be opened `<h1>` and closed `</h1>` in order to function.
    
    
## What are HTML Attributes?
Attributes contain additional pieces of information. Attributes take the form of an opening tag and additional info is placed inside.    An example of an attribute is: `<img>`
`<img src="mydog.jpg" alt="A photo of my dog.">`

In this instance, the image source (src) and the alt text (alt) are attributes of the `<img>` tag.

### Golden Rules To Remember

- The vast majority of tags must be opened (`<tag>`) and closed (`</tag>`) with the element information such as a title or text resting between the tags.
- When using multiple tags, the tags must be closed in the order in which they were opened. For example:
    
`<strong><em>This is really important!</em></strong>`

### Tags (some)

- heading: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
- containers: `<div>`, `<span>`, `<article>`...
- phrasing: `<b>`, `<i>`, `<u>`...
- hyperlinks: `<a>`
- embedded: `<audio>`, `<img>`, `<video>`...
- tabulated: `<table>`, `<tr>`, `<tbody>`, `<td>`...
- sections: `<header>`, `<section>`, `<article>`...
- metadata: `<meta>`, `<title>`, `<script>`...

### Attributes
Attributes are properties that a tag may or may not have.

`<div> Sneakers Brand Joma X54 </div>`

In this case, the div tag has no attributes

Now:

`<div class="price-item" id="offer"> Sneakers Brand Joma X54 </div>`

* a `div` tag
* a `class` attribute
* an `id` attribute


The `id` attribute must be unique for a tag (no two tags can have the same `id`), however, `class` is not intended to be unique, it usually groups tags with similar behavior. That the id is unique is a convention, html does allow us to have the same id. We will not search by id, unless we want to search for something specific, because we will want to obtain a lot of data....

Some frequently used attributes are:
- `class`
- `href`
- `src`
- `width`, `height`
- `alt`

## Beautitul soup
“If you're just trying to pull data from a website. Beautiful Soup is here to help."

But it will only help you if you [read the documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 🤭

Beautiful Soup is a Python library for getting data from HTML, XML, and other markup languages. Let's say that you have found some web pages that show relevant data for you, such as date information, content, address, values… but that web page does not provide any way to download the data directly. Beautiful Soup helps you extract particular content from a web page, remove the HTML markup, save the information, and even export it to you as an excel file.
The Beautiful Soup documentation will give you an idea of ​​the variety of things the Beautiful Soup library will help you with, from isolating titles and links, to extracting all text from html tags, to altering the HTML within the document you're working with.

In [None]:
#!pip install bs4

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

`pip install lxml` ?

In [None]:
url = "https://www.python.org/"

In [None]:
html_2 = requests.get(url)
html_2

In [None]:
# html.content

In [None]:
soup = BeautifulSoup(html_2.content, "html.parser")

In [None]:
type(soup)

In [None]:
#soup

In [None]:
tags = soup.find_all("div") #All the headers n2
len(tags)

In [None]:
tags2 = soup.find_all("div", attrs={"class":"psf-widget"})

In [None]:
len(tags2)

In [None]:
# You can target tags by specifying 
# the attributes they hold


# Attributes are specific information
# that modify tags


# Tags -> nouns
# Attributes -> adjectives

In [None]:
tags = soup.find_all("h2") #All the headers n2
tags[1]

In [None]:
tags[1].getText()

In [None]:
my_h2 = [element.getText().replace("\n", "").replace(">>>", "").strip() for element in tags]
my_h2

In [None]:
mango = requests.get("https://shop.mango.com/es").content
mango_soup = BeautifulSoup(mango, "html.parser")
#mango_soup

## Difference between find_all() and select()

#### Find_all
[FINDALL DOCUMENTATION](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)
The find_all() method scans the entire document for results.

`prices = soup.findAll("div", {"class": "price_product_box"})`

#### Select
[SELECT DOCUMENTATION](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)
BeautifulSoup has a .select() method that is used by the SoupSieve package to execute a CSS selector against a parsed document and return all matching elements. Tag has a similar method that executes a CSS selector against the content of a single tag.


`prices = soup.select("div.price_product_box")`

### Bento diner

https://flukeout.github.io/

## Let's scrape 🥷

### Example 1 -  Rufus

In [None]:
url_rufus = "https://es.wikipedia.org/wiki/Rufus_T._Firefly"

In [None]:
res = requests.get(url_rufus)
html = res.content
soup = BeautifulSoup(html, "html.parser")
#soup

In [None]:
tags = soup.find_all("a") #anchor tag -> links
tags

In [None]:
tags[3].get("href")

In [None]:
links = [tag.get("href") for tag in tags]
links_new = links[1:]
links_new

In [None]:
links_http = [enlace for enlace in links_new if "http" in enlace]
links_http[2]

In [None]:
def getlinks (url):
    res = requests.get(url_rufus)
    html = res.content
    soup = BeautifulSoup(html, "html.parser")
    
    tags = soup.find_all("a")
    
    links = [tag.get("href") for tag in tags]
    links_new = links[1:]
    links_new
    
    return [enlace for enlace in links_new if "http" in enlace]

In [None]:
getlinks("https://es.wikipedia.org/wiki/Rufus_T._Firefly/")

### Example 2 - Sillicon valley

In [None]:
sillicon_valley = 'https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)'

- Year
- Ceremony
- Category
- Recipients
- Result

In [None]:
res = requests.get(sillicon_valley)
html = res.content
soup = BeautifulSoup(html, "html.parser")
#soup

In [None]:
# The tags we're after: "table", specifically "class: wikitable..."

In [None]:
# one way to turn it into a table (without bs4)

In [None]:
df = pd.read_html(html)[3] #pandas sometimes will be lucky and find it
df.sample()

In [None]:
# Another way to turn it into a table: bs4 helps us as the attribute targets our table

In [None]:
tables = soup.find_all("table", {"class":"wikitable sortable"})[0] #there's only one

In [None]:
df = pd.read_html(tables.prettify())[0]
df.sample()

In [None]:
table_exported = pd.to_csv("sillicon_valley.csv")
table_exported

### Example 3 - Sneakers

In [None]:
url = "https://www.murallasport.com/29-zapatillas-moda-mujer"

- Name
- Brand
- Price
- Link

For each of them
- Description

Inspecting the page we saw we wanted to get: `<div class="name_product_box">`

In [None]:
soup = BeautifulSoup(requests.get(url).content, "html.parser")

1. Get Name through `soup.find_all()`

In [None]:
products = soup.find_all("div", attrs = {"class":"name_product_box"})
products[2]

In [None]:
products[2].getText().strip()

In [None]:
name = [element.getText().strip() for element in products]
name[5]

2. Get brand through `soup.select()`

In [None]:
brand_select = soup.select("span.marca-product-box")
brand = [element.getText().strip() for element in brand_select]
brand[0]

3. Price: `<div class="price_product_box">`

In [None]:
price_find = soup.find_all("div", attrs = {"class":"price_product_box"})
price = [element.getText().strip() for element in price_find ]
price[0]

In [None]:
new_list = []
for money in price:
    if "\n" in money:
        new_list.append(money.split("\n")[0])
    else:
        new_list.append(money)
new_list[0]

4. Link: `<div class="name_product_box" id="4665">` > `href` 

![Screenshot%202022-07-22%20at%2011.46.13.png](attachment:Screenshot%202022-07-22%20at%2011.46.13.png)

In [None]:
links_select = soup.find_all("div", {"class":"name_product_box"})
links_select[0]

In [None]:
links_select[0].find_all("a")[0]

In [None]:
# This is the same: find just returns the first one. In this case there's only one
links_select[0].find("a")

In [None]:
links_select[0].find("a").get("href")

In [None]:
links = [f"https://www.murallasport.com{element.find('a').get('href')}" for element in links_select]
links[2]

In [None]:
# Turn this into a DF

all_together = {
    "names":name,
    "brand":brand,
    "price":new_list,
    "links":links
}

In [None]:
sneakers = pd.DataFrame(all_together)
sneakers

In [None]:
# Targeting description insde of ONE link I got

![Screenshot%202022-07-22%20at%2012.03.33.png](attachment:Screenshot%202022-07-22%20at%2012.03.33.png)

In [None]:
# Get one description for every single LINK we obtained

In [None]:
a_given_product = requests.get("https://www.murallasport.com/producto/2500-canvas-color-chuck-taylor-all-star-move-low-top-negro")

In [None]:
soup_given = BeautifulSoup(a_given_product.content, "html.parser")

In [None]:
text = soup_given.findAll("div", {"class":"txt-description-product tab-pane active"})
text[0].getText().strip()

In [None]:
# Pseudo-code for following steps:
# function: gets a link, retrieves description
# new_coluumn = result of apply that function to link column

In [None]:
sneakers["links"][0]

In [None]:
def get_description (url):
    a_given_product = requests.get(url)
    soup_given = BeautifulSoup(a_given_product.content, "html.parser")
    try:
        text = soup_given.findAll("div", {"class":"txt-description-product tab-pane active"})[0]
        return text.getText().strip()
    except:
        return np.nan

In [None]:
get_description(sneakers["links"][0])

In [None]:
sneakers["description"] = sneakers["links"].apply(get_description)
sneakers

In [None]:
sneakers_to_csv = sneakers.to_csv("sneakers-full.csv")

### Example 5 - IMBD

In [None]:
url = "https://www.imdb.com/chart/top"

- Title
- Director
- Stars
- Year

In [None]:
headers = {"Accept-Language":"en-US"} # we tried
res = requests.get(url, headers) #python requesting in spanish
html = res.content
soup = BeautifulSoup(html, "html.parser")

In [None]:
#soup

In [None]:
# Soup retrieves the things translated:

![Screenshot%202022-07-22%20at%2013.23.30.png](attachment:Screenshot%202022-07-22%20at%2013.23.30.png)

1. Title: `<td class="titleColumn">` 

In [None]:
# find_all(tag, attri?) -> list
# select("tag attri") plate pickle apple -> list

In [None]:
titles = soup.select("td.titleColumn a")

In [None]:
titles = [i.get_text() for i in titles]
titles[0]

2. Ratings: `<td class="ratingColumn imbdRating">` 

In [None]:
ratings_get = soup.select("td.ratingColumn strong")
ratings_get[0].get_text()

In [None]:
ratings = [i.get_text() for i in ratings_get]
ratings[4]

3. Links: soup.select("td.titleColumn href")

In [None]:
links = soup.select("td.titleColumn a")
links[0]["href"]

In [None]:
links_okay = [f"https://imbd.com{i['href']}" for i in links]
links_okay[2]

4. Years: `soup.select("td.titleColumn span")`

In [None]:
#Q: How'd you find the years:
example = soup.select("td.titleColumn")
example[0]

In [None]:
#Still the question:
example = soup.select("td.titleColumn a") -> "Cadena perpetua"
example = soup.select("td.titleColumn span") -> "(1994)"

In [None]:
years_select = soup.select("td.titleColumn span")
years_select[0].get_text()

In [None]:
year = [element.get_text()[1:-1] for element in years_select]
year[0]

5. Directors: `soup.select("td.titleColumn")`

In [None]:
directors_search = soup.select("td.titleColumn a")
directors_search[0].get("title").split("(dir.),")[0].strip()

In [None]:
directors = [element.get("title").split("(dir.),")[0].strip() for element in directors_search]
directors[:3]

In [None]:
cast = [element.get("title").split("(dir.),")[1].strip() for element in directors_search]
cast[:3]

In [None]:
imbd = {
    "Movie":titles,
    "Score": ratings,
    "Year":year,
    "Director": directors,
    "Cast":cast,
    "Link":links_okay
}

In [None]:
imbd_df = pd.DataFrame(imbd)
imbd_df

In [None]:
imbd_df.to_csv("imbd.csv")

### Example 4 - 1992 olympics medal table

In [None]:
url = "https://en.wikipedia.org/wiki/1992_Summer_Olympics_medal_table"

- Table 

![elmedallero](../images/captura_medallero.png)

## Summary
It's your turn, what have we learned today?


- prettify -> readeable
- bs4 
- We don't need an API: we can access the webpage with just the link
- We receive an HTML file
- We can navigate throught the tags: both through Inspect on Chrome & on bs4
- We can store info from the tag in a list: we can convert to strings
- We can turn the list into a df
- We can `find_all(tags, attrs)` & we can `select("css.selectors")`

⚠️ Clear all output before pushing