<a href="https://colab.research.google.com/github/MathMachado/DSWP/blob/master/Notebooks/BeautifulSoup.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscrapping
* Beautiful Soup provides a few simple methods and Pythonic idioms for navigating,
searching, and modifying a parse tree: a toolkit for dissecting a document and
extracting what you need. It doesn't take much code to write an application.

*To extract data using web scraping with python, you need to follow these basic steps:

1. Find the URL that you want to scrape
2. Inspecting the Page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format 

[Python Frameworks and Libraries for Web Scraping](https://www.scrapehero.com/python-web-scraping-frameworks/)

# Step 1: Find the URL that you want to scrape
For this example, we are going scrape Flipkart website to extract the Price, Name, and Rating of Laptops.


In [0]:
import requests
source= requests.get('https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2')
if source.status_code == 200:
    print('Requisição bem sucedida!')
    content = source.content

# Step 2: Inspecting the Page
* The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.

Para isso, posicione o mouse sobre a informação que você quer capturar e clique em 'Inspect'.

# Step 3: Find the data you want to extract
* Let’s extract the Price, Name, and Rating which is nested in the “div” tag respectively.

# Step 4: Write the code

In [0]:
from bs4 import BeautifulSoup
import pandas as pd

As mentioned earlier, the data we want to extract is nested in <div> tags. So, I will find the div tags with those respective class-names, extract the data and store the data in a variable. Refer the code below:

In [0]:
soup = BeautifulSoup(content)

[Python Regular Expressions](https://developers.google.com/edu/python/regular-expressions)

In [0]:
# 'a' --> ignore uppercase
# 

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product

# find_all() --> returns a list of all elements matching the criteria, 
# even if only one element is found, find_all will return a list of a single item.
for a in soup.findAll('a', attrs= {'class':'_31qSD5'}):
    # find() --> returns the first HTML element found
    name= a.find('div', attrs={'class':'_3wU53n'})
    print(name)
    price= a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
    #print(price)
    rating=a.find('div', attrs={'class':'hGSR34'})
    #print(rating)
    #type(rating)
    products.append(name.text)
    #print(products)
    prices.append(price.text)
    ratings.append(rating.text) 

In [0]:
# Zoom: Varre o objeto soup em busca de todas as ocorrências de <a class="_31qSD5"
soup.findAll('a', attrs= {'class':'_31qSD5'})

In [0]:
# Busca todas as ocorrências de <a class="_31qSD5", mas captura somente a linha 0.
a= soup.findAll('a', attrs= {'class':'_31qSD5'})[0]
a

In [0]:
# Varre 'a' em busca de todas as ocorrências de ''_3wU53n'.
a.find('div', attrs={'class':'_3wU53n'})

In [0]:
# Varre 'a' em busca da primeira ocorrência de 'div' 
price= a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
price

In [0]:
# Finalmente, captura o valor desejado.
price.text

In [0]:
prices

In [0]:
products

In [0]:
ratings

# Step 6: Store the data in a required format

In [0]:
df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings}) 
df.to_csv('products.csv', index=False, encoding='utf-8')

# Exemplo 2
Source: [https://medium.com/data-hackers/como-fazer-web-scraping-em-python-23c9d465a37f](https://medium.com/data-hackers/como-fazer-web-scraping-em-python-23c9d465a37f)

# Step 1: Find the URL that you want to scrape

In [0]:
import requests
source= requests.get('https://www.basketball-reference.com/leagues/NBA_2018_totals.html')
if source.status_code == 200:
    print('Requisição bem sucedida!')
    content = source.content

# Step 2: Inspecting the Page

# Step 3: Find the data you want to extract

# Step 4: Write the code

In [0]:
soup= BeautifulSoup(content, 'html.parser')
table= soup.find(name='table')

In [0]:
table_str = str(table)
df= pd.read_html(table_str)[0]
df

# Exemplo 3

In [0]:
import requests
source= requests.get('http://books.toscrape.com/index.html')
if source.status_code == 200:
    print('Requisição bem sucedida!')
    content = source.content

In [0]:
soup = BeautifulSoup(content, 'html.parser')
#print(soup.prettify())

In [0]:
soup.find_all("a", href= True)

# Exemplo 5
[Tutorial: Web Scraping and BeautifulSoup](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)

In [0]:
import requests
source= requests.get('https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=2&ref_=adv_nxt')
if source.status_code == 200:
    print('Requisição bem sucedida!')
    content = source.content

In [0]:
soup = BeautifulSoup(content, 'html.parser')
#print(soup.prettify())

In [0]:
soup.find_all('div')