# Scraping with Python

This is my notebook to learn how to web scrape with Python.

So here we go!


## Lesson 1

### Basic to web scraping:
- Lybraries we will use:
    - urllib
    - BeautifilSoup (bs4)

### How to Scrape:
1. Use urllib to make a request to the web page you want.
1. Parse the html to a BeautifilSoup object.
1. Find the data you need in the object.
1. You're done.
        

In [None]:
# Here we have a simple exemple of the process:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://alura-site-scraping.herokuapp.com/hello-world.php"
response = urlopen(url) # Get on url
html = response.read() # Read the html data from response
soup = BeautifulSoup(html, 'html.parser') # Parse HTML to a Beautiful Soup OBJ
print(soup.find('h1', id='hello-world').get_text()) # Find the desired tag
print(soup.find('p').get_text())

## Lesson 2

### Improving a request:
Sometimes we need to improve a request to make it work with the page we need.
This can be adding a authorization or other headers or even a request body. To do this we create a Request and add things to this request, as in the code bellow.

### Cleaning up the data:

The object we get from response.read() is of type bytes. It's sometimes better to convert it to a str as strs have easier and greater options for a treatment.

It's always a good idea to decode and clean up the data we received to improve the convertion to a Beautiful Soup object.


In [None]:
from urllib.request import Request, urlopen

url = 'https://www.alura.com.br'
headers = {'User-Agent': 'Chrome/76.0.3809.100'} # Creating a headers dictionary.

req = Request(url, headers = headers) # Creating the request using the headers dictionary.
response = urlopen(req)
b = response.read()

html = b.decode('utf-8') # Decoding bytes to str using UTF-8.
html = html.split() # Spliting the str in empty spaces, line breaks and tabs.
html = " ".join(html) # Re-joining the data with only 1 whitespace between everything.
html = html.replace('> <', '><') # Removing the whitespace between tags.

# For a better use let's create a function to do all this for us.

def clean_input(input):
    return " ".join(input.split()).replace('> <', '><')


## Lesson 3

### Working with bs4 BeautifulSoup:
- Parse a html string using the html.parser
- To improve visualization use soup.prettify()
- We can find tags using soup.find('tagname')
- Once on a tag we can find it's atributes with tag.attrs()



In [None]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def clean_input(input):
    return " ".join(input.split()).replace('> <', '><')

url = 'https://www.alura.com.br'
headers = {'User-Agent': 'Chrome/76.0.3809.100'} # Creating a headers dictionary.

req = Request(url, headers = headers) # Creating the request using the headers dictionary.
response = urlopen(req)
b = response.read()

html = clean_input(b.decode('utf-8')) # Decoding and cleaning.

soup = BeautifulSoup(html, 'html.parser') # Parsing.
soup.prettify() # Prettify

print(soup.find('h1')) # Find first h1
print(soup.find('h1').attrs) # Get atributes from first h1

## Lesson 4

### Find() and find_all()

You can use find() to find a single tag and find_all() for multiple tags acordingly to some filters.

- find(tag, attributes, recursive, text, **kwargs)
- find_all(tag, attributes, recursive, text, limit, **kwargs)

### Find_parent() and find_siblings()

Find parents and siblings are an easy way to navigate throught the html to get where you want to get

- find_parent(tag, attributes, text, **kwargs)
- find_parents(tag, attributes, text, limit **kwargs)
- find_next_sibling(tag, attributes, text, **kwargs)
- find_next_siblings(tag, attributes, text, limit **kwargs)
- find_previous_sibling(tag, attributes, text, **kwargs)
- find_previous_siblings(tag, attributes, text, limit **kwargs)


## Lesson 5

### Identifying and storing data.

Now we need to identify where in the html the data we need is and store it.

- Open the page in a web browser and analyse it.
- Figure out where the important data is.
- Store this data in a dictionary or a database.

### Working with Data Frames

Data Frames are usefull to convert your data to a file better suited for data analisys.

- We will use Pandas
- Create a dataframe from our dict and export it to a file.

## Storing images

Sometimes we are interested in storing images as well. To do this we use urlretrieve.

- Find the source (src) of the image.
- Use urlretrieve to download it to a directory.


In [None]:
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
import pandas as pd

def clean_input(input):
    return " ".join(input.split()).replace('> <', '><')

url = 'https://alura-site-scraping.herokuapp.com/index.php'
headers = {'User-Agent': 'Chrome/76.0.3809.100'} # Creating a headers dictionary.

req = Request(url, headers = headers) # Creating the request using the headers dictionary.
response = urlopen(req)
b = response.read()

html = clean_input(b.decode('utf-8')) # Decoding and cleaning.

soup = BeautifulSoup(html, 'html.parser') # Parsing.

cards = [] # Array to hold all cards.
card = {} # Dictionary to store the information.

anuncio = soup.find('div', {'class': 'well card'}) # Find one add in the page.

infos = anuncio.find('div', {'class':'body-card'}).find_all('p') # Find the infos inside a body of card.

for info in infos:
    card[info.get('class')[0].split('-')[-1]] = info.get_text() # Storing infos to dict dynamicaly.

card['valor'] = anuncio.find('p',{'class':'txt-value'}).get_text() # Storing more info from other divs

items = anuncio.find('div', {'class':'body-card'}).ul.find_all('li') # Find the accessories inside a body of card.

items.pop() # Removing last item '...'

accessories = [] # Create array for accessories

for a in items :
        accessories.append(a.get_text().replace('► ', '')) # Adding acessories in array

card['accessories'] = accessories # Adding to dict

dataset = pd.DataFrame.from_dict(card, orient = 'index').T # Creating a data Frame from our card data.
dataset.to_csv('./output/data/dataset.csv', sep=';', index = False, encoding = 'utf-8-sig') # exporting the info.

image = anuncio.find('div', {'class':'image-card'}).find('img') # Find the image inside the card.
urlretrieve(image.get('src'), './output/img/' + image.get('src').split('/')[-1]) # Downloading the image to a data folder