### 1. From HTML

*Using only beautiful soap*

Save in a dataframe the next information using web scraping. Each row of the dataframe must have in different columns:

- The name of the title
- The id of the div where is the value scraped. If there is not id, then the value is must be numpy.nan
- The name of the tag where is the value scraped.
- The next scraped values in different rows: 
    - The value: "Este es el segundo párrafo"  --> Row 1
    - The url https://pagina1.xyz/ --> Row 2
    - The url https://pagina4.xyz/ --> Row 3
    - The url https://pagina5.xyz/ --> Row 4
    - The value "links footer-links" --> Row 5
    - The value "Este párrafo está en el footer" --> Row 6

In [1]:
html = """<html lang="es">
<head>
    <meta charset="UTF-8">
    <title>Página de prueba</title>
</head>
<body>
<div id="main" class="full-width">
    <h1>El título de la página</h1>
    <p>Este es el primer párrafo</p>
    <p>Este es el segundo párrafo</p>
    <div id="innerDiv">
        <div class="links">
            <a href="https://pagina1.xyz/">Enlace 1</a>
            <a href="https://pagina2.xyz/">Enlace 2</a>
        </div>
        <div class="right">
            <div class="links">
                <a href="https://pagina3.xyz/">Enlace 3</a>
                <a href="https://pagina4.xyz/">Enlace 4</a>
            </div>
        </div>
    </div>
    <div id="footer">
        <!-- El footer -->
        <p>Este párrafo está en el footer</p>
        <div class="links footer-links">
            <a href="https://pagina5.xyz/">Enlace 5</a>
        </div>
    </div>
</div>
</body>
</html>"""

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np
import smtplib

In [3]:
soup = BeautifulSoup(html)

div_id = [re.search('"\w+"', e).group()[1:-1] for e in re.findall("id=.*",str(soup.contents[0]))]

div_id

['main', 'innerDiv', 'footer']

In [4]:
# Get the title
Titulo =soup.title.string
Titulo

'Página de prueba'

In [5]:
# Get 2nd paragraph
Seg_par = soup.findAll('p')[1]
Seg_par

<p>Este es el segundo párrafo</p>

In [6]:
# Get all text between tags 'a' in a list
Urls= soup.findAll('a')
Lst_url =[]
# Create for loop to put links in the list
for url in Urls:
    link = url.get('href')
    Lst_url.append(link)
Lst_url

['https://pagina1.xyz/',
 'https://pagina2.xyz/',
 'https://pagina3.xyz/',
 'https://pagina4.xyz/',
 'https://pagina5.xyz/']

In [7]:
lista = re.findall("class=.*",str(soup.contents[0]))
Link_F = lista[4]
Link_F = Link_F[7:-2]
# print(lista)
Link_F

'links footer-links'

In [8]:
# Get 2nd paragraph
Estepa = soup.findAll('p')[2].text
Estepa

'Este párrafo está en el footer'

In [9]:
a_tags= []
for tag in soup.find_all(['a','p']):
    a_tags.append(tag.name)
a_tags

['p', 'p', 'a', 'a', 'a', 'a', 'p', 'a']

In [10]:
Values = [Seg_par.text, Lst_url[0], Lst_url[3], Lst_url[4] ]
print(Titulo)
print(Seg_par.text)
print(div_id)
print(Lst_url)
print(Link_F)
print(Estepa)
print(a_tags)

Página de prueba
Este es el segundo párrafo
['main', 'innerDiv', 'footer']
['https://pagina1.xyz/', 'https://pagina2.xyz/', 'https://pagina3.xyz/', 'https://pagina4.xyz/', 'https://pagina5.xyz/']
links footer-links
Este párrafo está en el footer
['p', 'p', 'a', 'a', 'a', 'a', 'p', 'a']


In [11]:
data = {'Titulo': [Titulo, Titulo, Titulo, Titulo, Titulo, Titulo], 'Div_id':[ 'main', 'innerDiv','innerDiv','footer','footer','footer'], 'Tag':['p','a','a','a','Nan','p',],'Value':[Seg_par.text, Lst_url[0], Lst_url [3],Lst_url[4],Link_F,Estepa]}
pd.DataFrame.from_dict(data)
 

Unnamed: 0,Titulo,Div_id,Tag,Value
0,Página de prueba,main,p,Este es el segundo párrafo
1,Página de prueba,innerDiv,a,https://pagina1.xyz/
2,Página de prueba,innerDiv,a,https://pagina4.xyz/
3,Página de prueba,footer,a,https://pagina5.xyz/
4,Página de prueba,footer,Nan,links footer-links
5,Página de prueba,footer,p,Este párrafo está en el footer


### 2. From Amazon

*Using  beautiful soap and/or regex*

Save in a dataframe the next information using web scraping. Using product pages from Amazon, do the following: 

- Get the product name from the web and save it in a column called "item_name"
- Get the price from the web and save it in a column called "item_price"

While you are doing the exercise, document the steps you are doing. Try to do the program for generic pages. If you cannot do it generic, explain the reasons. 

-------------------------------

**Example:** 

url = https://www.amazon.es/Tommy-Hilfiger-UM0UM00054-Camiseta-Hombre/dp/B01MYD0T1F/ref=sr_1_1?dchild=1&pf_rd_p=58224bec-cac9-4dd2-a42a-61b1db609c2d&pf_rd_r=VZQ1JTQXFVRZ9E9VSKX4&qid=1595364419&s=apparel&sr=1-1

*item_name* --> "Tommy Hilfiger Logo Camiseta de Cuello Redondo,Perfecta para El Tiempo Libre para Hombre"

*item_price* --> [[18,99 € - 46,59 €]] or one of the options.




In [17]:
# Browse Amazon to find something funny
url = 'https://www.amazon.es/joven-haberte-comprado-sart%C3%A9n-antiadherente/dp/1640015426/ref=sr_1_20?dchild=1&keywords=libros+para+colorear+adultos&qid=1595668292&sr=8-20'


In [18]:
# Hacemos la sopa 
page = requests.get(url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'})
soup = BeautifulSoup(page.content, 'html.parser')


In [22]:
# Explore tags
Tag_list = []
for tag in soup.find_all(True):
    Tag_list.append(tag.name)
    print(tag.name)
Tag_list

html
head
meta
meta
meta
title
meta
link
script
body
div
div
div
i
div
div
i
h4
p
div
div
div
form
input
input
div
div
div
h4
div
img
div
div
div
div
a
input
div
div
span
span
button
div
div
div
a
span
span
span
span
a
div
script
noscript
img
script


['html',
 'head',
 'meta',
 'meta',
 'meta',
 'title',
 'meta',
 'link',
 'script',
 'body',
 'div',
 'div',
 'div',
 'i',
 'div',
 'div',
 'i',
 'h4',
 'p',
 'div',
 'div',
 'div',
 'form',
 'input',
 'input',
 'div',
 'div',
 'div',
 'h4',
 'div',
 'img',
 'div',
 'div',
 'div',
 'div',
 'a',
 'input',
 'div',
 'div',
 'span',
 'span',
 'button',
 'div',
 'div',
 'div',
 'a',
 'span',
 'span',
 'span',
 'span',
 'a',
 'div',
 'script',
 'noscript',
 'img',
 'script']

In [14]:
link = soup.find_all(id="productTitle")
#<span id="productTitle" class="a-size-extra-large">
#Un día eres joven y al otro eres feliz por haberte comprado una sartén antiadherente: Un libro de colorear para adultos
#</span>
link

[]

In [15]:
# for child in soup.body.descendants:
  #  print(child)

In [16]:
child

NameError: name 'child' is not defined

In [17]:
# necesitamos find mas ingredientes para la soup 
title = soup.find(id="Title").get_text()
price = soup.find(id="priceblock_ourprice").get_text()


AttributeError: 'NoneType' object has no attribute 'get_text'

In [15]:
# Echamos la url a la olla para hacer la sopa
url = 'https://www.amazon.es/joven-haberte-comprado-sart%C3%A9n-antiadherente/dp/1640015426/ref=sr_1_20?dchild=1&keywords=libros+para+colorear+adultos&qid=1595668292&sr=8-20'

def get_page_contents(url):
    page = requests.get(url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(page.text, "html.parser")

sopita = get_page_contents(url)

In [16]:
Product_name = sopita.find_all(id_='productTitle')
Product_name

[]

In [23]:
# Encontré el ejercicio resuelto en towardsdatascience.
# Importamos las librerías necesarias
import requests
from bs4 import BeautifulSoup
import smtplib

# Relleno las headers con mis datos obtenidos usando xhaus
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}

# Sustituyo la url por algo que mole
url = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=18730d371b945bad11e9ea58ab9d8b32'


# con requests nos traemos el contenido de la web con nuestras cabeceras
page = requests.get(url, headers=headers)
# hacemos la sopa
soup = BeautifulSoup(page.content, 'html.parser')

title = soup.find(id="productTitle").get_Text()
price = soup.find(id="priceblock_ourprice").get_Text()
sep = ','
con_price = price.split(sep, 1)[0]
converted_price = int(con_price.replace('.', ''))

# price
print(title.strip())
print(converted_price)


AttributeError: 'NoneType' object has no attribute 'get_Text'

In [22]:
import requests
from bs4 import BeautifulSoup


# Relleno las headers con mis datos obtenidos usando xhaus
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}

# Sustituyo la url por algo que mole
url = 'https://www.amazon.es/joven-haberte-comprado-sart%C3%A9n-antiadherente/dp/1640015426/ref=sr_1_20?dchild=1&keywords=libros+para+colorear+adultos&qid=1595668292&sr=8-20'


# con requests nos traemos el contenido de la web con nuestras cabeceras
page = requests.get(url, headers=headers)
# hacemos la sopa
soup = BeautifulSoup(page.content, 'html.parser')
type(soup)
title = soup.find(id="productTitle")
print(title)

None


In [30]:


headers = {
    "User-agent": 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}

URL = 'https://www.amazon.de/gp/product/B0756CYWWD/ref=as_li_tl?ie=UTF8&tag=idk01e-21&camp=1638&creative=6742&linkCode=as2&creativeASIN=B0756CYWWD&linkId=18730d371b945bad11e9ea58ab9d8b32'
def amazon():

    page = requests.get(URL, headers=headers)

    soup = BeautifulSoup(page.content, 'html.parser')

    title = soup.find(id="productTitle").get_text()
    price = soup.find(id="priceblock_ourprice").get_text()
    sep = ','
    con_price = price.split(sep, 1)[0]
    converted_price = int(con_price.replace('.', ''))

    # price
    print(title.strip())
    print(converted_price)

amazon()

AttributeError: 'NoneType' object has no attribute 'get_text'