Si deseas crear un servicio utilizando el web scraping, podrías enfrentarte a bloqueos de IP y a la administración de proxies. Es bueno conocer las tecnologías y procesos subyacentes, pero para el scraping en masa, que problemas podemos tener y como solucionarlos.

In [1]:
#No te olvides de instalar bs4
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


Creación de un User-Agent

Muchos sitios web tienen ciertos protocolos para bloquear a los robots que acceden a los datos. Por lo tanto, para extraer datos de un script, necesitamos crear un User-Agent. El User-Agent es básicamente una cadena que le dice al servidor sobre el tipo de host que envía la solicitud.

In [2]:
usuario_fake = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'es-ES, es;q=0.5'})

""" Hay un campo extra en ENCABEZADOS llamado "Accept-Language", que traduce la página web a español, si es necesario. """

' Hay un campo extra en ENCABEZADOS llamado "Accept-Language", que traduce la página web a español, si es necesario. '

Una página web se accede por su URL (Uniform Resource Locator). Con la ayuda de la URL, enviaremos la solicitud a la página web para acceder a sus datos.

In [3]:
import requests
import bs4 
import pandas as pd


In [4]:
!pip install lxml



In [5]:
URL = "https://www.amazon.com/ZOTAC-Graphics-IceStorm-Advanced-ZT-A30900J-10P/dp/B08ZL6XD9H/"
pagina_web = requests.get(URL, headers=usuario_fake)


In [6]:
sopa = bs4.BeautifulSoup(pagina_web.content,'lxml')

### sacando el titulo ( ejemplo )


In [7]:
title = sopa.find("span", attrs={"id":'productTitle'})
title_value = title.string
print(title_value.strip())

AttributeError: 'NoneType' object has no attribute 'string'

### Creando un ejemplo completo de un producto

In [None]:
from bs4 import BeautifulSoup
import requests

def get_title(soup):
	try:
		title = soup.find("span", attrs={"id":'productTitle'})
		title_value = title.string
		title_string = title_value.strip()
	except AttributeError:
		title_string = ""	
	return title_string

def get_price(soup):
	try:
		price = soup.find("span", attrs={'class':'a-offscreen'}).text.replace(',', '')
	except AttributeError:
		price = ""	
	return price

def get_rating(soup):
	try:
		rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
	except AttributeError:
		try:
			rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
		except:
			rating = ""	
	return rating

def get_review_count(soup):
	try:
		review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
	except AttributeError:
		review_count = ""	
	return review_count

def get_availability(soup):
	try:
		available = soup.find("div", attrs={'id':'availability'})
		available = available.find("span").string.strip()
	except AttributeError:
		available = ""	
	return available	

if __name__ == '__main__':

	HEADERS = ({'User-Agent':
	            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
	            'Accept-Language': 'en-US, en;q=0.5'})

	URL = "https://www.amazon.com/ZOTAC-Graphics-IceStorm-Advanced-ZT-A30900J-10P/dp/B08ZL6XD9H/"
	webpage = requests.get(URL, headers=HEADERS)

	soup = BeautifulSoup(webpage.content, "lxml")

	print("Product Title =", get_title(soup))
	print("Product Price =", get_price(soup))
	print("Product Rating =", get_rating(soup))
	print("Number of Product Reviews =", get_review_count(soup))
	print("Availability =", get_availability(soup))
	print()
	print()

Product Title = ZOTAC Gaming GeForce RTX™ 3090 Trinity OC 24GB GDDR6X 384-bit 19.5 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Advanced Cooling, Spectra 2.0 RGB Lighting, ZT-A30900J-10P
Product Price = $1329.95
Product Rating = 4.2 out of 5 stars
Number of Product Reviews = 318 ratings
Availability = Only 1 left in stock - order soon




### Creando ejemplo con búsqueda

In [None]:
from bs4 import BeautifulSoup
import requests
import csv

def get_title(soup):
    try:
        title = soup.find("span", attrs={"id":'productTitle'}).string.strip()
    except AttributeError:
        return ""    
    return title

def get_price(soup):
    try:
        price = soup.find("span", attrs={'class':'a-offscreen'}).string.strip()
    except AttributeError:
        try:
            price = soup.find("span", attrs={'id':'priceblock_dealprice'}).string.strip()
        except:		
            return ""  
    return price

def get_rating(soup):
    try:
        rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
    except AttributeError:
        try:
            rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
        except:
            return ""
    return rating

def get_review_count(soup):
    try:
        review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
    except AttributeError:
        return ""
    return review_count

def get_availability(soup):
    try:
        available = soup.find("div", attrs={'id':'availability'}).find("span").string.strip()
    except AttributeError:
        return "Not Available"
    return available

def main():
    HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 'Accept-Language': 'en-US'}
    URL = input("Enter the Amazon Product Url- ") #Código para que sea también desde la consola
    webpage = requests.get(URL, headers=HEADERS)
    soup = BeautifulSoup(webpage.content, "lxml")

    links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})
    links_list = [link.get('href') for link in links]


    with open('amazon_product.csv', 'w', encoding="utf-8-sig") as csv_file:
        csv_writer = csv.writer(csv_file)
        csv_writer.writerow(['Title', 'Price', 'Rating', 'ReviewCount', 'Availability'])

        for link in links_list:
            new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)
            new_soup = BeautifulSoup(new_webpage.content, "lxml")

            title = get_title(new_soup)
            price = get_price(new_soup)
            rating = get_rating(new_soup)
            review_count = get_review_count(new_soup)
            availability = get_availability(new_soup)

            print("Product Title =", title)
            print("Product Price =", price)
            print("Product Rating =", rating)
            print("Number of Product Reviews =", review_count)
            print("Availability =", availability)
            print('\n')

            # Escribe la información del producto en el archivo CSV.
            csv_writer.writerow([title, price, rating, review_count, availability])

if __name__ == '__main__':
    main()


Product Title = maxsun AMD Radeon RX 550 4GB GDDR5 ITX Computer PC Gaming Video Graphics Card GPU 128-Bit DirectX 12 PCI Express X16 3.0 DVI-D Dual Link, HDMI, DisplayPort
Product Price = $109.99
Product Rating = 4.2 out of 5 stars
Number of Product Reviews = 984 ratings
Availability = In Stock


Product Title = GPVHOSO GeForce GT 1030 GDDR5 2GB 64 Bit DVI-D HDMI Output Graphics Card Computer, Video Card for PC Gaming
Product Price = $99.98
Product Rating = 3.5 out of 5 stars
Number of Product Reviews = 2 ratings
Availability = Only 7 left in stock - order soon


Product Title = MSI Gaming GeForce RTX 3060 12GB 15 Gbps GDRR6 192-Bit HDMI/DP PCIe 4 Torx Twin Fan Ampere OC Graphics Card (Ventus 2X 12G OC)
Product Price = $279.99
Product Rating = 4.7 out of 5 stars
Number of Product Reviews = 1,407 ratings
Availability = In Stock


Product Title = MSI Gaming GeForce RTX 3050 8GB GDRR6 128-Bit HDMI/DP PCIe 4 Torx Twin Fans Ampere OC Graphics Card (RTX 3050 Ventus 2X 8G OC)
Product Price = 

In [None]:
import pandas as pd
graficas = pd.read_csv('amazon_product.csv')

In [None]:
graficas

Unnamed: 0,Title,Price,Rating,ReviewCount,Availability
0,maxsun AMD Radeon RX 550 4GB GDDR5 ITX Compute...,$109.99,4.2 out of 5 stars,984 ratings,In Stock
1,GPVHOSO GeForce GT 1030 GDDR5 2GB 64 Bit DVI-D...,$99.98,3.5 out of 5 stars,2 ratings,Only 7 left in stock - order soon
2,MSI Gaming GeForce RTX 3060 12GB 15 Gbps GDRR6...,$279.99,4.7 out of 5 stars,"1,407 ratings",In Stock
3,MSI Gaming GeForce RTX 3050 8GB GDRR6 128-Bit ...,$242.99,4.6 out of 5 stars,624 ratings,In Stock
4,"AISURIX Radeon RX 580 Graphic Cards, 2048SP, R...",$109.99,4.4 out of 5 stars,418 ratings,In Stock
5,maxsun AMD Radeon RX 550 4GB GDDR5 ITX Compute...,$109.99,4.2 out of 5 stars,984 ratings,In Stock
6,AISURIX Radeon RX 5500 XT 8gb GDDR6 Graphics C...,$139.99,4.2 out of 5 stars,70 ratings,In Stock
7,"AISURIX Radeon RX 580 Graphic Cards, 2048SP, R...",$109.99,4.4 out of 5 stars,418 ratings,In Stock
8,iHTP 5V ARGB Graphics Card GPU Brace Graphics ...,$21.97,4.2 out of 5 stars,24 ratings,In Stock
9,ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12G...,$289.99,4.7 out of 5 stars,"2,899 ratings",In Stock
