# E-commerce Web Scraping

## Business Problem 📝💹

En este proyecto trabajamos para una compañía (Whirpool) de e-commerce que se prepara para un fin de semana de promociones. Como tarea nos fue asignado averigüar los precios que maneja la competencia de algunos productos especiales. Nuestro objetico es **determinar el precio mínimo** que maneja la competencia para cada producto para así poder igualarlo. 

Stakeholders: 
* Marketing manager

Recibimos:
* Un archivo de excel (Products_and_comp.xlsx) con una lista de los productos sobre los cuales debemos hacer web scraping.
* Los nombres de las compañías que son competencia y sobre las cuales debemos ingresar a su portal de e-commerce.

Entregable:
* Un pandas dataFrame con los precios y los URL de los productos para cada compañía de la competencia.

In [None]:

import pandas as pd
import numpy as np
import re
import time

# import requests
# from urllib.request import Request, urlopen
# from bs4 import BeautifulSoup

# Creamos un entorno virtual y usamos las siguientes librerias:
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service

from webdriver_manager.chrome import ChromeDriverManager


In [None]:
excel = pd.read_excel('Products_and_comp.xlsx')
excel.shape

In [None]:
excel

In [None]:
# Identificadores de los productos
productos = excel['Material']
marcas = excel['Marca']

In [None]:
# Obtenemos la competencia
competencia = excel.columns[3::2]
competencia

## Liverpool Mex

Empezamos haciendole Web Scraping a la primera compañía de la competencia. La metodología es la siguiente:
1) Seleccionamos el boton de la barra de búsquedas
2) Escribimos los identificadores de los productos
3) De los resultados obtenemos: precio, nombre y link. (Solo tomamos el precio final, i.e. el de venta y no al anterior a las promociones).
4) Automatizamos con selenium y un _for_ que pase por cada identificador
5) Generamos un csv con todos los resultados.


URL = 'https://www.liverpool.com.mx/tienda/home'

In [None]:
datos = []
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
driver.get("https://www.liverpool.com.mx/tienda/home")
no_encontrados = []

for i in range(len(productos)):
    try:
        search_bar = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//input[@class='form-control search-bar plp-no__results']")))
        #search_bar = driver.find_element(By.CLASS_NAME, "form-control search-bar plp-no__results")
        
        #Here we delete the previous search
        search_bar.send_keys(Keys.CONTROL, 'a')
        time.sleep(0.5) # This page needs some time to load.
        search_bar.send_keys(Keys.BACKSPACE)
        
        #We write our query
        search_bar.send_keys(productos[i])
        search_bar.send_keys(Keys.ENTER)
        
        try:       
            time.sleep(2) # To reload the poge needs aprox this time
            xpath_titulo = "//h1[@class='a-product__information--title']"
            titulo = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_titulo)))
            titulo = driver.find_element(By.XPATH, xpath_titulo)
            titulo = [titulo.text]
            
            
            xpath_precio = "//p[@class = 'a-product__paragraphDiscountPrice m-0 d-inline ']"
            WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_precio)))
            precio = driver.find_element(By.XPATH, xpath_precio)
            precio = [precio.text]
            
            
            get_url = driver.current_url
            link = [str(get_url)]

            # Añadimos el producto con datos np.Nan
            df = pd.DataFrame({'titles': titulo, \
                               'prices': precio, \
                               'links': link, \
                               'id': [productos[i]]}) 
            datos.append(df)

        except:
            # Si este xpath existe es poque la búsqueda devolvió nada.
            nada_xpath = "//div[@class='o-content__noResultsNullSearch']//p[@class = 'o-nullproduct-query'][1]"
            nada = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, nada_xpath))) 
            # Añadimos el producto con datos np.Nan
            df = pd.DataFrame({'titles': [np.nan], \
                               'prices': [np.nan], \
                               'links': [np.nan], \
                               'id': [productos[i]]}) 
            datos.append(df)
    except:
        no_encontrados.append(productos[i])

driver.quit() 
datos_final = pd.concat(datos, ignore_index = True)
datos_final.to_csv('scrapped_csv/Liverpool.csv')

In [None]:
liverpool = datos_final.copy()
# liverpool = pd.read_csv(/, index_col = [0])

In [None]:
liverpool

## Coppel Mex & Home Depot
No disponible. Solo disponible via VPN. (Algo que proveería una empresa que me contrate ;) )


![Coppel](images/Coppel.png)

## Costco

This page enters into maintenance mode after some scraping. We notice that the majority of products are missing.
This was a challegne that we succesfully surpass thanks to adding some random noise. Even, though, no products were find of our given list.

In [None]:
import random

datos = []
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
driver.get("https://www.costco.com.mx/")

search_xpath = "//div[@id='searchBoxContainer']//input[@class='search-input']"
search_bar = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, search_xpath)))
search_bar = driver.find_element(By.XPATH, search_xpath)

condition = True
i = 0
while condition and i < 425:
    tiempo_s = round(1+ random.random()*2,2)
    
    if i % 25 == 0:
        time.sleep(5) # We give the page some more free time to load
        
    try:
        #Here we delete the previous search
        search_bar.send_keys(Keys.CONTROL, 'a')
        search_bar.send_keys(Keys.BACKSPACE)

        #We write our new search
        search_bar.send_keys(productos[i])
        search_bar.send_keys(Keys.ENTER)
        time.sleep(tiempo_s)
        try:
            xpath_noresult= "//h1//span[@class='ng-star-inserted']"
            no_result = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_noresult)))
            no_result = driver.find_element(By.XPATH, xpath_noresult)
            
            if no_result.text != str("No se encontraron resultados para"):
                condition = False
                valor = producto_actual
            
        except:
            condition = False
            valor = producto_actual
            
        
    except:
        print('Falla en Botón de búsqueda' + str(producto_actual))
    i += 1
    

driver.quit() 

No results


## Sears Mex

Link = 'https://www.sears.com.mx/'


In [None]:
datos = []

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
driver.get("https://www.sears.com.mx/")

no_encontrados = []

for i in range(4):
    try:
        #In this case Selenium lost the object so it's neccesary to search it again. This is due to the page changing our path after interacting with it.
        search_xpath = "//div[@class='headerSup']//input[@class='input']"
        search_bar = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, search_xpath)))
        search_bar = driver.find_element(By.XPATH, search_xpath)
        
        #Here we delete the previous search
        search_bar.send_keys(Keys.CONTROL, 'a')
        search_bar.send_keys(Keys.BACKSPACE)

        #We write our new search
        search_bar.send_keys(productos[i] + ' ' + marcas[i])
        search_bar.send_keys(Keys.ENTER)
        
        try:
            time.sleep(1.5)
            xpath_titulo = "//article//p[@class='h4']"
            #WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_titulo)))
            titulos = driver.find_elements(By.XPATH, xpath_titulo)
            titulos = [ti.text for ti in titulos]          
        except:
            no_encontrados.append(producto_actual)
            print('error encontrar titulos')
        try:            
            xpath_precio = "//article//p[@class='precio1']"
            #WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_precio)))
            precios = driver.find_elements(By.XPATH, xpath_precio)
            precios = [p.text for p in precios]
        except:
            print('EErro al encontrar precios')
        try:
            xpath_link = "//article//a[1]"
            #WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, xpath_link)))
            links = driver.find_elements(By.XPATH, xpath_link)
            links = [link.get_attribute("href") for link in links] 
        except:
            print('Error al encontrar links')
    except:
        no_encontrados.append(producto_actual)
        print('error bus')
    
driver.quit() 

In [None]:
titulos

## Mercado Libre Mex

Empezamos haciendole web scraping a Mercado Libre. Usamos Selenium con un driver de Chrome 108. 
Es necesario ejecutar el script _Mercado_Libre.py


URL = 'https://www.mercadolibre.com.mx/a/store/seagate'

### Test
Después de correr el script, vemos el csv final.


In [None]:
datos = []

driver = webdriver.Chrome("./chromedriver")
driver.get("https://www.mercadolibre.com.mx/")
for i in range(len(productos)):
    try:
        # search_bar = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "nav-search-input")))
        search_bar = driver.find_element(By.CLASS_NAME, "nav-search-input")
        search_bar.clear()
        search_bar.send_keys(productos[i])
        search_bar.send_keys(Keys.ENTER)

        try:
            #xpath2 = "//h2[@class='ui-search-item__title ui-search-item__group__element shops__items-group-details shops__item-title']"
            xpath = "//h2[@class='ui-search-item__title shops__item-title']"
            title_products = driver.find_elements(By.XPATH, xpath)
            title_products = [title.text for title in title_products]
            

            xpathp = "//div[@class='ui-search-price ui-search-price--size-medium shops__price']//span[@class='price-tag ui-search-price__part shops__price-part']//span[@class='price-tag-fraction']"
            price_products = driver.find_elements(By.XPATH,xpathp)
            price_products = [price.text for price in price_products]

            xpathl = "//div[@class='ui-search-item__group ui-search-item__group--title shops__items-group']//a[1]"
            links = driver.find_elements(By.XPATH, xpathl)
            links = [link.get_attribute("href") for link in links]

            if len(links) == 0 or len(price_products) == 0 or len(title_products) == 0:
                xpath2 = "//h2[@class='ui-search-item__title ui-search-item__group__element shops__items-group-details shops__item-title']"
                #xpath = "//h2[@class='ui-search-item__title shops__item-title']"
                title_products = driver.find_elements(By.XPATH, xpath2)
                title_products = [title.text for title in title_products]
                

                xpathp = "//div[@class='ui-search-price ui-search-price--size-medium shops__price']//span[@class='price-tag ui-search-price__part shops__price-part']//span[@class='price-tag-fraction']"
                price_products = driver.find_elements(By.XPATH,xpathp)
                price_products = [price.text for price in price_products]

                xpathl = "//div[@class='ui-search-result__wrapper shops__result-wrapper']//div[@class = 'ui-search-result__image shops__picturesStyles']//a[1]"
                links = driver.find_elements(By.XPATH, xpathl)
                links = [link.get_attribute("href") for link in links]
                if len(links) == len(price_products) and len(price_products) == len(title_products) and len(title_products):
                    products = {'titles': title_products,  'prices': price_products, 'links': links }
                    df = pd.DataFrame(products)
                    df['id'] = productos[i]
                    datos.append(df)

            elif len(links) == len(price_products) and len(price_products) == len(title_products) and len(title_products):
                products = {'titles': title_products,  'prices': price_products, 'links': links }
                df = pd.DataFrame(products)
                df['id'] = productos[i]
                datos.append(df)
            else:
                df = pd.DataFrame({'titles': [np.nan], 'prices': [np.nan], 'links': [np.nan], 'id': productos[i]})
                print('Product Not Available:' + str(productos[i]) + ' pos ' + str(i))
                datos.append(df)
            
        except:
            df = pd.DataFrame({'titles': [np.nan], 'prices': [np.nan], 'links': [np.nan], 'id': productos[i]})
            print('No link for product:' + str(productos[i] + 'pos ' + str(i)))
            datos.append(df)
    except:
        print('Error in buttom in pos ' + str(i))

driver.quit() 
datos_final = pd.concat(datos, ignore_index = True)
datos_final.to_csv('scrapped_csv/Mercado_Libre.csv')

In [None]:
resultados_test1 = pd.read_csv('test_ML.csv', index_col = [0])
resultados_test1

In [None]:
df = pd.read_csv('scrapped_csv/Mercado_Libre.csv', index_col = [0])
df

In [None]:
set(productos[0:18]) - set(df.id.unique())

In [None]:
df.id.unique()

In [None]:
df[df.id == 'KSM150PSER']

## ELEKTRA

https://www.elektra.mx/

## Walmart Mex

https://www.walmart.com.mx/

## Sam's

https://www.sams.com.mx/

## Suburbia

https://www.suburbia.com.mx/tienda/home

## Palacio

No disponible sin VPN.

## Famsa

Página no carga

## REAMI

No disponible sin VPN.

## Soriana

https://www.soriana.com/

## La Unica

No disponible sin VPN.

## Cimaco

https://www.cimaco.com.mx/

## Cyber Puerta 

https://www.cyberpuerta.mx/