## Objective:

This notebook gets the data from the WebScrapping and outputs a DataFrame with all the info downloaded.

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/598px-CRISP-DM_Process_Diagram.png)

# Objective?

Get all the descriptions of the Products and their department

In [3]:
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
import time
import random

## Links

We are gonna gather all the links from the sections that we want and when we find 'Agencia de viajes' We brake the loop because we dont want the links down it

In [4]:
url = 'https://www.elcorteingles.es/'

req = requests.get(url)

soupHome = BeautifulSoup(req.text)

i = 0
links = []

for a in soupHome.find_all('a', href=True):
    if ('viajeselcorteingles' in a['href']):
        break
    if all([term not in a['href'] for term in ['/supermercado/', '/entradas/', '/club-del-gourmet/']]):
        links.append(a['href'].split('?')[0])

#Remove duplicates
links = list(dict.fromkeys(links))

In [5]:
len(links)

1301

In [6]:
links[900]

'/juguetes/munecos-articulados/'

## Correcciones manuales

These links correspond to lists of products that produce duplicates, if we filter them from the beginning we optimize the process by having to make fewer requests.

In [7]:
links_to_remove = ['/75/','/moda/', '/moda/mujer/', '/moda/zapatos/', '/cine/', '/musica/', '/mascotas/', '/bricor/',
                  '/bricor/iluminacion/', '/bricor/herramientas/', '/bricor/estanterias-y-ordenacion/', '/juguetes/'
                  '/bricor/armarios/', '/electronica/', '/electrodomesticos/', '/deportes/', '/hogar/', '/libros/',
                  '/bricor/bano/','/bricor/cocinas/',' /bricor/armarios/', 'https://www.elcorteingles.es/perfumeria/']
for x in links_to_remove:
    try:
        links.remove(x)
    except Exception as e:
        print(f'Exception: {e}')
        

Exception: list.remove(x): x not in list
Exception: list.remove(x): x not in list
Exception: list.remove(x): x not in list


## Productos

In [8]:
products = []
url_errors = []
suffix = ''

for url in links[0:]:
    url = 'https://www.elcorteingles.es'+url #Juguetes juegos de mesa y habilidad
    print (url)
    for page in range (0,30): #Weare gonna get only the first 30pages in order to reduce size
        time.sleep(random.randint(0,1)+random.random()) #Avoid saturating the servers
        try:
            if page > 1:
                suffix = '/'+str(page)+'/'
                
            req = requests.get(url+suffix, timeout=10)

            if req.ok: # status_code == 200
                s = BeautifulSoup(req.text)
                prods = s.findAll('span')

                for n, span_ in enumerate(prods):
                    if 'data-json' in span_.attrs and ('data-scope' in span_.attrs and span_.attrs['data-scope'] == 'product'):
                        obj = json.loads(span_['data-json'])
                        obj['image'] = 'http:{}'.format(str(span_.parent.find('img')['src']))
                        products.append(obj)

            else: # imprime url y status code donde ha dado error
                print(f'Status code error on url {url}. Status code: {req.status_code}')
                url_errors.append(url)
                break # Si no hay mas pages no sigue buscando mas y pasa al siguiente link
                
        except Exception as e:
            print(f'Exception: {e}')
            if url not in url_errors:
                url_errors.append(url)
                
    time.sleep(random.randint(5,10) + random.random()) #Para evitar suponerle un daño a la empresa
                


https://www.elcorteingles.es/moda/mujer/abrigos/


KeyboardInterrupt: 

In [9]:
len(products)

94

In [11]:
products[50]

{'id': '001060651400131',
 'brand': 'Woman Limited El Corte Inglés',
 'store_id': '60',
 'badges': ['express_delivery'],
 'price': {'original': 199, 'final': 139.3, 'currency': 'EUR'},
 'discount': 30,
 'media': {'count': 1},
 'name': 'Abrigo masculino con textura de mujer',
 'variant': '001060651400131002',
 'category': ['Moda', 'Mujer', 'Abrigos'],
 'alternative_id': 'A28233506',
 'eci_provider': '00000000',
 'gtin': '2401700051602',
 'status': 'show_pdp',
 'quantity': 1,
 'image': 'http://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03/201902/07/00160651400131____1__516x640.jpg'}

Lo guardamos en un Dataframe para poder usarlo en batch

In [12]:
dfProducts = pd.DataFrame.from_records(products)
dfProducts.head(3)

Unnamed: 0,alternative_id,badges,brand,category,discount,eci_provider,gtin,id,image,media,name,position,price,quantity,status,store_id,variant
0,A29692433,[express_delivery],Woman El Corte Inglés,"[Moda, Mujer, Abrigos]",50.0,0,2401676814157,1087557400030,http://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03...,{'count': 1},Plumífero ultraligero de mujer Woman Weekend E...,,"{'original': 79.99, 'final': 39.95, 'currency'...",1,show_pdp,60,1087557400030002
1,A27354683,[express_delivery],Woman El Corte Inglés,"[Moda, Mujer, Abrigos]",,0,2523534000714,1052353400071,http://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03...,{'count': 1},Gabardina corta básica de mujer Woman Weekend ...,,"{'final': 79.99, 'currency': 'EUR'}",1,show_pdp,60,1052353400071002
2,A26878646,[express_delivery],Fórmula Joven,"[Moda, Mujer, Abrigos]",50.0,0,2401685569666,1016615194640,http://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03...,{'count': 1},Gabardina de mujer Fórmula Joven con cinturón ...,,"{'original': 79.99, 'final': 39.99, 'currency'...",1,show_pdp,60,1016615194640038


In [13]:
dfProducts.shape

(94, 17)

### Quitar duplicados

In [14]:
dfProducts = dfProducts.drop_duplicates(subset='id')
dfProducts.shape

(47, 17)

In [15]:
dfProducts.to_csv('../Data/RAW_PRODUCTS.csv', index=False)

---