# Lab: analyzing tents data

## Data extraction

Let's fetch data from Decathlon. We'll do it in two phases. First, we'll make a list of all the tents they have. Then, we'll fetch the data for each tent.

Let's start by listing the tents they have.

In [37]:
import json
import re
import bs4
import requests

url = 'https://www.decathlon.fr/tous-les-sports/camping-bivouac/tentes-et-abris'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, 'html.parser')
script_tag = soup.find('script', id='__dkt')
raw_json = re.search(r'{(.+)}', script_tag.string).group(0)
data = json.loads(raw_json)


Here's a tool explore the JSON: https://jsonhero.io/j/VTrWj5vx53Ys

The data is quite deeply nested, but it's not difficult to extract:

In [38]:
# Note: if the following doesn't work, try modifying the index. It's possible that Decathlon has changed the structure of the page.
idx = 6
tents = {
    item['webLabel']: f"https://www.decathlon.fr/{item['url']}"
    for item in data['_ctx']['data'][idx]['data']['blocks']['items']
}


So now we have a URL for each tent. Let's grab some data for the first tent.

In [23]:
url = tents['Tente de camping - MH100 - 2 places']
response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, 'html.parser')
script_tag = soup.find('script', id='__dkt')
raw_json = re.search(r'{(.+)}', script_tag.string).group(0)
data = json.loads(raw_json)


This JSON can be explored here: https://jsonhero.io/j/QeKMElLudiaA

In [24]:
benefits = {
    b['label']: b
    for b in data['_ctx']['data'][10]['data']['benefits']
}
{
    'rating': data['_ctx']['data'][4]['data']['reviews']['notation'],
    'price': data['_ctx']['data'][4]['data']['models'][0]['price'],
    'weight': data['_ctx']['data'][4]['data']['models'][0]['grossWeight'],
    'composition': data['_ctx']['data'][4]['data']['models'][0]['composition'],
    'packed_size': benefits['Facilité de transport']['value'],
    'size': benefits['Habitabilité']['value']
}


{'rating': 4.45,
 'price': 30,
 'weight': '2.6',
 'composition': 'Tissu principal\n75% Polyester, 25% Polyéthylène\nArceau\n100% Fibre de verre',
 'packed_size': 'Dimensions de la housse : 58cm x 16cm x 16cm / 15 L. Poids : 2,6 kg',
 'size': 'Chambre 130 X 210 cm. (2 couchages de 65cm) Hauteur max. utile : 107 cm'}

Ok great, we can extract data for a single tent. Now let's do it for all of them!

First, let's list all the tents.

In [39]:
from tqdm import tqdm

tents_urls = {}

def get_page_content_from_decathlon_url(url):
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.content, 'html.parser')
    script_tag = soup.find('script', id='__dkt')
    raw_json = re.search(r'{(.+)}', script_tag.string).group(0)
    return json.loads(raw_json)

for page in tqdm(range(11)):
    url = f'https://www.decathlon.fr/tous-les-sports/camping-bivouac/tentes-et-abris?from={40 * page}&size={40}'
    data = get_page_content_from_decathlon_url(url)
    tents_urls.update({
        item['webLabel']: f"https://www.decathlon.fr/{item['url']}"
        for item in data['_ctx']['data'][idx]['data']['blocks']['items']
    })

print(f'Number of tents: {len(tents_urls)}')


  0%|          | 0/11 [00:00<?, ?it/s]

100%|██████████| 11/11 [00:10<00:00,  1.08it/s]

Number of tents: 379





Now we can fetch the data for each tent. There's a lot of tents, so we'll speed things by applying concurrency. We'll use Python's `concurrent.futures` module to do this.

In [29]:
from concurrent.futures import ThreadPoolExecutor, as_completed

tents_raw_data = {}

with ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = {
        executor.submit(requests.get, tent_url): tent_name
        for tent_name, tent_url in tents_urls.items()
    }
    for future in tqdm(as_completed(future_to_url), total=len(future_to_url)):
        tent_name = future_to_url[future]
        tents_raw_data[tent_name] = future.result()

len(tents_raw_data)


100%|██████████| 379/379 [00:41<00:00,  9.24it/s]


379

We now have the raw data. We can now extract the data we want. It's a good idea to split the data obtention and the data extraction into two separate steps. This way, if we make a mistake in the extraction, we don't have to re-fetch the data.

In [30]:
tents_info = {}

for tent_name, response in tqdm(tents_raw_data.items()):

    soup = bs4.BeautifulSoup(response.content, 'html.parser')
    script_tag = soup.find('script', id='__dkt')
    try:
        raw_json = re.search(r'{(.+)}', script_tag.string).group(0)
    except AttributeError:
        continue
    data = json.loads(raw_json)

    benefits_block = next(filter(
        lambda b: b['type'] == 'ProductBenefits',
        data['_ctx']['data']
    ), {})
    details_block = next(filter(
        lambda b: b['type'] == 'Supermodel',
        data['_ctx']['data']
    ))
    benefits = {
        b['label']: b
        for b in benefits_block.get('data', {}).get('benefits', [])
    }
    tents_info[tent_name] = {
        'rating': details_block['data'].get('reviews', {}).get('notation'),
        'price': details_block['data']['models'][0]['price'],
        'weight': details_block['data']['models'][0].get('grossWeight'),
        'composition': details_block['data']['models'][0].get('composition'),
        'packed_size': benefits.get('Facilité de transport', {}).get('value'),
        'size': benefits.get('Habitabilité', {}).get('value')
    }

len(tents_info)


100%|██████████| 379/379 [00:12<00:00, 30.43it/s]


378

In [33]:
tents_df = pd.DataFrame.from_dict(tents_info, orient='index')
tents_df.isnull().sum()


rating         247
price            0
weight         249
composition    228
packed_size    296
size           296
dtype: int64

In [35]:
len(tents_df[~tents_df.isnull().any(axis=1)])


68

In [78]:
tents_df.to_csv('../../data/tents.csv')
tents_df.head()


Unnamed: 0,rating,price,weight,composition,packed_size,size
Tente à arceaux de camping - Arpenaz 4.1 - 4 Personnes - 1 Chambre,4.13,120.0,10.2,Tissu principal\n100% Polyester\nArceau\n100% ...,Housse rectangulaire | 60 x 24 x 24 cm | 35 li...,Chambre : 240 x 210 cm | Séjour debout : 5 m² ...
Tente de camping - MH100 - 2 places,4.47,30.0,2.6,"Tissu principal\n75% Polyester, 25% Polyéthylè...",Dimensions de la housse : 58cm x 16cm x 16cm /...,Chambre 130 X 210 cm. (2 couchages de 65cm) Ha...
Tente de camping - 2 SECONDS - 3 places,4.6,100.0,3.562,Double toit\n100% Polyester\nChambre intérieur...,"Dimension de la housse : Ø77x9 cm / 41,9 L. Po...",Chambre 180 X 210 cm.\nHauteur max. utile : 10...
Tente de camping - 2 SECONDS - 2 places,4.4,65.0,2.9,"Tissu principal\n75% Polyester, 25% Polyéthylè...","Dimension de la housse : Ø65x7 cm / 23,2 L. Po...",Chambre 120 x 210 cm.\nHauteur max utile : 102...
Séjour à arceaux de camping - Arpenaz Base - 6 Personnes,4.2,120.0,7.95,Arceau\n100% Fibre de verre\nTissu principal\n...,Housse cylindrique | 57 x 18 cm | 18 litres | ...,"Hauteur : 2,15 m | Surface au sol : 6,25 m² | ..."


☝️ If you were not able to extract the data, you can use the already extracted data:

In [2]:
import pandas as pd

tents_df = pd.read_csv('../../data/tents.csv', index_col=0)
tents_df.head()


Unnamed: 0,rating,price,weight,composition,packed_size,size
Tente à arceaux de camping - Arpenaz 4.1 - 4 Personnes - 1 Chambre,4.13,120.0,10.2,Tissu principal\n100% Polyester\nArceau\n100% ...,Housse rectangulaire | 60 x 24 x 24 cm | 35 li...,Chambre : 240 x 210 cm | Séjour debout : 5 m² ...
Tente de camping - MH100 - 2 places,4.47,30.0,2.6,"Tissu principal\n75% Polyester, 25% Polyéthylè...",Dimensions de la housse : 58cm x 16cm x 16cm /...,Chambre 130 X 210 cm. (2 couchages de 65cm) Ha...
Tente de camping - 2 SECONDS - 3 places,4.6,100.0,3.562,Double toit\n100% Polyester\nChambre intérieur...,"Dimension de la housse : Ø77x9 cm / 41,9 L. Po...",Chambre 180 X 210 cm.\nHauteur max. utile : 10...
Tente de camping - 2 SECONDS - 2 places,4.4,65.0,2.9,"Tissu principal\n75% Polyester, 25% Polyéthylè...","Dimension de la housse : Ø65x7 cm / 23,2 L. Po...",Chambre 120 x 210 cm.\nHauteur max utile : 102...
Séjour à arceaux de camping - Arpenaz Base - 6 Personnes,4.2,120.0,7.95,Arceau\n100% Fibre de verre\nTissu principal\n...,Housse cylindrique | 57 x 18 cm | 18 litres | ...,"Hauteur : 2,15 m | Surface au sol : 6,25 m² | ..."


In [3]:
tents_df = tents_df[~tents_df.isnull().any(axis=1)]
len(tents_df)


70

## Data analysis

This is where the lab starts:

1. Do a bit of analysis on the numeric fields
2. Run a skyline.
3. Run a PCA.
4. Difficult: extract the size and/or the packed of the tent, and then rerun the same analysis
5. Bonus: build a simple regression model