# Parsing APIs Example

## Intro

Now we will take a look on a real data. When you parse data from web you will often meet API based web-pages. 

For example [zalando.fr](https://www.zalando.fr/accueil-homme/) is API based web-page. 

In this guided lab you will learn how to obtain the links from webpages and extract the data. Read through this doc, execute the cells in order and make sure you understand the explanations. 

*Note: This guided lab uses Google Chrome. Other browsers like Safari and Firefox have similar tools for developers but they work differently. To save your time in following this lab, it is strongly recommended that you install and use Google Chrome.*

## Obtaining the link

Zalando is discount e-store where you can buy clothes and accesories with discount. When we go to the web-page, we can choose different sections. First the general process will be shown using [Children section](https://www.zalando.fr/accueil-enfant/) as example.

Here we will parse data about promotions only. Therefore, final output will be the DataFrame with all the goods under discount.

[![Image from Gyazo](https://i.gyazo.com/fa4874d8e81c7570273bbfb853d66308.png)](https://gyazo.com/fa4874d8e81c7570273bbfb853d66308)


We go to Promos page. Right click of mouse shows us a list of actions possible, from which we select Inspect.

<img src='https://i.gyazo.com/bccbd11d69c9040dc98758d443e32052.png' width="400">


You will see the menu dropdown on the right side or on the bottom of the window. There you should click on Network:


[![Image from Gyazo](https://i.gyazo.com/f7e0db81cbfee67694183d1a7640bf81.png)](https://gyazo.com/f7e0db81cbfee67694183d1a7640bf81)

Right after the developer part will change showing the files behind the page. In order to obtain only useful files we select the following settings:
1. Preserve Log
2. Select XHR files.

[![Image from Gyazo](https://i.gyazo.com/9a899d4441d9d93e795f79747f1e47d5.png)](https://gyazo.com/9a899d4441d9d93e795f79747f1e47d5)

In order to obtain some files we need to scrool down and go forward to second page. 

[![Image from Gyazo](https://i.gyazo.com/0956eb3d5125075a236c9a439c7749c7.png)](https://gyazo.com/0956eb3d5125075a236c9a439c7749c7)

In the Network panel you can see the following files being uploaded. All the data on the web-page is uploaded from the json file, which is one of the following. It is important to understand which file contains what kind of information. 

<a href="https://gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c"><img src="https://i.gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c.png" alt="Image from Gyazo" width="724.8"/></a>

When you find what kind of information you need for the data to be uploaded you just test it. Here we need the article... file:

<a href="https://gyazo.com/78b35bf492994b3f35c0564a21da202a"><img src="https://i.gyazo.com/78b35bf492994b3f35c0564a21da202a.png" alt="Image from Gyazo" width="727.2"/></a>

When we test the link in Chrome inkognito mode we obtain the proper json file:


<a href="https://gyazo.com/b60453fa98454fa29771c731a5174443"><img src="https://i.gyazo.com/b60453fa98454fa29771c731a5174443.png" alt="Image from Gyazo" width="1530.4"/></a>

In order to change the objects in the json file (kind of pagination), you need to change the offset (the number of the first element on the page). in fact, if you take a look on the link, it is easy to unerstand the structure of the link.

# Reading the data

Now the party rocks! When we know how can we obtain the data, it is not a problem to obtain the whole database with all the data from the web-page.
In this lab you will collect your database of Zalando products. You select which goods you want to track. You can define as many filters to your data as you want. Just make sure that the data represents the filters.




In [1]:
import json
import requests
from urllib.request import urlopen
import pandas as pd
from pandas.io.json import json_normalize


In [38]:
# Paste the url you obtained for your data
url='https://www.zalando.fr/api/catalog/articles?categories=promo-enfant&limit=84&offset=84&sort=sale'
response = urlopen(url).read().decode('utf-8')
results = json.loads(response)


In [24]:
df = pd.DataFrame(results['articles'])
df

Unnamed: 0,sku,name,price,sizes,url_key,media,brand_name,is_premium,family_articles,flags,product_group,delivery_promises,amount
0,10K43A06H-A11,UNISEX - Chaussures d'entraînement et de fitne...,"{'original': '39,95 €', 'promotional': '19,95...","[25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]",kappa-unisex-chaussures-dentrainement-et-de-fi...,[{'path': 'spp-media-p1/413e5e6b391948ab970567...,Kappa,False,[],"[{'key': 'discountRate', 'value': '-50%', 'tra...",shoe,[],
1,GP024K0BZ-K11,BOY TIE DYE - Sweat à capuche - tapestry navy,"{'original': '39,95 €', 'promotional': '11,95...","[8-9a, 10-11a, 12-13a, 14-16a]",gap-boy-tie-dye-sweatshirt-tapestry-navy-gp024...,[{'path': 'spp-media-p1/ad92b65d6b8247c5b06ed1...,GAP,False,[],"[{'key': 'discountRate', 'value': '-70%', 'tra...",clothing,[],
2,NI114D0HF-A11,VALIANT - Baskets basses - white/black,"{'original': '32,95 €', 'promotional': '16,45...","[17, 18.5, 19.5, 21, 22, 23.5, 25, 27]",nike-sportswear-valiant-unisex-baskets-basses-...,[{'path': 'spp-media-p1/f807a786947b4feabfe93b...,Nike Sportswear,False,[],"[{'key': 'discountRate', 'value': '-50%', 'tra...",shoe,[],
3,LE226G005-K12,BATWING TEE - T-shirt imprimé - dress blues,"{'original': '17,95 €', 'promotional': '12,65...","[2a, 3a, 4a, 5a, 8a, 10a, 12a, 14a, 16a]",levisr-batwing-tee-t-shirt-imprime-le226g005-k12,[{'path': 'spp-media-p1/0d35625d4f433f5890a889...,Levi's®,False,[],"[{'key': 'discountRate', 'value': '-30%', 'tra...",clothing,[],
4,GE113D08U-K11,ALBEN GIRL WWF - Baskets basses - avio,"{'original': '69,95 €', 'promotional': '20,95...","[28, 29, 30, 31, 32, 39]",geox-alben-girl-wwf-baskets-basses-avio-ge113d...,[{'path': 'spp-media-p1/106f16f82beb4e60af689f...,Geox,False,[],"[{'key': 'discountRate', 'value': '-70%', 'tra...",shoe,[],
...,...,...,...,...,...,...,...,...,...,...,...,...,...
79,NI116D090-M11,WAFFLE ONE - Baskets basses - electric green/...,"{'original': '64,95 €', 'promotional': '32,45...","[28, 28.5, 29.5, 30, 31.5, 32, 33, 33.5, 34, 35]",nike-sportswear-waffle-one-baskets-basses-elec...,[{'path': 'spp-media-p1/7d56360096354d32b26456...,Nike Sportswear,False,[],"[{'key': 'discountRate', 'value': 'Jusqu’à -50...",shoe,[],
80,F5713G076-K11,Sandales - blue,"{'original': '24,95 €', 'promotional': '17,55...","[28, 29, 30, 31, 32, 33, 34, 35]",friboo-sandales-blue-f5713g076-k11,[{'path': 'spp-media-p1/b4d53957f8f7366fa52b44...,Friboo,False,[],"[{'key': 'discountRate', 'value': '-30%', 'tra...",shoe,[],
81,AD116D12Y-C11,OZWEEGO UNISEX - Baskets basses - grey five/gr...,"{'original': '89,95 €', 'promotional': '49,45...","[35.5, 36, 38, 36 2/3, 37 1/3, 38 2/3]",adidas-originals-ozweego-unisex-baskets-basses...,[{'path': 'spp-media-p1/002dd5dcd4333c8e92317f...,adidas Originals,False,[],"[{'key': 'discountRate', 'value': 'Jusqu’à -45...",shoe,[],
82,NI116D0B5-A11,NIKE VALIANT - Baskets basses - white/ silver-...,"{'original': '44,95 €', 'promotional': '22,45...","[35.5, 36, 36.5, 37.5, 38, 38.5, 39, 40]",nike-sportswear-nike-valiant-baskets-basses-wh...,[{'path': 'spp-media-p1/2078c54bd4c94c4c825657...,Nike Sportswear,False,[],"[{'key': 'discountRate', 'value': 'Jusqu’à -50...",shoe,[],


#### Collect first 84 object of the of the data (1st page)

Your output should be a Pandas DataFrame of goods. Each row should contain only text or numbers, having *family_articles, flags, media* and *sizes* remaining lists (they are exceptions).

In [None]:
# Your code

response = requests.get(url)
results = response.json()
results


flattened_data = json_normalize(results)

flattened_data1 = json_normalize(flattened_data.articles[0])
flattened_data1

In [7]:
url = 'https://www.zalando.fr/api/catalog/articles?age_groups=BABIES&categories=mode-enfant&limit=84&offset=84'

response = urlopen(url).read().decode('utf-8')
json_data = json.loads(response)

In [23]:
#results = pd.DataFrame(pd.json_normalize(json_data['articles']))

In [39]:
results['articles']

[{'sku': 'QU153G00I-Q11',
  'name': 'Trousse - black',
  'price': {'original': '10,99\xa0\xa0€',
   'promotional': '4,40\xa0\xa0€',
   'has_different_prices': False,
   'has_different_original_prices': False,
   'has_different_promotional_prices': False,
   'has_discount_on_selected_sizes_only': False},
  'sizes': ['One Size'],
  'url_key': 'quiksilver-trousse-black-qu153g00i-q11',
  'media': [{'path': 'spp-media-p1/34a5665512363d56b46e82cd0633426d/59f1885c00bf4860a986b441f625e08a.jpg',
    'role': 'DEFAULT',
    'packet_shot': True}],
  'brand_name': 'Quiksilver',
  'is_premium': False,
  'family_articles': [],
  'flags': [{'key': 'campaign',
    'value': '-15% EXTRA',
    'tracking_value': 'fr_eoss_ss21_wave_2_2021'},
   {'key': 'discountRate', 'value': '-60%', 'tracking_value': 'discount rate'},
   {'key': 'csr',
    'value': 'Éco-responsabilité',
    'tracking_value': 'sustainable'}],
  'product_group': 'accessoires',
  'delivery_promises': []},
 {'sku': 'OV023C02B-K11',
  'name': 

#### Collect all the objects from selected filters. Total number of pages can be found in the same json. Use *sku* column as index.

Your output should be a Pandas DataFrame of goods. Each row should contain only text or numbers, having family_articles, flags, media and sizes remaining lists (they are exceptions).

In [40]:
# Get the total number of pages
total_pages=results['pagination']['page_count']

# Your code
df=pd.DataFrame()
for i in range(total_pages):
    print(f'Extrayendo pagina {i} ... ')
    k=84*i
    url=f'https://www.zalando.fr/api/catalog/articles?categories=promo-enfant&limit=84&offset={k}&sort=sale'
    response = urlopen(url).read().decode('utf-8')
    results = json.loads(response)
    flattened_data = json_normalize(results)
    flattened_data1 = json_normalize(flattened_data.articles[0])
    flattened_data1=flattened_data1.set_index('sku')
    df = df.append(flattened_data1)


Extrayendo pagina 0 ... 


  flattened_data = json_normalize(results)
  flattened_data1 = json_normalize(flattened_data.articles[0])


Extrayendo pagina 1 ... 
Extrayendo pagina 2 ... 
Extrayendo pagina 3 ... 
Extrayendo pagina 4 ... 
Extrayendo pagina 5 ... 
Extrayendo pagina 6 ... 
Extrayendo pagina 7 ... 
Extrayendo pagina 8 ... 
Extrayendo pagina 9 ... 
Extrayendo pagina 10 ... 
Extrayendo pagina 11 ... 
Extrayendo pagina 12 ... 
Extrayendo pagina 13 ... 
Extrayendo pagina 14 ... 
Extrayendo pagina 15 ... 
Extrayendo pagina 16 ... 
Extrayendo pagina 17 ... 
Extrayendo pagina 18 ... 
Extrayendo pagina 19 ... 
Extrayendo pagina 20 ... 
Extrayendo pagina 21 ... 
Extrayendo pagina 22 ... 
Extrayendo pagina 23 ... 
Extrayendo pagina 24 ... 
Extrayendo pagina 25 ... 
Extrayendo pagina 26 ... 
Extrayendo pagina 27 ... 
Extrayendo pagina 28 ... 
Extrayendo pagina 29 ... 
Extrayendo pagina 30 ... 
Extrayendo pagina 31 ... 
Extrayendo pagina 32 ... 
Extrayendo pagina 33 ... 
Extrayendo pagina 34 ... 
Extrayendo pagina 35 ... 
Extrayendo pagina 36 ... 
Extrayendo pagina 37 ... 
Extrayendo pagina 38 ... 
Extrayendo pagina 39 

Extrayendo pagina 309 ... 
Extrayendo pagina 310 ... 
Extrayendo pagina 311 ... 
Extrayendo pagina 312 ... 
Extrayendo pagina 313 ... 
Extrayendo pagina 314 ... 
Extrayendo pagina 315 ... 
Extrayendo pagina 316 ... 
Extrayendo pagina 317 ... 
Extrayendo pagina 318 ... 
Extrayendo pagina 319 ... 
Extrayendo pagina 320 ... 
Extrayendo pagina 321 ... 
Extrayendo pagina 322 ... 
Extrayendo pagina 323 ... 
Extrayendo pagina 324 ... 
Extrayendo pagina 325 ... 
Extrayendo pagina 326 ... 
Extrayendo pagina 327 ... 
Extrayendo pagina 328 ... 
Extrayendo pagina 329 ... 
Extrayendo pagina 330 ... 
Extrayendo pagina 331 ... 
Extrayendo pagina 332 ... 
Extrayendo pagina 333 ... 
Extrayendo pagina 334 ... 
Extrayendo pagina 335 ... 
Extrayendo pagina 336 ... 
Extrayendo pagina 337 ... 
Extrayendo pagina 338 ... 
Extrayendo pagina 339 ... 
Extrayendo pagina 340 ... 
Extrayendo pagina 341 ... 
Extrayendo pagina 342 ... 
Extrayendo pagina 343 ... 
Extrayendo pagina 344 ... 
Extrayendo pagina 345 ... 
E

In [41]:
df

Unnamed: 0_level_0,name,sizes,url_key,media,brand_name,is_premium,family_articles,flags,product_group,delivery_promises,price.original,price.promotional,price.has_different_prices,price.has_different_original_prices,price.has_different_promotional_prices,price.has_discount_on_selected_sizes_only,amount,condition,condition_key,price.base_price
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
LE216D01N-A13,SOHO - Baskets basses - white/metallic silver,"[36, 37, 38, 39]",levis-soho-baskets-basses-whitemetallic-silver...,[{'path': 'spp-media-p1/42a34a9256c13d4c9f4510...,Levi's®,False,[],"[{'key': 'campaign', 'value': '-15% EXTRA', 't...",shoe,[],"69,95 €","20,95 €",False,False,False,False,,,,
C4A43D002-Q11,STREET CULTURE CREWNECK UNISEX - T-shirt impri...,"[9-10a, 11-12a, 13-14a, 15-16a]",champion-rochester-street-culture-crewneck-uni...,[{'path': 'spp-media-p1/eb26cb0710674715b03c88...,Champion Rochester,False,[],"[{'key': 'campaign', 'value': '-15% EXTRA', 't...",clothing,[],"24,95 €","7,55 €",False,False,False,False,,,,
N1243D1L3-F11,DRY UNISEX - T-shirt imprimé - saturn gold/white,"[6-8a, 8-10a, 10-12a, 12-13a, 13-15a]",nike-performance-dry-unisex-t-shirt-de-sport-s...,[{'path': 'spp-media-p1/3e9866a120cd4694b5bdec...,Nike Performance,False,[],"[{'key': 'campaign', 'value': '-15% EXTRA', 't...",clothing,[],"22,95 €","6,95 €",False,False,False,False,,,,
AD116D0PK-Q11,TEAM COURT - Baskets basses - core black/foot...,"[36, 38, 36 2/3, 37 1/3, 38 2/3]",adidas-originals-team-court-baskets-basses-ad1...,[{'path': 'spp-media-p1/8c0e4514252d392bbc7d62...,adidas Originals,False,[],"[{'key': 'campaign', 'value': '-15% EXTRA', 't...",shoe,[],"59,95 €","20,95 €",False,False,False,False,,,,
N1243A141-C11,STAR RUNNER 2 UNISEX - Chaussures de running n...,"[27.5, 28.5, 30, 31.5, 33, 34]",nike-performance-star-runner-2-unisex-chaussur...,[{'path': 'spp-media-p1/12ef704acac933aeb07ae9...,Nike Performance,False,[],"[{'key': 'campaign', 'value': '-15% EXTRA', 't...",shoe,[],"32,95 €","16,45 €",True,False,True,False,148 g,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
EA823K002-J11,Sweatshirt - rosa fumetto,"[10a, 12a]",emporio-armani-sweatshirt-rosa-fumetto-ea823k0...,[{'path': 'spp-media-p1/6dd699ff42ea30db9360b5...,Emporio Armani,True,[],"[{'key': 'campaign', 'value': 'Prix Mini', 'tr...",clothing,"[{'key': 'slow_delivery_flag', 'label': 'Livra...","224,95 €","224,95 €",False,False,False,False,,,,
NI124G00X-C11,REBRAND REPEAT TEE BABY - T-shirt imprimé - gr...,[12m],nike-sportswear-rebrand-repeat-tee-baby-t-shir...,[{'path': 'spp-media-p1/2550eb1073ea39a7adde7c...,Nike Sportswear,False,[],"[{'key': 'campaign', 'value': 'Prix Mini', 'tr...",clothing,[],"13,95 €","13,95 €",False,False,False,False,,,,
PO923G01A-K11,TEE SHORT SLEEVES - T-shirt imprimé - blue night,[4-5a],3-pommes-tee-short-sleeves-t-shirt-imprime-blu...,[{'path': 'spp-media-p1/328221b15f4d3f12b1f9f0...,3 Pommes,False,[],"[{'key': 'campaign', 'value': 'Prix Mini', 'tr...",clothing,[],"17,95 €","17,95 €",False,False,False,False,,,,
C1824B00D-Q11,LOGO BELT TAPERED CHINO - Chino - black,[6a],calvin-klein-jeans-logo-belt-tapered-chino-pan...,[{'path': 'spp-media-p1/8fd71d7796bf30e1adeea6...,Calvin Klein Jeans,False,[],"[{'key': 'campaign', 'value': 'Prix Mini', 'tr...",clothing,[],"59,95 €","59,95 €",False,False,False,False,,,,


#### Display the trending brand in DataFrame

In [42]:
df.brand_name.value_counts().index[0]

'Name it'

#### Display the brand with maximal total discount (sum of discounts on all goods)

In [43]:
#Our data is still text. Convert prices into numbers:
df['price.original']=df['price.original'].str.extract('(\d*,\d*)') # quita los signos
df['price.promotional']=df['price.promotional'].str.extract('(\d*,\d*)')

df['price.original'] = [x.replace(',', '.') for x in df['price.original']]
df['price.promotional'] = [x.replace(',', '.') for x in df['price.promotional']]

In [44]:
df['discount_amount']=df['price.original'].astype(float)-df['price.promotional'].astype(float)
df1=df.copy()

In [45]:
total_disc=df1.groupby(['brand_name']).sum().discount_amount

In [46]:
total_disc.sort_values(ascending=False).index[0]

'Nike Sportswear'

#### Display the brands without discount at all

In [47]:
total_disc[total_disc==0]

brand_name
24Bottles            0.0
3 Pommes             0.0
ALDO                 0.0
Bardot Junior        0.0
Barts                0.0
Billybandit          0.0
Bloch                0.0
Catimini             0.0
Chipie               0.0
Cross Sportswear     0.0
D-XEL                0.0
Didriksons           0.0
Ebbe                 0.0
Hurley               0.0
Hype                 0.0
J.CREW               0.0
Kaps                 0.0
Kjus                 0.0
Lili Gaufrette       0.0
Missoni Kids         0.0
Monta Juniors        0.0
Nike SB              0.0
ODLO                 0.0
Outerstuff           0.0
Patagonia            0.0
Paul Smith Junior    0.0
Rojo                 0.0
Scholl               0.0
Shoesme              0.0
Smitten Organic      0.0
Sorel                0.0
Stella McCartney     0.0
Sunnylife            0.0
Tumble 'n dry        0.0
UGG                  0.0
Umbro                0.0
Unisa                0.0
VOGUE Eyewear        0.0
White Stuff          0.0
Ziener        