# Parsing APIs

## Intro

While looking for open data very often you end up having no data. In this case it becomes necessary to obtain data from web pages.

Web scraping tools are designed to extract data from websites. These tools are useful for those who wants to get data from the Internet. Web scraping is a technology that allows you to automize data collection: instead of opening windows and copy pasting the content you have to write a short script and collect any volume of information. These tools allow you to manually or automatically retrieve new or updated data and save it for later use. For example, using web scraping tools you can extract information about products and prices from online stores.

Possible scenarios for using web scraping tools:

* Data collection for marketing research
* Contact information extraction (email addresses, phone numbers, etc.) from different sites
* StackOverFlow solution retrieval (or other similar resource) for offline access
* Job offer real-time collection
* Price tracking

Depending on complexity and time expenses web pages can be split by 2 main groups:

* API based websites
* HTML based websites

Now we will take a look on a real data. In this file you will see an example of API parsing.

[Zalando.fr](https://www.zalando.fr/accueil-homme/) is API based web-page. 

Here I will show how to obtain the link from web-page and extract the data.

*Note: I use Google Chrome. Other browsers like Safari and Firefox have similar tools for developers but they work differently. To save your time in following this file, it is strongly recommended that you install and use Google Chrome.*



## Obtaining the link

Zalando is discount e-store where you can buy clothes and accesories with discount. When we go to the web-page, we can choose different sections. First the general process will be shown using [Children section](https://www.zalando.fr/accueil-enfant/) as example.

Here we will parse data about promotions only. Therefore, final output will be the DataFrame with all the goods under discount.

[![Image from Gyazo](https://i.gyazo.com/fa4874d8e81c7570273bbfb853d66308.png)](https://gyazo.com/fa4874d8e81c7570273bbfb853d66308)


We go to Promos page. Right click of mouse shows us a list of actions possible, from which we select Inspect.

<img src='https://i.gyazo.com/bccbd11d69c9040dc98758d443e32052.png' width="400">


You will see the menu dropdown on the right side or on the bottom of the window. There you should click on Network:


[![Image from Gyazo](https://i.gyazo.com/f7e0db81cbfee67694183d1a7640bf81.png)](https://gyazo.com/f7e0db81cbfee67694183d1a7640bf81)


Right after the developer part will change showing the files behind the page. In order to obtain only useful files we select the following settings:
1. Preserve Log
2. Select XHR files.

[![Image from Gyazo](https://i.gyazo.com/9a899d4441d9d93e795f79747f1e47d5.png)](https://gyazo.com/9a899d4441d9d93e795f79747f1e47d5)

In order to obtain some files we need to scrool down and go forward to second page. 

[![Image from Gyazo](https://i.gyazo.com/0956eb3d5125075a236c9a439c7749c7.png)](https://gyazo.com/0956eb3d5125075a236c9a439c7749c7)

In the Network panel you can see the following files being uploaded. All the data on the web-page is uploaded from the json file, which is one of the following. It is important to understand which file contains what kind of information. 

<a href="https://gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c"><img src="https://i.gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c.png" alt="Image from Gyazo" width="724.8"/></a>

When you find what kind of information you need for the data to be uploaded you just test it. Here we need the article... file:

<a href="https://gyazo.com/78b35bf492994b3f35c0564a21da202a"><img src="https://i.gyazo.com/78b35bf492994b3f35c0564a21da202a.png" alt="Image from Gyazo" width="727.2"/></a>

When we test the link in Chrome inkognito mode we obtain the proper json file:


<a href="https://gyazo.com/b60453fa98454fa29771c731a5174443"><img src="https://i.gyazo.com/b60453fa98454fa29771c731a5174443.png" alt="Image from Gyazo" width="1530.4"/></a>

In order to change the objects in the json file (kind of pagination), you need to change the offset (the number of the first element on the page). In fact, if you take a look on the link, it is easy to unerstand the structure of the link.

# Reading the data

Now the party rocks! When we know how can we obtain the data, it is not a problem to obtain the whole database with all the data from the web-page.
I will use *Python 3.7* with *json* and *requests* libraries to collect the data.



In [1]:
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize


In [2]:
# Paste the url you obtained for your data
url='https://www.zalando.fr/api/catalog/articles?categories=promo-enfant&limit=84&offset=84&sort=sale'


## First problem

Every company is trying to protect their data. So, in order to obtain the link that leads to API data usually you need to do a bunch of work. Keep patience and you will get it!

Another way to hide the information is cookies. Cokkies are basically your credentials to prove that you are human, not robot. They are generated when you access the website using your browser. When you access the website using Python you are not introducing yourself to the website, therefore nothing is generated. In order to go through this challenge you need to open the website from browser first and collect the cookies generated. Using this data you can *'fool'* the webpage.

In [4]:
headers={'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ru-KG,ru;q=0.9,fr-FR;q=0.8,fr;q=0.7,en-GB;q=0.6,en;q=0.5,ru-RU;q=0.4,en-US;q=0.3',
'cache-control': 'no-cache',
'cookie': 'fvgs_ml=mosaic; Zalando-Client-Id=4bfa4af1-aea7-4399-b745-6439d1d552d6; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; bm_sz=0E165704C68EAA01EF151AE1C53D3AC9~YAAQBv4BF2eSxn5sAQAAXVZwkARCdzOLIxKw4zqi4SeizUuMuRvgJ7g/Gm/R+zj4sgHqigXVPt11bKn9Cpd6PZJS/uCuMWcki9yi5X6oL4cPdZTwMjvF1uBtR1GAm2WRp9jmOS0bIkIm74ON/VGhEsHa9gL5xEjrzQeH2cdTPj9zQuKr1WfRCCzVIDlIzKi7; ncx=k; _ga=GA1.2.1787745730.1565791379; _abck=EA94808A8A776813646168D5844BA4D3~0~YAAQBv4BF4CSxn5sAQAABVpwkAI7yu9yGkTrizxawtH03WjbSrcmjQelFV8p/xqywXQeWW0W4XFUbH8mlmec8/qYeGLVnyq1bgYX1A3H1hmPYuOi7L16AA+7IhYx5doiv3tkKmtv3buM5WldJtgXXSYY+zeMCA810RypWro0MEKPw07YwthrOGxeby8BNl5xBDRSY2bsSw42Q69uCMR2vnGJV17GrdpENQl7Z7UNWo4pCqOk63lfEWs74MrvfBOxFGbMLt0pa8bzrs3KyaJRn/zGduLUNJ3Lg4mdQ+37goE/DBx7dH/3HAU=~-1~-1~-1; _ga=GA1.2.1787745730.1565791379; _gid=GA1.2.264681709.1565791381; _gcl_au=1.1.683590255.1565791382; frsx=AAAAAOT4kuc0u1VwTcmNCjH24teLUFFjA5L2hIrcZ-CYOaK9pBJkKqDakYJb1XuDjv7x5w29rjjAj6otFmB1CzrOlxHoJ24sbCCDF3Wg-W4ckUKNfq4iJ2jZhWBsuYtbm1jajpDTvQrSVoIfeM2Y; bm_mi=22C3D748E443EB2920BB01387E46A7F6~GhWWdQDkiuRHnwXvK2kAcE/j2FKRoNJaxCFfKs8Csh8kayWJN0y322pk9UoxLztfK6I6ArhL8Tmn/0QP/GjaEnWwUep6EfzGASAeOo/7GOrwACd8m2tBqYI+NOc+c8CGovyOMi/Kt+028w4o7BJI5g4heWxmJNyv5gioaV83iFqreZjT1E+i+XBT60wf9u192Vk26ePOgNZ0wi8csOMO4rguAvIff5T4Ell/MD8F50XV9rsR4jaOUmNtRAj0Q8ACkGL4AQZyBn/z6lqEgUDWDKYh9xPKj8RVnOKfoXU97c0=; ak_bmsc=AB0DC65CD857CD47ED48D7F6E9CE4BBA1701FE06453200009014545DE5FC5F31~pluF2L4qRByD/4lhsvFenAhc+fjx/PBWRy1xK+kKH1BUf+D7uOhaqsOY+ot1zxljA2ymV7sQvlPvTXJ23SEDFf3mX/1155c4eOe6mtpokYKqJq4RRrT6dcpWmUCN2rZNtBZCyToXIMX3qbYgoWZVaji46XgMSZeINbAYYf3rD87ZpcWiON9OyKXIhMkyZUqKgtlb4XeW07ElSIIopGnw5pfuaCs0QP/I2PPnHiv2xX7SJqqk5Zp5O4smrU5VeiHmzcraLoGF0of+5yL73wyXsC1Q==; _gat_zalga=1',
'pragma': 'no-cache',
'referer': 'https://www.zalando.fr/promo-enfant/',
'user-agent': 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Mobile Safari/537.36',
'x-xsrf-token': 'AAAAAOT4kuc0u1VwTcmNCjH24teLUFFjA5L2hIrcZ-CYOaK9pBJkKqDakYJb1XuDjv7x5w29rjjAj6otFmB1CzrOlxHoJ24sbCCDF3Wg-W4ckUKNfq4iJ2jZhWBsuYtbm1jajpDTvQrSVoIfeM2Y',
'x-zalando-catalog-nakadi-context': '%7B%22previous_categories%22%3A%5B%22promo-enfant%22%5D%2C%22previous_selected_filters%22%3A%5B%5D%2C%22preselected_filters%22%3A%5B%5D%7D',
'x-zalando-octopus-tests': '%5B%7B%22testName%22%3A%22mobile-filters-design%22%2C%22testVariant%22%3A%22mobile-light-filters%22%2C%22testFeedbackId%22%3A%2200000000-0000-0000-0000-000000000000%3A__EMPTY__%22%7D%2C%7B%22testName%22%3A%22image-test%22%2C%22testVariant%22%3A%22Control%20Variant%20A%20Test1%22%2C%22testFeedbackId%22%3A%2282baf50f-d236-47ea-a8f7-a36d388bca89%3Aclientid-4bfa4af1-aea7-4399-b745-6439d1d552d6%22%7D%2C%7B%22testName%22%3A%22teaser-card-test%22%2C%22testVariant%22%3A%22outward%22%2C%22testFeedbackId%22%3A%2200000000-0000-0000-0000-000000000000%3A__EMPTY__%22%7D%2C%7B%22testName%22%3A%22filter-cleanup-test%22%2C%22testVariant%22%3A%22sorting-in-filter-toggle-groups%22%2C%22testFeedbackId%22%3A%22717403e6-fe98-4d17-9208-c9f40f1232b8%3Aclientid-4bfa4af1-aea7-4399-b745-6439d1d552d6%22%7D%5D'}

Cookies can be obtained from the same window where you get the link. Copy the **Request Headers** and reformat them as Python dictionary.

#### Collect first 84 object of the of the data (1st page)

Your output should be a Pandas DataFrame of goods. Each row should contain only text or numbers, having *family_articles, flags, media* and *sizes* remaining lists (they are exceptions).

In [9]:
response = requests.get(url, headers=headers) # access the website using the link and headers collected from the website 
results = response.json() #transform the obtained data into json file

flattened_data = json_normalize(results) #normalize the data (transform it into readable format)
flattened_data.head()

Unnamed: 0,articles,articlesToShow,breadcrumbs,carouselTeaser,categoryTree,collection,contentPositions.entry-point-teasers,contentPositions.in-cat-carousel,contentPositions.in-cat-carousel-fullwidth,contentPositions.in-cat-carousel-mobile,...,upperInCatTeaser,variants.fullWidthCatalog,variants.groupToggleFilters,variants.hideCategories,variants.mobileLightFilters,variants.myBrandsFilter,variants.outwardTeaserCard,variants.premiumCatalog,variants.sortingInFilter,wishlist
0,"[{'sku': 'L5213E008-J12', 'name': 'BERNIE - C...",84,"[{'items': [{'label': 'Enfant', 'url_key': 'en...",,"[{'label': 'Promotions', 'id': '9574', 'url_ke...",,"[7, 14, 20, 26]",9,8,6,...,,False,True,False,True,True,True,False,True,


Here we can see that the database of zalando is not written in a good way. Information about articles is in the first element of the dictionary, therefore if we ask data from *'articles'* we will probably get the proper information.

In [12]:
flattened_data1 = json_normalize(flattened_data.articles[0])
flattened_data1.head()

Unnamed: 0,amount,brand_name,flags,is_premium,media,name,price.has_different_original_prices,price.has_different_prices,price.has_different_promotional_prices,price.has_discount_on_selected_sizes_only,price.original,price.promotional,product_group,sku,url_key
0,,LICO,"[{'key': 'discountRate', 'value': 'Jusqu’à -32...",False,[{'path': 'L5/21/3E/00/8J/12/L5213E008-J12@12....,BERNIE - Chaussures à scratch - pink/lila/weiß,False,True,True,False,"19,95 €","14,00 €",shoe,L5213E008-J12,lico-bernie-chaussures-dentrainement-et-de-fit...
1,,Lacoste,"[{'key': 'discountRate', 'value': 'Jusqu’à -10...",True,[{'path': 'LA/22/4G/01/6G/12/LA224G016-G12@9.j...,BASIC - Polo - bordeaux,False,True,True,False,"39,95 €","36,00 €",clothing,LA224G016-G12,lacoste-polo-bordeaux-la224g016-g12
2,,Birkenstock,"[{'key': 'discountRate', 'value': '-10%', 'tra...",False,[{'path': 'BI/11/6G/00/0K/11/BI116G000-K11@2.2...,RIO - Sandales de bain - navy,False,False,False,False,"19,98 €","18,00 €",shoe,BI116G000-K11,birkenstock-sandales-navy-bi116g000-k11
3,,New Look 915 Generation,"[{'key': 'discountRate', 'value': '-20%', 'tra...",False,[{'path': 'NL/62/3C/01/IK/11/NL623C01I-K11@10....,MOM SKIRT - Jupe en jean - mid blue,False,False,False,False,"24,99 €","20,00 €",clothing,NL623C01I-K11,new-look-915-generation-mom-skirt-jupe-en-jean...
4,,Tommy Hilfiger,"[{'key': 'discountRate', 'value': '-21%', 'tra...",True,[{'path': 'TO/12/4L/02/VK/11/TO124L02V-K11@7.j...,PADDED BOMBER - Veste d'hiver - blue,False,True,True,False,"139,95 €","110,00 €",clothing,TO124L02V-K11,tommy-hilfiger-padded-bomber-veste-dhiver-blue...


Data seems to be more or less correct. Let's proceed with all the pages of zalando.

#### Collect all the objects from selected filters. Total number of pages can be found in the same json. Use *sku* column as index.

In order to proceed further we need to create a goal of analysis. 

***Goal*** of analysis is to detect the most discounted goods per company, detect the company and create an alert when the product gets discount higher then a certain percent.


In [None]:
# Get the total number of pages
total_pages=results['pagination']['page_count']

import time

#data collection
df=pd.DataFrame()
step=0 # dummy to detect be patient
for i in range(total_pages):
    k=84*i
    url=f'https://www.zalando.fr/api/catalog/articles?categories=promo-enfant&limit=84&offset={k}&sort=sale'
    response = requests.get(url, headers=headers)
    if step%5==0:
        time.sleep(2.4) # sleep every five parsed page
    results = response.json()
    flattened_data = json_normalize(results)
    flattened_data1 = json_normalize(flattened_data.articles[0])
    flattened_data1=flattened_data1.set_index('sku')
    df = df.append(flattened_data1)

In [30]:
df.head()

Unnamed: 0_level_0,amount,brand_name,flags,is_premium,media,name,price.has_different_original_prices,price.has_different_prices,price.has_different_promotional_prices,price.has_discount_on_selected_sizes_only,price.original,price.promotional,product_group,url_key,discount_amount
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
NI114D058-Q11,,Nike Sportswear,"[{'key': 'discountRate', 'value': '-20%', 'tra...",False,[{'path': 'NI/11/4D/05/8Q/11/NI114D058-Q11@12....,COURT BOROUGH - Baskets basses - black,False,False,False,False,34.95,28.0,shoe,nike-sportswear-baskets-basses-black-ni114d058...,6.95
CO416A03L-802,,Converse,"[{'key': 'discountRate', 'value': 'Jusqu’à -31...",False,[{'path': 'CO/41/6A/03/L8/02/CO416A03L-802@19....,CHUCK TAYLOR ALL STAR CORE - Baskets basses - ...,False,True,True,False,49.95,35.0,shoe,converse-chuck-taylor-as-core-ox-baskets-basse...,14.95
NI114D057-Q11,,Nike Sportswear,"[{'key': 'discountRate', 'value': '-20%', 'tra...",False,[{'path': 'NI/11/4D/05/7Q/11/NI114D057-Q11@12....,COURT BOROUGH - Chaussures premiers pas - black,False,False,False,False,29.95,24.0,shoe,nike-sportswear-court-borough-baskets-basses-n...,5.95
TO123G04J-K11,,Tommy Hilfiger,"[{'key': 'discountRate', 'value': 'Jusqu’à -26...",True,[{'path': 'TO/12/3G/04/JK/11/TO123G04J-K11@6.j...,GIRLS BASIC - T-shirt basique - sky captain,False,True,True,False,14.95,11.0,clothing,tommy-hilfiger-girls-basic-t-shirt-basique-to1...,3.95
NI114D089-Q13,,Nike Sportswear,"[{'key': 'discountRate', 'value': '-14%', 'tra...",False,[{'path': 'NI/11/4D/08/9Q/13/NI114D089-Q13@3.j...,AIR MAX AXIS - Baskets basses - black/white,False,False,False,False,69.95,60.0,shoe,nike-sportswear-air-max-axis-baskets-basses-ni...,9.95


#### Top 5 trending brands on Zalando

In [15]:
df.brand_name.value_counts().head()

Friboo      893
GAP         630
OVS         591
Benetton    575
Name it     543
Name: brand_name, dtype: int64

#### The brands with maximal total discount (sum of discounts on all goods)

In [16]:
#Our data is still text. Convert prices into numbers:
df['price.original']=df['price.original'].str.extract('(\d*,\d*)')
df['price.promotional']=df['price.promotional'].str.extract('(\d*,\d*)')

df['price.original'] = [x.replace(',', '.') for x in df['price.original']]
df['price.promotional'] = [x.replace(',', '.') for x in df['price.promotional']]

In [17]:
df['discount_amount']=df['price.original'].astype(float)-df['price.promotional'].astype(float)
df1=df.copy()

In [18]:
total_disc=df1.groupby(['brand_name']).sum().discount_amount

In [21]:
total_disc.sort_values(ascending=False).head()

brand_name
Polo Ralph Lauren    9159.07
Friboo               7382.39
Tommy Hilfiger       6388.58
J.CREW               6172.41
Nike Performance     5936.63
Name: discount_amount, dtype: float64

#### Brands without discount at all

In [22]:
total_disc[total_disc==0]

Series([], Name: discount_amount, dtype: float64)

#### The most discounted items

In [36]:
df[['name','price.original','price.promotional','discount_amount']].sort_values(by='discount_amount',ascending=False).head()

Unnamed: 0_level_0,name,price.original,price.promotional,discount_amount
sku,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BE926L000-Q11,JUNIOR ROADMASTER - Veste mi-saison - black,349.95,104.95,245.0
BE926L001-K11,KIDS JUNIOR TOURMASTER - Veste mi-saison - dar...,349.95,104.95,245.0
BE926L000-N11,JUNIOR ROADMASTER - Veste mi-saison - faded olive,349.95,104.95,245.0
BE924L000-K11,KIDS JUNIOR TRIALMASTER - Veste mi-saison - da...,374.95,149.95,225.0
PIJ23L006-N11,METAURO GABARDINA USED - Parka - verde/foglia ...,314.95,94.0,220.95


#### The most discounted items per brand name

In [86]:
df[['brand_name','name','price.original','price.promotional','discount_amount']].groupby('brand_name').apply(lambda x:x.sort_values('discount_amount',ascending=False).head(1)).reset_index('sku').drop(['brand_name','sku'],axis=1)

Unnamed: 0_level_0,name,price.original,price.promotional,discount_amount
brand_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IGOR,SPORT - Sandales de bain - transparent,24.95,15.00,9.95
3 Pommes,VESTE MILANO - Blazer - marine blue,59.95,30.00,29.95
ASICS,HYPERGEL - Chaussures de running neutres - bla...,89.95,55.00,34.95
Abercrombie & Fitch,Salopette - burgundy velvet,79.95,43.95,36.00
Absorba,BABY SET - Short - rouge,54.95,33.00,21.95
Aigle,LOLLY POP - Bottes en caoutchouc - charcoal,29.00,20.00,9.00
Angel & Rocket,FLORAL PUFF BALL - Robe de soirée - light pink,94.95,38.00,56.95
Armor lux,SALOPETTE - Salopette - lotus,74.95,55.95,19.00
BIDI BADU,TECH - Survêtement - dark blue/pink,49.95,14.95,35.00
BOSS Kidswear,Chemise - himmelblau,89.95,26.95,63.00


#### Alerts
In order to create alerts that will send you emails if some condition is met, you can follow [this](https://kirankoduru.github.io/python/crons-with-python.html) link. You shoud define the event, the periodicity of updates (how often the code above should be ran, having in mind that it is bad idea to parse the whole dataframe every time - it is obviously better to check if there are any updates first) and email address.