Here we will retrieve the api information.

In [1]:
import requests
import pandas as pd
from tqdm import tqdm_notebook

In [2]:
base_url = 'https://y29rdnycjd.execute-api.eu-west-1.amazonaws.com/dev/'

In [3]:
r = requests.get(base_url)

In [4]:
r.ok

True

We see we can connect, so we don't need any authentication.

By checking the [api swagger doc](https://y29rdnycjd.execute-api.eu-west-1.amazonaws.com/dev/swagger.json), we see there is an endpoint named `missingdata` to which we pass an id and returns the missing columns.

In [5]:
endpoint_url = base_url + 'missingdata/{id}'
endpoint_url

'https://y29rdnycjd.execute-api.eu-west-1.amazonaws.com/dev/missingdata/{id}'

In [6]:
sample_id = 'a0ef54d140f90a214b247122aa904a34'

In [7]:
r = requests.get(endpoint_url.format(id=sample_id))

In [8]:
r.request.url

'https://y29rdnycjd.execute-api.eu-west-1.amazonaws.com/dev/missingdata/a0ef54d140f90a214b247122aa904a34'

In [9]:
r.content

b'{"message": "Sample not found"}\n'

So the endpoint works, but that id is not found.

Now we will read one of the dataframes we already created.

In [10]:
files_df = pd.read_pickle('data/clean/files_df.pkl')

In [11]:
files_df.columns

Index(['tierafterorder', 'orderportalid', 'size', 'orderdate_gmt',
       'hasusedwishlist', 'ldsa_team_wishes_you', 'shipper', 'productid',
       'isreseller', 'issale', 'category_1stlevel', 'tierbeforeorder',
       'ddprate', 'platform', 'style', 'region', 'isusingmultipledevices',
       'freereturn', 'userid', 'isvip', 'brand', 'promocode', 'designer',
       'ddpsubcategory', 'shiptypeid', 'hasitemsonbag', 'country',
       'countrycode', 'countryoforigin', 'userfraudstatus'],
      dtype='object')

So I asked the instructors how to know which ids are in the API, I did it so I didn't have to blast the API with unnecessary requests.

It turns out, the rows that are in the API are those where column values are `API`.

In [12]:
ids_api = files_df[files_df.tierafterorder == 'API'].index.values

In [13]:
ids_api.shape

(10000,)

In [14]:
def get_missing_data(row_id):
    r = requests.get(endpoint_url.format(id=row_id))
    if r.ok:
        return r.json()
    else:
        return None

In [15]:
get_missing_data(ids_api[0])

{'orderportalid': 382388,
 'orderdate_gmt': '2018-01-01 00:15:06.020000+00:00',
 'designer': 4295,
 'style': 4299,
 'shipper': 2,
 'shiptypeid': 2,
 'userid': 257187.0,
 'isvip': 'Not VIP',
 'country': 1,
 'region': 1,
 'ddprate': 5.0083,
 'countrycode': 1,
 'hasusedwishlist': 'Yes',
 'isreseller': 'No',
 'hasitemsonbag': 'No',
 'tierafterorder': None,
 'tierbeforeorder': None,
 'isusingmultipledevices': 'Yes',
 'userfraudstatus': 3,
 'promocode': 1,
 'freereturn': 1,
 'issale': 'Yes',
 'productid': 4450,
 'brand': 337,
 'ddpsubcategory': 'Footwear with outer soles of rubber or plastics',
 'storeid': 5,
 'countryoforigin': 1,
 'size': 12,
 'category_1stlevel': 'Shoes',
 'platform': 'web',
 'returned': None}

It works!

Now we just need to do all of the requests. We will use [dask](dask.org) to speed up the process (see how to install it [here](https://docs.dask.org/en/latest/install.html)).

In [16]:
import dask
from dask.distributed import Client
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:58983  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.59 GB


In [17]:
data = [dask.delayed(get_missing_data)(row_id) for row_id in  ids_api]

In [18]:
res = dask.compute(data)

In [19]:
len(res[0])

10000

In [20]:
res[0][0]

{'orderportalid': 382388,
 'orderdate_gmt': '2018-01-01 00:15:06.020000+00:00',
 'designer': 4295,
 'style': 4299,
 'shipper': 2,
 'shiptypeid': 2,
 'userid': 257187.0,
 'isvip': 'Not VIP',
 'country': 1,
 'region': 1,
 'ddprate': 5.0083,
 'countrycode': 1,
 'hasusedwishlist': 'Yes',
 'isreseller': 'No',
 'hasitemsonbag': 'No',
 'tierafterorder': None,
 'tierbeforeorder': None,
 'isusingmultipledevices': 'Yes',
 'userfraudstatus': 3,
 'promocode': 1,
 'freereturn': 1,
 'issale': 'Yes',
 'productid': 4450,
 'brand': 337,
 'ddpsubcategory': 'Footwear with outer soles of rubber or plastics',
 'storeid': 5,
 'countryoforigin': 1,
 'size': 12,
 'category_1stlevel': 'Shoes',
 'platform': 'web',
 'returned': None}

If you don't want to use dask, you can always do it in plain python! To do it, uncomment the code below.

In [21]:
#all_data = []
#for i in tqdm_notebook(ids_api):
#    all_data.append(get_missing_data(i))

Make a dataframe out of the json responses.

In [22]:
api_df = pd.DataFrame(res[0])

In [23]:
api_df.head()

Unnamed: 0,brand,category_1stlevel,country,countrycode,countryoforigin,ddprate,ddpsubcategory,designer,freereturn,hasitemsonbag,...,returned,shipper,shiptypeid,size,storeid,style,tierafterorder,tierbeforeorder,userfraudstatus,userid
0,337,Shoes,1,1,1,5.0083,Footwear with outer soles of rubber or plastics,4295,1,No,...,,2,2,12,5,4299,,,3,257187.0
1,681,Clothing,1,1,8,0.0,Skirts,79959,1,No,...,,1,1,395,764,79978,,,3,270119.0
2,9,Bags,3,3,1,0.0,"Handbags, whether or not with shoulder strap, ...",13129,1,Yes,...,,2,2,17,174,13136,,,3,264764.0
3,92,Clothing,4,4,1,0.0,"Jerseys, pullovers, cardigans, waistcoats and ...",268,1,Yes,...,,2,9,35,156,268,VIP,VIP,6,195.0
4,1617,Teen Girl Clothing,1,1,1,5.0083,Dresses,148401,1,No,...,,4,2,10,163,148434,,,3,254415.0


In [24]:
api_df['id'] = ids_api

In [25]:
api_df = api_df.set_index('id')

In [26]:
api_df.to_pickle('data/clean/api_df.pickle')