## README TEXT
The crawled data is split into two files: "crawl1.txt" and "crawl2.txt" contain data from the Crawl1 and Crawl2 respectively, as described in the paper. The Crawl 2 data is smaller because it only covers the first two pages of seller per product, versus Crawl 1 which covers all pages. However, Crawl 2 includes data from the Buy Box, while Crawl 1 does not.

Every row in the data corresponds to an offer by a seller for a given product at a given epoch from the "New Offers" page for said product at said epoch.

The two files contain the following tab separated columns from left-to-right. Data from Crawl1 spans columns 1 to 14 (both inclusive); data from Crawl2 includes all columns.

COLUMN  DESCRIPTION
1       pid - (string) The unique product id (or ASIN) that Amazon assigns to each product.
2       epoc -(integer) The timestamp at which this sample was collected.
3       sid - (string) The unique seller id that Amazon assigns to each seller. Note that if the seller is Amazon, the seller id has been saved as "amazon".
4       price - (float) The item price listed by a seller. Note that the item price cannot be 0 for any item. For samples where the item price was unavailable (due to an error in downloading the page, scraping the data or because the price will only be displayed after adding the item to the cart), the item price has been saved as "0". Users are advised to ignore these (price==0) data points when running experiments involving item price.
5       sid_rating - (float) The "star" rating of a seller ranging from 0 to 5 (both inclusive).
6       sid_pos_fb - (integer) The positive feedback score of a seller ranging from 0 to 100 (both inclusive).
7       sid_rating_cnt - (integer) The number of ratings received by a seller.
8       shipping - (float) The shipping price listed by a seller. Note that the shipping price can be 0. For samples where shipping price was unavailable (due to an error in downloading the page, scraping the data or because the price will only be displayed after adding the item to the cart), the shipping price has been saved as "-1" . Users are advised to ignore these (price==-1) data points when running experiments involving shipping price.
9       page - (integer) The page number at which a seller was listed. Minimum page number is "1".
10      rank - (integer) The rank of a seller on the page it was listed. Minimum rank is "0".
11      pid_rating - (float) The "star" rating of a product ranging from 0 to 5 (both inclusive). For samples where the rating was unavailable, the rating has been saved as "nan".
12      pid_rating_cnt - (integer) The number of ratings received by a product. For samples where the number of ratings was unavailable, the number of ratings has been saved as "nan".
13      is_fba - (bool) Is this offer listing Fulfilled By Amazon? <yes/no>.
14      is_prime - (bool) Does the seller offer Prime shipping for this listing? <yes/no>.
15      bbox_sid - (string) The seller selected by Amazon as the default (BuyBox) seller for the product.
16      bbox_price - (float) The price by the BuyBox seller on the product page for a product.

Two additional files "prime_sids.txt.all" and "susp_sids_all_with_amazon.txt" contain the list of prime sellers and algorithmic sellers respectively.



## NOTES


In [2]:
import pandas as pd 
import numpy as np

In [None]:
crawl1 = pd.read_csv("crawl1.txt")

In [7]:
crawl2 = pd.read_table("crawl2.txt")

In [16]:
prime_sids = pd.read_table("prime_sids.txt", header=None, names=["prime_sids"])
algo_sellers = pd.read_table("susp_sids_all_with_amazon.txt", header=None, names=["algo_sids"])

In [None]:
crawl2.columns = ["pid", "epoc", "sid", "price", "sid_rating", "sid_pos_fb", "sid_rating_cnt", "shipping", "page", "rank", "pid_rating", "pid_rating_cnt", "is_fba", "is_prime", "bbox_sid", "bbox_price"]
crawl2["prime_seller"] = crawl2["sid"].isin(prime_sids["prime_sids"])
crawl2["algo_seller"] = crawl2["sid"].isin(algo_sellers["algo_sids"])

In [26]:
unique_pids = crawl2["pid"].unique().tolist()
unique_pids

['0975277324',
 'B00000J0RJ',
 'B00000JBNX',
 'B00002ND64',
 'B00004R9TL',
 'B00004RBDU',
 'B00004RIZ7',
 'B00004S7V8',
 'B00004SQLJ',
 'B00004TZY8',
 'B00004U9JO',
 'B00004UBGZ',
 'B00004UE29',
 'B00004YO15',
 'B00004YTJE',
 'B00004Z4A8',
 'B00004Z4CP',
 'B00004Z5SM',
 'B000052XHI',
 'B00005BXKM',
 'B00005O6B7',
 'B000067EH7',
 'B000067PCE',
 'B000067PQ0',
 'B000068O36',
 'B000068O3C',
 'B000068PBT',
 'B00006ANDK',
 'B00006I551',
 'B00006IBYA',
 'B00006IDV8',
 'B00006IEE4',
 'B00006IEEU',
 'B00006IEJC',
 'B00006IESK',
 'B00006IFAV',
 'B00006IFH0',
 'B00006IFKU',
 'B00006IUWA',
 'B00006JNN7',
 'B00006WNMJ',
 'B00008Y0VN',
 'B000096QQ5',
 'B00009IMCK',
 'B00009PGNT',
 'B0000AQOH2',
 'B0000AXRH5',
 'B0000CBK1L',
 'B0000YNR4M',
 'B0000YUXI0',
 'B00012YIA0',
 'B00016XJ4M',
 'B000197NXM',
 'B0001DSIVY',
 'B0001J3R3C',
 'B00026ZEDK',
 'B00028XJNA',
 'B00029WYEY',
 'B0002CZW0Y',
 'B0002D0CA8',
 'B0002D0CAI',
 'B0002D0HXA',
 'B0002E1G5C',
 'B0002E1P08',
 'B0002E7DIQ',
 'B0002FOBJY',
 'B0002GLC

In [None]:
algoonly = crawl2[crawl2[]]
pidsandcount = crawl2.groupby("pid").agg({"pid" : "count"})
pidsandcount.columns = ["count"]
pidsandcount = pidsandcount.reset_index().sort_values(by="count", ascending=False)
pidsandcount.head(30)

Unnamed: 0,pid,count
600,B009IH0BYQ,30140
255,B001EOV492,30110
159,B000NCOKZQ,30110
681,B00DJPK8PA,30110
63,B0002E1P08,30100
356,B003BZD03K,30100
69,B0002M6CVC,30090
56,B00028XJNA,30090
882,B00O4ON72Q,30080
22,B000067PCE,30080


In [39]:
product = crawl2[crawl2["pid"] == "B07PHSF8DP"]
product

Unnamed: 0,pid,epoc,sid,price,sid_rating,sid_pos_fb,sid_rating_cnt,shipping,page,rank,pid_rating,pid_rating_cnt,is_fba,is_prime,bbox_sid,bbox_price,prime_seller,algo_seller


In [20]:
crawl2

Unnamed: 0,pid,epoc,sid,price,sid_rating,sid_pos_fb,sid_rating_cnt,shipping,page,rank,pid_rating,pid_rating_cnt,is_fba,is_prime,bbox_sid,bbox_price,prime_seller,algo_seller
0,0975277324,1439301853,A19HZ7QWHIRFQA,35.00,0.0,0,0,6.49,1,7,5.0,2321,yes,no,amazon,40.36,False,False
1,0975277324,1439301853,A1G1QJKXJJSAN2,44.06,5.0,98,3761,3.99,2,9,5.0,2321,no,no,amazon,40.36,False,False
2,0975277324,1439301853,A1H9LQ4XQ5IZ7D,39.99,5.0,97,1577,0.00,1,1,5.0,2321,yes,yes,amazon,40.36,True,False
3,0975277324,1439301853,A1OUQ84L1EU4IB,36.99,5.0,100,2,6.49,2,1,5.0,2321,no,no,amazon,40.36,False,False
4,0975277324,1439301853,A1PSEM5PTWSZBK,39.95,4.5,85,5308,6.10,2,4,5.0,2321,no,no,amazon,40.36,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20740262,B01387NUN0,1440712515,A8UMA0WO93O39,5.95,5.0,100,3,0.00,1,1,1.0,3,no,no,A8UMA0WO93O39,5.95,False,False
20740263,B01387NUN0,1440712515,AB62NOZUAODRB,10.99,5.0,100,770,4.75,1,9,1.0,3,no,no,A8UMA0WO93O39,5.95,False,False
20740264,B01387NUN0,1440712515,AC9XK46ZTUENH,6.00,5.0,100,6,4.50,1,6,1.0,3,no,no,A8UMA0WO93O39,5.95,False,False
20740265,B01387NUN0,1440712515,AS94HDKW3U98F,10.49,0.0,0,0,0.00,1,5,1.0,3,no,no,A8UMA0WO93O39,5.95,False,False
