# Extracting ASINs

Books are often linked to via Amazon and each product on Amazon has a 10 character ASIN ([Amazon Standard Identification Number](https://en.wikipedia.org/wiki/Amazon_Standard_Identification_Number)).
In particular for books this is the same as the book's ISBN-10 which makes it easy to link.

We're going to extract ASIN's from links to Amazon for the Hacker News dataset.

In [1]:
import numpy as np
import pandas as pd

import html

from pathlib import Path

In [2]:
hn_path = Path('../data/01_raw/hackernews2021.parquet')

df = pd.read_parquet(hn_path, use_nullable_dtypes=True).set_index('id')

In [3]:
pd.options.display.max_colwidth = 200

Our regular expression looks for something that looks like a URL starting with `amazon.` and containing `/dp/` and a 10 characters that are uppercase latin letters or digits.

Because we're searching in the HTML escaped version we need to replace `/` with `&#x2F;` (which even occurs in the `href` for some reason).

In [4]:
import re
asin_re = re.compile(r'amazon\.[^"> ]*/dp/([A-Z0-9]{10})\W'.replace('/', '&#x2F;'))



example = '<a href="https:&#x2F;&#x2F;www.amazon.com&#x2F;x&#x2F;dp&#x2F;0884272079" rel="nofollow">'
asin_re.findall(example)

['0884272079']

In [5]:
%%time

asins = (
    df
    .text
    .dropna()
    .str.extractall(asin_re)
    [0]
    .rename('asin')
    .reset_index()
    .drop_duplicates(subset=['id', 'asin'])
    .set_index(['id', 'match'])
)

CPU times: user 7.3 s, sys: 51.8 ms, total: 7.35 s
Wall time: 7.34 s


We get a long list of things that look like ASINs.

In [6]:
asins

Unnamed: 0_level_0,Unnamed: 1_level_0,asin
id,match,Unnamed: 2_level_1
25763413,0,0809301377
29430630,0,B00TQ5SEAI
27595409,0,0884272079
27595409,2,0884271536
26919349,0,B08F3CJ5HF
...,...,...
27651602,0,B005PLQIQ4
26745394,0,0062435612
26745394,1,B07PPW5V9C
26745394,2,B08FRRF68Q


In [7]:
asin_count = asins.value_counts()

asin_count

asin      
1594035229    7
0262632691    6
B07N4DHFZM    5
0393009262    5
0465060730    5
             ..
1413326390    1
1408703971    1
1408190303    1
1402791038    1
B09H478XG4    1
Length: 2454, dtype: int64

There are <200 ASINs that have come up more than once.

In [8]:
asin_count.value_counts()

1    2243
2     158
3      40
4       7
5       4
7       1
6       1
dtype: int64

## Examining ASINs

The top 10 are all books (though there are some products further down the list)

In [9]:
def link_asin(asin):
    return f'<a href="https://www.amazon.com.au/dp/{asin}">{asin}</a>'

asin_count.reset_index().head(10).style.format({'asin': link_asin})

Unnamed: 0,asin,0
0,1594035229,7
1,0262632691,6
2,B07N4DHFZM,5
3,0393009262,5
4,0465060730,5
5,0735224897,5
6,0143125788,4
7,0201178885,4
8,0578675862,4
9,1492180742,4


In [10]:
df_asin = df.loc[asins.reset_index().id.drop_duplicates()]

df_asin

Unnamed: 0_level_0,title,url,text,dead,by,score,time,timestamp,type,parent,descendants,ranking,deleted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
25763413,,,"Just a historical note, from 1964, a book by Buckminster Fuller: <i>Education Automation: Freeing the scholar to return to his studies</i><p>Even back then we had the technology and opportunity to...",,yboris,,1610552713,2021-01-13 15:45:13+00:00,comment,25760960,,,
29430630,,,"<a href=""https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=J0dDTbA1fq8"" rel=""nofollow"">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=J0dDTbA1fq8</a><p><a href=""https:&#x2F;&#x2F;boardgamegeek.com&#x...",,iams,,1638543145,2021-12-03 14:52:25+00:00,comment,29430521,,,
27595409,,,"Eliyahu M. Goldratt has some great books explaining in great detail why this is the case: <a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;x&#x2F;dp&#x2F;0884272079"" rel=""nofollow"">https:&#x2F;&#x2F...",,sly010,,1624387075,2021-06-22 18:37:55+00:00,comment,27593834,,,
26919349,,,"This has 8 mp <a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;dp&#x2F;B08F3CJ5HF&#x2F;ref=emc_b_5_mob_t"" rel=""nofollow"">https:&#x2F;&#x2F;www.amazon.com&#x2F;dp&#x2F;B08F3CJ5HF&#x2F;ref=emc_b_5_mob...",,hnnnnnnng,,1619212979,2021-04-23 21:22:59+00:00,comment,26919315,,,
29586021,,,"Read: the power of now<p><a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;Power-Now-Guide-Spiritual-Enlightenment&#x2F;dp&#x2F;1577314808"" rel=""nofollow"">https:&#x2F;&#x2F;www.amazon.com&#x2F;Power-...",,quadcore,,1639700835,2021-12-17 00:27:15+00:00,comment,29585542,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25709802,,,"Then in that case they will surely be desirous of de-platforming the toxic content on Twitter, which advocates, among other things, the destruction of Israel By Iran and the internment of Uyghurs ...",,tatrajim,,1610251761,2021-01-10 04:09:21+00:00,comment,25709456,,,
26283630,,,"Nice piece, was talking to a friend about this yesterday, particularly regarding online community. Our conjecture ended up being that communities don’t scale, which is why Twitter fails, and Faceb...",,dmje,,1614416939,2021-02-27 09:08:59+00:00,comment,26274450,,,
27651602,,,"No, it is ~$3 for a standard ~5oz size.<p><a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;Crest-Complete-Whitening-Toothpaste-Triple&#x2F;dp&#x2F;B005PLQIQ4"" rel=""nofollow"">https:&#x2F;&#x2F;www.am...",,lotsofpulp,,1624802246,2021-06-27 13:57:26+00:00,comment,27651219,,,
26745394,,,So I can kind of see what you are trying to get at with your product. There are 3 books I recommend you read to help you further:<p>1. &quot;Competing Against Luck&quot; by Clay Christensen <a hre...,,luthfur,,1617927582,2021-04-09 00:19:42+00:00,comment,26734079,,,


# Exporting

Let's export some examples for further analysis.

We'll get the parent data for context.

In [11]:
df_asin = (
    df_asin.filter(regex='.*(?!_parent)$')
    .merge(df[['text', 'title', 'type', 'url']],
           left_on='parent',
           right_index=True,
           how='left',
           suffixes=('', '_parent'))
)

And clean the text by unescaping the HTML and reversing the formatdoc.

In [12]:
df_asin['clean_text'] = (
    df_asin['text']
    .map(html.unescape)
    .str.replace('</?i>', '**', regex=True)
    .str.replace('<p>', '\n\n')
    .replace('<a href="(.*?)".*?>.*?</a>',r'\1', regex=True)
)

In [13]:
has_text = df_asin['text_parent'].notna()

df_asin.loc[has_text, 'clean_text_parent'] = (
    df_asin.loc[has_text, 'text_parent']
    .map(html.unescape)
    .str.replace('</?i>', '**', regex=True)
    .str.replace('<p>', '\n\n')
    .replace('<a href="(.*?)".*?>.*?</a>',r'\1', regex=True)
)

In [14]:
columns = ['id', 'clean_text', 'clean_text_parent', 'by', 'timestamp', 'type', 'type_parent', 'parent', 'text']

In [15]:
df_asin.reset_index()[columns]

Unnamed: 0,id,clean_text,clean_text_parent,by,timestamp,type,type_parent,parent,text
0,25763413,"Just a historical note, from 1964, a book by Buckminster Fuller: **Education Automation: Freeing the scholar to return to his studies**\n\nEven back then we had the technology and opportunity to d...","My kids are in lockdown homeschooling, and sitting in on some of the live lessons you can see the cracks - very slow, kids moving at different paces, and much much harder for teacher to see who is...",yboris,2021-01-13 15:45:13+00:00,comment,story,25760960,"Just a historical note, from 1964, a book by Buckminster Fuller: <i>Education Automation: Freeing the scholar to return to his studies</i><p>Even back then we had the technology and opportunity to..."
1,29430630,https://www.youtube.com/watch?v=J0dDTbA1fq8\n\nhttps://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1\n\nhttps://www.amazon.com/Pandemic-Cooperative-Playtime-Z-Man-Games/dp/B00TQ5SEAI...,Can you give an example of what you mean?,iams,2021-12-03 14:52:25+00:00,comment,comment,29430521,"<a href=""https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=J0dDTbA1fq8"" rel=""nofollow"">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=J0dDTbA1fq8</a><p><a href=""https:&#x2F;&#x2F;boardgamegeek.com&#x..."
2,27595409,Eliyahu M. Goldratt has some great books explaining in great detail why this is the case:\nhttps://www.amazon.com/x/dp/0884272079\nhttps://www.amazon.com/x/dp/0884271536,This sounds like it might be good thing for the company. Having employees who have extra capacity is incredibly important for an organization that wants to get things done; if you're constantly ha...,sly010,2021-06-22 18:37:55+00:00,comment,comment,27593834,"Eliyahu M. Goldratt has some great books explaining in great detail why this is the case: <a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;x&#x2F;dp&#x2F;0884272079"" rel=""nofollow"">https:&#x2F;&#x2F..."
3,26919349,This has 8 mp\nhttps://www.amazon.com/dp/B08F3CJ5HF/ref=emc_b_5_mob_t,Why can I buy a 30mp trail camera with motion sensor and wifi for $70 but I cant buy a 30mp camera that connects to my computer over usb for that cheap.\n\nhttps://www.amazon.com/Victure-Activated...,hnnnnnnng,2021-04-23 21:22:59+00:00,comment,comment,26919315,"This has 8 mp <a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;dp&#x2F;B08F3CJ5HF&#x2F;ref=emc_b_5_mob_t"" rel=""nofollow"">https:&#x2F;&#x2F;www.amazon.com&#x2F;dp&#x2F;B08F3CJ5HF&#x2F;ref=emc_b_5_mob..."
4,29586021,Read: the power of now\n\nhttps://www.amazon.com/Power-Now-Guide-Spiritual-Enlightenment/dp/1577314808,"I am constantly worried that (1) I'm missing out on things, (2) something bad is going to happen and (3) can't see the point of it all since one day all will come to an end. I want to start enjoyi...",quadcore,2021-12-17 00:27:15+00:00,comment,story,29585542,"Read: the power of now<p><a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;Power-Now-Guide-Spiritual-Enlightenment&#x2F;dp&#x2F;1577314808"" rel=""nofollow"">https:&#x2F;&#x2F;www.amazon.com&#x2F;Power-..."
...,...,...,...,...,...,...,...,...,...
2391,25709802,"Then in that case they will surely be desirous of de-platforming the toxic content on Twitter, which advocates, among other things, the destruction of Israel By Iran and the internment of Uyghurs ...",I can't help but wonder if the opposite isn't at play here. It's reasonable that these companies have wanted to dissociate with Parler or other specific extremist groups but were afraid of politi...,tatrajim,2021-01-10 04:09:21+00:00,comment,comment,25709456,"Then in that case they will surely be desirous of de-platforming the toxic content on Twitter, which advocates, among other things, the destruction of Israel By Iran and the internment of Uyghurs ..."
2392,26283630,"Nice piece, was talking to a friend about this yesterday, particularly regarding online community. Our conjecture ended up being that communities don’t scale, which is why Twitter fails, and Faceb...",,dmje,2021-02-27 09:08:59+00:00,comment,story,26274450,"Nice piece, was talking to a friend about this yesterday, particularly regarding online community. Our conjecture ended up being that communities don’t scale, which is why Twitter fails, and Faceb..."
2393,27651602,"No, it is ~$3 for a standard ~5oz size.\n\nhttps://www.amazon.com/Crest-Complete-Whitening-Toothpaste-Triple/dp/B005PLQIQ4\n\nAmazon is more expensive than Costco/Walmart/Target, but especially so...",Toothpaste costs $6-$8 in Amazon US? Holy cow.,lotsofpulp,2021-06-27 13:57:26+00:00,comment,comment,27651219,"No, it is ~$3 for a standard ~5oz size.<p><a href=""https:&#x2F;&#x2F;www.amazon.com&#x2F;Crest-Complete-Whitening-Toothpaste-Triple&#x2F;dp&#x2F;B005PLQIQ4"" rel=""nofollow"">https:&#x2F;&#x2F;www.am..."
2394,26745394,"So I can kind of see what you are trying to get at with your product. There are 3 books I recommend you read to help you further:\n\n1. ""Competing Against Luck"" by Clay Christensen https://www.ama...","Hi HN,\n\nI have built an application to scratch my own itch. It’s a tool that I had a need for in my day job for years. The problem it’s solving is a problem that I felt many people had too. But ...",luthfur,2021-04-09 00:19:42+00:00,comment,story,26734079,So I can kind of see what you are trying to get at with your product. There are 3 books I recommend you read to help you further:<p>1. &quot;Competing Against Luck&quot; by Clay Christensen <a hre...


In [16]:
df_asin.reset_index().to_csv('../data/02_intermediate/hn_asin.csv', columns=columns, index=False)