# Sample

Bootstrap some sample data for annotation.

Use heuristics to get some examples and SentenceTransformers to generalise them.

In [1]:
%cd ..

/home/bookfinder


In [2]:
import pandas as pd

from sentence_transformers import SentenceTransformer

# Read in the data

We will only use 3% of the data for these annotations (buckets 0-2)

In [3]:
df = (
    pd.read_parquet('data/02_intermediate/hn_enriched.parquet')
    .query('bucket<3 & text_length > 0')
)

In [4]:
df

Unnamed: 0_level_0,title,url,text,dead,by,score,time,timestamp,type,parent,descendants,ranking,deleted,root,clean_text,bucket,text_length
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
28886146,"The Programmers, like the Poets",,"The programmers, like the poets, work only sli...",,graderjs,10.0,1634368355,2021-10-16 07:12:35+00:00,story,,6.0,,,28886146,"The programmers, like the poets, work only sli...",0,345
26236715,,,I&#x27;m not counting the research I did prior...,,BiteCode_dev,,1614084773,2021-02-23 12:52:53+00:00,comment,26213473,,,,26207965,I'm not counting the research I did prior the ...,1,135
26909217,,,TFA? I know you&#x27;re referring to Moxie (so...,,spurgu,,1619132826,2021-04-22 23:07:06+00:00,comment,26899430,,,,26891811,TFA? I know you're referring to Moxie (somehow...,0,83
26263927,,,"From the perspective of January 2009, quite a ...",,flyingfences,,1614267634,2021-02-25 15:40:34+00:00,comment,26263680,,,,26262170,"From the perspective of January 2009, quite a ...",1,50
27975372,,,"Yeah I was reluctant to link to it, it&#x27;s ...",,RankingMember,,1627408166,2021-07-27 17:49:26+00:00,comment,27974340,,,,27973169,"Yeah I was reluctant to link to it, it's terri...",2,195
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29631086,,,"Not that you’re wrong, but there seems to be t...",,wbsss4412,,1640038542,2021-12-20 22:15:42+00:00,comment,29629328,,,,29625625,"Not that you’re wrong, but there seems to be t...",2,206
27907148,,,I have read that they are using drone technolo...,,melicoy,,1626879333,2021-07-21 14:55:33+00:00,comment,27904758,,,,27904758,I have read that they are using drone technolo...,1,135
26118887,,,I was under the impression that Apple chose to...,,Malic,,1613166051,2021-02-12 21:40:51+00:00,comment,26118069,,,,26116062,I was under the impression that Apple chose to...,1,254
25858382,,,"I see where you are coming from, but I still o...",,mikem170,,1611234185,2021-01-21 13:03:05+00:00,comment,25853940,,,,25847062,"I see where you are coming from, but I still o...",1,1594


# Calculate Embeddings

Let's use a fast model

In [5]:
model_name = 'all-MiniLM-L6-v2'

model = SentenceTransformer(model_name)

In [6]:
len(df)

105290

In [7]:
%%time

embeddings = model.encode(df.clean_text.to_list(),
                          show_progress_bar=True,
                          normalize_embeddings=True,
                         convert_to_numpy=True)

Batches:   0%|          | 0/3291 [00:00<?, ?it/s]

CPU times: user 2min 10s, sys: 8.6 s, total: 2min 18s
Wall time: 1min 22s


In [8]:
embeddings.shape

(105290, 384)

# Heuristic Book Finders

As a very rough heuristic look for pattern "{Proper Noun} by {Proper Noun}"

We could create much better rules e.g. using a Gazeteer or using an existing NER model, but this is good enough to get some examples in the first round of annotation.

In [9]:
has_author_pattern = df.clean_text.str.contains(r'[A-Z][a-z]+ by [A-Z]')
sum(has_author_pattern)

136

In [10]:
pd.options.display.max_colwidth = 1000

In [11]:
df.loc[has_author_pattern][['clean_text']]

Unnamed: 0_level_0,clean_text
id,Unnamed: 1_level_1
26523441,"I found it a little garden path. I think ""African American Georgia newspapers from..."" is slightly clearer.\n\nThat slightly alters the meaning by not using the demonym: newspapers published by African Americans from Georgia vs newspapers published in Georgia by African Americans.\n\nI guess you could also go with African American Georgian Georgia newspapers to be explicit that it's newspapers published in Georgia by African Americans from Georgia.\n\nSorry, I guess I have Georgia on my mind now."
29673502,"In my view it's about the interface. Any utopian community must interface with the rest of the currently-not-utopian world. I suspect the utopian aspect can only be maintained if that interface is very small and restricted. Otherwise the property-based wider society will grow to dominate and maximally exploit the utopia, as is its nature. This concept is explored in The Dispossessed by Ursula K. Le Guin."
28100193,Almost certainly. Even calling the reproducing female 'queen' betrays a hierarchical sensibility that social insects almost certainly do not have. See Metaphors We Live By by George Lakoff and Mark Johnson.
27827363,"If anyone is interested in this era of comic books and the society they bred, The Amazing Adventures of Kavalier & Clay by Michael Chabon is a smart, excellently written novel.\n\nIt covers the comic book culture in NYC in WW2 and the years after, and covers a scenario very similar to this report."
26223637,"Glossary:\n\nAPT31 - a name given to an attack group that is attributed to China.\n\nEquation Group - a name given to an APT group which is believed to be the Tailored Access Operations (TAO) unit of the NSA. The unit is now named ""Computer Network Operations"" (CNO).\n\nJian - a name that was given to a 0-Day exploit that was attributed to the Chinese-affiliated attack group.\n\n0-Day - a vulnerability that is unknown to the public or to the relevant vendor (e.g Microsoft).\n\n0-Day Exploit - an exploit that is directed at a zero-day\n\n---\n\nIn this story, we claim that the Chinese APT acquired the Equation Group exploit somewhere around 2014, cloned it into their own version (Jian), and used it until was finally caught in 2017.\n\nInterestingly, the 0-Day was reported to Microsoft by Lockheed Martin's Incident Response team. This might suggest that the Chinese APT might have used it to attack American targets.\n\nI tried to summarize the highlights in a less technical lingo in ..."
...,...
27323291,> Confirmed by PCR\n\nWhat was the cycle threshold? The problem with this whole manufactured pandemic is that all the numbers and tests are bullshit because people are doing bad science. The massive relief bills introduced financial incentives to do bad science. It’s institutional failure at a massive scale.
28111445,"Cool, many unknown bands, thanks for also pointing to albums in particular! Will start with Rabies by Uncle Al"
25746920,"August 6, 2018\n\n#1: Let's Encrypt Root Trusted by All Major Root Programs(https://letsencrypt.org/2018/08/06/trusted-by-all-major-root-programs.html)\n\n#2: I’m a very slow thinker (2016)(https://sivers.org/slow)\n\n#3: Facebook has asked U.S. banks to share financial information about customers\n(https://www.wsj.com/articles/facebook-to-banks-give-us-your-data-well-give-you-our-users-1533564049)\n\nedit: That was with https://hn.algolia.com/ though, not by clicking on the date. Liked the idea before reading everything. With a click on the date I get a similar, but slightly different result. Neat feature though."
28724612,"The CRDT I was referencing was Shelf by Greg Little. He's given a few talks about it at the braid meetups. When he first showed it off, Kevin Jahns (the Yjs author) was also there and was as impressed as I was:\n\nhttps://braid.org/meeting-8 (the shelf part with Greg starts at about 43 minutes in to the recording.)\n\nThe code is all here. Its tiny:\n\nhttps://github.com/dglittle/shelf"


An alternative approach would be to use a "search phrase" but this didn't seem to work so well.

In [12]:
query = model.encode([
    'I recommend the book'
]).mean(axis=0)

Instead we will take the mean embedding of our examples

In [13]:
query = embeddings[has_author_pattern].mean(axis=0)

And calculate the dot product; because the embeddings are normalised this is proportional to the cosine product.

In [14]:
scores = embeddings @ query

Looking at some top examples it does reasonably well

In [15]:
indices = scores.argsort()

In [16]:
df.iloc[indices[-10:]].clean_text.to_frame()

Unnamed: 0_level_0,clean_text
id,Unnamed: 1_level_1
28708830,Any other book you think has been critical to Cloudflare's successful mindset?
27685099,"Has anyone read his other book ‘Silicon Snake Oil’ [1] from 1995 lately?\n\nSounds like most of his predictions in it (eg e-commerce will fail, digital books will not be viable, etc) were wildly off the mark - but were any prescient?\n\n[1] https://en.m.wikipedia.org/wiki/Silicon_Snake_Oil"
25823453,"I think its website makes it pretty clear that it's mostly targeting scientists...\n\nhttps://julialang.org/, https://docs.julialang.org/en/v1/ (The first two words of the Introduction are literally ""scientific computing"".)"
28708371,"When I joined Cloudflare in 2011 Matthew recommended the book.\n\nI bought ""The Innovator's Dilemma"", read it, and said to myself ""OK, we'll do that then""."
29009317,"Scihub has been a life saver for me once I started working on some of the more obscure areas of AI such as signal/time series processing. Anything off the beaten path is locked up behind paywalls, and I'm sorry I ain't paying $40 just to see if someone's paper sucks or not (which 95% of them do, in this particular niche, especially the ones shielded from scrutiny by paywalls)."
28603481,Good article. I used to love reading Kyle Rankin’s rantings in Linux Journal… good times.
28313287,Pretty cool. I read about this audience first philosophy before but haven't seen someone making a newsletter out of it.
27517837,"> as a preserved island of fun and ""hacking for the sake of it""\n\nSomehow this sentiment is anything but fun to me. Emacs is the last bastion of a different path down the tree of personal computing than the one our species took, one in which individual users are empowered and not - as is now the case - held hostage by us[0] in our ivory towers of impenetrable layers of cryptic technology.\n\nEmacs was made for humans to use as a real tool. Barely anything else in the tech world really qualifies to the same degree. I'd love to peek into the parallel universe where we stuck to introspectable, malleable software and ended up with something closer to the Houyhnhnm computing stack[1]. Alas, for now that ship has sailed.\n\nSee also: That old story about Emacs at Amazon, etc.\n\n[0]: ""tech people""\n\n[1]: https://ngnghm.github.io/"
25826384,"I’d love to know more about this, if you had recommendations?"
25785132,"This is old, but a really interesting read. Not so much because of the discussion of Mongo; more that it shows what happens when you unthinkingly follow tech trends, without thinking carefully about your own requirements."


# Sampling

For a first round of annotation let's grab the 100 most likely comments, and 100 other random comments

In [17]:
sample = pd.concat([df.iloc[indices[-100:]], df.sample(100)], axis=0).sample(frac=1)

# Save results

Put the results in a Prodigy compatible format

In [18]:
import json

with open('data/02_intermediate/hn_sample_0.jsonl', 'w') as f:
    for id, row in sample.iterrows():
        data = {"text": row["clean_text"], "meta": {"id": id}}
        print(json.dumps(data), file=f)