# HN Book Gazetteer

Using a gazetteer of existing books to find book posts on Hacker News.
This can be helpful as weak learning (here are some examples), and some books are recommended in lists so can be used to find further examples.
We can also look at comments around a book recommendation (children, parent and sibling comments) which may contain more books.

# Load the Data

In [1]:
from pathlib import Path
import pandas as pd
import xxhash

Read in all Hacker News Stories from 2021, which [can be downloaded from Kaggle](https://www.kaggle.com/datasets/edwardjross/hackernews-2021-comments-and-stories) (extracted from the BigQuery dataset).

In [2]:
df = pd.read_parquet('../data/01_raw/hackernews2021.parquet').set_index('id')

In [3]:
df

Unnamed: 0_level_0,title,url,text,dead,by,score,time,timestamp,type,parent,descendants,ranking,deleted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
27405131,,,They didn&#x27;t say they <i>weren&#x27;t</i> ...,,chrisseaton,,1622901869,2021-06-05 14:04:29+00:00,comment,27405089.0,,,
27814313,,,"Check out <a href=""https:&#x2F;&#x2F;www.remno...",,noyesno,,1626119705,2021-07-12 19:55:05+00:00,comment,27812726.0,,,
28626089,,,Like a million-dollars pixel but with letters....,,alainchabat,,1632381114,2021-09-23 07:11:54+00:00,comment,28626017.0,,,
27143346,,,Not the question...,,SigmundA,,1620920426,2021-05-13 15:40:26+00:00,comment,27143231.0,,,
29053108,,,There’s the Unorganized Militia of the United ...,,User23,,1635636573,2021-10-30 23:29:33+00:00,comment,29052087.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27367848,,,Housing supply isn’t something that can’t chan...,,JCM9,,1622636746,2021-06-02 12:25:46+00:00,comment,27367172.0,,,
28052800,,,Final Fantasy XIV has been experiencing consta...,,amyjess,,1628017217,2021-08-03 19:00:17+00:00,comment,28050798.0,,,
28052805,,,How did you resolve it?,,8ytecoder,,1628017238,2021-08-03 19:00:38+00:00,comment,28049375.0,,,
26704924,,,This hasn&#x27;t been my experience being vega...,,pacomerh,,1617657938,2021-04-05 21:25:38+00:00,comment,26704794.0,,,


# Split the Data

The data will be split deterministically by the by the root story.
This allows using features about the comment thread.

## Finding the root

For each comment the root can be found by walking up the parents recursively.

In [4]:
parent_dict = df['parent'].fillna(df.index.to_series()).to_dict()

root_dict = {}

for item, parent in parent_dict.items():
    while parent in parent_dict:
        grandparent = parent_dict[parent]
        if parent == grandparent:
            break
        parent = grandparent
    root_dict[item] = parent
    
df['root'] = df.index.map(root_dict)

## Deterministic Splitting

The hash of the root id with a fixed salt gives a deterministic random split.
Choose a 50% training set.

In [5]:
def bucket(s, salt='hnbooks'):
    return xxhash.xxh32_intdigest(str(s)+salt) % 100

bucket = df['root'].apply(bucket)

df['bucket'] = bucket

df['train'] = bucket < 50

# Book Gazetteer

These books are mostly picked from [MapFilterFold](https://mapfilterfold.com/) with a couple of my own choices. I deliberately left out any books that have very ambiguous names (Dune is probably the worst here).

In [6]:
seed_books = [
    'Structure and Interpretation of Computer Programs',
    'SICP',
    'Art of Computer Programming',
    'TAOCP',
    'Thinking, Fast and Slow',
    'How to Win Friends and Influence People',
    'Gödel, Escher, Bach',
    'Godel, Escher, Bach',
    'Selfish Gene',
    'Pragmatic Programmer',
    'Art of Motorcycle Maintenance',
    'Design of Everyday Things',
    "Man's Search for Meaning",
    "Deep Work",
    "Mythical Man-Month",
    "Surely You're Joking, Mr. Feynman!",
    "Code Complete",
    "Atlas Shrugged",
    "7 Habits of Highly Effective People",
    "Power of Habit",
    "Fooled by Randomness",
    "Working Effectively with Legacy Code",
    "Reasoned Schemer",
    "Little Schemer",
    "Clean Code",
    "Hitchhiker's Guide to the Galaxy",
    "Designing Data-Intensive Applications",
    "Dune",
    "Don't Make Me Think",
    "High Output Management",
    "Neuromancer",
    "The C Programming Language",
    "The War of Art",
    "The Art of War",
    "The Intelligent Investor",
    "Cryptonomicon",
    "So Good They Can't Ignore You",
    "4-Hour Workweek",
    "Head First Design Patterns",
    "Founders at Work",
    "Bhagavad Gita",
    "Brothers Karamazov",
    "Elements of Computing Systems",
    "Coders at Work",
    "How to Measure Anything",
    "Introduction to Algorithms",
    "On Intelligence",
]

In [7]:
sample = df.query('train').copy()

We'll use a case sensitive match on each name.

In [8]:
%%time

import html

seed_flag = (
    sample['text']
    .fillna('')
    .apply(html.unescape)
    .str.contains('|'.join([r'\b' + book.replace('.', r'\.') + r'\b' for book in seed_books]),
                  regex=True, case=True)
)

CPU times: user 1min 6s, sys: 0 ns, total: 1min 6s
Wall time: 1min 6s


In [9]:
sample['seed_book'] = False
sample.loc[seed_flag, 'seed_book'] = True
sample['seed_book'].sum()

1195

In [10]:
pd.options.display.max_colwidth = 750

In [11]:
import re
import html

def clean(text):
    text = text or ''
    text = html.unescape(text)
    text = text.replace('<i>', '')
    text = text.replace('</i>', '')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text.strip()

This seems to work reasonably well

In [12]:
sample.query('seed_book')['text'].apply(clean).to_frame()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
26963127,The Pragmatic Programmer.\n\nhttps://en.m.wikipedia.org/wiki/The_Pragmatic_Programmer\n\nUnix Power Tools\n\nhttps://docstore.mik.ua/orelly/unix/upt/index.htm\n\nIf you want the doorstop:\n\nhttps://www.amazon.com/s?k=Unix+power+tools&tag=duckduckgo-fpas-b-20
29076915,"""Asimov's Vision"". The guy himself has said in interviews that it was entirely inspired by gas/fluid dynamics theory, where you could predict the motion of the whole but individual molecules would be random, and then ""what if that could be applied to people"". Everything else is just exploring and extrapolating.\n\nIt's far from perfect but I feel that if it was a new IP work the reaction to it would be very different.\n\nAs mentioned elsewhere in this thread, both Foundation and Dune are almost impossible to translate to screen faithfully and still appeal to enough people to have a decent budget. Why should we judge the show by the book at all, if that's going to be the case?"
29645987,"To a certain extent it's supposed to be a a little bit teen superhero-y, so you get dragged along with the myth of Paul Adreides too i.e. he wants you too like Paul, so you get dragged along by/with him.\n\nFor what it's worth I see Dune as much more of a space opera than a avant-garde sci-fi book. For me I basically view it as star wars but as if George Lucas wasn't a hack. When I saw the new movie open with ""dreams are just messages from the deep"", I was just totally in love with the iconography of dune and was genuinely kind of miffed we don't get this kind of film more often versus whatever star wars is.\n\nThe thing with Dune's world building isn't that it's realistic, it's not very realistic, but rather that it's still blindingly ..."
29634052,"The short eval is standard amongst small lisps (see eg SICP), but the immutability which leads to the nice ABC GC (no RPLACD, no cycles, immutable strings) is practical. All other lisp liked one or our autolisp features: the very same immutability. Such a GC really is trivial and fast. Lot of temporary garbage, but only for a single op."
26237604,"> Like JSON, they are easy to parse and easy to generate, but being more loosely-specified than JSON, ...\n\nThey would not have been loosely-specified if they were specified, like JSON was :-) I mean this is taking things a bit backward. When tools use a data format based on S-exprs, they define more clearly what is or isn't valid (OCaml Dune, Guix, etc.)"
...,...
26649665,"I'm going to echo the Insight Data Science Fellowship Program as, by far, the best in that space. Even then, please pick up a copy of Clean Code if you're from another field. That's a chronic issue with data science candidates."
29011922,"Agree,the movie is ok and well done within Villeneuve high standards, but as expected a 2 parts movie is not enough to describe the depths of the Dune book.\nMaybe a trilogy or limited series would give more breath to describe the complex world and the size of the events; hopefully an extended version will bring some life into it."
27766371,"I really doubt we will have the capability of building ""a machine in the likeness of a human mind"" in my lifetime. Present AI systems are essentially just function fitting. Building big probabilistic systems that we optimize with loads of training data. This is a far, far cry from the ""strong AI"" that people are so afraid of. I really think that people writing these sorts of pieces have an understanding of AI that's more rooted in fiction than engineering.\n\nIt's interesting to ponder how we should go about building and interacting with ""strong AI"", and questioning whether we should even build it in the first place. But I really don't think any detailed moral frameworks can be built when we have no real idea of what a ""strong AI"" would..."
25864501,"Bundle the hours and responsibilities into a single full-time position.\n\nI mean if you have first-year ""Introduction to Algorithms"" as a PT, and another PT ""Introduction to NetWorking"" then they can easily be taught by a single person. Create an FT where the job is to teach both."


## Seed Relatives

If a comment contains the name of a book, then it seems likely that the child, parent or sibling comments are more likely to contain a book.
Let's extract those too.

In [13]:
sample['seed_child'] = False
sample.loc[sample.parent.isin(sample.query('seed_book').index), 'seed_child'] = True

sample.query('seed_child')['text'].apply(clean).to_frame()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
29012637,"Unfortunately the SyFy adaptation reeked of cheapness to me. Plot-wise better than Lynch's movie, but aesthetically terrible, and with some pretty bad acting too."
27787588,"My background is also in maths and I am keenly interested in, and frustrated by, notation as it appears in various fields. There's a time and place for fancy, dense notation, and I don't think it's here. Subjectively I found the use of unusual math unicode symbols to be gratuitous in this repo."
29645987,"To a certain extent it's supposed to be a a little bit teen superhero-y, so you get dragged along with the myth of Paul Adreides too i.e. he wants you too like Paul, so you get dragged along by/with him.\n\nFor what it's worth I see Dune as much more of a space opera than a avant-garde sci-fi book. For me I basically view it as star wars but as if George Lucas wasn't a hack. When I saw the new movie open with ""dreams are just messages from the deep"", I was just totally in love with the iconography of dune and was genuinely kind of miffed we don't get this kind of film more often versus whatever star wars is.\n\nThe thing with Dune's world building isn't that it's realistic, it's not very realistic, but rather that it's still blindingly ..."
26583823,"If you work on a team, being reachable is part of your work. There are (few) opportunities for solo thinking, and Cal is lucky to find one. I hope you find one for yourself.\n\nBut as a team, the answer ""I can't reach the expert because they secluded themselves"" is not acceptable. Because as a senior figure on a team, you job is to become a force multiplier for others, not a solo force. Not to do work that magnanimously benefits others, but work that makes others better.\n\nYou need occasional deep work time, sure. But the whole ""I must disconnect from everything, always"" thing does not work for any social endeavor - and most meaningful things are of a scope that requires a team.\n\nI wish I had a better answer here, because I too like ..."
28392463,">their absense is baffling\nWell if they lack an imagination then that's not Gibson's problem. It would be like saying that they can't read fantasy because messages have to be delivered on horseback ""why wouldn't they just have cell phones?"" I hear them cry."
...,...
25826588,"The reality is that software development is nothing like engineering or construction, it's totally different. You don't build a quick house, let people live in it and start building the walls whilst they live there.\n\nHumans like to think via metaphor because it's a least-effort mode of thought but sometimes there just isn't one and it's just tough luck and start thinking from first principles instead."
26330840,Complex numbers themselves can be explained with real numbers only.
26288203,> Verbosity is an existential threat.\n\n.. maybe I likes the verbosity?\n\nAsteroids are an existential threat. Verbosity is a style that some people enjoy.
27880365,It's a terrible design. My cat turned on the stove by walking across the capacitive touch controls.


In [14]:
sample['seed_parent'] = False
sample.loc[sample.index.isin(sample.query('seed_book').parent), 'seed_parent'] = True
sample['seed_parent'].sum()

sample.query('seed_parent')['text'].apply(clean).to_frame()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
25801005,"> Because an SQL database uses a schema or structure, this means changes are difficult. Say you’re running a production database full of a million records.\n\nArticles like this one perpetuate the myths in the minds of young developers. First off, “millions of records” is nothing this days. More importantly, the scheme ends up living somewhere. If it’s not in your database, you’re likely managing it in the app. There’s no free lunch when it comes to scheme for a typical SaaS."
29076915,"""Asimov's Vision"". The guy himself has said in interviews that it was entirely inspired by gas/fluid dynamics theory, where you could predict the motion of the whole but individual molecules would be random, and then ""what if that could be applied to people"". Everything else is just exploring and extrapolating.\n\nIt's far from perfect but I feel that if it was a new IP work the reaction to it would be very different.\n\nAs mentioned elsewhere in this thread, both Foundation and Dune are almost impossible to translate to screen faithfully and still appeal to enough people to have a decent budget. Why should we judge the show by the book at all, if that's going to be the case?"
29317313,The Guardian aims at a particular kind of person who essentially spends their entire life in artistic theory and not the real world. They derive their worth from being able to quote at each other.\n\nFor those people I'm sure it does feel like an enormous struggle.\n\nIf you watch for enjoyment rather than as an attempt at gaining social cachet then it's fine.
27493460,"I wish someone would do a study on self-help books/materials, to see if they actually have ever helped anyone.\n\nMy issue is not that the advice they give is necessarily wrong, but it's that the format usually goes something like this:\n\n1. Survey lots of ""successful"" people.\n\n2. Identify common behaviors of these people.\n\n3. Recommend that other people practice these behaviors.\n\nI mean, just look at the title, ""Never eat alone"". I don't doubt that most successful people have a wide network and rarely eat by themselves. I just don't think that telling an introvert, or worse, someone who is painfully shy, that making them engage in a behavior that is naturally uncomfortable for them will lead to equivalent level of success. I ki..."
28387016,The Bridge trilogy may actually be his best work but is hardly ever discussed. It's a real sleeper and if you haven't read it I highly recommend it.
...,...
26165719,"> slight case of death\n\nWhat counts as slight? Also, it sounds like something I would hear on an Oversimplified history video."
29310585,"I read Hawkins's ""On Intelligence"" a few years ago and enjoyed it. For anyone here who's read both that and ""New Theory of Intelligence"", what's the relation between the two?"
26389497,"Yeah, I think I first heard it in relation to Malcolm Gladwell and it's just so apt at capturing everything wrong with that category of book. I mean he's a skillful writer, and it's definitely entertaining stuff. But if you flip into critical mode and do comparative research vs authoritative sources, you start seeing how vapid it is really fast."
26409666,"This anecdote was recounted by Nassim Taleb in one of his books, either Fooled by Randomness (which I recommend) or The Black Swan (which I don't).\n\nAccording to him, it was MBA courses that recommended adding chess to the CV, as it showed strategic thinking and would never be verified."


Siblings seem less likely to contain a book title, but likely more than random.

In [15]:
sample['seed_sibling'] = False
sample.loc[sample.parent.isin(sample.query('seed_parent').index), 'seed_sibling'] = True

sample.query('seed_sibling')['text'].apply(clean).to_frame()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
29705685,
25658551,"How can anyone work at Facebooks nowadays and have a clear conscience.\n\nIs it just the salary? It is, isn't it? Just money. How depressing."
27038043,This constant linking to other things is a bit annoying. I found most of them are older books. I clicked the spanking link to maybe find a book on why you should/shouldn't spank your children and instead got a porn site.\n\nWho is this guy?
26309305,"Flowdash | Senior Full-Stack Engineer, Senior Product Designer | San Francisco & Remote (US-based) | Full-Time\n\nFlowdash (https://flowdash.com) lets anyone build business processes and workflows without code. We combine the familiarity of a spreadsheet with a visual workflow builder, plus built-in integrations to automate repetitive tasks so teams can focus on what matters. We have a ton of interesting technical challenges and we're looking for humble, self-motivated and independent engineers and a product designer to help us transform how organizations build software. Our stack is Ruby on Rails, PostgreSQL, GraphQL, and React+Apollo on the frontend.\n\nAs an early team member, you'll play an instrumental role in shaping our product s..."
26378820,"His blog post on the topic goes into more detail and examples [1]. I tend to agree with him, especially his conclusion that ""Patterns are signs of weakness in programming languages.""\n\n[1] https://blog.plover.com/prog/design-patterns.html"
...,...
26712146,"The thing is, much of the skills it takes to be a decent programmer are skills that we should endeavor to teach to and instill within everyone.\n\nThese are things like:\n\n- attention to detail\n\n- development of a sense of when attention to detail is very important and when ""just about"" is reasonable\n\n- ability to communicate a process in words, especially written words\n\n- ability to identify patterns and recognize that one has identified a pattern (pattern recognition is automatic in animals and especially humans, but it is often occurring subconsciously)\n\n- ability or willingness to ask ""why?"" and follow that answer far enough to understand the basic reason for something\n\n- willingness to challenge status quo when the event..."
28777239,"> It's just a job, not a career\n\nWhat's wrong with this? You realize that if you died today your job would get posted before your obituary, don't you?\n\nDon't let your job mean too much to you. Do the job, get paid, and find meaning, community, and fulfillment elsewhere."
25994083,"Amazon Services | https://sell.amazon.com/ | Seattle, WA | ONSITE | Full-time | Senior Salesforce.com Developer\nMy team is hiring a Senior Salesforce Developer to continue to customize and enhance our Salesforce CRM sales and marketing clouds using Lightning Web Components (LWC), Apex, Pardot, and many AWS Services (Redshift, SES, SQS, SNS, CodeCommit, Glue, Athena, AppFlow, S3 and Lambda).\n\nYou will join a team of Salesforce Admins and Developers, Technical Program and Product Managers, Business Analysts, Data Scientists and Data Engineers working to invent new ways to engage with our Selling Partners. We are working on new technologies to support our Consumer business in order to help grow the businesses of our selling partners and..."
27633865,"After nearly 13 years out of it, I picked up a 3D printer. I did a fair amount of printing and modeling in my last years of college, but since transitioning into programming not long after graduation it's been something that I hadn't had the time for, or the avenue to apply it.\n\nSo far I've had more failed prints than successes over the last week or two, but I'm still excited to be doing it. I'm learning the ins and outs, which were different from the last printers I worked with. The wheels are churning as to what I can make, and I'm very excited to continue exploring and ""resharpening"" the skills that I once had."


# Saving

In [16]:
sample.filter(like='seed_').to_parquet('../data/02_intermediate/seed_books.parquet')