# Ask HN Book Recommendations

Let's find all comments from Ask HN threads looking for book recommendations

# Load the Data

In [1]:
from pathlib import Path
import pandas as pd
import xxhash

Read in all Hacker News Stories from 2021, which [can be downloaded from Kaggle](https://www.kaggle.com/datasets/edwardjross/hackernews-2021-comments-and-stories) (extracted from the BigQuery dataset).

In [2]:
df = pd.read_parquet('../data/01_raw/hackernews2021.parquet').set_index('id')

In [3]:
df

Unnamed: 0_level_0,title,url,text,dead,by,score,time,timestamp,type,parent,descendants,ranking,deleted
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
27405131,,,They didn&#x27;t say they <i>weren&#x27;t</i> ...,,chrisseaton,,1622901869,2021-06-05 14:04:29+00:00,comment,27405089.0,,,
27814313,,,"Check out <a href=""https:&#x2F;&#x2F;www.remno...",,noyesno,,1626119705,2021-07-12 19:55:05+00:00,comment,27812726.0,,,
28626089,,,Like a million-dollars pixel but with letters....,,alainchabat,,1632381114,2021-09-23 07:11:54+00:00,comment,28626017.0,,,
27143346,,,Not the question...,,SigmundA,,1620920426,2021-05-13 15:40:26+00:00,comment,27143231.0,,,
29053108,,,There’s the Unorganized Militia of the United ...,,User23,,1635636573,2021-10-30 23:29:33+00:00,comment,29052087.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27367848,,,Housing supply isn’t something that can’t chan...,,JCM9,,1622636746,2021-06-02 12:25:46+00:00,comment,27367172.0,,,
28052800,,,Final Fantasy XIV has been experiencing consta...,,amyjess,,1628017217,2021-08-03 19:00:17+00:00,comment,28050798.0,,,
28052805,,,How did you resolve it?,,8ytecoder,,1628017238,2021-08-03 19:00:38+00:00,comment,28049375.0,,,
26704924,,,This hasn&#x27;t been my experience being vega...,,pacomerh,,1617657938,2021-04-05 21:25:38+00:00,comment,26704794.0,,,


# Split the Data

The data will be split deterministically by the by the root story.
This allows using features about the comment thread.

## Finding the root

For each comment the root can be found by walking up the parents recursively.

In [4]:
parent_dict = df['parent'].fillna(df.index.to_series()).to_dict()

root_dict = {}

for item, parent in parent_dict.items():
    while parent in parent_dict:
        grandparent = parent_dict[parent]
        if parent == grandparent:
            break
        parent = grandparent
    root_dict[item] = parent
    
df['root'] = df.index.map(root_dict)

## Deterministic Splitting

The hash of the root id with a fixed salt gives a deterministic random split.
Choose a 50% training set.

In [5]:
def bucket(s, salt='hnbooks'):
    return xxhash.xxh32_intdigest(str(s)+salt) % 100

bucket = df['root'].apply(bucket)

df['bucket'] = bucket

df['train'] = bucket < 50

# Finding HN Book Recommendations

We'll use a simple heuristic; any threads containing the work "book" or "textbook" and a work like "recommend", "best", "favorite", or "top" will be included.

In [6]:
sample = df.query('train').copy()

In [7]:
ask_hn_books = sample[
    sample['title']
    .str.contains(r'\b(?:text)?books?\b', regex=True) &
    sample['title']
    .str.contains(r'\b(?:recommend(?:ed)|best|favou?rite|top)\b', case=False, regex=True)
]

In [8]:
len(ask_hn_books)

44

In [9]:
pd.options.display.max_colwidth = 1000
pd.options.display.max_columns = 100

The majority of these are threads for book recommendations.

In [10]:
ask_hn_books[['title']].T

id,28129039,29668228,28391738,27847072,28308141,28456318,28181074,29686739,29317984,29364196,29146838,29450246,25999330,29493803,25824260,26543525,28830391,29042241,26044792,27324739,29538595,26213907,27802522,27961170,27807639,28588411,28523387,25857629,26624661,29712869,29573390,27673186,28267422,25675207,27846316,27646981,26406961,29706029,26157608,29247624,29182386,29726026,25922349,27641976
title,Why is it a “red flag” if someone’s favorite book is The Catcher in the Rye?,Ask HN: What's the best book you read in 2021?,Ask HN: Best books on modern distributed systems,The all-time best software engineering books,Ask HN: What are your top 5 favorite computer books?,Ask HN: What's the best book on AWS Lambda?,Ask HN: Best (practical) books on web security?,Ask HN: What is your favorite book that you've read this year?,"Best non-fiction books of 2021, according to Tyler Cowen",Ask HN: What's your favorite book about chaos theory?,Ask HN: Best book to learn C++ as a professional C programmer?,Show HN: Yearly Faves – track and share your favorite books and music from 2021,"Ask HN: What are your favorite non-fiction books, and why?",Non-business books for 2021 recommended by VCs,Ask HN: Best applied statistics books aimed at machine learning practitioners,"Show HN: Reclist.me – Share your favorite books, podcasts, newsletters",Ask HN: Best books/blogs/courses to learn advanced networking concepts?,Show HN: Top programming books for early developers,Show HN: Best books and courses for AWS MSK,"Show HN: I Built an app to show you the most recommended books on business, tech",Ask HN: What are your favourite textbooks?,Bookfeed.io – An RSS feed of newly released books from your favorite authors,Ask HN: What's the best CS or software engineering book you've read recently?,"What are your favorite MLOps courses, tools, books and research papers?",Ask HN: What are some favorite books that provide insight into other industries?,Ask HN: What are some of your favorite textbooks or technical books?,Show HN: RemoteDream – Find and book the best ho(s)tels to work remote from,"Amazon story: Late 90s, the internet was growing 2300%, book was best to sell",Ask HN: Best book/resources on business modelling?,"Top programming, mathematics, physics, and science books: part eight",Ask HN: Best place to purchase (used) technical books that's not Amazon?,What is the absolute best book for re-learning forgotten math for CS?,Ask HN: What are the best books on relational database modeling in SQL?,Ten best non-war history books of 2020,Ask HN: What is your favorite introductory-level textbook?,Experts recommending the five best books in their subject,The only 10 books AOC ever recommended to read,Best AI and Deep learning books to read in 2022,Bookfeed.io: RSS feed listing all new book releases from your favorite authors,Amazon Books editors announce best 2021 general interest science books,Ask HN: What are some of the best well-written books on computer science?,Ask HN: What is your favorite book for learning statistics?,Ask HN: Best tech focused (technical or non-technical) books you’ve read lately?,Ask HN: What is the best business book you've read?


In [11]:
sample['ask_hn_books'] = False
sample.loc[sample['root'].isin(ask_hn_books.index), 'ask_hn_books'] = True

In [12]:
pd.options.display.max_colwidth = 500

In [13]:
sample.query('ask_hn_books')[['text']]

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
29676102,"I also read it this year.<p>It came very close to me. The society of the book was very similar to the society in which I grew up, Soviet Union in 90s.<p>Some of the lines from the book:<p>&quot;.. the social consience completely dominates the individual conscience, instead of striking a balance with it. We don&#x27;t cooperate - we obey. We fear being outcast, being called lazy, dysfunctional, egoizing. We fear our neighbor&#x27;s opinion more than we respect our own freedom of choice.&quot;..."
29673970,"I re-read the series this year too. The books are just as good the second time &#x27;round--maybe better. There&#x27;s a fifth one new since my last read-through, and another in the offing perhaps[1].<p>Saunders&#x27; writing is for people who think that Gibson spends too much time explaining things.[2]<p>You do have to read between the lines a lot, and also stop a lot to figure out how bits of the world work, but the world-building is among the best I&#x27;ve seen. A fully worked out, human..."
29669184,- Existential Rationalism: Handling Hume&#x27;s Fork (second edition)\n- Living with the Himalayan Masters\n- The Outsider\n- Hirohito: Behind the Myth
29669495,The Coming of Neo-Feudalism by Joel Kotkin
29669067,"Probably <i>Reaper</i>, by Will Wight. It’s not an insightful nonfiction book or a piece of high literature, but the whole Cradle series is very, very fun."
...,...
29669044,"If you like that, you might like r&#x2F;progressionfantasy. They really drill down on the hero’s journey genre."
29669957,"Skip the retelling and go straight to the source, <i>South</i>, by Ernest Shackleton."
28404781,"@skrtskrt: Sorry for hijacking this post, but is there a way to contact you directly (my email is in my profile)? Your approach to building Django apps closely resonates with me and I would love to discuss these concepts in more detail."
26556264,Thanks for your comment! I really appreciate!<p>I&#x27;m using the Read Only permissions (the most basic). There is no lower permissions level than that.<p>thanks!


In [14]:
sample[['ask_hn_books']].to_parquet('../data/02_intermediate/ask_hn_books.parquet')