# Creating a Python-based recommender system for HackerNews posts

In this notebook, we will walk through how to create a simple algorithm that will recommend the top 10 posts based on our interests. The steps required to do this look like the following:

1. Fetch N posts from the HN API and aggregate into a list in-memory
2. Cleanse the data by vectorizing the list and removing stop words
3. Create a matrix that follows the structure of TF-IDF (term frequency inverse document frequency)
4. From here, we can vectorize our query and compute the cosine similarity of our input against our model. We will sort and rank the most similar titles to find links of interest. 

In [43]:
import requests

# Let's grab top HN stories so we can generate a similarity metric off title
response = requests.get('https://hacker-news.firebaseio.com/v0/topstories.json')
top_stories_ids = response.json()

# Define a function to fetch the title of a story given its ID
def fetch_title(story_id):
    response = requests.get(f'https://hacker-news.firebaseio.com/v0/item/{story_id}.json')
    story = response.json()
    return story['title']

# Fetch the titles of the top 200 stories and store them in an array
corpus = []
for story_id in top_stories_ids[:500]:
    title = fetch_title(story_id)
    corpus.append(title)


In [45]:
# We want to vectorize the words and remove common words like "of", "the", etc (stopwords)
vectorizer = TfidfVectorizer(stop_words='english')

# We won't define tfidf, but we're basically building a matrix of frequency counts
# and understanding their relative importance to other words within the document. 
# This format will be useful when using the vectors to compute cosine similarity
tfidf_matrix = vectorizer.fit_transform(corpus)

# Convert the sparse matrix to a pandas DataFrame for readability
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out(), index=corpus)

In [61]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

def find_most_similar(query, tfidf_matrix, corpus):
    """
    find_most_similar will leverage the vectors from above. We can take an input query 
    and compute the cosine similarity against all other vectors to rank the titles that are
    the most similar.
    """
    
    relevant_results = []
    queries = query.split(",")
    stripped_array = [word.strip() for word in queries]
    
    for title in range(len(stripped_array)):
    
        # Use the vectorizer to transform the query
        query_tfidf = vectorizer.transform([stripped_array[title]])

        # Compute the cosine similarity between the query and all documents in the corpus
        cosine_similarities = np.dot(query_tfidf, tfidf_matrix.T).toarray()[0]

        # Sort the cosine similarities in descending order and return the indices of the most similar documents
        most_similar_indices = cosine_similarities.argsort()[::-1]

        # Print the most similar documents and their cosine similarity scores
        for i in range(5):
            index = most_similar_indices[i]
            relevant_results.append(corpus[index])
        
    for res in relevant_results:
        print(res)
    

In [69]:
import ipywidgets as widgets

# Add a small UI for funsies
query_box = widgets.Text(
    value='',
    placeholder='Type your query here',
    description='Query:',
    disabled=False
)

submit_button = widgets.Button(description="Submit")
def handle_submit(button):
    query = query_box.value
    find_most_similar(query, tfidf_matrix, corpus)

submit_button.on_click(handle_submit)

display(query_box)
display(submit_button)

Text(value='', description='Query:', placeholder='Type your query here')

Button(description='Submit', style=ButtonStyle())

GPT-4 is coming next week – and it will be multimodal, says Microsoft Germany
GPT-4 is coming next week – and it will be multimodal, says Microsoft Germany
Microsoft Bing hits 100M active users in bid to grab share from Google
How long does Twitter have left?
Show HN: PyBroker – Algotrading in Python with Machine Learning
Oxy is Cloudflare's Rust-based next generation proxy framework
How long does Twitter have left?
Stadia’s pivot to a cloud service has also been shut down
Circle: SVB is 1 of 6 banking partners Circle uses for ~25% part of USDC in cash
Show HN: structured-ripgrep – Ripgrep over structured data
