###Install dependencies

In [None]:
!pip install newspaper3k
!pip install haystack-ai

## Create a Custom Haystack 2.0 Component

This `HackernewsNewestFetcher` ferches the `last_k` newest posts on Hacker News and returns the contents as a List of Haystack Document objects

In [None]:
from typing import List
from haystack.preview import component, Document
from newspaper import Article
import requests

@component
class HackernewsNewestFetcher():

  @component.output_types(articles=List[Document])
  def run(self, last_k: int = 5):
    newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')
    articles = []
    for id in newest_list.json()[0:last_k]:
      article = requests.get(url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty")
      if 'url' in article.json():
        articles.append(article.json()['url'])

    docs = []
    for url in articles:
      try:
        article = Article(url)
        article.download()
        article.parse()
        docs.append(Document(text=article.text[:1000], metadata={'title': article.title, 'url': url}))
      except:
        print(f"Couldn't download {url}, skipped")
    return {'articles': docs}


In [None]:
from haystack.preview import Pipeline
from haystack.preview.document_stores import MemoryDocumentStore
from haystack.preview.components.embedders.sentence_transformers_document_embedder import SentenceTransformersDocumentEmbedder
from haystack.preview.components.writers.document_writer import DocumentWriter

fetcher = HackernewsNewestFetcher()
document_store = MemoryDocumentStore()
embedder = SentenceTransformersDocumentEmbedder(model_name_or_path= "sentence-transformers/all-mpnet-base-v2")
writer = DocumentWriter(document_store)

In [None]:
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(name="fetcher",  instance=fetcher)
indexing_pipeline.add_component(name="embedder",  instance=embedder)
indexing_pipeline.add_component(name="writer",  instance=writer)

In [None]:
indexing_pipeline.connect("fetcher.articles", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

In [None]:
indexing_pipeline.run(data={"fetcher": {"last_k": 5}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {}}

In [None]:
document_store.filter_documents()[3]

Document(id='bc3741b7667efab13734e99edf49a4c1258b109d22625652beb25650324d4496', text='I just had a thought - or maybe it’s more accurate to say that a months-long thought process just crystallised into something. So I’m going to do a quick stream-of-consciousness post about it and hope it’s useful in some way and isn’t pure brain-crack.\n\nI’m sure I’m not the only one feeling the effects of Twitter’s X’s enshittification, followed by Reddit’s . People mostly younger than me probably feel the same about TikTok, a story which, along with that of Amazon, helped Cory Doctorow solidify the term. Even before Twitter was bought by the Muskrat, Facebook’s Meta’s downfall was strongly predicted (and continues to play out before our eyes), and I personally feel Google is going through a slower, decades-long enshittification since about 2010.\n\nMy initial perspective on this was filled with schadenfreude. X is quite clearly imploding (albeit a bit slower than some anticipated), and is already a

## Create a Haystack 2.0 RAG Pipeline

This pipeline uses the components available in the Haystack 2.0 preview package at time of writing (22 September 2023) as well as the custom component we've created above.

The end result is a RAG pipeline designed to provide a list of summaries for each of the `last_k` posts on Hacker News, followes by the source URL.

In [None]:
from getpass import getpass

api_key = getpass("OpenAI Key: ")

OpenAI Key: ··········


In [None]:
from haystack.preview.components.builders.prompt_builder import PromptBuilder
from haystack.preview.components.generators.openai.gpt import GPTGenerator

fetcher = HackernewsNewestFetcher()

prompt_template = """
You will be provided a few of the latest posts in HackerNews, followed by their URL.
For each post, provide a brief summary followed by the URL the full post can be found at.

Posts:
{% for article in articles %}
  {% if article.text|length > 0 %}
    {{article.text}}
    URL: {{article.metadata['url']}}
  {% endif %}
{% endfor %}
"""

prompt_builder = PromptBuilder(template = prompt_template)
llm = GPTGenerator(model_name = "gpt-4", api_key = api_key)


In [None]:
summarization_rag_pipeline = Pipeline()
summarization_rag_pipeline.add_component(name="fetcher", instance=fetcher)
summarization_rag_pipeline.add_component(name="prompt_builder", instance=prompt_builder)
summarization_rag_pipeline.add_component(name="llm", instance=llm)

In [None]:
summarization_rag_pipeline.connect("fetcher.articles", "prompt_builder.articles")
summarization_rag_pipeline.connect("prompt_builder.prompt", "llm.prompt")

In [None]:
result = summarization_rag_pipeline.run(data={"fetcher": {"last_k": 4}})

In [None]:
print(result['llm']['replies'][0])

1. "Event sourcing with Kafka: A practical example"
This post introduces event sourcing, a method of determining system state by analyzing and aggregating past events. This approach is perfect for large, distributed systems as it offers scalability, repeatability, traceability, and flexibility. The post also explains how to implement event sourcing with Kafka and Tinybird. This method is preferable to traditionally storing balance information in a database, an approach that can cause problems if data is inadvertently modified and lacks a transaction history.
URL: https://www.tinybird.co/blog-posts/event-sourcing-with-kafka

2. "Understanding Postgres IOPS"
This post highlights the importance of IOPS (Input/Output Operations Per Second), a key metric for measuring disk system performance as it represents the number of read and write operations performable per second. The post focuses on PostgreSQL, a database system that heavily relies on disk access, explaining what IOPS is, its impact