In this blog we will talk about how to build a simple News AI Assistant. We will be using Google News python library to fetch the news and then we will use the ChromaDB to do semantic search. After that we will use Retrieval Augmented Generation (RAG) model to generate the answer for the question.

Here is the video of this blog:
[![Free AI Workshop - Building a news assistant](https://img.youtube.com/vi/tKdDM0QIrvg/0.jpg)](https://www.youtube.com/embed/tKdDM0QIrvg "Free AI Workshop - Building a news assistant")

Also subscribe to my youtube channel for more such videos:


We start with installing Google News python library.

In [3]:
!pip install GoogleNews



Initialize the Google News object for fetching the news.

In [4]:
from GoogleNews import GoogleNews
googlenews = GoogleNews()

In [5]:
print(googlenews.getVersion())

1.6.15


Let's set the period to last 2 days.

In [6]:
googlenews = GoogleNews(period='2d')

Set the encoding to utf-8.

In [7]:
googlenews = GoogleNews(encode='utf-8')

Here we set the topic to "Sports".

In [8]:
googlenews.set_topic('CAAqKggKIiRDQkFTRlFvSUwyMHZNRFp1ZEdvU0JXVnVMVWRDR2dKSlRpZ0FQAQ')
googlenews.get_news()

And, now we can fetch the titles of the news.

In [11]:
titles = googlenews.get_texts()

titles

["Anand Mahindra's 'Punjabi' Reaction To India's 'Tunuk Tunuk' Dance After Chess Olympiad Win",
 'Chess Olympiad | Absolute domination by both Indian teams, says Harikrishna',
 '‘Aggressive style is working for India’s youngsters,’ says Giri',
 "'Celebration Inspired By Messi and Rohit': Chess Champs To NDTV",
 'India sharpened focus on fielding, fitness for T20 World Cup, say Harmanpreet and Muzumdar',
 "Harmanpreet: 'This is our best ever team at a T20 World Cup'",
 'Women’s T20 World Cup: Skipper Harmanpreet Kaur happy to ‘tick all boxes’ on preparation front',
 "Harmanpreet Kaur upbeat about India's T20 World Cup chances with strong batting core",
 'Virat Kohli steals the limelight as he, Rishabh Pant and Gautam Gambhir reach Kanpur for India vs Bangladesh 2nd Test',
 'India and Bangladesh brace for lower bounce on black-soil pitch in Kanpur',
 'India vs Bangladesh: Cues from Kanpur – Slow turner, rain and a security threat',
 "RCB Pacer To Make Debut, Kuldeep In For Bumrah? India'

Let's now install the ChromaDB library. We will use this as our vector database for doing semantic search.

In [12]:
!pip install chromadb

Collecting chromadb
  Using cached chromadb-0.5.7-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2-py3-none-any.whl.metadata (6.2 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp312-cp312-macosx_11_0_arm64.whl.metadata (252 bytes)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.6.6-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.2-cp312-cp312-macosx_11_0_universal2.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemetry_api-1.27.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Using cached opentelemetry_exporter_otlp_proto_grpc-1.27.0-py3-none-any.whl.metadata (2.3 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Using cached opentelemetry_instrumentation_fastapi-0.48b

Initialize the ChromaDB object. Also setup the connection.

In [13]:
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()

# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")

We need the ids for the titles from the news. So, we will create a list of ids.

In [22]:
# create ids for injecting documents they will be index of the title for now in string format
ids = [str(i) for i in range(len(titles))]

And metadata for the titles.

In [27]:
# list of metadata for each title, for now each title metadata is just dictionary with index as key and title as value
title_metadata = [{str(i): title} for i, title in enumerate(titles)]

We insert the vectors into the ChromaDB.

In [29]:
# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=titles, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
    metadatas= title_metadata, # filter on these!
    ids=ids, # unique for each doc
)

Now, we can query the ChromaDB for the similar titles.

In [41]:
# Query/search 2 most similar results. You can also .get by id
results = collection.query(
    query_texts=["What is happening with test matches?"],  # query text"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)

And we can see that we can now get the similar titles.

In [43]:
results

{'ids': [['23', '22']],
 'distances': [[1.0741010904312134, 1.1105579137802124]],
 'metadatas': [[{'23': "More wins than loses: Tracing India's incredible Test cricket journey"},
   {'22': 'World Test Championship: India tightens grip on top spot'}]],
 'embeddings': None,
 'documents': [["More wins than loses: Tracing India's incredible Test cricket journey",
   'World Test Championship: India tightens grip on top spot']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

Let's dump the output to a json variable.

In [44]:
import json
context = json.dumps(results["documents"]);

In [45]:
context

'[["More wins than loses: Tracing India\'s incredible Test cricket journey", "World Test Championship: India tightens grip on top spot"]]'

Let's now setup the OpenAI API key.

In [34]:
import getpass
openai_key = getpass.getpass("Enter your OpenAI key: ")

!export OPENAI_API_KEY=$openai_key

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Download the Langchain OpenAI library.

In [35]:
!pip install langchain-openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting langchain-openai
  Using cached langchain_openai-0.2.0-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<0.4,>=0.3 (from langchain-openai)
  Downloading langchain_core-0.3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting openai<2.0.0,>=1.40.0 (from langchain-openai)
  Downloading openai-1.47.1-py3-none-any.whl.metadata (24 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Using cached tiktoken-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4,>=0.3->langchain-openai)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langsmith<0.2.0,>=0.1.125 (from langchain-core<0.4,>=0.3->langchain-openai)
  Downloading langsmith-0.1.126-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain-core<0.4,>=0.3->langchain-openai)
  Using cached tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting distro<2,>=1.7.0 (from openai<2.0.0,>=1.40.0

Now we can setup the prompts and parameters for the RAG model.

In [46]:
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI

llm = OpenAI(api_key=openai_key)
prompt = PromptTemplate.from_template(
    """
    You are a new assistant. User ask you a question and you have a context.
    Question: {question}
    
    Context: {context}
    
    Based on the the question and the context, answer the question. Do not provide any information that is not present in the context.
    Don't mention about the context in the answer.
    Stick as close to the question as possible.
"""
)

chain = prompt | llm
chain.invoke({"question": "What is happening with test matches?", "context": context })

'\nCurrently, India has had more wins than losses in Test cricket and is also dominating the World Test Championship.'

: 

## Conclusion

We have successfully built a simple News AI Assistant using Google News python library, ChromaDB, and Retrieval Augmented Generation (RAG) model. This can be further extended to build a more complex AI Assistant.

Due a technical issue with Google News, we were not able to fetch the actual news articles. But soon we will update the code with the actual news articles. Stay tuned for more such blogs.