# Malawi News Chat Interface: An Intelligent RAG System for Personalized News Digests

## Introduction
This project aims to develop an intelligent chat interface that provides users with news updates from various Malawian news websites. Leveraging the power of BeautifulSoup for web scraping and AI for chat interactions, this system will allow users to stay informed about the latest news in an interactive and user-friendly manner.

## Project Steps

### Step 1: Web Scraping Using BeautifulSoup
Objective: Fetch news content from prominent Malawian news websites (e.g., Times).
Method: Utilize BeautifulSoup in Python to scrape news titles and detailed articles.
Output: Raw news data including titles and full article contents.
    
### Step 2: Data Storage
Objective: Store the scraped news data for easy access and processing.
Method: Save all news titles and corresponding detailed articles into a .txt file.
Output: A consolidated text file containing all relevant news information.
    
### Step 3: Building the AI Data Source
Objective: Prepare the scraped news data for AI processing.
Method: Use the stored .txt file as the data source for feeding into the AI model.
Preparation: Format and clean the data as necessary for optimal AI processing.
Output: A cleaned and structured dataset ready for integration with the AI model.

### Step 4: Chat Interface Integration
Objective: Develop a chat interface for users to interact with the AI.
Method: Connect the prepared AI model with a user-friendly chat interface.
Features: Enable users to query news, get summaries, and ask follow-up questions.
Output: A fully functional chat interface that delivers personalized news content.

### Step 1: Get the data from Malawi News Websites
- Start with a news any news website from Malawi

In [4]:
import requests
from bs4 import BeautifulSoup

In [16]:
def extract_titles(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        titles = [a_tag.get_text(strip=True) for a_tag in soup.find_all('a', class_='p-url')]
        return titles
    else:
        return f"Failed to retrieve data: Status code {response.status_code}"

In [17]:
def extract_links(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        links = [a_tag['href'] for a_tag in soup.find_all('a', class_='p-url')]
        return links
    else:
        return f"Failed to retrieve data: Status code {response.status_code}"

In [18]:
def extract_paragraphs_from_url(url):
    response = requests.get(url)

    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')

        div_content = soup.find('div', class_="entry-content rbct clearfix is-highlight-shares")
        paragraphs = [p.get_text(strip=True) for p in div_content.find_all('p')] if div_content else []

        full_text = ' '.join(paragraphs)
        return full_text
    else:
        return f"Failed to retrieve data: Status code {response.status_code}"

In [19]:
url = ''
links = extract_links(url)

In [20]:
titles = extract_titles(url)

In [23]:
with open('titles.txt', 'w') as file:
    for i, title in enumerate(titles, 1):
        file.write(f"{title}\n")

In [24]:
with open('links.txt', 'w') as file:
    for i, link in enumerate(links, 1):
        file.write(f"{link}\n")

In [21]:
def write_titles_with_content(titles_file, links_file, output_file):
    with open(titles_file, 'r') as f:
        titles = f.readlines()
    with open(links_file, 'r') as f:
        links = f.readlines()

    if len(titles) != len(links):
        return "The number of titles and links does not match."

    with open(output_file, 'w') as output:
        for title, link in zip(titles, links):
            output.write(title.strip() + '\n')

            extracted_text = extract_paragraphs_from_url(link.strip())
            output.write(extracted_text + '\n\n')

    return "Content written to " + output_file

In [25]:
output = write_titles_with_content('titles.txt', 'links.txt', 'output.txt')
print(output)

Content written to output.txt


## Lets Start Building the RAG System

In [1]:
import dotenv
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
dotenv.load_dotenv()

True

### Load the text data

In [10]:
# Load text data from a file using TextLoader
loader = TextLoader("./output.txt")
docs = loader.load()

In [11]:
len(docs[0].page_content)

228910

In [13]:
print(docs[0].page_content[500:1000])

era said the nation has put on their shoulders a crucial responsibility of upholding justice, fairness, and accountability within NIS operations, arguing that this is not an easy task by any stretch of the imagination. “The main function of the service then was to protect the political hegemony of the colonial government which was maintained in the one party state. Although intelligence was formally delinked from the Malawi Police Service in 2000, where it was called the Security and Intelligenc


### Index Spliting

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [17]:
len(all_splits[0].page_content)

54

In [15]:
len(all_splits)

390

In [16]:
all_splits[10].metadata

{'source': './output.txt', 'start_index': 5251}

### Index Store

In [18]:
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

In [19]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [20]:
retrieved_docs = retriever.invoke("What does the ministry of Tourism need?")

In [21]:
len(retrieved_docs)

6

In [22]:
print(retrieved_docs[0].page_content)

By Cathy Maulidi: The Ministry of Tourism needs about K1 billion to market the country, Director of Administration in the Ministry of Tourism Esther Nyirenda told Parliament Tuesday. Nyirenda told the Public Accounts Committee of Parliament that apart from marketing the country as a tourism destination, the ministry is advocating domestic tourism. “We need to market our country to both local and international tourists and doing that, needs a lot of resources. “We have our officers attached to embassies who help us market our country and we also reach out to people who can do marketing for us. We advertise and when they apply and meet what we want, we sign contracts with them so that they should be marketing Malawi while in their countries,” Nyirenda said. She cited high rates in hospitality units as one of the major hindrances to growing the tourism sector. According to Nyirenda, the ministry will launch its tourism marketing strategy by September 2023. But Public Accounts Committee


### Retrieval and Generation: Generate

In [23]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [24]:
prompt = hub.pull("rlm/rag-prompt")

In [25]:
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()
example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

In [26]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


In [27]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [30]:
for chunk in rag_chain.stream("Whats up with Tobbacco?"):
    print(chunk, end="", flush=True)

The Tobacco Commission has started a crop assessment to determine the potential tobacco production for the year. The assessment will run from January 15 to February 2, with authorities visiting tobacco growing areas for data collection. Farmers have been licensed with quotas worth 248 million kilograms of all types of tobacco.