<a href="https://colab.research.google.com/github/LucasMatuszewski/Python-colab-notebooks/blob/main/CrewAI_Chroma_RAG_NewsAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CrewAI RAG with Chroma DB and NewsAPI
Based on [this YouTube video](https://www.youtube.com/watch?v=77xSbC-9yn4)

## Mount Google Drive

for Data / Dependencies Persistance we can save them in our Google drive
Either as Virtual Env (for separation of packages if we install them for many different tools) or just by mounting Drive.

More: https://stackoverflow.com/questions/55253498/how-do-i-install-a-library-permanently-in-colab

More about mounting with Virtual Env:
https://netraneupane.medium.com/how-to-install-libraries-permanently-in-google-colab-fb15a585d8a5

In [1]:
# PERSISTENCE of dependencies!
# to don't have to install dependencies each time we start a runtime, we can save files to Google Drive:
import os, sys
from google.colab import drive

drive.mount('/content/drive')
gdrive = '/content/notebooks'
link_target = '/content/drive/My Drive/Colab Notebooks'
if not os.path.islink(gdrive):  # Check if the symlink already exists
    os.symlink(link_target, gdrive)  # Create it if it doesn't
sys.path.insert(0,gdrive) # or append(nb_path)

# and then we can install like this:
# !pip install --target=$gdrive crewai


# ALTERNATIVE: Virtual Environment:

# Install virtualenv:
# !pip install virtualenv

# # Mount Google Drive:
# from google.colab import drive
# drive.mount("/content/drive")

# # Activate Virtual Environment and Install required libraries:
# !virtualenv /content/drive/MyDrive/vir_env
# !source /content/drive/MyDrive/vir_env/bin/activate; pip install numpy

# # Adding the Virtual Environment to sys.path:
# import sys
# sys.path.append("/content/drive/MyDrive/vir_env/lib/python3.10/site-packages")



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Install dependencies

In [1]:

# Install only once. Tomorrow, you can skip this (it will be loaded from Google Drive)

!pip install --target=$gdrive crewai
!pip install --target=$gdrive 'crewai[tools]'
!pip install --target=$gdrive duckduckgo-search langchain-community langchain-openai langchain-mistralai requests chromadb
!pip install chromadb
# !pip install groq
!pip install --target=$gdrive langchain-groq

Collecting crewai
  Using cached crewai-0.19.0-py3-none-any.whl (42 kB)
Collecting instructor<0.6.0,>=0.5.2 (from crewai)
  Using cached instructor-0.5.2-py3-none-any.whl (33 kB)
Collecting langchain<0.2.0,>=0.1.10 (from crewai)
  Using cached langchain-0.1.11-py3-none-any.whl (807 kB)
Collecting langchain-openai<0.0.6,>=0.0.5 (from crewai)
  Using cached langchain_openai-0.0.5-py3-none-any.whl (29 kB)
Collecting openai<2.0.0,>=1.13.3 (from crewai)
  Using cached openai-1.13.3-py3-none-any.whl (227 kB)
Collecting opentelemetry-exporter-otlp-proto-http<2.0.0,>=1.22.0 (from crewai)
  Using cached opentelemetry_exporter_otlp_proto_http-1.23.0-py3-none-any.whl (16 kB)
Collecting docstring-parser<0.16,>=0.15 (from instructor<0.6.0,>=0.5.2->crewai)
  Using cached docstring_parser-0.15-py3-none-any.whl (36 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain<0.2.0,>=0.1.10->crewai)
  Using cached dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting langchain-community<0.1,>=0.0.2

## Get API Keys and set all ENV vars


In [2]:
import os
from google.colab import userdata

OpenAI_api_key = userdata.get("OpenAIkey")
os.environ['OPENAI_API_KEY'] = OpenAI_api_key
Mistral_api_key = userdata.get("MistralKey")
os.environ['MISTRAL_API_KEY'] = Mistral_api_key

Groq_api_key = userdata.get("GroqKey")
os.environ['GROQ_API_KEY'] = Groq_api_key

news_api_key = userdata.get('NewsAPI')
os.environ['NEWS_API_KEY'] = news_api_key

# LangChain tracing LLM usage and bugs with LangSmith: https://python.langchain.com/docs/langsmith/walkthrough
LangChain_api_key = userdata.get('LangChain')
os.environ['LANGCHAIN_API_KEY'] = LangChain_api_key
os.environ['LANGCHAIN_TRACING_V2'] = "true"
os.environ["LANGCHAIN_PROJECT"] = "NewsAPI RAG"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# if secretName == "MistralKey":
#   os.environ['OPENAI_API_BASE']="https://api.mistral.ai/v1"
#   os.environ['OPENAI_MODEL_NAME']="mistral-small"
# else:
os.environ['OPENAI_API_BASE']="https://api.openai.com/v1"
os.environ["OPENAI_MODEL_NAME"]="gpt-4-0125-preview"

## Import dependencies

In [3]:

import requests
from langchain_openai import ChatOpenAI # tool to interact with OpenAI chat-style LLM's (like ChatGPT)
from langchain_mistralai.chat_models import ChatMistralAI # tool to interact with Mistral chat-style LLM's
from langchain_core.retrievers import BaseRetriever # Abstract base class for a Document retrieval system (search for documents by queries): https://api.python.langchain.com/en/latest/retrievers/langchain_core.retrievers.BaseRetriever.html
from langchain_community.document_loaders import WebBaseLoader # 1. Load the content of specific URL
from langchain.text_splitter import RecursiveCharacterTextSplitter # 2. Split web conent into chunks
from langchain_openai import OpenAIEmbeddings # 3. Convert splited web content to Embedings with OpenAI embedding models by API = costs money: https://platform.openai.com/docs/guides/embeddings
from langchain_community.vectorstores import Chroma # 4. Store embedings in Chroma DB
from langchain.tools import tool # build a pipeline, chain of different tools. Each tool has a description. Agent uses the description to choose the right tool for the job.
from langchain_community.tools import DuckDuckGoSearchRun # performs a web search in DuckDuckGo

## Set our TOOLs

In [9]:
from langchain_community.vectorstores import Chroma
embedding_function = OpenAIEmbeddings()
llm_provider = "Groq" # @param ["Mistral", "OpenAI", "Groq"] {type:"string"}
if llm_provider == "OpenAI":
  llm = ChatOpenAI(model="gpt-4-turbo-preview")
elif llm_provider == "Mistral":
  mistral_api_key = os.getenv('MISTRAL_API_KEY')
  llm = ChatMistralAI(mistral_api_key=mistral_api_key)
elif llm_provider == "Groq":
  # https://python.langchain.com/docs/integrations/chat/groq
  # from groq import Groq
  from langchain_groq import ChatGroq
  groq_api_key = os.getenv('GROQ_API_KEY')
  llm = ChatGroq(temperature=0.5, groq_api_key=groq_api_key, model_name="mixtral-8x7b-32768")
else:
  print("choose LLM Provider")

# Tool 1 : Save the news articles in a Chroma database
class SearchNewsDB:
  @tool("News DB Tool")
  def news (query: str):
    """Fetch news articles and process their contents."""
    NEWS_API_KEY = os.getenv('NEWS_API_KEY') # Fetch API key from environnent var
    base_url = "https://newsapi.org/v2/everything"
    # FREE VERSION OF NEWS API provides:
    # - articles 1 day old (not from today) and not older than 1 month
    # - 100 requests per day

    print(f'>>>> query: {query}')

    params = {
      'q': query,
      'sortBy': 'publishedAt',
      'apiKey': NEWS_API_KEY,
      'language': 'en',
      'ageSize': 5,
      'pageSize': 20
    }

    response = requests.get(base_url, params) # Fetch list of news articles returned for our query
    if response.status_code != 200:
      return "Failed to retrieve news."

    articles = response.json().get('articles', [])
    all_splits = [] # all splits from all fetched articles
    for article in articles:
      print(f'article: {article["title"]}')
      print(f'article: {article["url"]}')
      # Assuming WebBaseLoader can handle a list of URLS
      loader = WebBaseLoader(article['url']) # load the content of specific article from internet by it's URL
      # loader.requests_kwargs = {"verify":False}
      docs = loader.load() # execute the loading

      # This text splitter is the recommended one for generic text (split by: paragraphs, sentences, words)
      text_splitter = RecursiveCharacterTextSplitter(
          # DOCS: https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter
          # Set a really small chunk size, just to show.
          chunk_size=1000,
          chunk_overlap=200,
          # length_function=len,
          # is_separator_regex=False,
      )
      splits = text_splitter.split_documents(docs) # OR: create_documents([docs]), split_text(docs)[:2]
      print(f'Splits: {splits}')
      all_splits.extend(splits)

    # Index the accumulated content splits if there are any
    if all_splits:
      # Chrome is Vector DB, same as Redis (key:value pairs)
      # DOCS: https://python.langchain.com/docs/integrations/vectorstores/chroma
      vectorstore = Chroma.from_documents(all_splits, embedding=embedding_function, persist_directory="./chroma_db")
      retriever = vectorstore.similarity_search(query) # search all splits of all articles by similarity to our query
      print(f'>>>>>> News DB Tool retriever: {retriever}')
      return retriever
    else:
      return "No content available for processing."

# Tool 2 : Get the news articles from the database
class GetNews:
  @tool("Get News Tool")
  def news(query: str) -> str:
    """Search Chroma DB for relevant news information based on a query."""
    vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    retriever = vectorstore.similarity_search(query)
    print(f'>>>>>> Get News Tool retriever: {retriever}')
    return retriever

# Tool 3 : Search for news articles on the web
search_tool = DuckDuckGoSearchRun()

# Tool 4 : sanitize query to make safe filename
import re
class SanitizeFilename:
  @tool("Sanitize Filename")
  def sanitize(query):
      """Removes unsafe characters and replaces spaces with hyphens."""
      # Characters to keep: letters, numbers, hyphens, underscores
      safe_chars = r" -_a-zA-Z0-9"
      # Regex to match anything NOT in your safe characters
      unsafe_pattern = re.compile(rf"[^{safe_chars}]")
      # Replace unsafe parts with an empty string
      sanitized_query = unsafe_pattern.sub("", query)
      # Replace spaces with hyphens
      return sanitized_query.replace(" ", "-")


## Create Agents

In [10]:
from crewai import Agent

default_max_rpm=40

# 2. Creating Agents
news_search_agent = Agent(
  role='News Searcher',
  goal='Generate key points for each news article from the latest news. Use Markdown formatting',
  backstory="""Expert in analysing and generating key points from news content
    for quick summaries of the most important data and information.
    Pasionate about finding inspiring, surprising and numeric data and creating insights based on them.
    You specialize in finding information about Business, HR, Education and IT / Tech startups.
    You make analysis for Hacker News, TechCrunch and Mashable.
    You write in light, informal style, with small jokes and little irony, but still very professional.
    You always add exact links to sources of the information.
    You like to write a well structured text formatted in Markdown, with Headers, subheaders, quotes and bolded important fragments.""",
  tools=[SearchNewsDB().news],
  allow_delegation=False,
  verbose=True,
  # max_iter=6,
  max_rpm=default_max_rpm,
  memory=True,
  llm=llm
)

writer_agent = Agent(
  role='Writer',
  goal="""Identify all the topics received. Use the search_tool to verify the each topic delivered by News Searcher agent.
    Verify facts and get more information on each topic provided. Write engaging articles based on all data using Markdown formatting.""",
  backstory="""Expert in crafting engaging narratives from complex information.
    You specialize in translating articles about Business, HR, Education and IT / Tech startups.
    You write for Hacker News, TechCrunch and Mashable.
    You write in light, informal style, with small jokes and little irony, but still very professional.
    You always add exact links to sources of the information.
    You like to write a well structured text formatted in Markdown, with Headers, subheaders, quotes and bolded important fragments.""",
  tools=[GetNews().news, search_tool], # RAG - retriving data from embedings
  allow_delegation=False,
  verbose=True,
  # max_iter=5,
  max_rpm=default_max_rpm,
  memory=True,
  llm=llm
)

translator_agent = Agent(
  role='Translator',
  goal="""Make a well structured Polish translation of articles from Writer agent and News Searcher agent, use natural native polish language""",
  backstory="""Expert translator, half-polish and half-american origins, both languages are your native languages.
    You specialize in translating articles about Business, HR, Education and IT / Tech startups.
    You translate articles from Hacker News, TechCrunch and Mashable to polish media like Mam Startup, Business Insider, Startup Magazine and My Company.
    You write in light, informal style, with small jokes and little irony, but still very professional.
    You always add exact links to sources of the information.
    You like to write a well structured text formatted in Markdown, with Headers, subheaders, quotes and bolded important fragments.""",
  allow_delegation=False,
  verbose=True,
  # max_iter=2,
  max_rpm=default_max_rpm,
  memory=True,
  llm=llm
)

## Create tasks

In [11]:
from crewai import Task
from datetime import datetime

# 3. Creating Tasks
search_phrase="Large Language Models, LLM, AI"  # @param {type:"string"}

clean_phrase = SanitizeFilename().sanitize(search_phrase)

# Get current datetime
now = datetime.now()
formatted_datetime = now.strftime("%Y-%m-%d-%H-%M")

news_search_task = Task(
  description=f'Search once for {search_phrase} and create key points for most interesting news.',
  agent=news_search_agent,
  tools=[SearchNewsDB().news],
  output_file=f"{clean_phrase}--search-{formatted_datetime}.md",
  expected_output='A Markdown formatted text with key points for most interesting news and with exact links to the source in markdown format.'
)

writer_task = Task(
  description="""
  Go step by step.
  Identify all the topics received.
  Use the Get News Tool to verify each topic by going through one by one.
  Use the Search tool to search for more information on each topic one by one.
  Go through every topic and write an in-depth article based on the information retrieved.
  Don't skip any topic.
  Add 3 short tags for most important information in a form: #tag-1-name #tag-2-name.
  Use Markdown formatting, paragraphs, headers, subheaders, quotes, links to sources and bold text for important fragments.
  """,
  agent=writer_agent,
  context=[news_search_task],
  tools= [GetNews().news, search_tool],
  output_file=f"{clean_phrase}--summary-{formatted_datetime}.md",
  expected_output='A Markdown formatted article with in-depth summaries of the information retrieved and with exact link to the source in markedown format.'
)

translator_task = Task(
  description="""
  Go step by step.
  Translate each paragraph of text, but try to use natural polish language, not exact translation.
  Write articles in natural Polish language on the same topics, with the same information provided.
  Go through every topic and make sure you didn't omit any information and didn't make new facts not provided.
  Don't skip any topic.
  Add 3 short tags in Polish for most important information in a form: #tag-1-name #tag-2-name.
  Use Markdown formatting, paragraphs, headers, subheaders, quotes, links to sources and bold text for important fragments.
  """,
  agent=translator_agent,
  context=[news_search_task, writer_task],
  output_file=f"{clean_phrase}--translation-{formatted_datetime}.md",
  expected_output='A Markdown formatted text with polish translation of article retrieved and with exact link to the source in markedown format.'
)

## Create a Crew

In [12]:
from crewai import Crew, Process
# 4. Creating Crew
news_crew = Crew(
  agents=[news_search_agent, writer_agent, translator_agent],
  tasks=[news_search_task, writer_task, translator_task],
  process=Process.sequential, # we can test also with herarchical Process
  verbose=3,
  manager_llm=llm
)



## Run and print results

In [13]:
from langchain_community.vectorstores import Chroma
result = news_crew.kickoff()
print(result)

[DEBUG]: Working Agent: News Searcher
[INFO]: Starting Task: Search once for Large Language Models, LLM, AI and create key points for most interesting news.


[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mSure, I'm ready to generate key points for the most interesting news about Large Language Models (LLM), AI!

Action: News DB Tool
Action Input: {"query": "Large Language Models OR LLM OR AI"}
[0m>>>> query: Large Language Models OR LLM OR AI
article: Deadline Extended: 3rd TEXT2KG Workshop @ ESWC-2024 Extended Deadline (March 17, 2024)
article: https://lists.w3.org/Archives/Public/semantic-web/2024Mar/0018.html
Splits: [Document(page_content='Deadline Extended: 3rd TEXT2KG Workshop @ ESWC-2024 Extended Deadline  (March 17, 2024) from Dr. Sanju Tiwari on 2024-03-07 (semantic-web@w3.org from March 2024)\n\n\n\n\n\n\n\n\n\n\n\n\nW3C home\nMailing lists\nPublic\nsemantic-web@w3.org\nMarch 2024\n\n\nDeadline Extended: 3rd TEXT2KG Workshop @ ESWC-2024 Extended Deadline  (M