<a href="https://colab.research.google.com/github/LucasMatuszewski/Python-colab-notebooks/blob/main/CrewAI_Chroma_RAG_NewsAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CrewAI RAG with Chroma DB and NewsAPI
Based on [this YouTube video](https://www.youtube.com/watch?v=77xSbC-9yn4)

## Install dependencies

In [1]:
!pip install crewai
!pip install 'crewai[tools]'
!pip install duckduckgo-search langchain-community langchain-openai langchain-mistralai requests chromadb

Collecting crewai
  Downloading crewai-0.19.0-py3-none-any.whl (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting instructor<0.6.0,>=0.5.2 (from crewai)
  Downloading instructor-0.5.2-py3-none-any.whl (33 kB)
Collecting langchain<0.2.0,>=0.1.10 (from crewai)
  Downloading langchain-0.1.10-py3-none-any.whl (806 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-openai<0.0.6,>=0.0.5 (from crewai)
  Downloading langchain_openai-0.0.5-py3-none-any.whl (29 kB)
Collecting openai<2.0.0,>=1.13.3 (from crewai)
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hColle

## Choose get API Keys and set all ENV vars


In [2]:
import os
from google.colab import userdata

OpenAI_api_key = userdata.get("OpenAIkey")
os.environ['OPENAI_API_KEY'] = OpenAI_api_key
Mistral_api_key = userdata.get("MistralKey")
os.environ['MISTRAL_API_KEY'] = Mistral_api_key
news_api_key = userdata.get('NewsAPI')
os.environ['NEWS_API_KEY'] = news_api_key

# LangChain tracing LLM usage and bugs with LangSmith: https://python.langchain.com/docs/langsmith/walkthrough
LangChain_api_key = userdata.get('LangChain')
os.environ['LANGCHAIN_API_KEY'] = LangChain_api_key
os.environ['LANGCHAIN_TRACING_V2'] = "true"
os.environ["LANGCHAIN_PROJECT"] = "NewsAPI RAG"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

# if secretName == "MistralKey":
#   os.environ['OPENAI_API_BASE']="https://api.mistral.ai/v1"
#   os.environ['OPENAI_MODEL_NAME']="mistral-small"
# else:
os.environ['OPENAI_API_BASE']="https://api.openai.com/v1"
os.environ["OPENAI_MODEL_NAME"]="gpt-4-0125-preview"

## Import dependencies

In [3]:
import requests
from langchain_openai import ChatOpenAI # tool to interact with OpenAI chat-style LLM's (like ChatGPT)
from langchain_mistralai.chat_models import ChatMistralAI # tool to interact with Mistral chat-style LLM's
from langchain_core.retrievers import BaseRetriever # Abstract base class for a Document retrieval system (search for documents by queries): https://api.python.langchain.com/en/latest/retrievers/langchain_core.retrievers.BaseRetriever.html
from langchain_community.document_loaders import WebBaseLoader # 1. Load the content of specific URL
from langchain.text_splitter import RecursiveCharacterTextSplitter # 2. Split web conent into chunks
from langchain_openai import OpenAIEmbeddings # 3. Convert splited web content to Embedings with OpenAI embedding models by API = costs money: https://platform.openai.com/docs/guides/embeddings
from langchain_community.vectorstores import Chroma # 4. Store embedings in Chroma DB
from langchain.tools import tool # build a pipeline, chain of different tools. Each tool has a description. Agent uses the description to choose the right tool for the job.
from langchain_community.tools import DuckDuckGoSearchRun # performs a web search in DuckDuckGo

## Set our TOOLs

In [21]:
embedding_function = OpenAIEmbeddings()
llm_provider = "OpenAI" # @param ["Mistral", "OpenAI"] {type:"string"}
if llm_provider == "OpenAI":
  llm = ChatOpenAI(model="gpt-4-turbo-preview")
elif llm_provider == "Mistral":
  mistral_api_key = os.getenv('MISTRAL_API_KEY')
  llm = ChatMistralAI(mistral_api_key=mistral_api_key)
else:
  print("choose LLM Provider")

# Tool 1 : Save the news articles in a Chroma database
class SearchNewsDB:
  @tool("News DB Tool")
  def news (query: str):
    """Fetch news articles and process their contents."""
    NEWS_API_KEY = os.getenv('NEWS_API_KEY') # Fetch API key from environnent var
    base_url = "https://newsapi.org/v2/everything"
    # FREE VERSION OF NEWS API provides:
    # - articles 1 day old (not from today) and not older than 1 month
    # - 100 requests per day

    params = {
      'q': query,
      'sortBy': 'publishedAt',
      'apiKey': NEWS_API_KEY,
      ' language': 'en',
      'ageSize': 5,
    }

    response = requests.get(base_url, params) # Fetch list of news articles returned for our query
    if response.status_code != 200:
      return "Failed to retrieve news."

    articles = response.json().get('articles', [])
    all_splits = [] # all splits from all fetched articles
    for article in articles:
      # Assuming WebBaseLoader can handle a list of URLS
      loader = WebBaseLoader(article['url']) # load the content of specific article from internet by it's URL
      docs = loader.load() # execute the loading

      # This text splitter is the recommended one for generic text (split by: paragraphs, sentences, words)
      text_splitter = RecursiveCharacterTextSplitter(
          # DOCS: https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter
          # Set a really small chunk size, just to show.
          chunk_size=1000,
          chunk_overlap=200,
          # length_function=len,
          # is_separator_regex=False,
      )
      splits = text_splitter.split_documents(docs) # OR: create_documents([docs]), split_text(docs)[:2]
      all_splits.extend(splits)

    # Index the accumulated content splits if there are any
    if all_splits:
      # Chrome is Vector DB, same as Redis (key:value pairs)
      # DOCS: https://python.langchain.com/docs/integrations/vectorstores/chroma
      vectorstore = Chroma.from_documents(all_splits, embedding=embedding_function, persist_directory="./chroma_db")
      retriever = vectorstore.similarity_search(query) # search all splits of all articles by similarity to our query
      return retriever
    else:
      return "No content available for processing."

# Tool 2 : Get the news articles from the database
class GetNews:
  @tool("Get News Tool")
  def news(query: str) -> str:
    """Search Chroma DB for relevant news information based on a query."""
    vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    retriever = vectorstore.similarity_search(query)
    return retriever

# Tool 3 : Search for news articles on the web
search_tool = DuckDuckGoSearchRun()

## Create Agents

In [22]:
from crewai import Agent

default_max_rpm=40

# 2. Creating Agents
news_search_agent = Agent(
  role='News Searcher',
  goal='Generate key points for each news article from the latest news',
  backstory="""Expert in analysing and generating key points from news content
    for quick summaries of the most important data and information.
    Pasionate about finding inspiring, surprising and numeric data and creating insights based on them.""",
  tools=[SearchNewsDB().news],
  allow_delegation=True,
  verbose=True,
  # max_iter=5,
  max_rpm=default_max_rpm,
  memory=True,
  llm=llm
)

writer_agent = Agent(
  role='Writer',
  goal='Identify all the topics received. Use the Get News Tool to verify the each topic and use search_tool to verify facts and get more information.',
  backstory="""Expert in crafting engaging narratives from complex information.""",
  tools=[GetNews().news, search_tool], # RAG - retriving data from embedings
  allow_delegation=True,
  verbose=True,
  # max_iter=5,
  max_rpm=default_max_rpm,
  memory=True,
  llm=llm
)

## Create tasks

In [23]:
from crewai import Task
# 3. Creating Tasks
search_phrase="AI in Training Employees and eLearning 2024"  # @param {type:"string"}

news_search_task = Task(
  description=f'Search for {search_phrase} and create key points for each news.',
  agent=news_search_agent,
  tools=[SearchNewsDB().news],
  output_file='news-search-key-points.md',
  expected_output='A markdown file with key points for each news.'
)

writer_task = Task(
  description="""
  Go step by step.
  Identify all the topics received.
  Use the Get News Tool to verify the each topic by going through one by one.
  Use the Search tool to search for information on each topic one by one.
  Go through every topic and write an in-depth summary of the information retrieved.
  Don't skip any topic.
  """,
  agent=writer_agent,
  context=[news_search_task],
  tools= [GetNews().news, search_tool],
  output_file='written-summaries.md',
  expected_output='A markdown file with in-depth summaries of the information retrieved.'
)

## Create a Crew

In [24]:
from crewai import Crew, Process
# 4. Creating Crew
news_crew = Crew(
  agents=[news_search_agent, writer_agent],
  tasks=[news_search_task, writer_task],
  process=Process.sequential, # we can test also with herarchical Process
  manager_llm=llm
)



## Run and print results

In [25]:
result = news_crew.kickoff()
print(result)



[1m> Entering new CrewAgentExecutor chain...[0m
[32;1m[1;3mThought: To begin, I need to gather recent news articles related to "AI in Training Employees and eLearning 2024" to understand the latest trends, innovations, and data in this area. This will involve using the News DB Tool to fetch relevant articles.

Action: News DB Tool
Action Input: {"query": "AI in Training Employees and eLearning 2024"}[0m[93m 

[Document(page_content='February 8, 2024                                                                                                                        \n\n                                        8 minutes to read                                    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n+2\n                            \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n                                            Comments                                        \n\n\n\n\n\n\n\n\n\n\n                                Summary:                            \n                            eLearnin