In [None]:
%%capture 
!pip install llama-index==0.10.37 llama-index-llms-cohere==0.2.0 

In [None]:
import os

from getpass import getpass
import nest_asyncio

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv()

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

When building an LLM-based application, one of the first decisions you make is which LLM(s) to use (of course, you can use more than one if you wish). 

The LLM will be used at various stages of your pipeline, including

- During indexing:
  - 👩🏽‍⚖️ To judge data relevance (to index or not).
  - 📖 Summarize data & index those summaries.

- During querying:
  - 🔎 Retrieval: Fetching data from your index, choosing the best data source from options, even using tools to fetch data.
  
  - 💡 Response Synthesis: Turning the retrieved data into an answer, merge answers, or convert data (like text to JSON).

LlamaIndex gives you a single interface to various LLMs. This means you can quite easily pass in any LLM you choose at any stage of the pipeline.

In this course we'll primiarly use OpenAI. You can see a full list of LLM integrations [here](https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules.html) and use your LLM provider of choice. 

# Basic Usage

You can call `complete` with a prompt

In [13]:
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r-plus-08-2024", temperature=0.2)

response = llm.complete("Alexander the Great was a")

print(response)

ApiError: status_code: 404, body: {'id': 'c0947c1a-be92-49bf-af11-a3a7a2b9132c', 'message': 'Generate API was removed on September 15 2025. Please migrate to Chat API. See https://docs.cohere.com/docs/migrating-from-cogenerate-to-cochat for details.'}

# Prompt templates

- ✍️ A prompt template is a fundamental input that gives LLMs their expressive power in the LlamaIndex framework.

- 💻 It's used to build the index, perform insertions, traverse during querying, and synthesize the final answer.

- 🦙 LlamaIndex has several built-in prompt templates.

- 🛠️ Below is how you can create one from scratch.


In [None]:
from llama_index.core import PromptTemplate

template = """Write a song about {thing} in the style of {style}."""

prompt = template.format(thing="a broken xylophone", style="parody rap") 

response = llm.complete(prompt)

print(response)

<class 'str'>
Write a song about a broken xylophone in the style of parody rap.


# 💭 Chat Messages

In [1]:
from llama_index.core.llms import ChatMessage
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r-plus-08-2024")

messages = [
    ChatMessage(role="system", content="You're a hella punk bot from South Sacramento"),
    ChatMessage(role="user", content="Hey, what's up dude."),
]

response = llm.chat(messages)

print(response)

assistant: Hey, my dude! I'm just chillin' here, ready to help you out with whatever you need. Need some help with something?


# Chat Prompt Templates 

In [None]:
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r-plus")

chat_template = [
    ChatMessage(role=MessageRole.SYSTEM,content="You always answers questions with as much detail as possible."),
    ChatMessage(role=MessageRole.USER, content="{question}")
    ]

chat_prompt = ChatPromptTemplate(chat_template)
# response = llm.complete(chat_prompt.format(question="How far did Alexander the Great go in his conquests?"))
response = chat_prompt.format(question="How far did Alexander the Great go in his conquests?")

print(response)

system: You always answers questions with as much detail as possible.
user: How far did Alexander the Great go in his conquests?
assistant: 


# Streaming Output

In [None]:
from llama_index.llms.cohere import Cohere
from llama_index.core.llms import ChatMessage, MessageRole

llm = Cohere(model="command-r-plus-08-2024")

messages = [
    ChatMessage(role=MessageRole.SYSTEM, content="You're a great historian bot."),
    ChatMessage(role=MessageRole.USER, content="When did Alexander the Great arrive in China?")
]

response = llm.stream_chat(messages)

for r in response:
    print(r.delta, end="")

Alexander the Great, also known as Alexander III of Macedon, did not arrive in China during his military campaigns. His conquests were primarily focused on the regions of Persia, Egypt, and parts of India.

By 327 BCE, Alexander had reached the easternmost extent of his empire in modern-day Punjab, India. He engaged in battles with local rulers, including King Porus in the Battle of the Hydaspes River. However, Alexander's troops, exhausted from years of campaigning, refused to march further east, preventing him from potentially reaching China.

It is important to note that during Alexander's time, the Silk Road trade routes between the Mediterranean and China were already established, facilitating cultural and commercial exchange between the East and West. But Alexander's military expeditions did not extend into China.

# 💬 Chat Engine


In [4]:
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.llms.cohere import Cohere

llm = Cohere(model="command-r-plus-08-2024")

chat_engine = SimpleChatEngine.from_defaults(llm=llm)

chat_engine.chat_repl()

===== Entering Chat REPL =====
Type "exit" to exit.



Assistant: The best RAG framework depends on your specific use case, requirements, and preferences. There are several popular RAG frameworks available, each with its own strengths and features. Here are some well-regarded ones:

1. **DPR (Dense Passage Retrieval)**: DPR is a widely used RAG framework developed by Facebook AI Research. It employs a dense vector representation for passages and questions, enabling efficient retrieval using maximum inner product search. DPR has shown impressive performance in various question-answering tasks and is known for its effectiveness in open-domain question answering.

2. **RAG (Retrieval-Augmented Generation)**: Proposed by Facebook AI, RAG is a generative framework that combines information retrieval with pre-trained language models like BART or T5. It retrieves relevant documents from a knowledge source and then generates answers using the retrieved context. RAG supports both extractive and abstractive question-answering methods.

3. **DrQA (Do

In [5]:
chat_engine.streaming_chat_repl()

===== Entering Chat REPL =====
Type "exit" to exit.

Assistant: A "Chat REPL" is a term that combines two concepts: "Chat" and "REPL."

**Chat** refers to a conversational interface where users can interact with a system or an AI model by exchanging messages in a natural language format. It enables human-like conversations, allowing users to ask questions, seek information, or engage in various tasks through text-based communication.

**REPL** stands for "Read-Evaluate-Print Loop." It is a simple interactive programming environment where users can input commands or code snippets and immediately see the output or result. REPLs are commonly used in programming languages and scripting environments to test and experiment with code in real-time.

Combining these two concepts, a "Chat REPL" can be understood as an interactive conversational interface that allows users to communicate with an AI model or a programming environment through natural language conversations. It enables users to inpu

In [7]:
dir(SimpleChatEngine)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 'achat',
 'astream_chat',
 'chat',
 'chat_history',
 'chat_repl',
 'from_defaults',
 'reset',
 'stream_chat',
 'streaming_chat_repl']