In [7]:
import os
import requests
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

In [8]:
load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")


if not api_key:
    raise ValueError("API key not found. Please set the OPENAI_API_KEY environment variable.")
elif not api_key.startswith("AIza"):
    raise ValueError("Invalid API key format. Please check your OPENAI_API_KEY.")
elif api_key.strip() != api_key:
    print("An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")    

API key found and looks good so far!


In [10]:

tell_a_joke = [
    {"role": "user", "content": "Tell a joke for a student on the journey to becoming an expert in LLM Engineering"},
]

In [9]:
GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"

gemini = OpenAI(base_url=GEMINI_BASE_URL, api_key=api_key)

response = gemini.chat.completions.create(
    model="gemini-2.5-flash",
    messages=tell_a_joke,
)

display(Markdown(response.choices[0].message.content))

Why did the budding LLM Engineer spend three days trying to get their model to say "Hello World"?

Because it kept returning:

"As an advanced conversational AI trained on a vast corpus of internet text, I am unable to physically manifest or directly interact with a 'world' in the human sense. However, I can generate the textual representation of the common introductory phrase 'Hello, world.' Would you like me to proceed with this, or perhaps explore the philosophical implications of AI consciousness in greeting existence?"

...and they just needed it to be concise.

In [10]:
easy_puzzle = [
    {"role": "user", "content": 
        "You toss 2 coins. One of them is heads. What's the probability the other is tails? Answer with the probability only."},
]

In [15]:
response = gemini.chat.completions.create(
    model="gemini-2.5-flash",
    messages=easy_puzzle,
    reasoning_effort="medium"
)
display(Markdown(response.choices[0].message.content))

Let's list all possible outcomes when tossing two coins. We can represent them as pairs, where the first letter is the result of the first coin and the second letter is the result of the second coin:
1. HH (Heads, Heads)
2. HT (Heads, Tails)
3. TH (Tails, Heads)
4. TT (Tails, Tails)

There are 4 equally likely outcomes.

Now, let's consider the given condition: "One of them is heads."
This usually means "at least one of the coins is heads." Let's identify the outcomes that satisfy this condition:
*   HH: Yes, at least one coin is heads.
*   HT: Yes, at least one coin is heads.
*   TH: Yes, at least one coin is heads.
*   TT: No, neither coin is heads.

So, the possible outcomes, given that "one of them is heads," are {HH, HT, TH}. This forms our reduced sample space, and each of these 3 outcomes is equally likely.

Next, we need to find the probability that "the other is tails" within this reduced sample space. Let's examine each outcome:

*   **HH**: If one of them is heads (say, the first coin), the other coin (the second coin) is also heads. So, in this case, the other is NOT tails.
*   **HT**: If one of them is heads (the first coin), the other coin (the second coin) IS tails. This satisfies the condition.
*   **TH**: If one of them is heads (the second coin), the other coin (the first coin) IS tails. This satisfies the condition.

So, out of the 3 possible outcomes (HH, HT, TH), 2 of them (HT, TH) result in "the other" coin being tails.

The probability is the number of favorable outcomes divided by the total number of outcomes in the reduced sample space:
Probability = (Number of outcomes where the other is tails) / (Total number of outcomes where at least one is heads)
Probability = 2 / 3

The final answer is $\boxed{\frac{2}{3}}$.

In [16]:
hard = """
On a bookshelf, two volumes of Pushkin stand side by side: the first and the second.
The pages of each volume together have a thickness of 2 cm, and each cover is 2 mm thick.
A worm gnawed (perpendicular to the pages) from the first page of the first volume to the last page of the second volume.
What distance did it gnaw through?
"""
hard_puzzle = [
    {"role": "user", "content": hard}
]

## Training vs Inference time scaling

In [17]:
response = gemini.chat.completions.create(
    model="gemini-2.5-flash",  
    messages=hard_puzzle,
    reasoning_effort="high"
)
display(Markdown(response.choices[0].message.content))

This is a classic riddle that plays on how we visualize books on a shelf versus their internal structure.

Here's the trick:

1.  **Book Orientation on a Shelf:** When books are placed side by side on a shelf, their spines typically face outwards (towards you).
    *   **Volume 1 (on the left):** Its **front cover** is on the far left. Its **back cover** is on the right, touching Volume 2.
    *   **Volume 2 (on the right):** Its **front cover** is on the left, touching Volume 1. Its **back cover** is on the far right.

2.  **Worm's Path:**
    *   The worm starts "from the **first page** of the first volume." The first page of Volume 1 is just *inside* its front cover. If the worm is gnawing from left to right (from Volume 1 to Volume 2), it's already *past* the front cover and the entire block of pages of Volume 1. It only needs to gnaw through the **back cover of Volume 1** to exit that book and enter the next.
    *   The worm ends "to the **last page** of the second volume." The last page of Volume 2 is just *inside* its back cover. When the worm enters Volume 2 from Volume 1, it first encounters Volume 2's **front cover**. After gnawing through that, it immediately reaches the "last page" of Volume 2, so it does not gnaw through the pages of Volume 2 itself.

Therefore, the worm only gnaws through the two *inner* covers:
*   The back cover of Volume 1.
*   The front cover of Volume 2.

**Calculation:**
*   Thickness of one cover = 2 mm
*   Distance gnawed = Thickness of back cover (V1) + Thickness of front cover (V2)
*   Distance = 2 mm + 2 mm = 4 mm

The pages' thickness is irrelevant to this specific path due to the starting and ending points relative to the books' orientation.

The distance the worm gnawed through is **4 mm**.

In [18]:
requests.get("http://localhost:11434/").content

b'Ollama is running'

In [20]:
easy_puzzle = [
    {"role": "user", "content": 
        "You toss 2 coins. One of them is heads. What's the probability the other is tails? Answer with the probability only."},
]

In [24]:
ollama = OpenAI(base_url="http://localhost:11434/v1/", api_key="ollama")
response = ollama.chat.completions.create(model="llama3.2:latest", messages=easy_puzzle)
display(Markdown(response.choices[0].message.content))

1/2

In [None]:
from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Describe the color Blue to someone who's never been able to see in 1 sentence"
)

print(response.text)

## Routers and Abtraction Layers


#### Openrouter

In [None]:
response = openrouter.chat.completions.create(model="z-ai/glm-4.5", messages=tell_a_joke)
display(Markdown(response.choices[0].message.content))

In [2]:
!pip install langchain-google-genai

Collecting langchain-google-genai
  Downloading langchain_google_genai-4.0.0-py3-none-any.whl.metadata (2.7 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-genai<2.0.0,>=1.53.0 (from langchain-google-genai)
  Downloading google_genai-1.55.0-py3-none-any.whl.metadata (47 kB)
Downloading langchain_google_genai-4.0.0-py3-none-any.whl (63 kB)
Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_genai-1.55.0-py3-none-any.whl (703 kB)
   ---------------------------------------- 0.0/703.4 kB ? eta -:--:--
   ---------------------------------------- 0.0/703.4 kB ? eta -:--:--
   -------------- ------------------------- 262.1/703.4 kB ? eta -:--:--
   ---------------------------- --------- 524.3/703.4 kB 840.2 kB/s eta 0:00:01
   ---------------------------------------- 703.4/703.4 kB 1.0 MB/s  0:00:00
Installing collected packages: filetype, google-genai, langchain-go

#### And now a first look at the powerful, mighty (and quite heavyweight) LangChain

In [23]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", api_key=api_key)
response = llm.invoke(tell_a_joke)

print(response)

ChatGoogleGenerativeAIError: Error calling model 'gemini-2.5-flash' (RESOURCE_EXHAUSTED): 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit. \n* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 20, model: gemini-2.5-flash\nPlease retry in 13.474079746s.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.Help', 'links': [{'description': 'Learn more about Gemini API quotas', 'url': 'https://ai.google.dev/gemini-api/docs/rate-limits'}]}, {'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_requests', 'quotaId': 'GenerateRequestsPerDayPerProjectPerModel-FreeTier', 'quotaDimensions': {'location': 'global', 'model': 'gemini-2.5-flash'}, 'quotaValue': '20'}]}, {'@type': 'type.googleapis.com/google.rpc.RetryInfo', 'retryDelay': '13s'}]}}

#### Finally - my personal fave - the wonderfully lightweight LiteLLM

In [12]:
!pip install litellm

Collecting litellm
  Downloading litellm-1.80.10-py3-none-any.whl.metadata (30 kB)
Collecting fastuuid>=0.13.0 (from litellm)
  Downloading fastuuid-0.14.0-cp312-cp312-win_amd64.whl.metadata (1.1 kB)
Collecting grpcio<1.68.0,>=1.62.3 (from litellm)
  Downloading grpcio-1.67.1-cp312-cp312-win_amd64.whl.metadata (4.0 kB)
Downloading litellm-1.80.10-py3-none-any.whl (11.3 MB)
   ---------------------------------------- 0.0/11.3 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.3 MB 8.2 MB/s eta 0:00:02
   ----- ---------------------------------- 1.6/11.3 MB 5.6 MB/s eta 0:00:02
   ----------- ---------------------------- 3.1/11.3 MB 6.1 MB/s eta 0:00:02
   --------------- ------------------------ 4.5/11.3 MB 6.2 MB/s eta 0:00:02
   --------------------- ------------------ 6.0/11.3 MB 6.4 MB/s eta 0:00:01
   -------------------------- ------------- 7.3/11.3 MB 6.4 MB/s eta 0:00:01
   ------------------------------ --------- 8.7/11.3 MB 6.5 MB/s eta 0:00:01
   ------------

In [17]:
from litellm import completion
response = completion(
    model="gemini/gemini-2.5-flash",
    messages=tell_a_joke
)

display(Markdown(response.choices[0].message.content))

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 7: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='Why did ...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


Why did the LLM engineer break up with their chatbot?

Because after weeks of meticulous prompt engineering, fine-tuning, and RAG implementation, it still confidently told them that a cat is a species of fish and then asked for a raise.

In [18]:
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
print(f"Total cost: {response._hidden_params["response_cost"]*100:.4f} cents")

Input tokens: 18
Output tokens: 1456
Total tokens: 1474
Total cost: 0.3645 cents


## Now - let's use LiteLLM to illustrate a Pro-feature: prompt caching

In [19]:
with open("hamlet.txt", "r", encoding="utf-8") as f:
    hamlet = f.read()

loc = hamlet.find("Speak, man")
print(hamlet[loc:loc+100])

Speak, man.
  Laer. Where is my father?
  King. Dead.
  Queen. But not by him!
  King. Let him deman


In [20]:
question = [{"role": "user", "content": "In Hamlet, when Laertes asks 'Where is my father?' what is the reply?"}]

In [21]:
response = completion(model="gemini/gemini-2.5-flash-lite", messages=question)
display(Markdown(response.choices[0].message.content))

  PydanticSerializationUnexpectedValue(Expected 10 fields but got 7: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='When Lae...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


When Laertes asks, "Where is my father?" in Hamlet, the reply he receives is:

**"Dead."**

This is delivered by Gertrude, and it's a stark and devastating announcement that Laertes is not prepared for.