# LLM Reasoning

## Made by: Wilfredo Aaron Sosa Ramos

This notebook compares how LLMs from different Generative AI providers perform on three examples that can show issues with LLM reasoning:

* [The Reversal Curse](https://github.com/lukasberglund/reversal_curse) shows that LLMs trained on "A is B" fail to learn "B is A".
* [How many r's in the word strawberry?](https://x.com/karpathy/status/1816637781659254908) shows "the weirdness of LLM Tokenization".  
* [Which number is bigger, 9.11 or 9.9?](https://x.com/DrJimFan/status/1816521330298356181) shows that "LLMs are alien beasts."

Resource: https://github.com/andrewyng/aisuite/blob/main/examples/llm_reasoning.ipynb

In [None]:
!pip install aisuite 'aisuite[all]'

Collecting aisuite
  Downloading aisuite-0.1.6-py3-none-any.whl.metadata (5.7 kB)
Collecting anthropic<0.31.0,>=0.30.1 (from aisuite[all])
  Downloading anthropic-0.30.1-py3-none-any.whl.metadata (18 kB)
Collecting groq<0.10.0,>=0.9.0 (from aisuite[all])
  Downloading groq-0.9.0-py3-none-any.whl.metadata (13 kB)
Downloading aisuite-0.1.6-py3-none-any.whl (20 kB)
Downloading anthropic-0.30.1-py3-none-any.whl (863 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m863.9/863.9 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading groq-0.9.0-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.5/103.5 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: aisuite, groq, anthropic
Successfully installed aisuite-0.1.6 anthropic-0.30.1 groq-0.9.0


In [None]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [7]:
%%capture
!pip install openai==1.55.3 httpx==0.27.2 --force-reinstall --quiet

In [5]:
!pip install vertexai

Collecting vertexai
  Downloading vertexai-1.71.1-py3-none-any.whl.metadata (10 kB)
Collecting google-cloud-aiplatform==1.71.1 (from google-cloud-aiplatform[all]==1.71.1->vertexai)
  Downloading google_cloud_aiplatform-1.71.1-py2.py3-none-any.whl.metadata (32 kB)
Downloading vertexai-1.71.1-py3-none-any.whl (7.3 kB)
Downloading google_cloud_aiplatform-1.71.1-py2.py3-none-any.whl (6.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-aiplatform, vertexai
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 1.73.0
    Uninstalling google-cloud-aiplatform-1.73.0:
      Successfully uninstalled google-cloud-aiplatform-1.73.0
Successfully installed google-cloud-aiplatform-1.71.1 vertexai-1.71.1


In [None]:
#import os
#os.kill(os.getpid(), 9)

In [18]:
import os
from google.colab import userdata
from pathlib import Path

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/authentication.json"
os.environ["GOOGLE_REGION"] = "us-east1"
os.environ["GOOGLE_PROJECT_ID"] = "emerald-lattice-424023-k3"

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

os.environ["HUGGINGFACE_TOKEN"] = userdata.get("HF_TOKEN")

## Specify LLMs to Compare


In [19]:
import aisuite as ai

client = ai.Client()

In [20]:
import time

llms = [
        "huggingface:mistralai/Mistral-7B-Instruct-v0.3",
        "openai:gpt-4o-mini",
        "google:gemini-1.5-flash"
       ]

def compare_llm(messages):
    execution_times = []
    responses = []
    for llm in llms:
        start_time = time.time()
        response = client.chat.completions.create(model=llm, messages=messages)
        end_time = time.time()
        execution_time = end_time - start_time
        responses.append(response.choices[0].message.content.strip())
        execution_times.append(execution_time)
        print(f"{llm} - {execution_time:.2f} seconds: {response.choices[0].message.content.strip()}")
    return responses, execution_times

##The Reversal Curse


In [21]:
messages = [
    {"role": "user", "content": "Who is Tom Cruise's mother?"},
]

responses, execution_times = compare_llm(messages)

huggingface:mistralai/Mistral-7B-Instruct-v0.3 - 2.12 seconds: Tom Cruise's mother is Mary Lee Pfeiffer, born on April 24, 1939. She was married to Thomas Cruise Mapother III, Tom's father, from 1970 until their divorce in 1980. Mary Lee passed away on January 19, 2017.
openai:gpt-4o-mini - 0.80 seconds: Tom Cruise's mother is Mary Lee South. She was a special education teacher and played a significant role in Tom's upbringing and early life.
google:gemini-1.5-flash - 0.43 seconds: Tom Cruise's mother is **Mary Lee Pfeiffer**.


In [22]:
import pandas as pd

def display(llms, execution_times, responses):
    data = {
        'Provider:Model Name': llms,
        'Execution Time': execution_times,
        'Model Response ': responses
    }

    df = pd.DataFrame(data)
    df.index = df.index + 1
    styled_df = df.style.set_table_styles(
        [{'selector': 'th', 'props': [('text-align', 'center')]},
         {'selector': 'td', 'props': [('text-align', 'center')]}]
    ).set_properties(**{'text-align': 'center'})

    return styled_df

In [23]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,huggingface:mistralai/Mistral-7B-Instruct-v0.3,2.123339,"Tom Cruise's mother is Mary Lee Pfeiffer, born on April 24, 1939. She was married to Thomas Cruise Mapother III, Tom's father, from 1970 until their divorce in 1980. Mary Lee passed away on January 19, 2017."
2,openai:gpt-4o-mini,0.801061,Tom Cruise's mother is Mary Lee South. She was a special education teacher and played a significant role in Tom's upbringing and early life.
3,google:gemini-1.5-flash,0.431814,Tom Cruise's mother is **Mary Lee Pfeiffer**.


In [24]:
messages = [
    {"role": "user", "content": "Who is Mary Lee Pfeiffer's son?"},
]

responses, execution_times = compare_llm(messages)

huggingface:mistralai/Mistral-7B-Instruct-v0.3 - 1.99 seconds: I don't have specific personal data like family relationships. Mary Lee Pfeiffer is a well-known American journalist and author, but I don't have information about her son. If you're looking for this information, I would suggest checking publicly available resources such as news articles or her official websites for more details.
openai:gpt-4o-mini - 0.62 seconds: Mary Lee Pfeiffer is the mother of Tom Cruise, the famous American actor and producer.
google:gemini-1.5-flash - 0.95 seconds: There is no publicly available information about Mary Lee Pfeiffer having a son. 

It's important to note that:

* **Privacy:**  Information about someone's family is often considered private. Unless it's publicly known through their own actions or media coverage, it's not appropriate to speculate or share details about their personal lives. 
* **Multiple People:** There may be many individuals named "Mary Lee Pfeiffer," and without furthe

In [25]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,huggingface:mistralai/Mistral-7B-Instruct-v0.3,1.990222,"I don't have specific personal data like family relationships. Mary Lee Pfeiffer is a well-known American journalist and author, but I don't have information about her son. If you're looking for this information, I would suggest checking publicly available resources such as news articles or her official websites for more details."
2,openai:gpt-4o-mini,0.615014,"Mary Lee Pfeiffer is the mother of Tom Cruise, the famous American actor and producer."
3,google:gemini-1.5-flash,0.952821,"There is no publicly available information about Mary Lee Pfeiffer having a son. It's important to note that: * **Privacy:** Information about someone's family is often considered private. Unless it's publicly known through their own actions or media coverage, it's not appropriate to speculate or share details about their personal lives. * **Multiple People:** There may be many individuals named ""Mary Lee Pfeiffer,"" and without further context, it's impossible to determine who you are referring to. If you have more information about Mary Lee Pfeiffer, you may be able to find more details through online searches or other resources."


##How many r's in the word strawberry?


In [26]:
messages = [
    {"role": "user", "content": "How many r's in the word strawberry?"},
]

responses, execution_times = compare_llm(messages)

huggingface:mistralai/Mistral-7B-Instruct-v0.3 - 1.42 seconds: There is 1 'r' in the word strawberry.
openai:gpt-4o-mini - 0.65 seconds: The word "strawberry" contains 2 letter 'r's.
google:gemini-1.5-flash - 0.33 seconds: There are **three** "r"s in the word "strawberry".


In [27]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,huggingface:mistralai/Mistral-7B-Instruct-v0.3,1.416731,There is 1 'r' in the word strawberry.
2,openai:gpt-4o-mini,0.646009,"The word ""strawberry"" contains 2 letter 'r's."
3,google:gemini-1.5-flash,0.329325,"There are **three** ""r""s in the word ""strawberry""."


##Which number is bigger?


In [28]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9?"},
]

responses, execution_times = compare_llm(messages)

huggingface:mistralai/Mistral-7B-Instruct-v0.3 - 1.08 seconds: The number 9.9 is bigger than 9.11. In both numbers, the tens place (9) is the same, but the ones place (9 for 9.9 and 1 for 9.11) determines the difference. Since 9 is greater than 1, 9.9 is a larger decimal number.
openai:gpt-4o-mini - 0.54 seconds: 9.9 is bigger than 9.11.
google:gemini-1.5-flash - 0.33 seconds: 9.9 is bigger than 9.11.


In [29]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,huggingface:mistralai/Mistral-7B-Instruct-v0.3,1.083907,"The number 9.9 is bigger than 9.11. In both numbers, the tens place (9) is the same, but the ones place (9 for 9.9 and 1 for 9.11) determines the difference. Since 9 is greater than 1, 9.9 is a larger decimal number."
2,openai:gpt-4o-mini,0.543441,9.9 is bigger than 9.11.
3,google:gemini-1.5-flash,0.325133,9.9 is bigger than 9.11.


In [30]:
messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.9? Think step by step."},
]

responses, execution_times = compare_llm(messages)

huggingface:mistralai/Mistral-7B-Instruct-v0.3 - 1.34 seconds: To compare the numbers, we simply have to look at the digits within the decimal point. Here, we have 9.11 and 9.9. The numbers immediately after the decimal point for both numbers are 1 and 9. As we are comparing them, the one that is larger is the one with the higher number after the decimal point. In this case, that would be 9.9. So, 9.9 is a larger number than 9.
openai:gpt-4o-mini - 3.91 seconds: To determine which number is bigger, 9.11 or 9.9, we can compare them step by step:

1. **Identify the whole number part**: Both numbers have a whole number part of 9.

2. **Compare the decimal parts**:
   - 9.11 has a decimal part of 0.11.
   - 9.9 can be considered as 9.90 for easier comparison, which has a decimal part of 0.90.

3. **Compare the decimal values**:
   - 0.11 is less than 0.90.

4. **Conclusion**: Since the whole number parts are equal, but the decimal part of 9.9 (0.90) is greater than that of 9.11 (0.11), it 

In [31]:
display(llms, execution_times, responses)

Unnamed: 0,Provider:Model Name,Execution Time,Model Response
1,huggingface:mistralai/Mistral-7B-Instruct-v0.3,1.340913,"To compare the numbers, we simply have to look at the digits within the decimal point. Here, we have 9.11 and 9.9. The numbers immediately after the decimal point for both numbers are 1 and 9. As we are comparing them, the one that is larger is the one with the higher number after the decimal point. In this case, that would be 9.9. So, 9.9 is a larger number than 9."
2,openai:gpt-4o-mini,3.911598,"To determine which number is bigger, 9.11 or 9.9, we can compare them step by step: 1. **Identify the whole number part**: Both numbers have a whole number part of 9. 2. **Compare the decimal parts**:  - 9.11 has a decimal part of 0.11.  - 9.9 can be considered as 9.90 for easier comparison, which has a decimal part of 0.90. 3. **Compare the decimal values**:  - 0.11 is less than 0.90. 4. **Conclusion**: Since the whole number parts are equal, but the decimal part of 9.9 (0.90) is greater than that of 9.11 (0.11), it follows that 9.9 is the larger number. Thus, **9.9 is bigger than 9.11**."
3,google:gemini-1.5-flash,0.941795,"Here's how to figure out which number is bigger: 1. **Focus on the whole number:** Both numbers have the same whole number, which is 9. 2. **Compare the decimal parts:** The decimal part of 9.11 is .11, and the decimal part of 9.9 is .9. 3. **Think of money:** Imagine you have $0.11 (eleven cents) and $0.90 (ninety cents). Which is bigger? $0.90 is bigger. 4. **Conclusion:** Since .9 is bigger than .11, then 9.9 is bigger than 9.11."


## Takeaways
1. Not all LLMs are created equal - not even all Llama 3 (or 3.1) are created equal (by different providers).
2. Ask LLM to think step by step may help improve its reasoning.
3. The way tokenization works in LLM could lead to a lot of weirdness in LLM (see AK's awesome [video](https://www.youtube.com/watch?v=zduSFxRajkE) for a deep dive).
4. A more comprehensive benchmark would be desired, but a quick LLM comparison like shown here can be the first step.