# Reasoning with Open Source LLMs

This notebook demonstrates how to use an open-source reasoning-capable LLM from Hugging Face. It's created to run on a GPU with 16GB VRAM.

We'll use Mistral-7B-Instruct-v0.3, which is known for its strong reasoning capabilities while being small enough to run on a 16GB GPU.

## Setup and Dependencies

First, let's install the necessary packages:

In [None]:
# !pip install transformers accelerate bitsandbytes torch einops

## Import Required Libraries

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time

## Check GPU Availability

In [2]:
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

CUDA available: True
GPU: NVIDIA GeForce RTX 4070 Ti SUPER
Memory: 17.17 GB


## Load the Model with Quantization

We'll use 4-bit quantization to fit the model comfortably within 16GB VRAM.

In [3]:
from huggingface_hub import login


# Make turn to turn this on:
![Image](https://i.imgur.com/9gFDb8a.png)

In [4]:
login(token="") #TODO add you token here, get one from here https://huggingface.co/settings/tokens, 

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/ali/.cache/huggingface/token
Login successful


In [6]:
# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

# Model selection
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [8]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    torch_dtype=torch.float16
)


Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   4%|4         | 199M/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [9]:
# Print GPU memory usage after loading the model
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"GPU memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

GPU memory allocated: 4.13 GB
GPU memory reserved: 4.15 GB


## Define Generation Parameters

Here we'll define the parameters for text generation. For reasoning tasks, we want to use a lower temperature and a decent maximum new token length to allow for step-by-step thinking.

In [10]:
generation_config = {
    "max_new_tokens": 1024,
    "temperature": 0.1,  # Low temperature for more deterministic reasoning
    "top_p": 0.9,
    "top_k": 50,
    "repetition_penalty": 1.1,
    "do_sample": True,   # Enable sampling for some variety in responses
    "pad_token_id": tokenizer.eos_token_id
}

## Create a Helper Function for Inference

Let's create a function that handles prompting the model with appropriate formatting.

In [11]:
def generate_response(prompt, system_prompt="You are a helpful, honest, and precise assistant."):
    """Generate a response from the model based on the prompt and system prompt."""
    # Format for Mistral-7B-Instruct-v0.2
    formatted_prompt = f"<s>[INST] {system_prompt}\n\n{prompt} [/INST]"
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
    
    # Track token generation time
    start_time = time.time()
    
    # Generate the response
    with torch.no_grad():
        output = model.generate(
            **inputs,
            **generation_config
        )
    
    # Calculate tokens per second
    generation_time = time.time() - start_time
    num_new_tokens = output.shape[1] - inputs['input_ids'].shape[1]
    tokens_per_second = num_new_tokens / generation_time
    
    # Decode the response
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Extract only the model's reply (remove the prompt)
    response = response.split('[/INST]')[-1].strip()
    
    print(f"Generated {num_new_tokens} tokens in {generation_time:.2f} seconds ({tokens_per_second:.2f} tokens/sec)")
    
    return response

## Test with a Simple Reasoning Task

In [12]:
simple_reasoning_prompt = """
Mark has 5 apples. He gives 2 apples to Sarah. Sarah gives him 3 oranges in return. 
Then Mark buys 4 more apples and 2 more oranges. 
How many apples and oranges does Mark have now? 
Think through this step by step.
"""

response = generate_response(simple_reasoning_prompt)
print(response)

Generated 191 tokens in 6.80 seconds (28.10 tokens/sec)
Mark starts with 5 apples. He gives away 2 apples to Sarah, so he has 5 - 2 = <<5-2=3>>3 apples left.
Sarah gives him 3 oranges in return, so Mark receives 3 oranges. Therefore, Mark now has 3 apples and 3 oranges.
Next, Mark buys 4 more apples. So, he has 3 apples (from the beginning) + 4 apples (that he bought) = <<3+4=7>>7 apples in total.
However, we forgot to add the 2 oranges that Mark also bought. So, Mark now has 7 apples and 3 oranges (the ones he received from Sarah) + 2 oranges (that he bought) = <<7+3+2=12>>12 fruits in total.


## Complex Reasoning Test: Multi-step Logical Problem

In [13]:
logical_reasoning_prompt = """
I want you to solve this logical puzzle step by step:

Four friends (Alex, Blake, Casey, and Dana) are deciding where to go on vacation. They are considering four destinations: England, France, Germany, and Italy. From the clues below, determine which person wants to visit which country.

Clues:
1. The person who wants to visit France loves cheese.
2. Dana has been to England before and wants to visit somewhere new.
3. Casey is allergic to cheese.
4. Blake wants to visit Germany or Italy.
5. Alex has never left North America before.
6. The person who wants to visit Italy speaks Italian.
7. Neither Alex nor Blake speaks any foreign languages.
8. Casey speaks French fluently.

For each person, determine which country they want to visit, and explain your reasoning for each deduction.
"""

response = generate_response(logical_reasoning_prompt)
print(response)

Generated 225 tokens in 7.35 seconds (30.63 tokens/sec)
Based on the given clues, here's how we can determine which country each friend wants to visit:

1. The person who loves cheese and isn't Casey (since Casey is allergic to cheese) wants to visit France.
2. Dana has already been to England, so she wants to visit either Germany or Italy.
3. Since neither Alex nor Blake speaks any foreign languages, Blake cannot be going to Italy because the person speaking Italian is Alex. Therefore, Blake wants to visit Germany.
4. This leaves Dana as the last person, and since she hasn't been to England before, she must want to visit Italy.

So, the final answer is:
- Alex wants to visit Italy
- Blake wants to visit Germany
- Casey wants to avoid all destinations due to her allergy
- Dana wants to visit Italy

Therefore, Alex wants to visit Italy, Blake wants to visit Germany, Casey doesn't want to go anywhere due to her allergy, and Dana wants to visit Italy.


## Mathematical Reasoning

In [14]:
math_reasoning_prompt = """
Please solve this math problem step by step:

A store is having a 15% off sale. Additionally, if you spend more than $100 after the discount, you get an extra $20 off. If I want to buy an item that costs $x, for what values of x will I pay exactly $100 after all discounts are applied?
"""

response = generate_response(math_reasoning_prompt)
print(response)

Generated 316 tokens in 11.45 seconds (27.60 tokens/sec)
To find out for what value of x we pay exactly $100 after all discounts are applied, we need to apply both discounts step by step. Let's denote the original price of the item as x.

1. First, let's calculate the price after the 15% discount: x * (1 - 0.15) = x * 0.85

2. Next, let's check if the new price after the first discount is more than $100. If it is not, we cannot reach exactly $100 with any further discounts. So, we have: x * 0.85 > 100

3. Solving for x in the above inequality, we get: x > 117.65 (rounded up to two decimal places).

4. Since we can get an additional $20 off if we spend more than $100 after the first discount, we need to subtract $20 from the final price to get the exact amount we want to pay: x * 0.85 - 20 = 100

5. Solving for x in this equation, we get: x > 127.65 (rounded up to two decimal places).

Therefore, for any value of x greater than 127.65, we will pay exactly $100 after all discounts are ap

## Chain-of-Thought Prompting

Let's try using chain-of-thought prompting techniques to enhance reasoning.

In [15]:
cot_system_prompt = """
You are a helpful assistant that solves problems step-by-step. 
For each problem:
1. Identify what the question is asking for
2. List the relevant facts and constraints
3. Work through the problem systematically
4. Verify your answer
5. State your final answer clearly
"""

cot_prompt = """
The probability of rain on Saturday is 60%. The probability of rain on Sunday is 70%. 
Assuming these events are independent, what is the probability that:
1. It rains on both Saturday and Sunday?
2. It rains on either Saturday or Sunday (or both)?
3. It rains on exactly one of those days?
"""

response = generate_response(cot_prompt, system_prompt=cot_system_prompt)
print(response)

Generated 507 tokens in 16.32 seconds (31.06 tokens/sec)
1. Problem: Find the probability that it rains on both Saturday and Sunday.
   Relevant Facts:
   - Probability of rain on Saturday: 0.6
   - Probability of rain on Sunday: 0.7
   Constraints:
   - The events of rain on Saturday and rain on Sunday are independent.

   Solution:
   Since the events are independent, we can calculate the probability of both events occurring by multiplying their individual probabilities: P(rain on Saturday and rain on Sunday) = P(rain on Saturday) * P(rain on Sunday) = 0.6 * 0.7 = 0.42 or 42%

2. Problem: Find the probability that it rains on either Saturday or Sunday (or both).
   Relevant Facts:
   - Probability of rain on Saturday: 0.6
   - Probability of rain on Sunday: 0.7

   Solution:
   To find the probability of either event occurring, we use the addition rule of probability: P(rain on Saturday or rain on Sunday) = P(rain on Saturday) + P(rain on Sunday) - P(rain on both) = 0.6 + 0.7 - 0.42 

## Load and Try a Different Model (If VRAM allows)

You can also try other models within your 16GB VRAM constraint.

In [23]:
# Clear the previous model from VRAM
del model
torch.cuda.empty_cache()

In [20]:
# Load a different model
model_name = "" # Try anything you want
# model_name = "facebook/opt-6.7b" # This gave me very very very bad results

# Load new tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
    torch_dtype=torch.float16
) 


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Update the helper function to use the response for the new LLM you used

In [None]:
# def generate_opt_response(prompt, system_prompt="You are a helpful, honest, and precise assistant."):
#     """Generate a response from OPT-6.7B based on the prompt and system prompt."""
#     formatted_prompt = f"System: {system_prompt}\nUser: {prompt}\nAssistant:"
#     
#     inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
#     

#     # Generate the response
#     with torch.no_grad():
#         output = model.generate(
#             **inputs,
#             **generation_config
#         )
#     
#     # Calculate tokens per second
#     generation_time = time.time() - start_time
#     num_new_tokens = output.shape[1] - inputs['input_ids'].shape[1]
#     tokens_per_second = num_new_tokens / generation_time
#     
#     # Decode the response
#     response = tokenizer.decode(output[0], skip_special_tokens=True)
#     
#     # Extract only the model's reply (remove the prompt)
#     response = response.split("Assistant:")[-1].strip()
#     return response

In [21]:
def generate_a_response(prompt, system_prompt="You are a helpful, honest, and precise assistant."):
    pass 
    return response

## Test the new one with the Same Reasoning Task

In [None]:
# Use the same math reasoning prompt
response = generate_a_response(math_reasoning_prompt)
print(response)

## Conclusion

In this notebook, we've demonstrated how to:

1. Load and run reasoning-capable open-source LLMs on a 16GB VRAM GPU
2. Use 4-bit quantization to fit larger models in memory
3. Craft prompts that encourage step-by-step reasoning

You can extend this notebook by:
- Trying other models like Phi-2, TinyLlama, or SOLAR-10.7B-Instruct-v1.0
- Implementing more advanced reasoning techniques like tree-of-thought or self-consistency
- Creating a benchmark of reasoning tasks to evaluate model performance
- Fine-tuning the models on reasoning datasets (requires more VRAM or parameter-efficient techniques like LoRA)