<a href="https://colab.research.google.com/github/20wiz/Common-Sense-Reasoning-ARC/blob/dev/Common_Sense_Reasoning_with_phi_1_5_cot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CoT by prompt

In [5]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [22]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
import random
import json
from datetime import datetime

class ARCEvaluator:
    def init_model(self, model_name="microsoft/phi-1_5", device=None):
        self.model_name = model_name
        self.device = device or torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name).to(self.device)
        self.model.eval()

    def load_model(self, model, tokenizer, device):
        self.tokenizer = tokenizer
        self.device = device
        self.model = model
        self.model.eval()

    def make_prompt(self, question, choices, use_cot=True):
        """
        Creates a structured prompt for ARC questions with optional chain-of-thought reasoning.

        Args:
            question (str): The question text
            choices (dict): Dictionary containing 'text' and 'label' lists
            use_cot (bool): Whether to use chain-of-thought prompting
        """
        if use_cot:
            prompt = (
                "Let's solve this step by step:\n\n"
                f"Question: {question}\n\n"
                "Choices:\n"
            )
            for text, label in zip(choices['text'], choices['label']):
                prompt += f"{label}: {text}\n"

            prompt += ("\nLet's think about this:\n"
                      "1. First, let's understand what the question is asking.\n"
                      "2. Then, let's analyze each choice carefully.\n"
                      "3. Finally, we'll choose the most logical answer.\n\n"
                      "Reasoning:\n"
                      f"1. The question is asking about {question}\n"
                      "2. Let's examine each option:\n")

            for text, label in zip(choices['text'], choices['label']):
                prompt += f"   Option {label}: {text} - \n"

            prompt += ("\n3. Based on this analysis, the correct answer is ")
        else:
            # Original simple prompt
            prompt = f"Question: {question}\nChoices:\n"
            for text, label in zip(choices['text'], choices['label']):
                prompt += f"{label}: {text}\n"
            prompt += "Answer: "

        return prompt

    def predict_answer(self, prompt, use_cot=True):
        """
        Generate model prediction for a given prompt.

        Args:
            prompt (str): Formatted question prompt
            use_cot (bool): Whether using chain-of-thought prompting
        """
        inputs = self.tokenizer(prompt, return_tensors='pt').to(self.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                # max_new_tokens=200 if use_cot else 1,  # Longer generation for CoT
                max_new_tokens=100 if use_cot else 1,  # Longer generation for CoT
                do_sample=True if use_cot else False,  # Allow some randomness for CoT
                temperature=0.7 if use_cot else 1.0,   # Lower temperature for more focused reasoning
                top_p=0.9 if use_cot else 1.0,        # Nucleus sampling for CoT
                pad_token_id=self.tokenizer.eos_token_id
            )

        output_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        # print(f"output_text : {output_text}")

        if use_cot:
            # Extract the final answer from the reasoning chain
            # Look for patterns like "answer is X" or "therefore X" or just the last character
            output_lines = output_text.split('\n')
            answer = None

            # Try to find the answer in the last few lines
            for line in reversed(output_lines):
                line = line.strip().lower()
                if "answer is" in line:
                    answer = line[-1].upper()
                    break
                elif "therefore" in line:
                    answer = line[-1].upper()
                    break

            # Fallback to last character if no clear answer found
            if not answer:
                answer = output_text.strip()[-1].upper()

            return answer, output_text
        else:
            return output_text.strip()[-1], output_text

    def evaluate_dataset(self, split='validation', num_samples=None, save_results=True, use_cot=True):
          """
          Evaluate model performance on ARC dataset.

          Args:
              split (str): Dataset split to evaluate ('validation' or 'test')
              num_samples (int): Number of samples to evaluate (None for all)
              save_results (bool): Whether to save detailed results to file
              use_cot (bool): Whether to use chain-of-thought prompting
          """
          dataset = load_dataset('ai2_arc', 'ARC-Challenge')[split]

          if num_samples is not None:
              indices = random.sample(range(len(dataset)), min(num_samples, len(dataset)))
              dataset = dataset.select(indices)

          results = []
          correct = 0
          total = 0

          for sample in tqdm(dataset, desc=f"Evaluating {split} set"):
              print(type(sample))
              print(sample)
              question = sample["question"]
              choices = sample["choices"]
              gold = sample["answerKey"]
              gold_text = [text for text, l in zip(choices['text'], choices['label']) if l == gold]
              print("question & answer:")
              print(question)
              print(choices)
              print("gold")
              print(gold, gold_text)

              prompt = self.make_prompt(question, choices, use_cot)
              # print(f"Prompt:\n{prompt}")
              predicted_answer, full_response = self.predict_answer(prompt, use_cot)
              print(f"\nPredicted Answer: {predicted_answer} \n")
              print(f"Full Response: {full_response}")

              # check if predicted_answer includes gold

              # gold = f"Option {gold}"
              answer = full_response.split("the correct answer is")
              print(f"answer1:  {answer}")
              # not use the last one,  can occur multiple times
              answer = answer[1].split('\n')
              print(f"answer2 : {answer}")
              answer = answer[1]
              print(f"answer3 : {answer}")

              # is_correct = gold in predicted_answer
              is_correct = gold in answer and gold_text in answer

              # is_correct = predicted_answer == gold
              if is_correct:
                  correct += 1
              total += 1

              results.append({
                  'id': sample['id'],
                  'question': question,
                  'choices': choices,
                  'gold_answer': gold,
                  'predicted_answer': predicted_answer,
                  'full_reasoning': full_response if use_cot else None,
                  'is_correct': is_correct
              })

          accuracy = correct / total
          print(f"\nAccuracy: {correct}/{total} = {accuracy:.2%}")

          if save_results:
              self._save_results(results, accuracy, split, use_cot)

          return accuracy, results

In [7]:
model_name = "microsoft/phi-1_5"
device =  torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

In [23]:
evaluator = ARCEvaluator()
evaluator.load_model(model, tokenizer, device)

In [25]:
    # Run evaluation with chain-of-thought
print("Evaluating with Chain-of-Thought:")
accuracy_cot, results_cot = evaluator.evaluate_dataset(
    split='validation',
    num_samples=1,  # Set to None to evaluate full dataset
    save_results=False,
    use_cot=True
)

Evaluating with Chain-of-Thought:


Evaluating validation set:   0%|          | 0/1 [00:00<?, ?it/s]

<class 'dict'>
{'id': 'Mercury_7005478', 'question': 'The human body temperature is relatively constant. Which is a feedback mechanism that helps the human body maintain its normal temperature in a cold environment?', 'choices': {'text': ['Water is released from the skin.', 'Muscles shake in small movements.', 'The rate of heart beats slows.', 'The lungs take in additional air.'], 'label': ['A', 'B', 'C', 'D']}, 'answerKey': 'B'}
question & answer:
The human body temperature is relatively constant. Which is a feedback mechanism that helps the human body maintain its normal temperature in a cold environment?
{'text': ['Water is released from the skin.', 'Muscles shake in small movements.', 'The rate of heart beats slows.', 'The lungs take in additional air.'], 'label': ['A', 'B', 'C', 'D']}
gold
B ['Muscles shake in small movements.']


Evaluating validation set:   0%|          | 0/1 [00:03<?, ?it/s]


Predicted Answer: S 

Full Response: Let's solve this step by step:

Question: The human body temperature is relatively constant. Which is a feedback mechanism that helps the human body maintain its normal temperature in a cold environment?

Choices:
A: Water is released from the skin.
B: Muscles shake in small movements.
C: The rate of heart beats slows.
D: The lungs take in additional air.

Let's think about this:
1. First, let's understand what the question is asking.
2. Then, let's analyze each choice carefully.
3. Finally, we'll choose the most logical answer.

Reasoning:
1. The question is asking about The human body temperature is relatively constant. Which is a feedback mechanism that helps the human body maintain its normal temperature in a cold environment?
2. Let's examine each option:
   Option A: Water is released from the skin. - 
   Option B: Muscles shake in small movements. - 
   Option C: The rate of heart beats slows. - 
   Option D: The lungs take in additional air




TypeError: 'in <string>' requires string as left operand, not list

In [2]:
c={'text': ['gas', 'liquid', 'solid'], 'label': ['A', 'B', 'C']}

In [3]:
label =c["label"]

In [4]:
# prompt: get matching text from c with label,  short code
g='B'
matching_text = [text for text, l in zip(c['text'], c['label']) if l == g]
matching_text

['liquid']