Setting up the environment. Using Hugging Face's transformers library with PEFT capabilities.

In [None]:
pip install torch transformers peft



In [None]:
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
import os
os.environ["HF_TOKEN"] = "token"

Choose the base model such as GPT-2 OR GPT-3

https://huggingface.co/openai-community/gpt2

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Authenticate and load the model
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=os.getenv("HF_TOKEN"))
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=os.getenv("HF_TOKEN"))



config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Has to define the metacognitive vector as input. This vector has to be concatenated with the student's answer text as input to the model.

In our case metacognitive vector is a tensor of [1,16] params.

In [None]:
import torch

# Example metacognitive vector for one student
num_metacognitive_attributes = 10
metacognitive_vector = torch.rand(1, num_metacognitive_attributes)  # Random for illustration


Implementing LoRA layers.

Start by applying LoRA to the attention layers, which is a common practice.

**Identify and Adapt Specific Layers with LoRA**
We’ll adapt the self-attention layers typically used in transformer-based models. These layers are often named attn or attention, depending on the model architecture. Below is a modified approach to apply LoRA only to layers of type nn.Linear within specific submodules.

In [None]:
# import torch.nn as nn

# class LoRAAdapter(nn.Module):
#     def __init__(self, input_dim, output_dim, rank=4):
#         super(LoRAAdapter, self).__init__()
#         self.lora_A = nn.Linear(input_dim, rank, bias=False)
#         self.lora_B = nn.Linear(rank, output_dim, bias=False)

#     def forward(self, x):
#         # Apply low-rank adaptation
#         return self.lora_B(self.lora_A(x))

# # Integrate LoRA into the model's key layers (e.g., attention layers)
# # This is a simplified approach; in practice, you might integrate LoRA layers across several key submodules.
# for name, module in model.named_modules():
#     if isinstance(module, nn.Linear):  # Choose layers where you want to apply LoRA
#         lora_layer = LoRAAdapter(module.in_features, module.out_features)
#         module.add_module("lora_adapter", lora_layer)


above get stuck in a recursion loop, so need to selectively adapt only specific layers in the model, such as the slef attention layers.

**LoRA Adapter:** This is defined as a separate module that can be added to specific layers. It consists of:

In [None]:
# Define LoRA adapter
class LoRAAdapter(nn.Module):
    def __init__(self, in_dim, out_dim, r=4):
        super().__init__()
        self.down_proj = nn.Linear(in_dim, r, bias=False)
        self.up_proj = nn.Linear(r, out_dim, bias=False)

    def forward(self, x):
        return self.up_proj(self.down_proj(x))

**Selective Replacement: **We loop through the model's submodules and look for attn or attention in their names, as these are commonly where transformer models implement their self-attention layers.

For each nn.Linear submodule within these layers, we replace it with a combination of the original layer and the LoRA adapter, allowing for fine-tuning without retraining the whole model.
Sequential Combination: We use nn.Sequential to stack the original Linear layer with the LoRA adapter. This retains the original functionality while applying LoRA as an additional operation.

If you want more precise control, you could adjust the conditional checks to target different layers. For instance, adding conditions based on model-specific layer names will give you finer control over where LoRA is applied.

In [None]:
# Apply LoRA only to top-level attention layers to avoid deep recursion
for name, module in list(model.named_children()):
    if "attn" in name or "attention" in name:  # Targeting attention layers at the top level
        for sub_name, sub_module in module.named_children():
            if isinstance(sub_module, nn.Linear):
                # Create a LoRA adapter for this layer
                lora_adapter = LoRAAdapter(sub_module.in_features, sub_module.out_features)

                # Stack the original linear layer with LoRA adapter
                combined_layer = nn.Sequential(sub_module, lora_adapter)

                # Replace the original layer with combined layer
                setattr(module, sub_name, combined_layer)

print("LoRA adaptation applied to attention layers.")

LoRA adaptation applied to attention layers.


Prepare Dataset and Tokenizer

In [None]:
# Example synthetic data for feedback generation
data = {
    "student_response": [
        "I'm confused about sorting algorithms.",
        "I don't understand how to calculate the time complexity.",
        "What is the difference between DFS and BFS?"
    ],
    "feedback": [
        "Sorting algorithms organize data in a specific order.",
        "Time complexity helps you understand how fast an algorithm runs.",
        "DFS explores as deep as possible before backtracking, while BFS explores neighbor nodes first."
    ]
}

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict(data)
tokenizer.pad_token = tokenizer.eos_token
# Tokenize this custom dataset
def tokenize_function(examples):
    return tokenizer(examples["student_response"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

In [None]:
print(f"Number of samples in the dataset: {len(tokenized_dataset)}")

Number of samples in the dataset: 2


In [None]:
# Split dataset into train/test
split_dataset = tokenized_dataset.train_test_split(test_size=0.2, seed=42)
tokenized_dataset = DatasetDict({
    "train": split_dataset["train"],
    "test": split_dataset["test"]
})

AttributeError: 'DatasetDict' object has no attribute 'train_test_split'

In [None]:
tokenized_dataset

Dataset({
    features: ['student_response', 'feedback', 'input_ids', 'attention_mask'],
    num_rows: 3
})

**Define the Custom Training Loop with KL Divergence Loss**
To apply KL Divergence Loss for controlling the output distribution, it’s often used alongside the usual cross-entropy loss to encourage the model’s output to align with the desired distribution

In [None]:
print(len(tokenized_dataset['train']))


1


In [None]:
!pip install datasets
from datasets import Dataset, DatasetDict
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
import sys

# Assuming 'tokenized_dataset' already exists as per your provided code

# Split the dataset into train and test (adjust test_size as needed)
# You can add a validation split as well using a dictionary
# e.g., train_test_split(test_size=0.1, train_size=0.8, seed=42)
split_dataset = tokenized_dataset['train'].train_test_split(test_size=0.2, seed=42)

# Convert the split dataset to a DatasetDict
tokenized_dataset = DatasetDict({
    "train": split_dataset["train"],
    "test": split_dataset["test"]  # Reusing the original test split if needed
})

#** Convert the lists to tensors before creating DataLoader
def preprocess_function(examples):
    return {"input_ids": torch.tensor(examples["input_ids"]), "labels": torch.tensor(examples["labels"])}

tokenized_dataset = tokenized_dataset.map(preprocess_function, batched=True)


# Now you can access the train split
train_dataloader = DataLoader(tokenized_dataset["train"], shuffle=True, batch_size=8)

# Set up optimizer and learning rate scheduler
sys.setrecursionlimit(10000)
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)




ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

Define the Training Loop

In [None]:
from tqdm.auto import tqdm

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training Loop
model.train()
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    for batch in train_dataloader:
        inputs = batch["input_ids"].to(device)
        labels = batch["labels"].to(device)

        # Forward pass
        outputs = model(inputs, labels=labels)
        logits = outputs.logits

        # Compute Cross-Entropy Loss
        ce_loss = outputs.loss

        # Compute KL Divergence Loss between the logits and a reference distribution
        with torch.no_grad():
            teacher_logits = model(inputs).logits  # Reference distribution (can be a pre-trained model)

        kl_loss = kl_divergence_loss(logits, teacher_logits)

        # Combine losses
        total_loss = ce_loss + 0.5 * kl_loss  # Adjust weight of kl_loss as needed

        # Backpropagation
        total_loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.update(1)


  0%|          | 0/3 [00:00<?, ?it/s]

AttributeError: 'list' object has no attribute 'to'

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler

# Custom KL Divergence Loss function
def kl_divergence_loss(student_logits, teacher_logits, temperature=1.0):
    student_probs = torch.nn.functional.log_softmax(student_logits / temperature, dim=-1)
    teacher_probs = torch.nn.functional.softmax(teacher_logits / temperature, dim=-1)
    kl_loss = torch.nn.functional.kl_div(student_probs, teacher_probs, reduction="batchmean") * (temperature ** 2)
    return kl_loss

# Prepare DataLoader
train_dataloader = DataLoader(tokenized_dataset["train"], shuffle=True, batch_size=8)

# Set up optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)


KeyError: "Column train not in the dataset. Current columns in the dataset: ['student_response', 'feedback', 'input_ids', 'attention_mask']"

Generate Random Metacognitive Vector for Each Student
First, you need to create a random metacognitive vector for each student. This vector represents the metacognitive attributes for each student (planning, monitoring, and evaluation). The length of the vector could correspond to different metacognitive attributes.

In [None]:
import numpy as np
# Function to generate random metacognitive vector
def generate_metacognitive_vector():
    # Random metacognitive vector for planning, monitoring, and evaluation attributes
    return np.random.randint(1, 6, size=15)  # Assuming each student has a 15-length vector

# Generate random metacognitive vectors for each student
num_records = 100  # Assuming you want 100 records for example
metacognitive_vectors = [generate_metacognitive_vector() for _ in range(num_records)]


In [None]:
metacognitive_vectors

[array([4, 5, 1, 5, 1, 4, 3, 2, 1, 5, 3, 2, 1, 3, 4]),
 array([4, 1, 2, 3, 3, 4, 5, 3, 4, 4, 3, 5, 2, 4, 2]),
 array([5, 4, 4, 1, 5, 2, 4, 5, 2, 4, 4, 3, 2, 3, 2]),
 array([5, 2, 2, 3, 4, 4, 5, 4, 4, 2, 3, 5, 3, 3, 5]),
 array([1, 5, 5, 1, 4, 1, 2, 2, 5, 4, 4, 3, 2, 5, 4]),
 array([2, 5, 3, 5, 4, 4, 5, 5, 2, 3, 1, 3, 1, 1, 2]),
 array([2, 1, 2, 2, 3, 1, 1, 1, 4, 3, 3, 5, 1, 4, 5]),
 array([3, 4, 4, 5, 5, 5, 2, 3, 2, 3, 1, 1, 1, 4, 4]),
 array([4, 5, 3, 4, 5, 4, 3, 1, 2, 4, 3, 2, 3, 4, 2]),
 array([2, 3, 2, 3, 2, 3, 4, 1, 3, 3, 5, 3, 1, 4, 1]),
 array([4, 5, 5, 3, 4, 4, 1, 4, 1, 1, 2, 4, 1, 2, 3]),
 array([1, 3, 3, 2, 2, 3, 2, 2, 1, 3, 3, 4, 4, 3, 5]),
 array([5, 2, 1, 4, 1, 5, 4, 3, 4, 1, 3, 1, 1, 4, 2]),
 array([1, 3, 2, 2, 3, 5, 4, 2, 2, 3, 3, 5, 5, 3, 5]),
 array([5, 4, 4, 4, 5, 3, 3, 2, 2, 1, 3, 4, 4, 5, 2]),
 array([5, 1, 3, 1, 2, 4, 3, 5, 2, 4, 4, 1, 4, 2, 4]),
 array([1, 2, 4, 4, 5, 5, 4, 3, 1, 1, 4, 1, 1, 1, 3]),
 array([1, 1, 5, 5, 1, 3, 4, 3, 2, 3, 2, 4, 5, 2, 2]),
 array([2,

Incorporate the Metacognitive Vectors into the Dataset

In [None]:
# Adding metacognitive vectors to DataFrame
df['metacognitive_vector'] = metacognitive_vectors

NameError: name 'df' is not defined

In [None]:
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

# Generating Dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Step 1: Define the algorithm questions and answers
algorithm_questions = [
    "Design an algorithm to find the nth Fibonacci number in an efficient way.",
    "Develop an algorithm that determines if a list of integers is sorted in non-decreasing order.",
    "Create an algorithm to find the smallest missing positive integer in an unsorted list.",
    "Design an algorithm to find the longest consecutive sequence in an array of integers.",
    "Develop an algorithm that calculates the maximum profit from stock prices, given a list of daily prices."
]

python_answers = [
    "def nth_fibonacci(n):\n    if n <= 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        return nth_fibonacci(n-1) + nth_fibonacci(n-2)",
    "def is_sorted(lst):\n    return all(lst[i] <= lst[i+1] for i in range(len(lst) - 1))",
    "def find_missing_positive(arr):\n    arr = sorted(set(arr))\n    smallest = 1\n    for num in arr:\n        if num == smallest:\n            smallest += 1\n    return smallest",
    "def longest_consecutive_sequence(arr):\n    arr = sorted(set(arr))\n    longest, current = 0, 1\n    for i in range(1, len(arr)):\n        if arr[i] == arr[i-1] + 1:\n            current += 1\n        else:\n            longest = max(longest, current)\n            current = 1\n    return max(longest, current)",
    "def max_profit(prices):\n    min_price = float('inf')\n    max_profit = 0\n    for price in prices:\n        if price < min_price:\n            min_price = price\n        elif price - min_price > max_profit:\n            max_profit = price - min_price\n    return max_profit"
]

In [None]:
# Step 2: Generate metacognitive vectors
def generate_metacognitive_vector():
    # Generate a random metacognitive vector (15 attributes for example)
    return np.random.randint(1, 6, size=15)

In [None]:
# Step 3: Generate dataset
num_records = 100  # Number of student records you want
metacognitive_vectors = [generate_metacognitive_vector() for _ in range(num_records)]

In [None]:

# Function to generate algorithm question and answer pair
def generate_algorithm_question_answer():
    question = np.random.choice(algorithm_questions)
    answer = np.random.choice(python_answers)
    return question, answer

In [None]:
# Generate algorithm questions and answers
algo_questions_answers = [generate_algorithm_question_answer() for _ in range(num_records)]
algo_questions, python_code_answers = zip(*algo_questions_answers)

In [None]:
# Step 4: Generate feedback based on code and metacognitive vector
def generate_feedback_with_code(vector, code_answer):
    # Split vector into planning, monitoring, evaluation
    planning_score = np.sum(vector[:7])
    monitoring_score = np.sum(vector[7:12])
    evaluation_score = np.sum(vector[12:])

    # Initialize feedback lists
    feedback = []
    general_feedback = []

    # Generate feedback based on metacognitive vector
    if planning_score < 3:
        feedback.append("Your planning could be more detailed for this problem.")
    elif planning_score < 7:
        feedback.append("Planning is adequate; aim for more specific steps.")
    else:
        feedback.append("Strong planning approach!")

    if monitoring_score < 3:
        feedback.append("Pay closer attention to each step in your code.")
    elif monitoring_score < 5:
        feedback.append("Good monitoring; watch for minor errors in syntax or logic.")
    else:
        feedback.append("Great job with monitoring your implementation.")

    if evaluation_score < 3:
        feedback.append("Check your solution against all problem requirements.")
    elif evaluation_score < 5:
        feedback.append("Evaluation is acceptable, but ensure full accuracy.")
    else:
        feedback.append("Excellent evaluation skills!")

    # Add code-based feedback to both general and personalized feedback
    if "return nth_fibonacci(n-1) + nth_fibonacci(n-2)" in code_answer:
        feedback.append("Consider optimizing the recursive Fibonacci solution to improve efficiency.")
        general_feedback.append("Consider optimizing the recursive Fibonacci solution to improve efficiency.")
    elif "all(lst[i] <= lst[i+1] for i in range(len(lst) - 1))" in code_answer:
        feedback.append("Good solution for checking sorted order.")
        general_feedback.append("Good solution for checking sorted order.")
    elif "sorted(set(arr))" in code_answer:
        feedback.append("Effective handling of duplicates; consider edge cases like an empty array.")
        general_feedback.append("Effective handling of duplicates; consider edge cases like an empty array.")
    elif "max_profit = price - min_price" in code_answer:
        feedback.append("Good solution for max profit, but ensure it handles all input cases.")
        general_feedback.append("Good solution for max profit, but ensure it handles all input cases.")
    elif "longest, current = 0, 1" in code_answer:
        feedback.append("Good approach to find the longest sequence; verify edge cases.")
        general_feedback.append("Good approach to find the longest sequence; verify edge cases.")

    # Combine feedback into final strings
    personalized_feedback = " ".join(feedback)
    general_feedback = " ".join(general_feedback)

    return personalized_feedback, general_feedback


In [None]:
# Step 5: Create DataFrame
feedbacks = [generate_feedback_with_code(vector, code) for vector, code in zip(metacognitive_vectors, python_code_answers)]
metacognitive_feedbacks, general_feedbacks = zip(*feedbacks)
df = pd.DataFrame({
    "algorithm_question": algo_questions,
    "student_python_answer": python_code_answers,
    "metacognitive_vector": metacognitive_vectors,
    "general_feedback": general_feedbacks,
    "metacognitive_feedback": metacognitive_feedbacks
})

In [None]:
# Save the dataset to CSV
df.to_csv('student_algorithm_feedback_dataset.csv', index=False)


In [None]:

print(df.head())  # Check the first few rows of the dataset

                                  algorithm_question  \
0  Develop an algorithm that calculates the maxim...   
1  Create an algorithm to find the smallest missi...   
2  Design an algorithm to find the longest consec...   
3  Design an algorithm to find the nth Fibonacci ...   
4  Design an algorithm to find the nth Fibonacci ...   

                               student_python_answer  \
0  def max_profit(prices):\n    min_price = float...   
1  def find_missing_positive(arr):\n    arr = sor...   
2  def longest_consecutive_sequence(arr):\n    ar...   
3  def longest_consecutive_sequence(arr):\n    ar...   
4  def find_missing_positive(arr):\n    arr = sor...   

                            metacognitive_vector  \
0  [4, 1, 4, 4, 5, 3, 3, 2, 2, 3, 3, 1, 1, 4, 5]   
1  [5, 5, 3, 3, 2, 2, 2, 2, 5, 5, 4, 4, 3, 1, 4]   
2  [2, 3, 2, 3, 5, 3, 2, 2, 2, 3, 1, 5, 4, 3, 3]   
3  [4, 2, 1, 3, 4, 2, 4, 1, 3, 1, 4, 3, 1, 2, 5]   
4  [1, 1, 3, 2, 1, 1, 3, 5, 1, 4, 3, 5, 2, 2, 5]   

             

# Dataset creation-2

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path = "/content/drive/MyDrive/annotated_dataset.csv"
df = pd.read_csv(file_path)

In [None]:
df.head()

Unnamed: 0,feedback,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16
0,"[\n {\n 'line_number': 2,\n 'feedback...",3,1,3,2,3,1,2,3,2,2,1,2,2,3,1,2
1,"[\n {\n 'line_number': 4,\n '...",2,2,2,2,1,1,3,3,2,1,1,2,3,2,3,2
2,"[\n {\n 'line_number': 2,\n '...",1,2,3,2,1,2,2,2,2,1,2,2,1,1,2,3
3,"[\n {\n 'line_number': 1,\n '...",2,1,2,3,1,1,2,2,1,3,2,1,3,1,1,1
4,"[\n {\n 'line_number': 3,\n '...",3,1,3,3,3,2,1,1,2,2,1,2,3,3,2,3


Above dataset includes metacognitive vector/value for each dimension of the metacognition and now considering above vectors and feedback to generate another column for the metacognitive feedback.

In [None]:
data = pd.DataFrame(df)

In [None]:
data

Unnamed: 0,feedback,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16
0,"[\n {\n 'line_number': 2,\n 'feedback...",3,1,3,2,3,1,2,3,2,2,1,2,2,3,1,2
1,"[\n {\n 'line_number': 4,\n '...",2,2,2,2,1,1,3,3,2,1,1,2,3,2,3,2
2,"[\n {\n 'line_number': 2,\n '...",1,2,3,2,1,2,2,2,2,1,2,2,1,1,2,3
3,"[\n {\n 'line_number': 1,\n '...",2,1,2,3,1,1,2,2,1,3,2,1,3,1,1,1
4,"[\n {\n 'line_number': 3,\n '...",3,1,3,3,3,2,1,1,2,2,1,2,3,3,2,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,"[\n {\n ""line_number"": 8,\n ""...",2,3,1,3,2,2,3,2,3,1,3,2,3,2,1,2
362,"[\n {\n 'line_number': 8,\n '...",2,1,3,2,3,1,1,3,2,1,1,2,2,1,3,1
363,"[\n {\n 'line_number': 8,\n 'feedbac...",1,1,1,2,1,2,2,3,2,3,3,2,2,3,3,2
364,"[\n {\n 'line_number': 2,\n '...",1,3,3,1,2,2,3,3,2,2,2,2,2,1,1,3


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

In [None]:
data['metacognitive_vector'] = data.iloc[:, 1:].apply(lambda row: row.tolist(), axis=1)

In [None]:
data.head()

Unnamed: 0,feedback,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,metacognitive_vector
0,"[\n {\n 'line_number': 2,\n 'feedback...",3,1,3,2,3,1,2,3,2,2,1,2,2,3,1,2,"[3, 1, 3, 2, 3, 1, 2, 3, 2, 2, 1, 2, 2, 3, 1, 2]"
1,"[\n {\n 'line_number': 4,\n '...",2,2,2,2,1,1,3,3,2,1,1,2,3,2,3,2,"[2, 2, 2, 2, 1, 1, 3, 3, 2, 1, 1, 2, 3, 2, 3, 2]"
2,"[\n {\n 'line_number': 2,\n '...",1,2,3,2,1,2,2,2,2,1,2,2,1,1,2,3,"[1, 2, 3, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 3]"
3,"[\n {\n 'line_number': 1,\n '...",2,1,2,3,1,1,2,2,1,3,2,1,3,1,1,1,"[2, 1, 2, 3, 1, 1, 2, 2, 1, 3, 2, 1, 3, 1, 1, 1]"
4,"[\n {\n 'line_number': 3,\n '...",3,1,3,3,3,2,1,1,2,2,1,2,3,3,2,3,"[3, 1, 3, 3, 3, 2, 1, 1, 2, 2, 1, 2, 3, 3, 2, 3]"


In [None]:
import os
import openai
from openai import OpenAI



In [None]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [None]:
from dotenv import load_dotenv
load_dotenv()

True

In [None]:
# api_key = os.getenv("OPENAI_API_KEY")
# if api_key:
#     print("API key loaded successfully!")
# else:
#     print("API key not found in environment variables. Check your .env file and load_dotenv() call.")
#     print("Current working directory:", os.getcwd())

In [None]:
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),  # Get the API key from the environment variable
)

In [None]:
# api = 'uLM5z0jrFdDEuCTayx1FMbLvqxKo6mTF'
# !pip install -qU mistsralai

[31mERROR: Could not find a version that satisfies the requirement mistsralai (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for mistsralai[0m[31m
[0m

In [None]:
!pip install huggingface_hub
!huggingface-cli logout
!huggingface-cli login


Successfully logged out.

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM



In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2.
403 Client Error. (Request ID: Root=1-672f0d9e-33b125814fc8ff201dbd0883;a8764139-72d2-406c-9de0-28682e33503c)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/resolve/main/config.json.
Access to model mistralai/Mistral-7B-Instruct-v0.2 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 to ask for access.

In [None]:
def generate_metacognitive_feedback_llm(general_feedback, vector):
    # Construct a prompt for the LLM using general feedback and the metacognitive vector
    prompt = f"""
    You are an AI assistant helping a student improve their coding skills. Based on the feedback provided for a specific line of code,
    generate metacognitive feedback that encourages the student to reflect on their coding process and mistakes. Use the provided
    metacognitive vector to adjust the depth of the feedback.

    General Feedback: {general_feedback}

    Metacognitive Vector (self-regulation): {vector}

    Your feedback should guide the student in reflecting on their approach, identifying mistakes, and improving their future coding practices.
    Encourage a growth mindset and self-assessment.

    Example feedback: "Think about why this error occurred and how you could avoid similar issues in the future..."
    """



    # Call the LLM to generate the response (using OpenAI as an example)
    response = client.chat.completions.create(
        model="gpt-3.5-turbo", # You can change this to GPT-4 or another model
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        max_tokens=150,  # You can adjust this based on the response length you need
        temperature=0.8  # Control the creativity of the model's response
    )

    # Return the generated feedback
    return response.choices[0].message.content.strip()



In [None]:
data['metacognitive_feedback'] = data.apply(
    lambda row: generate_metacognitive_feedback_llm(row['feedback'], row['metacognitive_vector']), axis=1
)


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: "sk-proj**********************************************************************************************************************************************************NEA". You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [None]:
print(data[['feedback', 'metacognitive_feedback']])

# LoRA Model Training

In [None]:
import torch
from torch.utils.data import DataLoader, Dataset

In [None]:
# Step 1: Load the dataset
class FeedbackDataset(Dataset):
    def __init__(self, questions, answers, feedback, tokenizer, max_length=512):
        self.questions = questions
        self.answers = answers
        self.feedback = feedback
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        question = self.questions[idx]
        answer = self.answers[idx]
        feedback = self.feedback[idx]

        inputs = self.tokenizer(question + " " + answer, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")
        labels = self.tokenizer(feedback, truncation=True, padding='max_length', max_length=self.max_length, return_tensors="pt")

        return inputs, labels

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

In [None]:
# # Load pre-trained tokenizer
# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
# Create DataLoader for training
train_dataset = FeedbackDataset(df['algorithm_question'].values, df['student_python_answer'].values, df['metacognitive_feedback'].values, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

NameError: name 'df' is not defined

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.8/64.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-7b1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

NameError: name 'model' is not defined

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    # target_modules=["q_proj", "v_proj"], #if you know the
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

In [None]:
file_path = "/content/student_algorithm_feedback_dataset.csv"


In [None]:
import transformers
from datasets import load_dataset
data = load_dataset("csv", data_files=file_path)


In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['algorithm_question', 'student_python_answer', 'metacognitive_vector', 'general_feedback', 'metacognitive_feedback'],
        num_rows: 100
    })
})

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, AutoTokenizer
from datasets import DatasetDict

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Step 1: Preprocessing function
def preprocess_data(examples):
    # Concatenate `algorithm_question`, `student_python_answer`, and `metacognitive_vector` as input text
    inputs = [
        f"Question: {question} Answer: {answer} Metacognitive Vector: {vector}"
        for question, answer, vector in zip(examples['algorithm_question'], examples['student_python_answer'], examples['metacognitive_vector'])
    ]
    # Tokenize the input texts
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=512)  # Adjust max_length as needed
    return model_inputs

# Step 2: Apply preprocessing to the dataset
dataset = DatasetDict({
    "train": data['train'].map(preprocess_data, batched=True)
})

# Remove columns not needed for training after tokenization
dataset['train'] = dataset['train'].remove_columns(['algorithm_question', 'student_python_answer', 'metacognitive_vector', 'general_feedback', 'metacognitive_feedback'])

# Step 3: Initialize the DataCollator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [None]:
dataset['train']

In [None]:
# Step 4: Define the Trainer and TrainingArguments
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    max_steps=200,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    output_dir='outputs',
    report_to="none"  # Optional: Disable logging to console
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    data_collator=data_collator,
)


In [None]:
# Step 5: Set `use_cache` to `False` and train
model.config.use_cache = False  # Recommended for training; enables efficient gradient checkpointing
trainer.train()