In [1]:
!pip install transformers datasets evaluate --quiet

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling, EarlyStoppingCallback
from datasets import load_dataset
import math
import warnings

warnings.filterwarnings('ignore')

In [3]:
# Text generation task on the dataset with python code

output_dir = './gpt2-codefeedback'
model_name = 'distilgpt2'
dataset_name = 'fxmeng/CodeFeedback-Python105K'

dataset = load_dataset(dataset_name)
print(f"Dataset structure: \n{dataset}")
print(f"Dataset sample: \n{dataset['train'][0]}")

Dataset structure: 
DatasetDict({
    train: Dataset({
        features: ['query', 'response'],
        num_rows: 104848
    })
})
Dataset sample: 
{'query': 'Create a nested loop to print every combination of numbers between 0-9, excluding any combination that contains the number 5. Additionally, exclude any combination that contains a repeating digit. Implement the solution without using any built-in functions or libraries to check for repeating digits.', 'response': 'Here is an example of a nested loop in Python to print every combination of numbers between 0-9, excluding any combination that contains the number 5 or repeating digits:\n\n```python\nfor i in range(10):  # First digit\n    for j in range(10):  # Second digit\n        for k in range(10):  # Third digit\n            # Checking for the conditions\n            if i != 5 and j != 5 and k != 5 and i != j and i != k and j != k:\n                print(i, j, k)\n```\n\nThis code will generate and print every combination of thr

In [4]:
# The dataset contains only train part, so we'll divide it on two parts: train and test (10% of data) to evaluate the model

dataset = dataset['train'].shuffle().select(range(10000)).train_test_split(test_size=0.1)
train = dataset['train']
test = dataset['test']
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['query', 'response'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['query', 'response'],
        num_rows: 1000
    })
})


In [5]:
# Now let's preprocess the data specifically for our task and 'gpt2' model

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

def preprocess(example):
  tokenized = tokenizer(
      f"# Question:\n{example['query']}\n# Answer:\n{example['response']}",
      truncation=True,
      padding='max_length',
      max_length=512
  )
  tokenized['labels'] = tokenized['input_ids'].copy()
  return tokenized

tokenized_train = train.map(preprocess, remove_columns=train.column_names)
tokenized_test = test.map(preprocess, remove_columns=test.column_names)
print(f"We converted {train[0]} to \n {tokenized_train[0]}")

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We converted {'query': 'Write an algorithm to find the common elements in two arrays. The algorithm should have a time complexity of O(n), where n is the size of the larger array. Additionally, you are not allowed to use any additional data structures or built-in functions for solving the problem.', 'response': '1. Initialize an empty array, commonElements, to store the common elements found in both arrays.\n2. Iterate through the first array, array1, and create a dictionary (hash map) where the keys are the elements in array1 and the values are boolean flags indicating if the element has been encountered or not. Set all values to False initially.\n3. Iterate through the second array, array2.\n    - Check if the current element of array2 exists as a key in the dictionary created in step 2.\n    - If it does, check if the value associated with the key is False. If it is, set the value to True and add the element to the commonElements array.\n4. Return the commonElements array.\n\nPseudo

In [6]:
# Let's define the model

model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_strategy='epoch',
    num_train_epochs=3,
    learning_rate=5e-5,
    logging_strategy='steps',
    logging_steps=10,
    save_strategy='epoch',
    report_to='none',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    resume_from_checkpoint=True,
    fp16=True
)

In [7]:
# Train and evaluate the model

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
trainer.train()

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results['eval_loss'])
print(f"Perplexity: {perplexity:.2f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,1.7941,1.67233
2,1.6879,1.603865
3,1.3537,1.578177


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


Perplexity: 4.85


In [8]:
# Save the model and tokenizer

trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('./gpt2-codefeedback/tokenizer_config.json',
 './gpt2-codefeedback/special_tokens_map.json',
 './gpt2-codefeedback/vocab.json',
 './gpt2-codefeedback/merges.txt',
 './gpt2-codefeedback/added_tokens.json',
 './gpt2-codefeedback/tokenizer.json')

In [9]:
# Let's define the funciton of prediction generation:

def generate_response(query):
  input_ids = tokenizer.encode(
      f"# Question:\n{query}\n# Answer:\n",
      return_tensors='pt'
  ).to(model.device)
  output_ids = model.generate(
      input_ids,
      max_new_tokens=200,
      do_sample=True,
      temperature=0.2
  )
  result = tokenizer.decode(output_ids[0], skip_special_tokens=True)
  return result

# The model and tokenizer is already in the memory, so we'll skip a step of its loading. But the dataset was overwritten with its sample, so let's load it again
dataset = load_dataset(dataset_name)['train']

In [42]:
# Let's test a model on an example:

any_index = 111
print(generate_response(test[any_index]['query']))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# Question:
Design a search engine algorithm that first sorts an array of words in alphabetical order and then conducts a binary search for input word queries. Additionally, the algorithm must also predict the next word in the sequence of a given input sentence using a basic predictive model based on the frequency of word sequences in a previously defined text corpus. 

words = [ 'hello', 'this', 'is', 'a', 'test' ]
# Answer:
Here is a Python solution using the `sieve` library in Python:

```python
import numpy as np

def binary_search(words):
    # Create a dictionary to store the words in alphabetical order
    words = []
    # Iterate through each word in the input sentence
    for word in words:
        # Check if the word is already in the dictionary
        if word.isalpha():
              # If it is, add it to the dictionary
             word.append(word)
                                           


In [43]:
# Let's see what is the true response on the test query:

print(test[any_index]['response'])

Python provides many helpful libraries for various tasks. One of them is Random library to work with random numbers, another is scikit-learn (a simple and efficient tool for data mining and data analysis), and datetime for date and time tasks. Following is Python code to solve the problem:

```python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
import numpy as np

class SearchEngine:
  def __init__(self, words, corpus):
    self.words = sorted(words)
    self.corpus = corpus.split(' ')
    self.model = self.train_model()

  def binary_search(self, target):
    low = 0
    high = len(self.words) - 1
    while low <= high:
      mid = (low + high) // 2
      if self.words[mid] == target:
        return mid
      elif self.words[mid] < target:
        low = mid + 1
      else:
        high = mid - 1
    return -1

  def train_model(self):
    self.cv = CountVectorizer(ngram_range=(2, 2))
    self

In the homework I fine-tuned the model, that should generate the response with python code based on the query. For that purpose I used the dataset 'fxmeng/CodeFeedback-Python105K', that consists of queries, asking to write a python code, and responses with mixed text and python code respectively. I applied text generation LLM for that case.

Experiment 1. Training on CPU.
The main problem I faced with is too long time of the model fine-tunning.
In order to fasten the training time and decrease the memory usage I:
- changed the model from 'gpt2' to 'distilgpt2';
- decreased the train and test sample by ~100 times (from 105K to 1K);
- changed some hyperparameters.
Anyway, even with simplified inputs, it took the model 5 hours to train.
As a result, the loss on evaluation dataset improved from 2.05 to 1.97, the perplexity score is rather ok (7.16), but the testing on the real cases showed, that the model had slightly learned the response structure (including python code structure), had good level of text generation, but did not learn to write python code correctly.

Experiment 2. Training on GPU T4.
In the 2nd experiment I changed the machine from CPU to GPU T4 and increased the sample size to 10K (10 times compared to the 1st experiment).
This time it took 26 min to train the same model.
The outputs in the saved notebook are from the 2nd exepriment.
The loss on the evaluation dataset improved from 1.67 to 1.58, and the perplexity score improved to 4.85.
Updated model generates even better python code structure and syntaxis, but still the code itself is meaningless.

So, the further ways to improve the model are:
- to experiment with the data preprocessing to focus on the python code,
- to increase the size of the model (increase train sample size, model type, hyperparameters).