In [12]:
pip install txtinstruct > /dev/null

In [None]:
!wget https://huggingface.co/NeuML/txtai-wikipedia/resolve/main/documents

In [None]:
!cat /content/documents | more

txtinstruct consists of three components to help train instruction-following models. 

## 3-components
 
**Statement generation** models create a statement from a context. This statement can be a question or request to describe a concept depending on the model.

**Knowledge source** for pulling context. An example knowledge source used in this notebook is a txtai embeddings index of the full Wikipedia dataset.

**Large language model (LLM)** for translating source statements into target statements. A prompt is used in combination with the knowledge source context to generate the target text

In [None]:
from datasets import load_dataset

from txtinstruct.models import StatementGenerator

# Load SQuAD dataset
dataset = load_dataset("squad", 
                       split="train")

# Train model
generator = StatementGenerator()

In [14]:
model, tokenizer = generator(
    "google/flan-t5-small",
    dataset,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=128 // 16,
    num_train_epochs=0.01,
    logging_steps=100,
)
#Note that we only trained the model for a fraction of an epoch 
#for expediency. Under normal circumstances, num_train_epochs 
#would be at least 3.

Downloading and preparing dataset generator/default to /root/.cache/huggingface/datasets/generator/default-73577150cf9f0b45/0.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /root/.cache/huggingface/datasets/generator/default-73577150cf9f0b45/0.0.0. Subsequent calls will reuse this data.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]



Downloading pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


In [15]:
from txtai.pipeline import Sequences

# Load statement generation model
statements = Sequences((model, tokenizer))

# Run example prompt
statements("""Generate a question using the context below.
### Context:
Hugging face is an open-source platform for hosting 
all kind of AI language models.""")

'What is Hugging Face?'

In [16]:
from txtai.embeddings import Embeddings
from txtinstruct.data import DatasetBuilder

# Load embeddings
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", 
                container="neuml/txtai-wikipedia")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading (…)7603f/.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading (…)c4d387603f/README.md:   0%|          | 0.00/2.00k [00:00<?, ?B/s]

Downloading embeddings:   0%|          | 0.00/4.67G [00:00<?, ?B/s]

Downloading documents:   0%|          | 0.00/3.14G [00:00<?, ?B/s]

Downloading (…)d387603f/config.json:   0%|          | 0.00/534 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [17]:
# Query templates
templates = [
    "Tell me about {text}",
    "Give an explanation on {text}",
    "Provide a quick summary on {text}",
    "Explain {text} in simple terms",
    "Describe {text}"
]

('Provide a quick summary on {text}',)

In [18]:
embeddings.search("""SELECT id, text 
                  FROM txtai 
                  WHERE similar('machine learning') 
                  AND percentile >= 0.99 LIMIT 5")

[{'id': 'Machine learning',
  'text': "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. \nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field 

In [None]:
# Build dataset
builder = DatasetBuilder(Sequences("google/flan-t5-small"), 
                         statements, 
                         templates)
builder(
    embeddings.search("SELECT id, text FROM txtai WHERE similar('machine learning') AND percentile >= 0.99 LIMIT 5"),
    5,
    "data.json"
)

In [7]:
import json

from txtinstruct.models import Instructor

# Read in generated dataset
with open("data.json", encoding="utf-8") as f:
    data = json.load(f)

In [8]:
data[0]

{'context': "Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. \nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data 

In [24]:
# Instruction-tune model
instructor = Instructor()
model, tokenizer = instructor(
    "google/flan-t5-small", 
    data,
    "sequence-sequence",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=128 // 8,
    num_train_epochs=3,
    logging_steps=100,
)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


In [10]:
from txtai.pipeline import Extractor

def prompt(text):
    template = "Answer the following question using only the context below. Give a detailed answer. "
    template += "Say 'I don't have data on that' when the question can't be answered.\n"
    template += f"Question: {text}\n"
    template += "Context: "

    return template


In [11]:
#This model is from hugging face
extractor = Extractor(
    embeddings,
    Sequences("google/flan-t5-small")
)

extractor([{
    "query": "Tell me about Linux",
    "question": prompt("Tell me about Linux")
}])

[{'answer': 'Linux'}]

In [25]:
#This model is trained in this colab notebook
extractor = Extractor(
    embeddings,
    Sequences((model, tokenizer))
)

extractor([{
    "query": "Tell me about Linux",
    "question": prompt("Tell me about Linux")
}])

[{'answer': 'Linux (or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which includes the kernel and supporting system software and libraries, many of which are provided by the GNU Project.'}]

In [27]:
extractor([{
    "query": "Tell me about adversarial Machine Learning",
    "question": prompt("Tell me about adversarial Machine Learning")
}])

[{'answer': 'Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks'}]

In [1]:
!git clone https://github.com/lamini-ai/lamini.git

Cloning into 'lamini'...
remote: Enumerating objects: 193, done.[K
remote: Counting objects: 100% (76/76), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 193 (delta 44), reused 47 (delta 26), pack-reused 117[K
Receiving objects: 100% (193/193), 27.82 MiB | 16.32 MiB/s, done.
Resolving deltas: 100% (110/110), done.


In [2]:
!pip install llama-llm jsonlines > /dev/null

In [None]:
!python /content/lamini/generate_data.py