In [1]:
!pip install txtai[all] > /dev/null

The Entity pipeline applies a token classifier to text and extracts entity/label combinations.

In [29]:
from txtai.pipeline import Entity

# Create and run pipeline
entity = Entity()
entity("There are many usecases in NLP and they are called tasks in HuggingFace, " \
       "Each task has many AI models. TxtAI takes care of selecting them for you")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[('NLP', 'ORG', 0.8296668529510498),
 ('HuggingFace', 'ORG', 0.9723879098892212),
 ('TxtAI', 'ORG', 0.9399739503860474)]

The Extractor pipeline is a combination of a similarity instance (embeddings or similarity pipeline) to build a question context and a model that answers questions. The model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline.

In [None]:
from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Embeddings model ranks candidates before passing to QA pipeline
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

In [19]:
!unzip /content/selfask_index.zip

Archive:  /content/selfask_index.zip
replace content/index/embeddings? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: content/index/embeddings  
  inflating: content/index/documents  
  inflating: content/index/config    


In [20]:
embeddings.load("/content/content/index")

In [21]:
# Create and run pipeline
extractor = Extractor(embeddings, 
                      "google/flan-t5-base")

In [22]:
def prompt(question):
  return f"""Answer the following question using only the context below. 
  Say 'no answer' when the question can't be answered.
Question: {question}
Context: """

def search(query, question=None):
  # Default question to query if empty
  if not question:
    question = query

  return extractor([("answer", query, prompt(question), False)])[0][1]

reply = search("What is self ask")
print(reply)

a 1-shot prompt


The Generator pipeline takes an input prompt and generates follow-on text.

In [24]:
from txtai.pipeline import Generator

# Create and run pipeline
generator = Generator()
generator("Hello, talk about NLP?")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'"NLP is a kind of programming language. If you know a string that ends in NLP, you can create a new string, like in D. The difference is you don\'t have to be able to do any parsing," he said.\n\n"What the word \'type\' does is it allows you to make a string of elements of arbitrary length in one place. That means you can create things that are completely and irreversibly different from what the regular alphabet does," he added.\n\nNLP has been around for a while, and is a very basic system that anyone with basic data-oriented programming skills would use. For one thing, it\'s only a program. For a second or two, it\'s a collection of different programs.\n\n"And that\'s exactly why I would give these programming languages a try!" he said.\n\nNLP has long been a favorite of programming languages, since it\'s not such an easy language to learn.\n\n"The reason NLP was the first programming language that people were excited about is because NLP is the first programming language to have th

The Labels pipeline uses a text classification model to apply labels to input text. This pipeline can classify text using either a zero shot model (dynamic labeling) or a standard text classification model (fixed labeling).


In [25]:
%%capture

from txtai.pipeline import Labels

# Create labels model
labels = Labels()


No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [26]:
data = ["Spacers lose again, give up 3 goals in a loss to the LandLovers",
        "Spacers 5 LandLovers 4 final in extra innings",
        "LandLovers drop Game 2 against the Spacers, 5-4",
        "Landers 4 Thunders 1 final. 49 saves for the Thunders.",
        "Slashing, penalty, 20 second power play coming up",
        "What a sleek save!",
        "Leads the NFL in sacks with 9.5",
        "Earth 38 Mars 13",
        "With the 30 lightyears completion, down to the 10 light line",
        "Drains the 3pt shot!!, 0:25 remaining in the game",
        "Intercepted! Drives down the court and shoots for the win",
        "Massive dunk!!! they are now up by 15 with 2 minutes to go"]

# List of labels
tags = ["Baseball", "Football", "Hockey", "Basketball"]

In [27]:
labels(data[0],tags)

[(1, 0.3686995506286621),
 (3, 0.3435738682746887),
 (0, 0.14597725868225098),
 (2, 0.1417493373155594)]

In [28]:
for text in data:
    print("%-75s %s" % (text, tags[labels(text, tags)[0][0]]))

Spacers lose again, give up 3 goals in a loss to the LandLovers             Football
Spacers 5 LandLovers 4 final in extra innings                               Basketball
LandLovers drop Game 2 against the Spacers, 5-4                             Basketball
Landers 4 Thunders 1 final. 49 saves for the Thunders.                      Baseball
Slashing, penalty, 20 second power play coming up                           Hockey
What a sleek save!                                                          Hockey
Leads the NFL in sacks with 9.5                                             Football
Earth 38 Mars 13                                                            Basketball
With the 30 lightyears completion, down to the 10 light line                Basketball
Drains the 3pt shot!!, 0:25 remaining in the game                           Basketball
Intercepted! Drives down the court and shoots for the win                   Basketball
Massive dunk!!! they are now up by 15 with 2 minutes to g

The Sequences pipeline runs text through a sequence-sequence model and generates output text.

In [None]:
from txtai.pipeline import Sequences

# Create and run pipeline
sequences = Sequences()
sequences("Hello, how are you?", 
          "translate English to German: ")

The Similarity pipeline computes similarity between queries and list of text using a text classifier.

This pipeline supports both standard text classification models and zero-shot classification models. The pipeline uses the queries as labels for the input text. The results are transposed to get scores per query/label vs scores per input text.

In [3]:
from txtai.pipeline import Similarity

# Create and run pipeline
similarity = Similarity()

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [4]:
similarity("Interesting Idea", [
    "There is number of benefits in making life interesting", 
    "Life on Mars will be more interesting and filled with challenges"
])

[(1, 0.9839457869529724), (0, 0.8746001124382019)]

The Summary pipeline summarizes text. This pipeline runs a text2text model that abstractively creates a summary of the input text.

In [None]:
from txtai.pipeline import Summary

# Create and run pipeline
summary = Summary()

In [6]:
summary("""There is number of benefits in making life interesting. 
    Life on Mars will be more interesting and filled with challenges""")

'There is number of benefits in making life interesting. \n    Life on Mars will be more interesting and filled with challenges'

The Translation pipeline translates text between languages. It supports over 100+ languages. Automatic source language detection is built-in. This pipeline detects the language of each input text row, loads a model for the source-target combination and translates text to the target language.

In [7]:
from txtai.pipeline import Translation

# Create and run pipeline
translate = Translation()
translate("This is a test translation into Spanish", "es")

Downloading (…)d-models/lid.176.ftz:   0%|          | 0.00/938k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]



'Esta es una traducción de prueba al español'

In [8]:
translate("This is a test translation into Spanish", "de")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

'Dies ist eine Testübersetzung ins Spanische'

In [9]:
translate("This is a test translation into Spanish", "hi")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

'यह स्पैनिश में एक जांच अनुवाद है'