# Install Rasa
pip install rasa

# Create a Rasa Project
rasa init --no-prompt

Prepare Data

Organize your dataset into Rasa's training format. Create a data folder and create files for NLU training data (nlu.yml) and conversation examples (stories.yml).

# nlu.yml
version: "2.0"
nlu:
- intent: ask_legal_question
  examples: |
    - What are the laws about <keyword>?
    - Can you explain <keyword> law?

# stories.yml
version: "2.0"
stories:
- story: Seq2Seq Response Example
  steps:
  - user_intent: ask_legal_question
  - action: action_seq2seq_response

Create a Custom Action

Define a custom action for your Seq2Seq model. Create a Python file, e.g., actions.py, and implement the action.

In [None]:
# actions.py
from typing import Any, Text, Dict, List
from rasa_sdk import Action, Tracker
from rasa_sdk.executor import CollectingDispatcher

class ActionSeq2SeqResponse(Action):
    def name(self) -> Text:
        return "action_seq2seq_response"

    def run(self, dispatcher: CollectingDispatcher,
            tracker: Tracker,
            domain: Dict[Text, Any]) -> List[Dict[Text, Any]]:
        # Implement your Seq2Seq model logic here
        # Get user input and generate a response
        user_input = tracker.latest_message['text']
        response = your_seq2seq_model_generate_response(user_input)

        # Send the generated response back to the user
        dispatcher.utter_message(response)

        return []

In [None]:
# Train your Rasa model
rasa train

In [None]:
# Run the Rasa Server
rasa run

In [None]:
# Chat with the Bot
rasa shell

**model focused on Law field**

 implementing this command needs transformer library

 https://github.com/speechbrain/speechbrain/?tab=readme-ov-file

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pre-trained chatbot model and tokenizer
model_path = "SeanJIE250/chatbot_LAW"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype='auto'
).eval()

# Define a conversation (user's query)
messages = [
    {"role": "user", "content": "杀了人在中国判多少年？"}  # User asks about the sentence for killing someone in China
]

# Tokenize and generate a response for the conversation
input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')
outputs = model.generate(input_ids.to('cuda'), max_new_tokens=200)  # Adjust max_new_tokens as needed
response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)

# Print the generated response
print(response)

# Another example conversation
messages = [
    {"role": "user", "content": "How to split the property if I divorced with my husband?"}
]

# Tokenize and generate a response for the second conversation
input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')
outputs = model.generate(input_ids.to('cuda'), max_new_tokens=200)  # Adjust max_new_tokens as needed
response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)

# Print the generated response for the second conversation
print(response)

The provided code is specifically designed for **Question-Answer Sentence Relationship Labeling** (QASRL) tasks


https://huggingface.co/kleinay/qanom-seq2seq-model-joint?text=The+generated+response+is+decoded+using+the+tokenizer+to+obtain+the+text+form+of+the+model%27s+output%2C+and+the+response+is+printed.

In [3]:
from typing import Optional
import json
from argparse import Namespace
from pathlib import Path
from transformers import Text2TextGenerationPipeline, AutoModelForSeq2SeqLM, AutoTokenizer

def get_markers_for_model(is_t5_model: bool) -> Namespace:
    special_tokens_constants = Namespace()
    if is_t5_model:
        # T5 model have 100 special tokens by default
        special_tokens_constants.separator_input_question_predicate = "<extra_id_1>"
        special_tokens_constants.separator_output_answers = "<extra_id_3>"
        special_tokens_constants.separator_output_questions = "<extra_id_5>"  # if using only questions
        special_tokens_constants.separator_output_question_answer = "<extra_id_7>"
        special_tokens_constants.separator_output_pairs = "<extra_id_9>"
        special_tokens_constants.predicate_generic_marker = "<extra_id_10>"
        special_tokens_constants.predicate_verb_marker = "<extra_id_11>"
        special_tokens_constants.predicate_nominalization_marker = "<extra_id_12>"

    else:
        special_tokens_constants.separator_input_question_predicate = "<question_predicate_sep>"
        special_tokens_constants.separator_output_answers = "<answers_sep>"
        special_tokens_constants.separator_output_questions = "<question_sep>"  # if using only questions
        special_tokens_constants.separator_output_question_answer = "<question_answer_sep>"
        special_tokens_constants.separator_output_pairs = "<qa_pairs_sep>"
        special_tokens_constants.predicate_generic_marker = "<predicate_marker>"
        special_tokens_constants.predicate_verb_marker = "<verbal_predicate_marker>"
        special_tokens_constants.predicate_nominalization_marker = "<nominalization_predicate_marker>"
    return special_tokens_constants

def load_trained_model(name_or_path):
   # Load pre-trained model and tokenizer
    import huggingface_hub as HFhub
    tokenizer = AutoTokenizer.from_pretrained(name_or_path)
    model = AutoModelForSeq2SeqLM.from_pretrained(name_or_path)
    # load preprocessing_kwargs from the model repo on HF hub, or from the local model directory
    kwargs_filename = None
    if name_or_path.startswith("kleinay/") and 'preprocessing_kwargs.json' in HFhub.list_repo_files(name_or_path):
        kwargs_filename = HFhub.hf_hub_download(repo_id=name_or_path, filename="preprocessing_kwargs.json")
    elif Path(name_or_path).is_dir() and (Path(name_or_path) / "experiment_kwargs.json").exists():
        kwargs_filename = Path(name_or_path) / "experiment_kwargs.json"

    if kwargs_filename:
        preprocessing_kwargs = json.load(open(kwargs_filename))
        # integrate into model.config (for decoding args, e.g. "num_beams"), and save also as standalone object for preprocessing
        model.config.preprocessing_kwargs = Namespace(**preprocessing_kwargs)
        model.config.update(preprocessing_kwargs)
    return model, tokenizer


class QASRL_Pipeline(Text2TextGenerationPipeline):
    def __init__(self, model_repo: str, **kwargs):
        model, tokenizer = load_trained_model(model_repo)
        super().__init__(model, tokenizer, framework="pt")
        self.is_t5_model = "t5" in model.config.model_type
        self.special_tokens = get_markers_for_model(self.is_t5_model)
        # self.preprocessor = preprocessing.Preprocessor(model.config.preprocessing_kwargs, self.special_tokens)
        self.data_args = model.config.preprocessing_kwargs
        # backward compatibility - default keyword values implemeted in `run_summarization`, thus not saved in `preprocessing_kwargs`
        if "predicate_marker_type" not in vars(self.data_args):
            self.data_args.predicate_marker_type = "generic"
        if "use_bilateral_predicate_marker" not in vars(self.data_args):
            self.data_args.use_bilateral_predicate_marker = True
        if "append_verb_form" not in vars(self.data_args):
            self.data_args.append_verb_form = True
        self._update_config(**kwargs)

    def _update_config(self, **kwargs):
        " Update self.model.config with initialization parameters and necessary defaults. "
        # set default values that will always override model.config, but can overriden by __init__ kwargs
        kwargs["max_length"] = kwargs.get("max_length", 80)
        # override model.config with kwargs
        for k,v in kwargs.items():
            self.model.config.__dict__[k] = v

    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs, forward_kwargs, postprocess_kwargs = {}, {}, {} # super()._sanitize_parameters(**kwargs)
        forward_kwargs.update(kwargs.get("generate_kwargs", dict()))
        forward_kwargs.update(kwargs.get("model_kwargs", dict()))
        preprocess_keywords = ("predicate_marker", "predicate_type", "verb_form")
        for key in preprocess_keywords:
            if key in kwargs:
                preprocess_kwargs[key] = kwargs[key]

        return preprocess_kwargs, forward_kwargs, postprocess_kwargs

    # Processing Input Sentences
    def preprocess(self, inputs, predicate_marker="<predicate>", predicate_type=None, verb_form=None):
        # Here, inputs is string or list of strings; apply string postprocessing
        if isinstance(inputs, str):
            processed_inputs = self._preprocess_string(inputs, predicate_marker, predicate_type, verb_form)
        elif hasattr(inputs, "__iter__"):
            processed_inputs = [self._preprocess_string(s, predicate_marker, predicate_type, verb_form) for s in inputs]
        else:
            raise ValueError("inputs must be str or Iterable[str]")
        # Now pass to super.preprocess for tokenization
        return super().preprocess(processed_inputs)
   # takes input sentences and prepares them for the model.
   # It adds special markers to indicate key elements in the sentences,
   # like the predicate (main action)

    def _preprocess_string(self, seq: str, predicate_marker: str, predicate_type: Optional[str], verb_form: Optional[str]) -> str:
        sent_tokens = seq.split(" ")
        assert predicate_marker in sent_tokens, f"Input sentence must include a predicate-marker token ('{predicate_marker}') before the target predicate word"
        predicate_idx = sent_tokens.index(predicate_marker)
        sent_tokens.remove(predicate_marker)
        sentence_before_predicate = " ".join([sent_tokens[i] for i in range(predicate_idx)])
        predicate = sent_tokens[predicate_idx]
        sentence_after_predicate = " ".join([sent_tokens[i] for i in range(predicate_idx+1, len(sent_tokens))])

        if self.data_args.predicate_marker_type == "generic":
            predicate_marker = self.special_tokens.predicate_generic_marker
        #  In case we want special marker for each predicate type: """
        elif self.data_args.predicate_marker_type == "pred_type":
            assert predicate_type is not None, "For this model, you must provide the `predicate_type` either when initializing QASRL_Pipeline(...) or when applying __call__(...) on it"
            assert predicate_type in ("verbal", "nominal"), f"`predicate_type` must be either 'verbal' or 'nominal'; got '{predicate_type}'"
            predicate_marker = {"verbal": self.special_tokens.predicate_verb_marker ,
                                "nominal": self.special_tokens.predicate_nominalization_marker
                                }[predicate_type]

        if self.data_args.use_bilateral_predicate_marker:
            seq = f"{sentence_before_predicate} {predicate_marker} {predicate} {predicate_marker} {sentence_after_predicate}"
        else:
            seq = f"{sentence_before_predicate} {predicate_marker} {predicate} {sentence_after_predicate}"

        # embed also verb_form
        if self.data_args.append_verb_form and verb_form is None:
            raise ValueError(f"For this model, you must provide the `verb_form` of the predicate when applying __call__(...)")
        elif self.data_args.append_verb_form:
            seq = f"{seq} {self.special_tokens.separator_input_question_predicate} {verb_form} "
        else:
            seq = f"{seq} "

        # append source prefix (for t5 models)
        prefix = self._get_source_prefix(predicate_type)

        return prefix + seq

    def _get_source_prefix(self, predicate_type: Optional[str]):
        if not self.is_t5_model or self.data_args.source_prefix is None:
            return ''
        if "Generate QAs for <predicate_type> QASRL: " in self.data_args.source_prefix:
            if predicate_type is None:
                raise ValueError("source_prefix includes 'Generate QAs for <predicate_type> QASRL: ' but input has no `predicate_type`.")
            if self.data_args.source_prefix == "Generate QAs for <predicate_type> QASRL: ": # backwrad compatibility - "Generate QAs for <predicate_type> QASRL: " alone was a sign for a longer prefix
                return f"Generate QAs for {predicate_type} QASRL: "
            else:
                return self.data_args.source_prefix.replace("Generate QAs for <predicate_type> QASRL: ", predicate_type)
        else:
            return self.data_args.source_prefix


    def _forward(self, *args, **kwargs):
       # Generating Questions and Answers
    # The pipeline uses the pre-trained model to generate questions and
    # answers based on the preprocessed input sentences. The model understands
    # the context and structure of the sentences
        outputs = super()._forward(*args, **kwargs)
        return outputs

        # Post-processing the Outputs
        def postprocess(self, model_outputs):
            # Extract the first element from the list
            output_seq = self.tokenizer.decode(
                model_outputs["output_ids"][0],
                skip_special_tokens=False,
                clean_up_tokenization_spaces=False,
            )

        output_seq = output_seq.strip(self.tokenizer.pad_token).strip(self.tokenizer.eos_token).strip()
        qa_subseqs = output_seq.split(self.special_tokens.separator_output_pairs)
        qas = [self._postrocess_qa(qa_subseq) for qa_subseq in qa_subseqs]
        return {"generated_text": output_seq,
                "QAs": qas}
                # The postprocess method decodes the model's output to extract the generated text, questions,
                # and answers. It formats the results in a more human-readable way.

    def _postrocess_qa(self, seq: str) -> str:
        # split question and answers
        if self.special_tokens.separator_output_question_answer in seq:
            question, answer = seq.split(self.special_tokens.separator_output_question_answer)[:2]
        else:
            print("invalid format: no separator between question and answer found...")
            return None
            # question, answer = seq, '' # Or: backoff to only question
        # skip "_" slots in questions
        question = ' '.join(t for t in question.split(' ') if t != '_')
        answers = [a.strip() for a in answer.split(self.special_tokens.separator_output_answers)]
        return {"question": question, "answers": answers}


if __name__ == "__main__":
    pipe = QASRL_Pipeline("kleinay/qanom-seq2seq-model-baseline")
    res1 = pipe("The student was interested in Luke 's <predicate> research about see animals .", verb_form="research", predicate_type="nominal")
    res2 = pipe(["The doctor was interested in Luke 's <predicate> treatment .",
                 "The Veterinary student was interested in Luke 's <predicate> treatment of sea animals ."], verb_form="treat", predicate_type="nominal", num_beams=10)
    res3 = pipe("A number of professions have <predicate> developed that specialize in the treatment of mental disorders .", verb_form="develop", predicate_type="verbal")
    print(res1)
    print(res2)
    print(res3)

[{'generated_text': 'who _ _ researched something _ _ ? Luke what did someone research _ _ _ ? see animals'}]
[{'generated_text': 'who _ _ treated someone _ _ ? Luke'}, {'generated_text': "who    treated  predicate-type>The Veterinary student was interested in Luke 's treatment how did someone treat                      "}]
[{'generated_text': 'who   has developed predicate-type>A number of professions have developed predicate-type>A number of professions have developed predicate-type> specialize in the treatment of mental disorders'}]


using the pipeline to generate questions and answers for specific input sentences. It shows how to instantiate the pipeline, process input sentences, and obtain the model's generated Q&A pairs.


**Text Summarization**
---

https://huggingface.co/facebook/bart-large-cnn/blob/main/README.md

In [1]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]


In [2]:
from google.colab import files

uploaded = files.upload()

Saving data.csv to data.csv


In [6]:
import pandas as pd

# Replace 'your_file.csv' with the actual path to your CSV file
df = pd.read_csv('data.csv')

# Assuming your CSV has a column named 'content'
text_data = df['Law'].tolist()

In [11]:
from transformers import pipeline

# Check the length of the text_data
print(len(text_data))

# Check the model's max_position_embeddings
print(summarizer.model.config.max_position_embeddings)

# Ensure that the length of the text_data is less than or equal to the model's max_position_embeddings
if len(text_data) > summarizer.model.config.max_position_embeddings:
    print("The length of the text_data is greater than the model's max_position_embeddings. Please use a smaller dataset or a model with a larger max_position_embeddings.")
else:
    # Summarize each text entry in your data
    print("The length of the text_data is less than or equal to the model's max_position_embeddings.")
    summaries = [summarizer(text, max_length=50, min_length=30, do_sample=False)[0]['summary_text'] for text in text_data]
# Add the summaries to your DataFrame
df['summary'] = summaries

# Optionally, you can save the updated DataFrame to a new CSV file
df.to_csv('summerized.csv', index=False)

400
1024
The length of the text_data is less than or equal to the model's max_position_embeddings.


IndexError: index out of range in self

https://huggingface.co/philschmid/bart-large-cnn-samsum

In [1]:
import pandas as pd
from transformers import pipeline

# Load the summarization pipeline
summarizer = pipeline("summarization", model="philschmid/bart-large-cnn-samsum")

# Read the CSV file
df = pd.read_csv('data.csv')  # Replace 'your_file.csv' with the actual path to your CSV file


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
from google.colab import files

uploaded = files.upload()

Saving summerized.csv to summerized.csv


In [3]:
# Assuming your CSV has a column named 'content'
text_data = df['Law'].tolist()

# Summarize each text entry in your data
summaries = [summarizer(text)[0]['summary_text'] for text in text_data]

# Add the summaries to your DataFrame
df['summary'] = summaries

# Optionally, you can save the updated DataFrame to a new CSV file
df.to_csv('summerized.csv', index=False)

Your max_length is set to 142, but your input_length is only 84. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your max_length is set to 142, but your input_length is only 136. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)
Your max_length is set to 142, but your input_length is only 63. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=31)
Token indices sequence length is longer than the specified maximum sequence length for this model (1326 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

In [15]:
from speechbrain.inference.text import GPTResponseGenerator
res_gen_model = GPTResponseGenerator.from_hparams(source="speechbrain/MultiWOZ-GPT-Response_Generation", savedir="pretrained_models/MultiWOZ-GPT-Response_Generation", pymodule_file="custom.py")
print("Hi,How could I help you today?", end="\n")
while True:
  turn = input()
  response = res_gen_model.generate_response(turn)
  print(response, end="\n")


ModuleNotFoundError: No module named 'speechbrain'