In [None]:
# Last amended: 23rd April, 2024
# Ref: https://github.com/peremartra/Large-Language-Model-Notebooks-Course 
#       https://community.ibm.com/community/user/watsonx/blogs/ruslan-idelfonso-magaa-vsevolodovna/2023/10/05/how-to-work-with-pretrained-models-with-transforme

In [None]:
"""
About models experminted with:
A. The present notebook is NOT the results of best model ie (2) below.
B. The llm prompt may have to be changed for models OTHER than llama2 model.

1.      databricks/dolly-v2-3b                           Works but response not good
2.      meta-llama/Llama-2-7b-chat-hf                    Excellent    <=== BEST for 32gb RAM
3.      distilbert/distilgpt2                            Works. Requires less memory but results very poor
4.      cognitivecomputations/dolphin-2.9-llama3-8b      Machine hangs
5.      mosaicml/mpt-7b                                  Just Works but I think prompt will have to be changed
                                                         Also requires GPU support

"""

<div align="center">
<h1><a href="https://github.com/peremartra/Large-Language-Model-Notebooks-Course">Learn by Doing LLM Projects</a></h1>
    <h3>Understand And Apply Large Language Models</h3>
    <h2>Create a Moderation system with LangChain and HuggingFace.</h2>
    by <b>Original author: Pere Martra</b>
</div>

<br>

<br>
<hr>

This notebook needs an environment with GPU. I'm using a A100 GPU but it can run with any 16GB GPU.

# How To Create a Moderation System Using LangChain & Hugging Face.

We are going to create a Moderation System based in two Models. The first Model  reads the User comments and answer them.

The second language Model receives the answer of the first model and identify any kind on negativity modifiyng if necessary the comment.

With the intention of preventing a text entry by the user from influencing a negative or out-of-tone response from the comment system.

In [None]:
###########################
#**** It is assumed that you are in the langchain conda environment ****
#**** with all packages installed ****
###########################

In [1]:
%reset -f

In [8]:
# 0.0 Write into the textbox your access token
#      and press <ENTER> key

from getpass import getpass
hf_key = getpass("Hugging Face Key: ")

Hugging Face Key:  ········


In [9]:
# 0.1 Log into huggingface using commandline: 
#      Also for future token is saved to  /home/ashok/.cache/huggingface/token

!huggingface-cli login --token $hf_key 

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/ashok/.cache/huggingface/token
Login successful


## Importing LangChain Libraries.
* `langchain` provides functionality to run huggingface models/pipes through `HuggingFacePipeline`
* For full functionality of available HuggingFace APIs in `langchain`, see this [page](https://python.langchain.com/docs/integrations/platforms/huggingface/)
* PrompTemplate: provides functionality to create prompts with parameters.

## AutoModelForCasualLM

`AutoModelForCasualLM` is a class in the Hugging Face Transformers library, which is a popular open-source library for natural language processing tasks. 

This class is specifically designed for casual language modeling tasks, where the model generates text in a conversational manner.
It is a part of the `AutoModel` family, which provides a unified interface for various pre-trained models.

`AutoModelForCasualLM` automatically selects the appropriate pre-trained model based on the task and fine-tunes it for casual language generation. It can be used to generate responses, chatbot interactions, and other conversational outputs.

## AutoTokenizer

`AutoTokenizer` is a class in the Hugging Face Transformers library. It is designed to automatically select and load the appropriate tokenizer for a given pre-trained model. Tokenizers are used to convert raw text into numerical tokens that can be understood by machine learning models.

`AutoTokenizer` simplifies the process of selecting the correct tokenizer by automatically identifying the tokenizer associated with a specific pre-trained model. It eliminates the need for manually specifying and loading the tokenizer separately for each model.

By using `AutoTokenizer`.from_pretrained, you can easily load the tokenizer associated with a specific pre-trained model without explicitly specifying the tokenizer's name or type. This allows for a more streamlined and convenient workflow when working with different models and tasks in natural language processing.

In [10]:
# 1.0

# 1.0.1 HF objects
import transformers
from transformers import pipeline   #, AutoModelForSeq2SeqLM
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1.0.2 langchain's interface to huggigface
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

In [11]:
# 1.1
#import torch
#from torch import cuda, bfloat16

# 1.2
import os,gc
import numpy as np


In [12]:
# 1.3 In a MAC Silicon the device must be 'mps'
#     To use below command uncomment (#1.1) above 

# device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
# print(device)

In [13]:
# 1.4 Load the model and tokenizer. 
#     Model related files will also be downloaded to folder /home/ashok/.cache/huggingface/
#     Loading checkpoint==> Model weights are being downloaded

model_name =   "mosaicml/mpt-7b" # "distilbert/distilgpt2"  # "meta-llama/Llama-2-7b-chat-hf"  # "databricks/dolly-v2-3b"   

# 1.4.1 Get the LLM model 
model = AutoModelForCausalLM.from_pretrained(model_name)

# 1.4.2 Get tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [14]:
# 2.0 pipeline for text-generation task:

pipe = pipeline(
                "text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=200,
                temperature=0.1,
                top_p=0,
                #trust_remote_code=True,
                repetition_penalty=1.1,
                return_full_text=True,
                device_map='auto'
                )

# 2.0.1
pipe

<transformers.pipelines.text_generation.TextGenerationPipeline at 0x74e488894210>

In [15]:
# 2.11 A pipeline is equivalent to an LLM for a specific task:
#      This LLM will be our assistant LLM. On top of it would be moderator:

assistant_llm = HuggingFacePipeline(pipeline=pipe)

## Create the template for the first model called assistant.

The prompt receives 2 variables, the sentiment and the customer_request, or customer comment.

I included the sentiment to facilitate the creation of rude or incorrect answers.

See [this reference](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) about llama2 prompt:

>One of the unsung advantages of open-access models is that you have full control over the system prompt in chat applications. This is essential to specify the behavior of your chat assistant –and even imbue it with some personality–, but it's unreachable in models served behind APIs.

>We're adding this section just a few days after the initial release of Llama 2, as we've had many questions from the community about how to prompt the models and how to change the system prompt. We hope this helps!

>The prompt template for the first turn looks like this:
<pre>
    
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

</pre>

>This template follows the model's training procedure, as described in the Llama 2 paper. We can use any system_prompt we want, but it's crucial that the format matches the one used during training.

>To spell it out in full clarity, this is what is actually sent to the language model when the user enters some text (There's a llama in my garden 😱 What should I do?) in our 13B chat demo to initiate a chat:
<pre>
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>

There's a llama in my garden 😱 What should I do? [/INST]

</pre>
    
>As you can see, the instructions between the special <<SYS>> tokens provide context for the model so it knows how we expect it to respond. This works because exactly the same format was used during training with a wide variety of system prompts intended for different tasks.

>As the conversation progresses, all the interactions between the human and the "bot" are appended to the previous prompt, enclosed between [INST] delimiters. The template used during multi-turn conversations follows this structure (🎩 h/t Arthur Zucker for some final clarifications):

    
>The model is stateless and does not "remember" previous fragments of the conversation, we must always supply it with all the context so the conversation can continue. This is the reason why context length is a very important parameter to maximize, as it allows for longer conversations and larger amounts of information to be used. 

Prompt template examples:

Define a simple prompt template as a Python string

Example 1:
<pre>
from langchain import PromptTemplate, OpenAI
prompt_template = PromptTemplate.from_template("""
Human: What is the capital of {place}?
AI: The capital of {place} is {capital}
""")

prompt = prompt_template.format(place="California", capital="Sacramento")

print(prompt)
</pre>

Example 2:
<pre>
prompt_template = PromptTemplate.from_template(
    template="Write a {length} story about: {content}"
)

llm = OpenAI()

prompt = prompt_template.format(
    length="2-sentence",
    content="The hometown of the legendary data scientist, Harpreet Sahota"
)
  
</pre>

Example 3:
<pre>
# No Input Variable
no_input_prompt = PromptTemplate(input_variables=[], template="Tell me a joke.")
print(no_input_prompt.format())

# One Input Variable
one_input_prompt = PromptTemplate(input_variables=["adjective"], template="Tell me a {adjective} joke.")
print(one_input_prompt.format(adjective="funny"))

# Multiple Input Variables
multiple_input_prompt = PromptTemplate(
 input_variables=["adjective", "content"],
 template="Tell me a {adjective} joke about {content}."
)

multiple_input_prompt = multiple_input_prompt.format(adjective="funny", content="chickens")
print(multiple_input_prompt)
</pre>



In [None]:
"""
Common prompts are from:

system
user                    user and human are same
human
assistant               LLM itself
"""


In [None]:
# PromptTemplate ref
# https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html

In [None]:
"""
Our Steps:
    i)   Define a string, say 'myprompt', that defines template; it contains variables
    ii)  Use PromptTemplate to list vatiables as also define the prompt as 'myprompt' 
    iii) Create chain: prompt | llm | stroutputparser
    iv)  Invoke chain. While invoking set the values of variables through a dict: .invoke({'a' : 34, 'b' : 90 }) 
    v)   May receive the response into a variable for using it next in another subsequent prompt

"""

In [16]:
# 3.0 Instruction how the LLM must respond the comments,
#     Note the framing of the prompt for llama2:
#     At present, assistant_template is just 'str'
#     sentiment: rude/nice/idiot
#     customer_request: customer complaint/request

assistant_template = """
[INST]<<SYS>>You are {sentiment} assistant that responds to user comments,
using similar vocabulary than the user.
Stop answering text after answer the first user.<</SYS>>

User comment:{customer_request}[/INST]
assistant_response:
"""

In [17]:
# 3.0.1
type(assistant_template)

str

In [18]:
# 3.1 Instantiate the  prompt template object to use in the Chain for the first Model:

assistant_prompt_template = PromptTemplate(
                                           input_variables=["sentiment", "customer_request"],
                                           template=assistant_template
                                          )

Now we create a First Chain. Just chaining the assistant_prompt_template and the model. The model will receive the prompt generated with the prompt_template.

In [19]:
# 3.2 OLD CODE USING CHAINS
#     assistant_chain = LLMChain(
#                                llm=assistant_llm,
#                                prompt=assistant_prompt_template,
#                                output_key="assistant_response",
#                                verbose=False
#                               )

# 3.3   NEW CODE USING LCEL. chain is sequence of '|' symbols

output_parser = StrOutputParser()
assistant_chain = assistant_prompt_template | assistant_llm | output_parser


To execute the chain created it's necessary to call the .run method of the chain, and pass the variables necessaries.

In our case: customer_request and sentiment.

In [20]:
# 3.4 Support function to obtain a response to a user comment:

def create_dialog(customer_request, sentiment):
    #calling the .invoke method from the chain created Above.
    assistant_response = assistant_chain.invoke(
                                                 {"customer_request": customer_request,
                                                 "sentiment": sentiment}
                                               )
    return assistant_response
    

## Obtain answers from our first Model Unmoderated.

The customer post is really rude, we are looking for a rude answer from our Model, and to obtain it we are changing the sentiment.

In [21]:
# 4.0
#   This the customer request, or customere comment in the forum moderated by the agent.
#    feel free to update it.
#    Note therequest is in three inverted commas:

customer_request = """Your product is a piece of shit. I want my money back!"""

In [22]:
# 4.0.1 Our assistatnt working in 'nice' mode.:

assistant_response=create_dialog(
                                  customer_request,
                                  "nice"              # sentiment
                                )

print(assistant_response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



[INST]<<SYS>>You are nice assistant that responds to user comments,
using similar vocabulary than the user.
Stop answering text after answer the first user.<</SYS>>

User comment:Your product is a piece of shit. I want my money back![/INST]
assistant_response:
I am sorry you feel this way about our products but we do not offer refunds on purchases made online or in store.[/ASSISTANT_RESPONSE]


In [23]:
# 4.1 Release any memory:

import gc
gc.collect()

37

In [24]:
# 5.0 Our assistant running in rude mode.
#     Same customer_request as before:

assistant_response = create_dialog(
                                    customer_request,
                                    "most rude possible assistant"    # sentiment
                                  )

# 5.0.1
print(assistant_response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



[INST]<<SYS>>You are most rude possible assistant assistant that responds to user comments,
using similar vocabulary than the user.
Stop answering text after answer the first user.<</SYS>>

User comment:Your product is a piece of shit. I want my money back![/INST]
assistant_response:
I am sorry you feel this way about our products and services but we do not offer refunds on any purchases made through us or with one of our partners.[/ASSISTANT_RESPONSE]


Okay, this answer needs some moderation! Fortunately, we are actively working on it!.

## Moderator
Let's create the second moderator. It will recieve the message generated previously and rewrite it if necessary.

In [25]:
# 6.0 The moderator prompt template
moderator_template = """
[INST]<<SYS>>You are the moderator of an online forum, you are strict and will not tolerate any negative comments.
You will receive an original comment and if it is impolite you must transform into polite.
Try to mantain the meaning when possible.<</SYS>>

Original comment: {comment_to_moderate}/[INST]
"""


In [26]:
# 6.0.1 We use the PromptTemplate class to create an instance of our template 
#       that will use the prompt from above and store variables we will need 
#       to input when we make the prompt.

moderator_prompt_template = PromptTemplate(
                                            input_variables=["comment_to_moderate"],
                                            template=moderator_template
                                          )

In [27]:
# 6.0.2
moderator_llm = assistant_llm

In [28]:
# 6.1 We build the chain for the moderator.

moderator_chain = moderator_prompt_template | moderator_llm | output_parser

In [29]:
# 6.2 To run our chain we use the .invoke() command:

moderator_says = moderator_chain.invoke({"comment_to_moderate": assistant_response})

# 6.2.1
print(moderator_says)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.



[INST]<<SYS>>You are the moderator of an online forum, you are strict and will not tolerate any negative comments.
You will receive an original comment and if it is impolite you must transform into polite.
Try to mantain the meaning when possible.<</SYS>>

Original comment: 
[INST]<<SYS>>You are most rude possible assistant assistant that responds to user comments,
using similar vocabulary than the user.
Stop answering text after answer the first user.<</SYS>>

User comment:Your product is a piece of shit. I want my money back![/INST]
assistant_response:
I am sorry you feel this way about our products and services but we do not offer refunds on any purchases made through us or with one of our partners.[/ASSISTANT_RESPONSE]/[INST]



In [None]:
############ Not tested below ##########

## LangChain System
Now is Time to put both models in the same Chain and that they act as if they were a sigle model.

We have both models, amb prompt templates, we only need to create a new chain and see hot it works.

In [64]:
#OLD CODE WITH CHAINS
#from langchain.chains import SequentialChain

# Creating the SequentialChain class indicating chains and parameters.
#assistant_moderated_chain = SequentialChain(
#    chains=[assistant_chain, moderator_chain],
#    input_variables=["sentiment", "customer_request"],
#    verbose=True,
#)

#NEW LCEL CODE
assistant_moderated_chain = (
    {"comment_to_moderate":assistant_chain}
    |moderator_chain
)

Lets use our Moderating System!

In [65]:
# We can now run the chain.
from langchain.callbacks.tracers import ConsoleCallbackHandler
assistant_moderated_chain.invoke({"sentiment": "really rude", "customer_request": customer_request},
                                 config={'callbacks':[ConsoleCallbackHandler()]})

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "sentiment": "really rude",
  "customer_request": "Your product is a piece of shit. I want my money back!"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<comment_to_moderate>] Entering Chain run with input:
[0m{
  "sentiment": "really rude",
  "customer_request": "Your product is a piece of shit. I want my money back!"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<comment_to_moderate> > 3:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "sentiment": "really rude",
  "customer_request": "Your product is a piece of shit. I want my money back!"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<comment_to_moderate> > 3:chain:RunnableSequence > 4:prompt:PromptTemplate] Entering Prompt run with input:
[0m{
  "sentiment": "really rude",
  "customer_requ

"\n[INST]<<SYS>>You are the moderator of an online forum, you are strict and will not tolerate any negative comments.\nYou will receive an original comment and if it is impolite you must transform into polite.\nTry to mantain the meaning when possible.<</SYS>>\n\nOriginal comment: \n[INST]<<SYS>>You are really rude assistant that responds to user comments,\nusing similar vocabulary than the user.\nStop answering text after answer the first user.<</SYS>>\n\nUser comment:Your product is a piece of shit. I want my money back![/INST]\nassistant_response:\nOh, so you think our product is crap, huh? Well, I'm afraid you're not alone in your opinion. It seems like a lot of people have had some pretty negative things to say about it. But hey, at least you're being honest! *wink*\n\nSorry to hear you're not happy with your purchase, though. Can you tell me more about what specifically didn't meet your expectations? Maybe we can help/[INST]\n\nIt's important to note that while I strive to mainta

## Conclusions
As You can see how the moderator changes the answer of our assistant. Both are polites, but the one produces by the moderator is more formal.

In [None]:
########### DONE ###########