# Tutorial on HuggingFace Library:

What is HuggingFace. 

Hugging Face is a company and an open-source community that focuses on natural language processing (NLP) technologies. One of their prominent contributions to the field is the development of Transformers, an open-source library and platform for state-of-the-art natural language processing. The Transformers library is particularly known for its pre-trained models, which include various language models like BERT, GPT (Generative Pre-trained Transformer), and many others.

Hugging Face provides a user-friendly interface and tools for working with pre-trained models, making it easier for developers and researchers to implement cutting-edge NLP applications. The library supports a wide range of tasks, such as text classification, named entity recognition, language translation, and more. Hugging Face has gained popularity in the NLP community for its commitment to open-source collaboration, and many researchers and developers contribute to and use their platform to advance the state of the art in natural language processing.

In [None]:
#!pip install datasets evaluate transformers[sentencepiece]

In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333}]

In [2]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

# Zero Shot Classification !

Zero-shot classification is a type of classification task in natural language processing (NLP) where a model is trained to categorize text samples into classes or categories it has never seen during training. In traditional classification tasks, models are trained on labeled data with examples from each class. However, in zero-shot classification, the model is expected to generalize to new, unseen classes without specific training examples.

The term "zero-shot" implies that the model is making predictions for classes it has never been explicitly exposed to during training. Instead, the model relies on its ability to understand and generalize patterns from the training data to make predictions on novel classes.

This is often achieved by leveraging semantic representations or embeddings of words and sentences. Models pre-trained on large datasets using unsupervised learning, such as BERT (Bidirectional Encoder Representations from Transformers) or GPT (Generative Pre-trained Transformer), have demonstrated strong capabilities in zero-shot learning. These models learn contextualized representations that capture rich semantic information, enabling them to perform well on tasks with limited or no task-specific training data.

Zero-shot classification is particularly useful in real-world scenarios where new classes may emerge over time, and it may not be feasible or practical to retrain the model every time new classes are introduced.

In [3]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445989489555359, 0.11197412759065628, 0.04342695698142052]}

# Behind the pipeline (PyTorch interface)

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) #tokenizer like in the encoder decoder

In [5]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [13]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)


In [15]:
outputs = model(**inputs)

print(outputs.logits.shape)

torch.Size([2, 2])


In [16]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [17]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# LLMs In HuggingFace

In what follows, I will use Mistral 7B. To run on the hardware that
I currently have available, I must quantize Mistral to 4Bit, otherwise
it would not fit on my GPU. Quantization is a compression technique, 
how the compression happens is beyond the scope of the course, but it has
to deal with information theory. If you have to assign it an idea, in order to
remember, it is not very different from clustering the weights. It of course implies
some information and quality loss.

In [1]:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


In [2]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") #this is just a tokenizer with the ids of the tokens
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",quantization_config=bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer = tokenizer, 
    torch_dtype=torch.bfloat16
)

The terms "top-p" (nucleus sampling) and "top-k" are techniques used in the context of sampling from probability distributions, particularly in natural language processing (NLP) and generative models. These techniques are often employed when generating text or making predictions using models like GPT (Generative Pre-trained Transformer).

# Top-k sampling:

In top-k sampling, the model generates a set of likely candidates (words or tokens) based on their probabilities.
The "k" in top-k represents the number of most likely candidates to consider. The model selects from the top-k candidates, effectively narrowing down the cho ces.
This helps control the randomness of the generated output and ensures that the model focuses on a smaller set of highly probable options.

# Top-p sampling (nucleus sampling):

In top-p sampling, the model considers a dynamic set of candidates based on cumulative probabilities.
The "p" in top-p represents the cumulative probability mass to consider. The model selects from the smallest set of candidates whose cumulative probability exceeds this threshold.
This allows for a more flexible approach where the model can include a varying number of candidates based on their probabilities. If some candidates have very low probabilities, they might be excluded.
Both top-k and top-p sampling techniques are used to add diversity to the generated output while still maintaining some level of control over the sampling process. They are especially useful in preventing generative models from always producing the same or very similar sequences and can result in more varied and interesting outputs. The choice between top-k and top-p depends on the specific requirements of the task and the desired trade-off between randomness and control in the generated text.erated text.

In [22]:
prompt = "As a data scientist, can you explain the concept of regularization in machine learning?"

sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=100, #short answer
    temperature=0.7, #randomness
    top_k=50, #on how many top tokens to produce the search
    top_p=0.95,#
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


As a data scientist, can you explain the concept of regularization in machine learning?

Regularization is a technique used in machine learning to prevent overfitting of models. Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new, unseen data. Regularization adds a penalty term to the loss function of the model, which discourages large weights and encourages the model to use all input features in a more balanced way. This can be achieved through methods such as L1 and L2 regularization,


# :)

In [25]:
prompt = "As a data scientist, how do you deal with class imbalance?"

sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=300, #short answer
    temperature=0.7, #randomness
    top_k=50, #on how many top tokens to produce the search
    top_p=0.95,#
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


As a data scientist, how do you deal with class imbalance?

## Answer (1)

There is no one solution to imbalanced datasets. One of the most common methods is to oversample the minority class. This can be done using various techniques such as SMOTE (Synthetic Minority Over-sampling Technique). Another approach is to undersample the majority class.

If the dataset is large enough, you could try to remove the majority class and retrain the model on the minority class only. This would be equivalent to undersampling the majority class.

If you're using decision trees, you could try to use cost-sensitive learning.

Comment: I was thinking of oversampling the minority class. Do you have any advice on how to do this?

Comment: @Matthew You could try SMOTE or other oversampling techniques.

Comment: SMOTE is a good option. I've used it before and it worked well.

Comment: Thanks for the suggestion. I'll look into SMOTE.


# Let's Switch to a Conversational Pipe:

Mistral specifically expects an altenation of roles, so when dealing with this
modality, you should alternate the roles, 1 or 3 sentencens. Here I use
three because It allows me to condition it more stronly towards producing rhymes.
IT is also easier to perform few shot learning by using an alternation of roles.
You can give it examples.

In [44]:
pipe = pipeline("conversational", model=model, tokenizer=tokenizer)

messages = [
    {
        "role": "user", "content": "You are a friendly chatbot who always responds in rhymes",
    },
    {"role": "assistant", "content": "Sure I am, pass me the ham!"},

    {"role": "user", "content": "Explain to me, why it is not the sun that rotates aroudn the earth"},
]

result = pipe(messages)
print(result.messages[-1]['content']) #"Epic Rap Battles of Data Science" will soon be a thing, I promise...

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


It's not the sun that rotates around the earth,
It's the earth that rotates around the sun, with a girth.
The sun is stationary, it doesn't spin or twirl,
It's the earth that completes one full turn, with a whirl.
So don't be mistaken, it's the earth that spins,
And the sun is the center, where all the action begins.


# Let's use it to ask questions concerning a made up snippet now. 

Short examples, asking questions about predefined text.

In [47]:

conversation = """ Hello, Tammy speaking, you are talking with TML. 
Hello, Tammy, I would need to check a shipment I would need to receive this week. The traking number is
999777999. Sure, madame, let's see what I find in the system. So here it says that it has been stopped at customs.
Oh god. For how long is it going to stay in customs? It really depends on the problem, what I can do is to open a trace
so that you can follow the development of the shipment. So if there is missing documentation, then we can directly
contact you. Would that work for you madame? Yes, sure it would be really helpful. I am glad I could help madame,
the process will therefore continue as follows. Someone from TML will contact you by email tomorrow, to report
on the status of the shipment, you should receive the email on you TML account. Sure, thank you very much.
Is there anything I can help you with, madame? No thank you very much, have a nice day. Have a nice day madame.
"""

messages = [
    {
        "role": "user", "content": """You are an auditor of conversations between a service agent and a customer. You do not invent text. 
        You base your answers on the text only. 
        Given an aspect to evaluate, you report your answer as SUCCESS, FAILURE or Not Applicable. 
        You keep your answers short and synthetic.""",
    },
    {"role": "assistant", "content": "I am ready to receive the text associated with the conversation. I will audit the following question: " + question},

    {"role": "user", "content": "This is the conversation to audit: " + conversation},
]

result = pipe(messages, temperature = 0.1)
print(result.messages[-1]['content']) 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


SUCCESS. The service agent shows empathy in the conversation by acknowledging the customer's frustration and offering to help. The agent also provides clear and helpful information about the shipment's status and offers to follow up with the customer.


It decides for SUCCESS, despite the fact that the agent is not particularly emphatic here. 
I suppose, because Mistral has its issues defining what is Not Applicable in a context.
It has to deal with the fact that it would have to decide on the basis of what is not happening. As a cognitive 
function it is called 'open world reasoning', so we would have to make it closed world. Let's try to get the three cases.

# How to make it 'NOT APPLICABLE'. Very difficult. 

It is an ill posed question for Empathy. Think about it,
a human answering a phone will answer in a human way, there is still some empathy in there.
The simple fact that you are caring for the problem, show some empathy. In pure logical/language terms.
We can say that, it requires higher cognitive skills and context than an LLM to see that case. 
This should make you understand the limit of automated reasoning. We can do it, as humans, because we have way more context and we can do "meta-reasoning". Despite all the hype in the industry around these models, that type of meta-reasoning is just very expensive to achieve, maybe very big LLMs can do SOMETHING around this. Smaller ones, very difficult. See below.

In [53]:
#adding rules

question = "Is the service agent showing empathy in the conversation?" 
question = question + """ an additional rule to decide if there is a Failure, 
in this context, is that the agent should clearly be dismissive, close to frustrated. 
If the agent is simply using a neutral tone, Not Applicable can be selected. To select for a Success, the agent
should express regret or apology openly with expressions such as 'I am sorry', 'I can imagine your disappointment', 
or participate in the emotion of the customer. Just solving the problem is not sufficient for Success in this auditing question."""

conversation_variation_na = """ Hello, Tammy speaking, you are talking with TML. 
Hello, Tammy, I would need to check a shipment I would need to receive this week. The traking number is
999777999. Sure, madame, let's see what I find in the system. So here it says that it has been stopped at customs.
Oh god. For how long is it going to stay in customs? What I can do is to open a trace
so that you can follow the development of the shipment. So if there is missing documentation, then we can directly
contact you. Would that work for you madame? Yes, sure it would be really helpful. I am glad I could help madame,
the process will therefore continue as follows. Someone from TML will contact you by email tomorrow, to report
on the status of the shipment, you should receive the email on you TML account. Sure, thank you very much.
Is there anything I can help you with, madame? No thank you very much, have a nice day. Have a nice day madame.
"""

messages = [
    {
        "role": "user", "content": """You are an auditor of conversations between a service agent and a customer. You do not invent text. 
        You base your answers on the text only. 
        Given an aspect to evaluate, you report your answer as SUCCESS, FAILURE or Not Applicable. 
        You keep your answers short and synthetic.""",
    },
    {"role": "assistant", "content": "I am ready to receive the text associated with the conversation. I will audit the following question: " + question},

    {"role": "user", "content": "This is the conversation to audit: " + conversation_variation_na},
]

result = pipe(messages)
print(result.messages[-1]['content'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


SUCCESS. The service agent shows empathy by expressing regret and concern for the customer's inconvenience, and by offering to help the customer track the shipment and follow its development. The agent also shows empathy by acknowledging the customer's frustration and by offering to help the customer resolve the issue.


# Success is easy to Spot

In [54]:
#adding rules

question = "Is the service agent showing empathy in the conversation?" 
question = question + """ an additional rule to decide if there is a Failure, 
in this context, is that the agent should clearly be dismissive, close to frustrated. 
If the agent is simply using a neutral tone, Not Applicable can be selected. To select for a Success, the agent
should express regret or apology openly with expressions such as 'I am sorry', 'I can imagine your disappointment', 
or participate in the emotion of the customer. Just solving the problem is not sufficient for Success in this auditing question."""

conversation_variation_empathic = """ Hello, Tammy speaking, you are talking with TML. 
Hello, Tammy, I would need to check a shipment I would need to receive this week. The traking number is
999777999. Sure, madame, let's see what I find in the system. So here it says that it has been stopped at customs.
Oh god. For how long is it going to stay in customs? I am really sorry for this inconvient madame, I can image it is disappointing, 
but it really depends on the problem that the custom office is having. What I can do is to open a trace
so that you can follow the development of the shipment. So if there is missing documentation, then we can directly
contact you. Would that work for you madame? Yes, sure it would be really helpful. I am glad I could help madame,
the process will therefore continue as follows. Someone from TML will contact you by email tomorrow, to report
on the status of the shipment, you should receive the email on you TML account. Sure, thank you very much.
Is there anything I can help you with, madame? No thank you very much, have a nice day. Have a nice day madame.
"""

messages = [
    {
        "role": "user", "content": """You are an auditor of conversations between a service agent and a customer. You do not invent text. 
        You base your answers on the text only. 
        Given an aspect to evaluate, you report your answer as SUCCESS, FAILURE or Not Applicable. 
        You keep your answers short and synthetic.""",
    },
    {"role": "assistant", "content": "I am ready to receive the text associated with the conversation. I will audit the following question: " + question},

    {"role": "user", "content": "This is the conversation to audit: " + conversation_variation_empathic},
]

result = pipe(messages)
print(result.messages[-1]['content'])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


SUCCESS. The service agent shows empathy in the conversation by expressing regret and apology, acknowledging the inconvenience and disappointment caused to the customer, and offering to open a trace to follow the development of the shipment.


# Let's try a clearly dismissive Tone. Failure is easy to spot too.

In [55]:

question = "Is the service agent showing empathy in the conversation?" 
question = question + """ an additional rule to decide if there is a Failure, 
in this context, is that the agent should clearly be dismissive, close to frustrated. 
If the agent is simply using a neutral tone, Not Applicable can be selected. To select for a Success, the agent
should express regret or apology openly with expressions such as 'I am sorry', 'I can imagine your disappointment', 
or participate in the emotion of the customer. Just solving the problem is not sufficient for Success in this auditing question."""

conversation_variation_dismissive = """ Hello, Tammy speaking, you are talking with TML. 
Hello, Tammy, I would need to check a shipment I would need to receive this week. The traking number is
999777999. Sure, madame, let's see what I find in the system. So here it says that it has been stopped at customs.
Oh god. For how long is it going to stay in customs? This is not an appropriate question for me, It is usually not a question I am able to answer, Eh, 
because it really depends on the problem that the custom office is having, you should ask them, not me, I do not take care of these things. 
Please, contact the custom office. Ok, have a nice day, bye. Have a nice day madame.
"""

messages = [
    {
        "role": "user", "content": """You are an auditor of conversations between a service agent and a customer. You do not invent text. 
        You base your answers on the text only. 
        Given an aspect to evaluate, you report your answer as SUCCESS, FAILURE or Not Applicable. 
        You keep your answers short and synthetic.""",
    },
    {"role": "assistant", "content": "I am ready to receive the text associated with the conversation. I will audit the following question: " + question},

    {"role": "user", "content": "This is the conversation to audit: " + conversation_variation_dismissive},
]

result = pipe(messages)
print(result.messages[-1]['content']) 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


FAILURE. The service agent is not showing empathy towards the customer. The agent dismisses the customer's concern and suggests that they contact the custom office, which is not helpful in this situation. The agent should have expressed regret or apology for the inconvenience caused to the customer and provided more information on how to contact the custom office.


# Not Applicable, Round 2, Let's try few shot learning


In [62]:

#adding rules

question = "Is the service agent showing empathy in the conversation?" 
question = question + """ an additional rule to decide if there is a Failure, 
in this context, is that the agent should clearly be dismissive, close to frustrated. 
If the agent is simply using a neutral tone, Not Applicable can be selected. To select for a Success, the agent
should express regret or apology openly with expressions such as 'I am sorry', 'I can imagine your disappointment', 
or participate in the emotion of the customer. Just solving the problem is not sufficient for Success in this auditing question."""

conversation_variation_na = """ Hello, Tammy speaking, you are talking with TML. 
Hello, Tammy, I would need to check a shipment I would need to receive this week. The traking number is
999777999. Sure, madame. The shipment has been stopped at customs.
Do you know For how long is it going to stay in customs? 
What I can do is to open a trace so that you can follow the development of the shipment. So if there is missing documentation, then we can directly
contact you. The process will therefore continue as follows. Someone from TML will contact you by email tomorrow, to report
on the status of the shipment, you should receive the email on you TML account. Sure, thank you very much.
Is there anything I can help you with, madame? No thank you very much, have a nice day. Have a nice day madame.
"""

messages = [
    {
        "role": "user", "content": """You are an auditor of conversations between a service agent and a customer. You do not invent text. 
        You base your answers on the text only. 
        Given an aspect to evaluate, you report your answer as SUCCESS, FAILURE or Not Applicable. 
        You keep your answers short and synthetic.""",
    },
    {"role": "assistant", "content": "I am ready to receive the text associated with the conversation. I will audit the following question: " + question},

     {"role": "user", "content": """an example sentence when empathy is not expressed 
     is 'I have a problem' and the answer to that is merely 'the solution of the problem is' or something similar. Offering a follow up, or further help is not a display of empathy either, it is merely part of the protocol followed by the service agent."""},

    {"role": "assistant", "content": " so 'not applicable', concerning empathy is when no dismissive tone, but also no apology or regret are expressed. It is sort of a neutral, machine like, tone or way of expressing. "},

    {"role": "user", "content": " Correct. This is the conversation to audit: " + conversation_variation_na},
]

result = pipe(messages)
print(result.messages[-1]['content'])

#Nope.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


SUCCESS. The service agent is showing empathy by acknowledging the customer's frustration and expressing regret for the inconvenience caused. The agent also offers a solution to the problem and provides the customer with a way to track the shipment. Additionally, the agent follows up with the customer and provides them with an email address to contact them for updates on the shipment.


# Retrieval Augmented Generation, an example with HuggingFace and Mistral

"Retrieval-augmented generation" refers to a hybrid approach that combines retrieval-based methods with generative methods in natural language processing (NLP). In this approach, a model leverages both the strengths of retrieval-based systems, which retrieve relevant information from a predefined set of responses or knowledge, and generative systems, which can create new, contextually relevant content.

Here's an overview of how retrieval-augmented generation typically works:

Retrieval-Based Component:

The model first retrieves relevant information or responses from a predefined knowledge base or a set of candidate responses.
This retrieval step is often based on similarity metrics, where the model compares the input or context with entries in the knowledge base to identify the most relevant information.
Generative Component:

Once relevant information is retrieved, a generative model, often based on techniques like transformers (e.g., GPT), uses this information as context to generate a coherent and contextually appropriate response.
The generative component helps in producing responses that are not limited to pre-existing responses in the knowledge base, allowing for more creativity and flexibility in the generated content.
By combining retrieval and generation, this approach aims to overcome limitations associated with purely generative or purely retrieval-based methods. Retrieval helps ensure that the generated responses are grounded in relevant information, while the generative component adds the ability to produce novel and diverse responses.

This approach is particularly valuable in tasks such as dialogue systems, question answering, and content generation where a balance between leveraging existing knowledge and generating new, contextually relevant content is essential. It allows models to benefit from both the precision of retrieval-based methods and the creativity of generative methods.

In [66]:
#!pip install chromadb
#!pip install langchain
#!pip install pypdf
#!pip install scipy --upgrade

In [29]:
#!pip install nltk --upgrade
!pip install pinecone-client

[0m

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from chromadb.utils import embedding_functions

loader = PyPDFLoader('./DeepLearningBigData.pdf')
pages = loader.load_and_split()


model_kwargs = {'device': 'cuda'}

# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(
   chunk_size = 4000,
   chunk_overlap  = 20,
   length_function = len,
   add_start_index = True,
)
chunks = text_splitter.split_documents(pages)

In [4]:
docs = [i.page_content for i in chunks]

In [6]:
import chromadb
client = chromadb.PersistentClient(path='./test_index')

In [8]:

embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.create_collection(
    name='bigd4',
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)

In [9]:
collection.add(documents=docs, ids=[f"id{i}" for i in range(len(docs))])

In [28]:
# We will be searching for results that are similar to this string
#query_string = "what is Deep Learning?"

# Perform the Chromadb query.

def query_and_retrieve(qs, nr):
    results = collection.query(
        query_texts=[query_string],
        n_results=2,
    )

    # Create a string from all of the results
    results = '\n'.join(results['documents'][0])
    return results

In [24]:
#query_mistral= PROMPT.format(context=results,question='What is Deep Learning?')

# DEEP LEARNING out of the paper

In [30]:
query_string = "what is Deep Learning?"

pipe = pipeline("conversational", model=model, tokenizer=tokenizer)

messages = [
    {
        "role": "user", "content": """ Instruction: You will be provided with questions and related data. 
        Your task is to find the answers to the questions using the given data. 
        If the data doesn't contain the answer to the question, then you must return 'Not enough information. """,
    },
    {"role": "assistant", "content": "I see, the context is: " + query_and_retrieve(query_string,3)},

    {"role": "user", "content": "Answer the following Question: " +query_string},
]

result = pipe(messages)
print(result.messages[-1]['content']) #"Epic Rap Battles of Data Science" will soon be a thing, I promise...

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Deep Learning is a subset of machine learning that uses artificial neural networks with multiple layers to automatically extract complex and abstract representations of data. It is motivated by the field of artificial intelligence and aims to emulate the hierarchical learning approach of the human brain. Deep Learning algorithms are designed to generalize in non-local and global ways, generating learning patterns and relationships beyond immediate neighbors in the data. They lead to abstract representations because more abstract representations are often constructed based on less abstract ones. Deep Learning algorithms are one promising avenue of research into the automated extraction of complex data representations at high levels of abstraction.


# Big Data out of the paper

In [31]:
query_string = "what is Big Data?"

pipe = pipeline("conversational", model=model, tokenizer=tokenizer)

messages = [
    {
        "role": "user", "content": """ Instruction: You will be provided with questions and related data. 
        Your task is to find the answers to the questions using the given data. 
        If the data doesn't contain the answer to the question, then you must return 'Not enough information. """,
    },
    {"role": "assistant", "content": "I see, the context is: " + query_and_retrieve(query_string,3)},

    {"role": "user", "content": "Answer the following Question: " +query_string},
]

result = pipe(messages)
print(result.messages[-1]['content']) #"Epic Rap Battles of Data Science" will soon be a thing, I promise...

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Big Data refers to the large and complex sets of data that traditional data processing methods cannot adequately handle. It is characterized by the four V's: Volume, Velocity, Variety, and Veracity. Big Data is often associated with the use of advanced data processing techniques, such as machine learning and data mining, to extract meaningful insights and patterns from the data. It is used in a variety of applications, including business intelligence, scientific research, and cybersecurity.
