# RAG FOR ADIDAS SALES DATASET
This notebook shows how to set a pipeline for a Retrieval Augmented Generation model, based on a LLM AI, to obtain quick and specific information about the given dataset.

### LIBRARIES
Installation of required libraries.

In [None]:
%%capture
! pip install farm-haystack[all-gpu] datasets transformers bitsandbytes accelerate langchain[llms]
! pip install sentencepiece && touch /root/.ipython/profile_default/startup/00-first.py && echo "import <sentencepiece>" >> /root/.ipython/profile_default/startup/00-first.py

In [None]:
# Haystack
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.retriever import EmbeddingRetriever
from haystack.nodes import PromptNode, PromptTemplate
from haystack.pipelines import Pipeline
from haystack.schema import Document

# Llama2
import torch
import transformers
from transformers import LlamaForCausalLM, LlamaTokenizer, GenerationConfig

# Other
from datasets import Dataset
import pandas as pd
from google.colab import files

import warnings
warnings.filterwarnings("ignore")

### DATASET
Load the dataset.

In [None]:
uploaded = files.upload()
my_files = list(uploaded.keys())

In [None]:
df = pd.read_excel("/content/Adidas US Sales Datasets Edit.xlsx")

In [None]:
print(df.head())

### LLAMA2
Download the LLM.

**Login to Huggingface**

In [None]:
! huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


**Download model and tokenizer**

In [None]:
pretrained_model_name_or_path = "meta-llama/Llama-2-13b-chat-hf"

model = LlamaForCausalLM.from_pretrained(
    pretrained_model_name_or_path,
    load_in_4bit=True,
    torch_dtype=torch.float16
)

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path)
model.eval()

Downloading config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm

### DOCUMENT STORE
Create a document store for the selected dataset: every occurrence is a different document. This is needed to perform the research based on word embeddings.

In [None]:
def create_documents(df):
  documents = []
  columns_to_use = list(df.columns)

  for row in df.values.tolist():
    content_to_link = ""
    meta_to_link = {}
    for i in range(0, len(row)):
      content_to_link = content_to_link + " " + str(row[i])
      meta_to_link.update({columns_to_use[i]: row[i]})

    documents.append(
        Document(
            content = content_to_link,
            meta = meta_to_link
        )
    )

  return documents

In [None]:
docs = create_documents(df)

**Document store**

In [None]:
document_store = InMemoryDocumentStore(use_bm25=True)
document_store.write_documents(docs)

Updating BM25 representation...: 100%|██████████| 9648/9648 [00:00<00:00, 101726.11 docs/s]


### PIPELINE
Create a pipeline to gather the information from the document store and assemble a prompt for the LLM.

**Embedding retriever**

The embedding retriever will select the k most relevant documents from the store.

In [None]:
%%capture
retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
    model_format="sentence_transformers",
    top_k=20)

In [None]:
document_store.update_embeddings(retriever)

Updating Embedding:   0%|          | 0/9648 [00:00<?, ? docs/s]

Batches:   0%|          | 0/302 [00:00<?, ?it/s]

Documents Processed: 10000 docs [00:38, 261.10 docs/s]


In [None]:
# Test retriever (optional)
retriever.retrieve("Foot locker in new York")

**Prompt node**

Create a template and assemble a prompt.

In [None]:
prompt_template = """Synthesize a comprehensive answer from the following text for the given question.
              Provide a clear and concise response that summarizes the key points and information presented in the text.
              Your answer should be in your own words and be no longer than 50 words.
              \n\n Related text: {join(documents)} \n\n Question: {query} \n\n Answer:"""

brief_template = """[INST]<<SYS>>Synthesize a comprehensive answer from the following text for the given question.
              Provide a clear and concise response that summarizes the key points and information presented in the text.
              Your answer should be in your own words and be no longer than 50 words.<</SYS>>
              \n\n Related text: {join(documents)} \n\n Question: {query}<<SYS>>
              The output must be ONLY the requested information. Avoid to return useless informations.<</SYS>>[/INST]"""

prompt_node = PromptNode(
    model_name_or_path="meta-llama/Llama-2-13b-chat-hf", default_prompt_template=prompt_template, max_length=200,
    model_kwargs={'model':model,
                  'tokenizer':tokenizer,
                  'task_name':'text-generation',
                  'device':None,
                  'stream':True}
)

**Pipeline**

Assemble the pipeline, composed by every step needed to perform the RAG.

In [None]:
pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])

### CHAT

Examples of requests in natural language and answer from the model, based on the information obtained from the retriever.

**Standard template**

In [None]:
def chat_from_pipeline(pipeline, query):
  output = pipeline.run(query=query)

In [None]:
chat_from_pipeline(pipe, "Foot Locker in new york that sales Women's Athletic Footwear")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Based on the text, Foot Locker in New York sells Women's Athletic Footwear with a price range of $41.0 to $49.0 and an average price of $45.0. The sales are predominantly online with a percentage of in-store sales as well. The data is from the period of January 2020 to March 2020.</s>


In [None]:
chat_from_pipeline(pipe, "Walmart in Texas with total sales of at least 600000 dollars")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Walmart in South Texas with total sales of at least 600000 dollars has a variety of products including Women's and Men's Apparel, Women's Athletic Footwear, and Accessories. The top-selling products are Women's Apparel with total sales of 320000 dollars, Men's Apparel with total sales of 292500 dollars, and Women's Athletic Footwear with total sales of 135000 dollars. The products are available both in-store and online, with the majority of sales coming from Outlet stores. The total sales for the period were at least 600000 dollars.</s>


In [None]:
chat_from_pipeline(pipe, "The sales amount of Foot Locker's store in new york that sales Women's Athletic Footwear")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The sales amount of Foot Locker's store in New York that sales Women's Athletic Footwear is $35,000.

Explanation: Based on the given text, the sales amount of Foot Locker's store in New York that sales Women's Athletic Footwear is $35,000. This information can be found in the text under the date January 24, 2020, and the product category Women's Athletic Footwear.</s>


**Brief template (To fix)**

In [None]:
chat_from_pipeline(pipe, "The sales amount of Foot Locker's store in new york that sales Women's Athletic Footwear")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 Sure! Here's the answer to your question based on the provided text:

The sales amount of Foot Locker's store in New York that sells Women's Athletic Footwear is $9450.0.</s>
