<a href="https://colab.research.google.com/github/Microflow-IO/modsecurity-for-anylog/blob/main/ragas_local_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://huggingface.co/Qwen/Qwen2-7B-Instruct-GPTQ-Int8

Cloning into 'Qwen2-7B-Instruct-GPTQ-Int8'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 42 (delta 16), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (42/42), 3.61 MiB | 3.19 MiB/s, done.
Filtering content: 100% (3/3), 8.25 GiB | 45.94 MiB/s, done.


In [2]:
!git clone https://huggingface.co/BAAI/bge-large-zh-v1.5

Cloning into 'bge-large-zh-v1.5'...
remote: Enumerating objects: 55, done.[K
remote: Total 55 (delta 0), reused 0 (delta 0), pack-reused 55 (from 1)[K
Unpacking objects: 100% (55/55), 176.26 KiB | 2.55 MiB/s, done.


In [3]:
!du -sh *

2.5G	bge-large-zh-v1.5
17G	Qwen2-7B-Instruct-GPTQ-Int8
55M	sample_data


In [None]:
!pip install ragas==0.1.15 langchain==0.2.15 langchain-community==0.2.13 langchain-core==0.2.35 langchain-openai==0.1.23 langchain-text-splitters==0.2.2 torch==2.1.2 sentence-transformers==3.0.1 taming-transformers-rom1504==0.0.6 transformers==4.49 transformers-stream-generator==0.0.4 FlagEmbedding==1.2.11

In [None]:
!pip install optimum gptqmodel

In [None]:
!ps -ef | grep python3

In [4]:
from typing import List, Optional, Any
from datasets import Dataset
from ragas.metrics import faithfulness, context_recall, context_precision, answer_relevancy
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import BaseRagasEmbeddings
from ragas.run_config import RunConfig
from FlagEmbedding import FlagModel
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
from langchain.llms.base import LLM
from langchain.callbacks.manager import CallbackManagerForLLMRun

In [5]:
class MyLLM(LLM):
    # Support Qwen2
    tokenizer: AutoTokenizer = None
    model: AutoModelForCausalLM = None

    def __init__(self, mode_name_or_path: str):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(mode_name_or_path)
        self.model = AutoModelForCausalLM.from_pretrained(mode_name_or_path, device_map="auto")
        self.model.generation_config = GenerationConfig.from_pretrained(mode_name_or_path)

    def _call(
            self,
            prompt: str,
            stop: Optional[List[str]] = None,
            run_manager: Optional[CallbackManagerForLLMRun] = None,
            **kwargs: Any,
    ) -> str:
        messages = [{"role": "user", "content": prompt}]
        input_ids = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self.tokenizer([input_ids], return_tensors="pt").to('cuda')
        generated_ids = self.model.generate(model_inputs.input_ids, max_new_tokens=4096)
        generated_ids = [
            output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
        ]
        response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        return response

    @property
    def _llm_type(self):
        return "qwen2"

In [6]:
class MyEmbedding(BaseRagasEmbeddings):

    def __init__(self, path, run_config, max_length=512, batch_size=256):
        self.model = FlagModel(path, query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章：")
        self.max_length = max_length
        self.batch_size = batch_size
        self.run_config = run_config

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return self.model.encode_corpus(texts, self.batch_size, self.max_length).tolist()

    def embed_query(self, text: str) -> List[float]:
        return self.model.encode_queries(text, self.batch_size, self.max_length).tolist()

In [7]:
data_samples = {
    'question': [
        'When was the first Super Bowl?',
        'Who won the most Super Bowls?'
    ],
    'answer': [
        'The first Super Bowl was held on Jan 15, 1967',
        'The most Super Bowls have been won by The New England Patriots'
    ],
    'contexts': [
        [
            'The first AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles, California.'],
        [
            'The New England Patriots have won the Super Bowl a record six times, surpassing the Pittsburgh Steelers who have won it six times as well.']
    ],
    'ground_truth': [
        'The first Super Bowl was held on January 15, 1967',
        'The New England Patriots have won the Super Bowl a record six times'
    ]
}

dataset = Dataset.from_dict(data_samples)
print(dataset)

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 2
})


In [8]:
llm_path = "/content/Qwen2-7B-Instruct-GPTQ-Int8"
emb_path = "/content/bge-large-zh-v1.5"

run_config = RunConfig(timeout=800, max_wait=800)

embedding_model = MyEmbedding(emb_path, run_config)
my_llm = LangchainLLMWrapper(MyLLM(llm_path), run_config)

result = evaluate(
    dataset,
    metrics=[context_recall, context_precision, answer_relevancy, faithfulness],
    llm=my_llm,
    embeddings=embedding_model,
    run_config=run_config
)

df = result.to_pandas()
print(df.head())
df.to_csv("result.csv", index=False)


[32mINFO[0m  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.      
[32mINFO[0m  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.                              


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


[32mINFO[0m   Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`                              


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some weights of the model checkpoint at /content/Qwen2-7B-Instruct-GPTQ-Int8 were not used when initializing Qwen2ForCausalLM: {'model.layers.8.mlp.down_proj.bias', 'model.layers.15.mlp.up_proj.bias', 'model.layers.25.mlp.gate_proj.bias', 'model.layers.27.mlp.down_proj.bias', 'model.layers.3.mlp.up_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.17.self_attn.o_proj.bias', 'model.layers.12.mlp.up_proj.bias', 'model.layers.11.mlp.gate_proj.bias', 'model.layers.21.mlp.down_proj.bias', 'model.layers.25.mlp.down_proj.bias', 'model.layers.27.mlp.gate_proj.bias', 'model.layers.4.mlp.gate_proj.bias', 'model.layers.14.mlp.down_proj.bias', 'model.layers.8.mlp.up_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.25.self_attn.o_proj.bias', 'model.layers.14.mlp.gate_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.2.mlp.up_proj.bias', 'model.layers.15.mlp.down_proj.bias', 'model.layers.11.mlp.up_proj.bias', 'model.layers.24.self_attn.o_proj.bias', 'mode

[32mINFO[0m  Format: Converting `checkpoint_format` from `gptq` to internal `gptq_v2`.                    
[32mINFO[0m  Format: Converting GPTQ v1 to v2                                                             
[32mINFO[0m  Format: Conversion complete: 0.05669260025024414s                                            
[32mINFO[0m  Optimize: `TritonV2QuantLinear` compilation triggered.                                       


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


                         question  \
0  When was the first Super Bowl?   
1   Who won the most Super Bowls?   

                                              answer  \
0      The first Super Bowl was held on Jan 15, 1967   
1  The most Super Bowls have been won by The New ...   

                                            contexts  \
0  [The first AFL–NFL World Championship Game was...   
1  [The New England Patriots have won the Super B...   

                                        ground_truth  context_recall  \
0  The first Super Bowl was held on January 15, 1967             1.0   
1  The New England Patriots have won the Super Bo...             1.0   

   context_precision  answer_relevancy  faithfulness  
0                1.0          0.836584           1.0  
1                1.0          0.000000           1.0  
