## Task 3：源大模型RAG实战

### 2.1 环境准备

In [None]:
# 查看已安装依赖
! pip list

In [None]:
# 安装 streamlit
! pip install streamlit==1.24.0

### 2.2 模型下载

In [1]:
# 向量模型下载
from modelscope import snapshot_download
model_dir = snapshot_download("AI-ModelScope/bge-small-zh-v1.5", cache_dir='.')

  from .autonotebook import tqdm as notebook_tqdm
Downloading: 100%|██████████| 190/190 [00:00<00:00, 328B/s]
Downloading: 100%|██████████| 776/776 [00:00<00:00, 1.48kB/s]
Downloading: 100%|██████████| 124/124 [00:00<00:00, 266B/s]
Downloading: 100%|██████████| 47.0/47.0 [00:00<00:00, 92.2B/s]
Downloading: 100%|██████████| 91.4M/91.4M [00:00<00:00, 120MB/s] 
Downloading: 100%|██████████| 349/349 [00:00<00:00, 715B/s]
Downloading: 100%|██████████| 91.4M/91.4M [00:00<00:00, 123MB/s] 
Downloading: 100%|██████████| 27.5k/27.5k [00:00<00:00, 54.9kB/s]
Downloading: 100%|██████████| 52.0/52.0 [00:00<00:00, 65.0B/s]
Downloading: 100%|██████████| 125/125 [00:00<00:00, 253B/s]
Downloading: 100%|██████████| 429k/429k [00:00<00:00, 703kB/s]
Downloading: 100%|██████████| 367/367 [00:00<00:00, 759B/s]
Downloading: 100%|██████████| 107k/107k [00:00<00:00, 220kB/s]


In [2]:
# 源大模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('IEITYuan/Yuan2-2B-Mars-hf', cache_dir='.')
# model_dir = snapshot_download('IEITYuan/Yuan2-2B-July-hf', cache_dir='.')

Downloading: 100%|██████████| 0.98k/0.98k [00:00<00:00, 1.91kB/s]
Downloading: 100%|██████████| 0.98k/0.98k [00:00<00:00, 2.43kB/s]
Downloading: 100%|██████████| 51.0/51.0 [00:00<00:00, 110B/s]
Downloading: 100%|██████████| 1.29k/1.29k [00:00<00:00, 2.06kB/s]
Downloading: 100%|██████████| 144/144 [00:00<00:00, 273B/s]
Downloading: 0.00B [00:00, ?B/s]
Downloading: 100%|██████████| 4.41G/4.41G [00:12<00:00, 391MB/s] 
Downloading: 100%|██████████| 7.61k/7.61k [00:00<00:00, 20.2kB/s]
Downloading: 100%|██████████| 411/411 [00:00<00:00, 732B/s]
Downloading: 100%|██████████| 2.06M/2.06M [00:00<00:00, 3.67MB/s]
Downloading: 100%|██████████| 1.12k/1.12k [00:00<00:00, 2.22kB/s]
Downloading: 100%|██████████| 52.0k/52.0k [00:00<00:00, 99.9kB/s]
Downloading: 100%|██████████| 52.0k/52.0k [00:00<00:00, 131kB/s]


### 2.3 RAG实战

In [3]:
# 导入所需的库
from typing import List
import numpy as np

import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

In [4]:
# 定义向量模型类
class EmbeddingModel:
    """
    class for EmbeddingModel
    """

    def __init__(self, path: str) -> None:
        self.tokenizer = AutoTokenizer.from_pretrained(path)

        self.model = AutoModel.from_pretrained(path).cuda()
        print(f'Loading EmbeddingModel from {path}.')

    def get_embeddings(self, texts: List) -> List[float]:
        """
        calculate embedding for text list
        """
        encoded_input = self.tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
        encoded_input = {k: v.cuda() for k, v in encoded_input.items()}
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            sentence_embeddings = model_output[0][:, 0]
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings.tolist()

In [6]:
print("> Create embedding model...")
embed_model_path = './AI-ModelScope/bge-small-zh-v1___5'
embed_model = EmbeddingModel(embed_model_path)

> Create embedding model...
Loading EmbeddingModel from ./AI-ModelScope/bge-small-zh-v1___5.


In [7]:
# 定义向量库索引类
class VectorStoreIndex:
    """
    class for VectorStoreIndex
    """

    def __init__(self, doecment_path: str, embed_model: EmbeddingModel) -> None:
        self.documents = []
        for line in open(doecment_path, 'r', encoding='utf-8'):
            line = line.strip()
            self.documents.append(line)

        self.embed_model = embed_model
        self.vectors = self.embed_model.get_embeddings(self.documents)

        print(f'Loading {len(self.documents)} documents for {doecment_path}.')

    def get_similarity(self, vector1: List[float], vector2: List[float]) -> float:
        """
        calculate cosine similarity between two vectors
        """
        dot_product = np.dot(vector1, vector2)
        magnitude = np.linalg.norm(vector1) * np.linalg.norm(vector2)
        if not magnitude:
            return 0
        return dot_product / magnitude

    def query(self, question: str, k: int = 1) -> List[str]:
        question_vector = self.embed_model.get_embeddings([question])[0]
        result = np.array([self.get_similarity(question_vector, vector) for vector in self.vectors])
        return np.array(self.documents)[result.argsort()[-k:][::-1]].tolist() 

In [8]:
print("> Create index...")
doecment_path = './knowledge.txt'
index = VectorStoreIndex(doecment_path, embed_model)

> Create index...
Loading 3 documents for ./knowledge.txt.


In [9]:
question = '介绍一下广州大学'
print('> Question:', question)

context = index.query(question)
print('> Context:', context)

> Question: 介绍一下广州大学
> Context: ['广州大学（Guangzhou University），简称广大（GU），是由广东省广州市人民政府举办的全日制普通高等学校，实行省市共建、以市为主的办学体制，是国家“111计划”建设高校、广东省和广州市高水平大学重点建设高校。广州大学的办学历史可以追溯到1927年创办的私立广州大学；1951年并入华南联合大学；1983年筹备复办，1984年定名为广州大学；2000年7月，经教育部批准，与广州教育学院（1953年创办）、广州师范学院（1958年创办）、华南建设学院西院（1984年创办）、广州高等师范专科学校（1985年创办）合并组建成立新的广州大学。']


In [10]:
# 定义大语言模型类
class LLM:
    """
    class for Yuan2.0 LLM
    """

    def __init__(self, model_path: str) -> None:
        print("Creat tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=False, add_bos_token=False, eos_token='<eod>')
        self.tokenizer.add_tokens(['<sep>', '<pad>', '<mask>', '<predict>', '<FIM_SUFFIX>', '<FIM_PREFIX>', '<FIM_MIDDLE>','<commit_before>','<commit_msg>','<commit_after>','<jupyter_start>','<jupyter_text>','<jupyter_code>','<jupyter_output>','<empty_output>'], special_tokens=True)

        print("Creat model...")
        self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()

        print(f'Loading Yuan2.0 model from {model_path}.')

    def generate(self, question: str, context: List):
        if context:
            prompt = f'背景：{context}\n问题：{question}\n请基于背景，回答问题。'
        else:
            prompt = question

        prompt += "<sep>"
        inputs = self.tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
        outputs = self.model.generate(inputs, do_sample=False, max_length=1024)
        output = self.tokenizer.decode(outputs[0])

        print(output.split("<sep>")[-1])

In [11]:
print("> Create Yuan2.0 LLM...")
model_path = './IEITYuan/Yuan2-2B-Mars-hf'
# model_path = './IEITYuan/Yuan2-2B-July-hf'
llm = LLM(model_path)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


> Create Yuan2.0 LLM...
Creat tokenizer...


You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Creat model...
Loading Yuan2.0 model from ./IEITYuan/Yuan2-2B-Mars-hf.


In [13]:
print('> Without RAG:')
llm.generate(question, [])

print('> With RAG:')
llm.generate(question, context)

> Without RAG:
 广州大学（Guangzhou University）是广东省内一所综合性大学，位于中国广东省广州市。广州大学成立于1952年，前身为广州工学院，是中华人民共和国成立后创建的第一所高等工科院校。
广州大学坐落在广州市海珠区，占地面积广阔，校园环境优美。学校拥有多个校区，其中主校区位于广州市番禺区，其他校区分布在广州市的其他地区。学校占地面积约4000亩，拥有现代化的教学、实验和生活设施。
广州大学以培养人才为宗旨，注重理论与实践相结合的教学模式。学校开设了多个学院和专业，涵盖了工学、理学、文学、法学、经济学、管理学、艺术学等多个领域。学校现有本科专业近300个，研究生专业涵盖科学、工程、管理、文学、法学、艺术等多个领域。
广州大学注重国际交流与合作，积极推进国际化办学。学校与许多国际知名大学建立了合作关系，开展学术交流和合作研究。此外，学校还鼓励学生参与国际交流项目，提供海外实习和留学机会，提升学生的国际视野和能力。
广州大学一直以来致力于为学生提供优质的教育环境和丰富的学习资源。学校拥有先进的教学设施和实验室，以及图书馆、体育场馆、艺术工作室等丰富的学生课外活动设施。
广州大学以其优秀的教学质量、领先的科研水平和培养优秀学生的能力而闻名。学校致力于培养具有创新精神和社会责任感的高素质人才，为地方经济发展和社会进步做出贡献。<eod>
> With RAG:
 广州大学是一所位于广东省广州市的全日制普通高等学校，实行省市共建、以市为主的办学体制。学校的办学历史可以追溯到1927年创办的私立广州大学，后来在1951年并入华南联合大学。1984年定名为广州大学。2000年，广州大学经过教育部批准，与广州教育学院、广州师范学院、华南建设学院西院、广州高等师范专科学校合并组建新的广州大学。<eod>
