<a href="https://colab.research.google.com/github/Crossme0809/frenzyTechAI/blob/main/autotrain/AutoTrain%EF%BC%9A%E5%9C%A8Google_Colab%E4%B8%8A%E5%BE%AE%E8%B0%83LLM%E7%9A%84%E6%9C%80%E7%AE%80%E5%8D%95%E6%96%B9%E6%B3%95.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 在免费的Google Colab实例上训练LLM - Autotrain
欢迎使用本笔记本，它将向您展示如何使用您自己的数据集微调 LLM 模型。我们将使用：

- 最近的peft 库和bitsandbytes 用于以4 位加载大型模型。
- 自动训练来运行训练

微调方法将依赖于一种名为“低等级适配器”（LoRA）的最新方法，而不是微调整个模型，您只需微调这些适配器并将它们正确加载到模型中即可。微调模型后，您还可以在 🤗 Hub 上共享您的适配器并轻松加载它们。

请注意，这可用于支持 device_map 的任何模型（即使用加速加载模型）。

## 步骤0 - 定义一些帮助程序函数：
- 启用文本换行，这样我们就不必水平滚动
- 定义一个包装函数，它将我们的查询传递给模型进行推理并返回解码后的模型的完成（响应）。

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)


让我们定义一个包装函数，它将从用户问题的模型中获得补全

In [None]:
def get_completion(query: str, model, tokenizer) -> str:
  device = "cuda:0"

  prompt_template = """
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  {query}

  ### Answer:
  """
  prompt = prompt_template.format(query=query)

  encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

  model_inputs = encodeds.to(device)


  generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
  decoded = tokenizer.batch_decode(generated_ids)
  return (decoded[0])



设置运行时 为了微调 Mistral 7b，GPU 实例至关重要。请按照以下说明操作：

1. 转到运行时（位于顶部菜单栏）。
2. Select Change Runtime Type.
3. Choose T4 GPU (or a comparable option).

## 第 1 步 - 安装必要的软件包并登录 Hugging Face
首先，安装下面的依赖项来开始。由于这些功能仅在主分支上可用，因此我们需要从源代码安装以下库。


In [None]:
!pip install -q pandas
!pip install -q autotrain-advanced safetensors
!autotrain setup --update-torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.4/130.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/174.1 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m8.9 

**连接到 Hugging Face 进行模型上传**

登录到 Hugging Face 为了确保模型可以上传用于推理，需要登录 Hugging Face 中心。

获取拥抱脸令牌步骤：

导航到此 URL：https://huggingface.co/settings/tokens 创建一个写入令牌并将其复制到剪贴板 运行下面的代码并输入您的令牌

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 第 2 步 - 加载数据集

让我们加载一个金融数据集，以根据基本金融知识微调我们的模型。在本指南中，我们将加载一个小数据集

In [None]:
from datasets import load_dataset
data = load_dataset("ronal999/finance-alpaca-demo", split='train')
# we'll only load 1/6 of the original dataset in the demo
data = data.shard(num_shards=6, index=0)
print(data)
# Explore the data
df = data.to_pandas()
df.head(10)

Downloading readme:   0%|          | 0.00/587 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/457k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/690 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'instruction', 'input', 'output', 'prompt'],
    num_rows: 115
})


Unnamed: 0,text,instruction,input,output,prompt
0,,"For a car, what scams can be plotted with 0% f...",,The car deal makes money 3 ways. If you pay in...,Below is an instruction that describes a task....
1,,Who can truly afford luxury cars?,,Most of the people I know that own them are sl...,Below is an instruction that describes a task....
2,,How to evaluate an annuity,,Annuities are usually not good deals. Commissi...,Below is an instruction that describes a task....
3,,"Giving kids annual tax free gift of $28,000",,From the IRS' website: How many annual exclus...,Below is an instruction that describes a task....
4,,What does a reorganization fee that a company ...,,"Its a broker fee, not something charged by the...",Below is an instruction that describes a task....
5,,"Paid cash for a car, but dealer wants to chang...",,"As others have said, if the dealer accepted pa...",Below is an instruction that describes a task....
6,,Google Finance: Input Parameters For Simple Mo...,,I looked at this a little more closely but the...,Below is an instruction that describes a task....
7,,Medium-term money investment in Germany,,Due to the zero percent interest rate on the E...,Below is an instruction that describes a task....
8,,question about short selling stocks,,If you had an agreement with your friend such ...,Below is an instruction that describes a task....
9,,"Theoretically, if I bought more than 50% of a ...",,Owning more than 50% of a company's stock nor...,Below is an instruction that describes a task....


将数据集以 CSV 格式保存在本地 Colab 根目录中

---



In [None]:
data.to_csv("train.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

137937

## 第 3 步 - 认识 AutoTrain 并进行训练！

## AutoTrain 命令概述

#### 命令标志的作用的简短概述。

- `!autotrain`: 在 Jupyter Notebook 等环境中执行的命令，以直接运行 shell 命令。 autotrain 是一个自动训练实用程序。

- `llm`: 指定任务类型的子命令或参数

- `--train`: Initiates the training process.

- `--project_name`: 设置项目名称

- `--model`: 指定 `Hugging Face` 上托管的原始模型。如果您愿意，可以更改模型，您可以使用 Hugging Face Hub 中的大多数文本生成模型

- `--data_path .`: 训练数据集的路径。这 ”。”指当前目录。 `train.csv` 文件需要位于此目录中。

- `--use_peft`: 使用Parameter-Efficient-Finetuning来减少内存的使用

- `--use_int4`: 使用 INT4 量化来减小模型大小并加快推理时间，但会牺牲一些精度。

- `--learning_rate 2e-4`: 将训练的学习率设置为0.0002。

- `--train_batch_size 4`: 将训练的批量大小设置为 4。

- `--num_train_epochs 3`: 训练过程将迭代数据集 3 次。

### 运行前需要的步骤

转到下面的 `!autotrain` 代码单元并按照以下步骤进行更新：

1. 在 `--project_name` 替换 `*enter-a-project-name*` 之后，如果您愿意，请选择一个项目名称
2. 在 `--repo_id` 之后替换 `*username*/*repository*` 。将 `*username*` 替换为您的 Hugging Face 用户名，将 `*repository*` 替换为您希望在其下创建的存储库名称。您无需事先创建此存储库，训练完成后它将自动创建并上传。
3. 确认 `train.csv` 位于 Colab 的根目录中。 `--data_path .` 标志将使 AutoTrain 在那里查找您的数据。执行此操作，将 train.csv 上传到名为 data/train.csv 的文件夹，该文件夹必须包含文本列
4. 确保添加要训练的 LoRA 目标模块 `--target-modules q_proj, v_proj`
5. 完成这些更改后，一切准备就绪，请运行以下命令！






In [None]:
!autotrain llm \
--train \
--project_name mistral-7b-autotrained-finance \
--model mistralai/Mistral-7B-v0.1\
--data_path . \
--text-column prompt \
--learning_rate 2e-4 \
--train_batch_size 1 \
--num_train_epochs 3 \
--trainer sft \
--use_peft \
--use_int4 \
--fp16 \
--lora-r 16 \
--lora-alpha 32 \
--lora-dropout 0.05 \
--target_modules

 \
--push_to_hub \
--repo_id [Your_REPO_ID] \
--token [YOUR_TOKEN]\


> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='.', train_split='train', valid_split=None, text_column='prompt', model='mistralai/Mistral-7B-v0.1', learning_rate=0.0002, num_train_epochs=3, train_batch_size=1, warmup_ratio=0.1, gradient_accumulation_steps=1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.0, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=-1, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='mistral-7b-autotrained-finance', evaluation_strategy='epoch', save_total_limit=1, save_strategy='epoch', auto_find_batch_size=False, fp16=False, push_to_hub=True, use_int8=False, model_max_length=1024, repo_id='Ronal999/mistral-7b-autotrained-finance', use_int4=True, trainer='sft', target_modules='q_proj,v_proj', merge_adapter=False, token='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', backend='default', username=None, use_flash_attention_2=

如果您想了解有关可用命令行标志的更多信息

In [None]:
!autotrain llm -h

usage: autotrain <command> [<args>] llm
       [-h]
       [--train]
       [--deploy]
       [--inference]
       [--data_path DATA_PATH]
       [--train_split TRAIN_SPLIT]
       [--valid_split VALID_SPLIT]
       [--text_column TEXT_COLUMN]
       [--model MODEL]
       [--learning_rate LEARNING_RATE]
       [--num_train_epochs NUM_TRAIN_EPOCHS]
       [--train_batch_size TRAIN_BATCH_SIZE]
       [--warmup_ratio WARMUP_RATIO]
       [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
       [--optimizer OPTIMIZER]
       [--scheduler SCHEDULER]
       [--weight_decay WEIGHT_DECAY]
       [--max_grad_norm MAX_GRAD_NORM]
       [--seed SEED]
       [--add_eos_token]
       [--block_size BLOCK_SIZE]
       [--use_peft]
       [--lora_r LORA_R]
       [--lora_alpha LORA_ALPHA]
       [--lora_dropout LORA_DROPOUT]
       [--logging_steps LOGGING_STEPS]
       [--project_name PROJECT_NAME]
       [--evaluation_strategy EVALUATION_STRATEGY]
       [--save_total_limit SAVE_TOTAL_LIM

## 第 4 步 - 完成！加载模型进行推理

现在我们需要以 4 位精度重新加载基础模型并使用 peft 库。为了防止 VRAM 出现问题，我建议重新启动笔记本电脑，重新执行前三个步骤，然后执行下一步。

Load directly adapters from the Hub using the command below
使用以下命令直接从Hub加载适配器

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "Ronal999/mistral-7b-autotrained-finance"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)
print(f"Successfully loaded the model {peft_model_id} into memory")

Downloading (…)/adapter_config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading adapter_model.bin:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

Successfully loaded the model Ronal999/mistral-7b-autotrained-finance into memory


然后，你可以直接使用从 🤗 Hub 加载的训练模型进行推理，就像通常在 Transformer 中所做的那样。

In [None]:
result = get_completion(query="Will capital gains affect my tax bracket?", model=model, tokenizer=tokenizer)
print(result)



<s> 
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  ### Question:
  Will capital gains affect my tax bracket?

  ### Answer:
  1. Yes. Capital gains are a type of income that can affect your tax bracket. \n 2. When you calculate your taxable income, you include all sources of income, including capital gains. \n 3. If the resulting taxable income would put you in a higher tax bracket, then the capital gains will affect your tax bracket. \n 4. If you have a lot of capital gains and not enough other sources of income to move you into a higher tax bracket, then you may be able to "zero out" your tax bracket by using tax-loss harvesting, which is when you sell off investments that have lost value so as to offset capital gains.</s>
