# PAWSEY 11
---

Table of Contents
1.   Reproducing the models
*   Mistral7B
*   BLOOMZ
*   Llama
*   Fine-tune model

2.   Reproducing the benchmarks
3.   Testing on a custom dataset
4.   Fine-tuning
5.   Retriever
6.   RAG

*Note: each largest dropdown should have the same runtime; different largest dropdowns could have different runtime*



Run this before running any sections below:

In [2]:
# Device: CPU/GPU
!pip install torch
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



In [None]:
# Device: TPU
!pip install torch -U
!pip install torch-xla

import torch
import torch_xla.core.xla_model as xm

# Device
device = xm.xla_device() if xm.xla_device() is not None else torch.device("cpu")





In [5]:
# Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Reproducing the models

GPU: T4 GPU

### Mistral 7B (credit: referencing Imam's code)

In [None]:
# Dependencies
!pip install ctransformers

Collecting ctransformers
  Downloading ctransformers-0.2.27-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ctransformers
Successfully installed ctransformers-0.2.27


In [None]:
# Imports
from ctransformers import AutoModelForCausalLM

In [None]:
# Load the model
mistral_model = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file="mistral-7b-v0.1.Q4_K_M.gguf", model_type="mistral", gpu_layers=0)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

mistral-7b-v0.1.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

In [None]:
# Test the model
print(mistral_model("Can you tell me a joke?"))



I’m sorry, I’ve forgotten my password.

How about another one?

Oh! Here it is!

What do you call a guy with no arms and no legs in the middle of an active minefield?

Jason!

Why did the cannibal bite his wife?

Because he was hungry for a sandwich.

How many psychiatrists does it take to change a light bulb?

Only one, but the light bulb must want to change.

How many programmers does it take to change a light bulb?

None, they are waiting for someone else to write a program that will do it automatically.

What’s the difference between a well-designed algorithm and a badly designed algorithm?

The amount of documentation in comments that comes with them.

Why is there no toilet paper in the bathroom?

Someone took all my program code into the bathroom to debug it, I am afraid that the rest of the code will be flushed down the toilet when the toilet gets fixed.

How do you make a computer run faster?

Hit it on the back.

What is the


### Mixtral

In [None]:
# Dependencies
!pip install --upgrade ctransformers



In [None]:
# Imports
from ctransformers import AutoModelForCausalLM

In [None]:
# Load the model
model_name = "TheBloke/Mixtral-8x7B-v0.1-GGUF"
model_file = "mixtral-8x7b-v0.1.Q2_K.gguf"
gpu_layers = 0
mixtral_model = AutoModelForCausalLM.from_pretrained(model_name, model_file=model_file, gpu_layers=gpu_layers)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

RuntimeError: ignored

In [None]:
# Test the model
print(mixtral_model("Can you tell me a joke?"))

### BLOOMZ1b

In [None]:
# Dependencies
!pip install transformers



In [None]:
# Imports
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BloomTokenizerFast, BloomForCausalLM

In [None]:
# Load the Model
param = "bloomz-1b1"
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

In [None]:
# Test the model
inputs = tokenizer.encode("Can you tell me a joke?", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))



Can you tell me a joke? A man is walking down the street and sees a dog. He


### Llama 2

In [None]:
# Dependencies
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools>=42
    Downloading setuptools-69.0.2-py3-none-any.whl (819 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 819.5/819.5 kB 13.5 MB/s eta 0:00:00
  Collecting scikit-build>=0.13
    Downloading scikit_build-0.17.6-py3-none-any.whl (84 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.3/84.3 kB 10.8 MB/s eta 0:00:00
  Collecting cmake>=3.18
    Downloading cmake-3.27.9-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (26.1 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.1/26.1 MB 58.6 MB/s eta 0:00:00
  Co

In [None]:
# Imports
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

In [None]:
# Load the model
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin"

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32 # Change this value based on your model and your GPU VRAM pool.
    )

# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

llama-2-13b-chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


32

In [None]:
# Test the model
response=lcpp_llm(prompt="Can you tell me a joke?", max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

print(response['choices'][0]['text'])

Llama.generate: prefix-match hit


Can you tell me a joke?

Sure, here's one:

Why couldn't the bicycle stand up by itself?

Do you want to guess or would you like me to give you the answer?


### Fine-tuned model

In [None]:
# Dependencies
!pip install transformers



In [None]:
# Imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [None]:
# Load the model
tokenizer = BloomTokenizerFast.from_pretrained("/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token")
model = BloomForCausalLM.from_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune').to(device)

In [None]:
# Test the model
inputs = tokenizer.encode("Can you tell me a joke?", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))



Can you tell me a joke? 
The man who made the first successful water 
transformation in


## Reproducing the bechmarks

BLOOMZ1b in LLMU style

Referencing [this website](https://www.kaggle.com/code/debarshichanda/llm-evaluation-mmlu-style)

In [None]:
# Dependencies
!pip install transformers
!pip install torch
!pip install datasets
!pip install scikit-learn
!pip install langchain

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting langchain
  Downloading langchain-0.0.346-py3-none-any.whl (2.0 MB)
[2K     

In [None]:
# Imports
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display

In [None]:
# Load the pretrained LLM and tokenizer
param = "bloomz-1b1"
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

# Test the model
inputs = tokenizer.encode("How is the weather today", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))



How is the weather today? sunny</s>


In [None]:
# Load the benchmark dataset (download dataset from kaggle first)
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/kaggle-llm-science-exam/train.csv")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Prepare for the prompts
template = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: {prompt}\n
A) {a}\n
B) {b}\n
C) {c}\n
D) {d}\n
E) {e}\n

Answer:"""

prompt = PromptTemplate(template=template, input_variables=['prompt', 'a', 'b', 'c', 'd', 'e'])

# See the format
sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'],
                               a=sample['A'],
                               b=sample['B'],
                               c=sample['C'],
                               d=sample['D'],
                               e=sample['E'])))

def format_text(example):
    text = prompt.format(prompt=example['prompt'],
                         a=example['A'],
                         b=example['B'],
                         c=example['C'],
                         d=example['D'],
                         e=example['E'])
    return {"text": text}

dataset = dataset.map(format_text)

Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: Which of the following statements accurately describes the impact of Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass" discrepancy in galaxy clusters?

A) MOND is a theory that reduces the observed missing baryonic mass in galaxy clusters by postulating the existence of a new form of matter called "fuzzy dark matter."

B) MOND is a theory that increases the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 20.

C) MOND is a theory that explains the missing baryonic mass in galaxy clusters that was previously considered dark matter by demonstrating that the mass is in the form of neutrinos and axions.

D) MOND is a theory that reduces the discrepancy between the observed missing baryonic mass in galaxy clusters and the measured velocity dispersions from a factor of around 10 to a factor of about 2.

E) MOND is a theory that eliminates the observed missing baryonic mass in galaxy clusters by imposing a new mathematical formulation of gravity that does not require the existence of dark matter.


Answer:

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
# Get the answer of each question
def get_ans(text):
    inputs = tokenizer(text, return_tensors='pt')
    logits = model(input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda()).logits[0, -1]
    options_list = [(logits[tokenizer(' A').input_ids[-1]], 'A'), (logits[tokenizer(' B').input_ids[-1]], 'B'), (logits[tokenizer(' C').input_ids[-1]], 'C'), (logits[tokenizer(' D').input_ids[-1]], 'D'), (logits[tokenizer(' E').input_ids[-1]], 'E')]
    options_list = sorted(options_list, reverse=True)
    ans_list = []
    for i in range(3):
        ans_list.append(options_list[i][1])

    return ans_list

In [None]:
# Computes average precision at k between two lists of items
# actual: list of elements to be predicted
# list of predicted elements
# k: int, maximum number of predicted elements
# returns: double, avg precision at k over input lists
def apk(actual, predicted, k=10):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

In [None]:
# Login to weights&biases
!pip install wandb
!wandb login < 5f205be85aa31899f0e2fc5882532096d33c54b2 # API key

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/2.1 MB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.1/2.1 MB[0m [31m36.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.38.0-py2.py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.8/252.8 kB[0m [31m32.6 MB/s[0

In [None]:
import wandb
run = wandb.init(project='baselines', # Name of project
                 name='BLOOMZ MMLU evaluation', # Name of Run
                 anonymous='must')

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# Get results
aps = []
eval_table = wandb.Table(columns=['Question', 'Answer', 'Prediction 1', 'Prediction 2', 'Prediction 3', 'AP', 'A', 'B', 'C', 'D', 'E', 'text'])
bar = tqdm(enumerate(dataset['train']), total=len(dataset['train']))
for i, data in bar:
    ans_list = get_ans(data['text'])
    average_precision = apk([data['answer']], ans_list, k=3)
    aps.append(average_precision)
    ans1, ans2, ans3 = ans_list
    eval_table.add_data(data['prompt'],
                        data['answer'],
                        ans1,
                        ans2,
                        ans3,
                        average_precision,
                        data['A'],
                        data['B'],
                        data['C'],
                        data['D'],
                        data['E'],
                        data['text'])

wandb.log({'Evaluation': eval_table})
run.finish()

100%|██████████| 200/200 [00:35<00:00,  5.58it/s]


VBox(children=(Label(value='0.778 MB of 0.778 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# View precision
mean_average_precision = np.mean(aps)
mean_average_precision

0.38166666666666665

## Testing on customised dataset

BLOOMZ1b on customised dataset

(Dataset obtained manually by creating 36 questions in MMLU style)

In [None]:
# Dependencies
!pip install transformers
!pip install torch
!pip install datasets
!pip install scikit-learn
!pip install langchain



In [None]:
# Imports
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display

In [None]:
# Load the bloomz model
param = "bloomz-1b1"
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

# Test the model
inputs = tokenizer.encode("How is the weather today", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]



How is the weather today? sunny</s>


In [None]:
# Get dataset
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/custom-data/csiro-quiz.csv")

In [None]:
# Prepare for the prompts
template = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: {prompt}\n
A) {a}\n
B) {b}\n
C) {c}\n
D) {d}\n
E) {e}\n

Answer:"""

prompt = PromptTemplate(template=template, input_variables=['prompt', 'a', 'b', 'c', 'd', 'e'])

# See the format
sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'],
                               a=sample['A'],
                               b=sample['B'],
                               c=sample['C'],
                               d=sample['D'],
                               e=sample['E'])))

def format_text(example):
    text = prompt.format(prompt=example['prompt'],
                         a=example['A'],
                         b=example['B'],
                         c=example['C'],
                         d=example['D'],
                         e=example['E'])
    return {"text": text}

dataset = dataset.map(format_text)

Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: Which of the following are complementary strategies for managing and reducing the risks of climate change?

A) Climate change refers to any long-term trends or shifts in climate over many decades.

B) Implementing policies to regulate cloud cover.

C) Ignoring the impact of human activities on the environment.

D) Relying solely on natural climate variability.

E) Increasing greenhouse gas emissions for economic growth.


Answer:

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

In [None]:
# Get the answer of each question
def get_ans(text):
    inputs = tokenizer(text, return_tensors='pt')
    logits = model(input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda()).logits[0, -1]
    options_list = [(logits[tokenizer(' A').input_ids[-1]], 'A'), (logits[tokenizer(' B').input_ids[-1]], 'B'), (logits[tokenizer(' C').input_ids[-1]], 'C'), (logits[tokenizer(' D').input_ids[-1]], 'D'), (logits[tokenizer(' E').input_ids[-1]], 'E')]
    options_list = sorted(options_list, reverse=True)
    ans_list = []
    for i in range(3):
        ans_list.append(options_list[i][1])

    return ans_list

In [None]:
# Computes average precision at k between two lists of items
# actual: list of elements to be predicted
# list of predicted elements
# k: int, maximum number of predicted elements
# returns: double, avg precision at k over input lists
def apk(actual, predicted, k=10):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

In [None]:
# Login to weights&biases
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install wandb
!wandb login < 5f205be85aa31899f0e2fc5882532096d33c54b2 # API key

/bin/bash: line 1: 5f205be85aa31899f0e2fc5882532096d33c54b2: No such file or directory


In [None]:
import wandb
run = wandb.init(project='baselines', # Name of project
                #  name='BLOOMZ custom evaluation', # Name of run
                #  name='BLOOMZ custom evaluation csiro-quiz', # Name of run
                 name='BLOOMZ pretrain evaluation csiro-quiz', # Name of run
                 anonymous='must')

In [None]:
# Get results
aps = []
eval_table = wandb.Table(columns=['Question', 'Answer', 'Prediction 1', 'Prediction 2', 'Prediction 3', 'AP', 'A', 'B', 'C', 'D', 'E', 'text'])
bar = tqdm(enumerate(dataset['train']), total=len(dataset['train']))
for i, data in bar:
    ans_list = get_ans(data['text'])
    average_precision = apk([data['answer']], ans_list, k=3)
    aps.append(average_precision)
    ans1, ans2, ans3 = ans_list
    eval_table.add_data(data['prompt'],
                        data['answer'],
                        ans1,
                        ans2,
                        ans3,
                        average_precision,
                        data['A'],
                        data['B'],
                        data['C'],
                        data['D'],
                        data['E'],
                        data['text'])

wandb.log({'Evaluation': eval_table})
run.finish()

100%|██████████| 17/17 [00:10<00:00,  1.57it/s]


VBox(children=(Label(value='0.042 MB of 0.042 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# View precision
mean_average_precision = np.mean(aps)
mean_average_precision

0.3725490196078431

## Fine-tuning

Must use a higher RAM (I used TPU)

In [None]:
# Dependencies
!pip install transformers
!pip install torch
!pip install datasets
!pip install scikit-learn
!pip install langchain
!pip install PyPDF2
!pip install transformers[torch] -U
!pip install accelerate==0.20.3

[31mERROR: Could not find a version that satisfies the requirement accelerate==0.21.3 (from versions: 0.0.1, 0.1.0, 0.2.0, 0.2.1, 0.3.0, 0.4.0, 0.5.0, 0.5.1, 0.6.0, 0.6.1, 0.6.2, 0.7.0, 0.7.1, 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.13.2, 0.14.0, 0.15.0, 0.16.0, 0.17.0, 0.17.1, 0.18.0, 0.19.0, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.22.0, 0.23.0, 0.24.0, 0.24.1, 0.25.0)[0m[31m
[0m[31mERROR: No matching distribution found for accelerate==0.21.3[0m[31m
[0m

Retrieving and preprocessing PDF

In [None]:
# Get PDF
import PyPDF2

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

pdf_path = '/content/gdrive/MyDrive/Pawsey/csiro-data/372882eng.pdf'
text_content = pdf_to_text(pdf_path)

text_content



In [None]:
# Preprocessing the text by removing non-alphanumeric characters
import re

def preprocess_text(text):
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return processed_text

preprocessed_text = preprocess_text(text_content)
preprocessed_text



In [None]:
# Create dataset

# Split data into training and validation sets
train_data = preprocessed_text[:int(0.8 * len(preprocessed_text))]
val_data = preprocessed_text[int(0.8 * len(preprocessed_text)):]

# Save data to text files
with open('/content/gdrive/MyDrive/Pawsey/other-data/train.txt', 'w') as file:
    file.write(train_data)

with open('/content/gdrive/MyDrive/Pawsey/other-data/val.txt', 'w') as file:
    file.write(val_data)

Fine-tune the model

In [None]:
# Imports
import accelerate
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display

In [None]:
# Load the bloomz model
param = "bloomz-1b1"
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

# Fine-tune the model
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='/content/gdrive/MyDrive/Pawsey/other-data/train.txt',
    block_size=128
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

training_args = TrainingArguments(
    output_dir='/content/gdrive/MyDrive/Pawsey/model-out',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]



Step,Training Loss


TrainOutput(global_step=33, training_loss=2.1786545262192236, metrics={'train_runtime': 285.3972, 'train_samples_per_second': 0.431, 'train_steps_per_second': 0.116, 'total_flos': 64231989313536.0, 'train_loss': 2.1786545262192236, 'epoch': 3.0})

In [None]:
# Evaluate

# # Example: Evaluate on the validation set
# eval_results = trainer.evaluate(eval_dataset=val_data)
# print(eval_results)

In [None]:
# Save the fine-tuned model and tokenizer
model.save_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune')
tokenizer.save_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token')

('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token/tokenizer_config.json',
 '/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token/special_tokens_map.json',
 '/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token/tokenizer.json')

Evaluate on the custom dataset

(Swap back to GPU device)



In [None]:
# Dependencies
!pip install transformers
!pip install datasets
!pip install scikit-learn
!pip install langchain
!pip install PyPDF2
!pip install transformers[torch] -U

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting langchain
  Downloading langchain-0.0.347-py3-none-any.whl (2.0 MB)
[2K     [

In [None]:
# Imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display

In [None]:
# Load the model
tokenizer = BloomTokenizerFast.from_pretrained("/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token")
model = BloomForCausalLM.from_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune').to(device)

# Test the model
inputs = tokenizer.encode("How is the weather today", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))



How is the weather today 
differentiated from previous years 
Climate change occurs against a


In [None]:
# Get dataset
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/custom-data/train.csv")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Prepare for the prompts
template = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: {prompt}\n
A) {a}\n
B) {b}\n
C) {c}\n
D) {d}\n
E) {e}\n

Answer:"""

prompt = PromptTemplate(template=template, input_variables=['prompt', 'a', 'b', 'c', 'd', 'e'])

# See the format
sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'],
                               a=sample['A'],
                               b=sample['B'],
                               c=sample['C'],
                               d=sample['D'],
                               e=sample['E'])))

def format_text(example):
    text = prompt.format(prompt=example['prompt'],
                         a=example['A'],
                         b=example['B'],
                         c=example['C'],
                         d=example['D'],
                         e=example['E'])
    return {"text": text}

dataset = dataset.map(format_text)

Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: Which of the following are complementary strategies for managing and reducing the risks of climate change?

A) Adaptation and mitigation

B) Science and technology

C) Society and innovation

D) Love and peace

E) Parents and children


Answer:

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

In [None]:
# Get the answer of each question
def get_ans(text):
    inputs = tokenizer(text, return_tensors='pt')
    logits = model(input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda()).logits[0, -1]
    options_list = [(logits[tokenizer(' A').input_ids[-1]], 'A'), (logits[tokenizer(' B').input_ids[-1]], 'B'), (logits[tokenizer(' C').input_ids[-1]], 'C'), (logits[tokenizer(' D').input_ids[-1]], 'D'), (logits[tokenizer(' E').input_ids[-1]], 'E')]
    options_list = sorted(options_list, reverse=True)
    ans_list = []
    for i in range(3):
        ans_list.append(options_list[i][1])

    return ans_list

In [None]:
# Computes average precision at k between two lists of items
# actual: list of elements to be predicted
# list of predicted elements
# k: int, maximum number of predicted elements
# returns: double, avg precision at k over input lists
def apk(actual, predicted, k=10):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

In [None]:
# Login to weights&biases
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install wandb
!wandb login < 5f205be85aa31899f0e2fc5882532096d33c54b2 # API key

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.38.0-py2.py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.8/252.8 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->w

In [None]:
import wandb
run = wandb.init(project='baselines', # Name of project
                 name='BLOOMZ1b finetune evaluation', # Name of run
                 anonymous='must')

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# Get results
aps = []
eval_table = wandb.Table(columns=['Question', 'Answer', 'Prediction 1', 'Prediction 2', 'Prediction 3', 'AP', 'A', 'B', 'C', 'D', 'E', 'text'])
bar = tqdm(enumerate(dataset['train']), total=len(dataset['train']))
for i, data in bar:
    ans_list = get_ans(data['text'])
    average_precision = apk([data['answer']], ans_list, k=3)
    aps.append(average_precision)
    ans1, ans2, ans3 = ans_list
    eval_table.add_data(data['prompt'],
                        data['answer'],
                        ans1,
                        ans2,
                        ans3,
                        average_precision,
                        data['A'],
                        data['B'],
                        data['C'],
                        data['D'],
                        data['E'],
                        data['text'])

wandb.log({'Evaluation': eval_table})
run.finish()

100%|██████████| 36/36 [00:02<00:00, 12.08it/s]


VBox(children=(Label(value='0.045 MB of 0.045 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# View precision
mean_average_precision = np.mean(aps)
mean_average_precision

0.537037037037037

## Retriever

This section tests the performance of the two retrievers on BloomZ1b model.

In [None]:
# Dependencies
!pip install transformers
!pip install torch
!pip install datasets
!pip install scikit-learn
!pip install langchain
!pip install PyPDF2
!pip install transformers[torch] -U
!pip install accelerate==0.20.3

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting langchain
  Downloading langchain-0.0.348-py3-none-any.whl (2.0 MB)
[2K     

In [None]:
# Imports
import torch
import accelerate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import DPRReader, RagTokenizer, RagRetriever, RagSequenceForGeneration, DPRReaderTokenizer
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display
import PyPDF2
import re

DPR Retriever

In [None]:
# Set up retriever
r_tokenizer = DPRReaderTokenizer.from_pretrained("facebook/dpr-reader-single-nq-base")
r_model = DPRReader.from_pretrained("facebook/dpr-reader-single-nq-base").to(device)

In [None]:
# Get database PDF and preprocess
pdf_path = '/content/gdrive/MyDrive/Pawsey/csiro-data/372882eng.pdf'

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

# Remove non-alphanumeric characters
def preprocess_text(text):
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return processed_text

text_content = pdf_to_text(pdf_path)
preprocessed_text = preprocess_text(text_content)

# View result
preprocessed_text[500:1000]

In [None]:
# Test the retriever

# Split the long text into chunks
text_chunks = [preprocessed_text[i:i+2500] for i in range(0, len(preprocessed_text), 2500)]

# Process each chunk separately
predicted_answers = []
for chunk in text_chunks:
    encoded_inputs = r_tokenizer(
        questions=["what is energy production"],
        titles=["Environment"],
        texts=[chunk],
        return_tensors="pt",
    ).to(device)

    outputs = r_model(**encoded_inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    relevance_logits = outputs.relevance_logits

    # Extract answer span from the context
    start_index = torch.argmax(outputs.start_logits)
    end_index = torch.argmax(outputs.end_logits) + 1  # Adding 1 to include the end index itself
    answer_span = encoded_inputs["input_ids"][0][start_index:end_index]

    # Decode the answer span
    decoded_answer = r_tokenizer.decode(answer_span)
    predicted_answers.append(decoded_answer)

# Aggregate the results as needed
final_answer = " ".join(predicted_answers)
print("Retrieved Answer:", final_answer)

Langchain Similarity Retriever

In [None]:
!pip install sentence-transformers
!pip install faiss-cpu

In [None]:
# Imports
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.vectorstores import FAISS

In [None]:
huggingFaceAPIKey = "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"

In [None]:
# Get database PDF and preprocess
pdf_path = '/content/gdrive/MyDrive/Pawsey/csiro-data/372882eng.pdf'

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

# Remove non-alphanumeric characters
def preprocess_text(text):
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return processed_text

text_content = pdf_to_text(pdf_path)
preprocessed_text = preprocess_text(text_content)

# View result
preprocessed_text[500:1000]

# Specify the path for the output .txt file
output_file_path = '/content/gdrive/MyDrive/Pawsey/custom-data/preprocessed_text.txt'

# Write the preprocessed_text to the file
with open(output_file_path, 'w', encoding='utf-8') as file:
    file.write(preprocessed_text)

# Split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(output_file_path).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

# Load into vector database
db = FAISS.from_documents(documents, HuggingFaceEmbeddings())

In [None]:
# Testing response
query = "What is climate?"
allSearchResults = db.similarity_search(query, k=2)

# Concatenate page_content into one string
result_string = ""

for i, searchResults in enumerate(allSearchResults):
    result_string += searchResults.page_content + "\n"

# Display
print(result_string)
final_answer = result_string

Generator

In [None]:
# Load the Model
param = "bloomz-1b1"
g_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
g_model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

In [None]:
# Test the model
input_question = "what is energy production"
generator_input = f"Question: {input_question} Document: {final_answer}"

inputs = g_tokenizer.encode(generator_input, return_tensors="pt").to(device)
outputs = g_model.generate(inputs)
print(g_tokenizer.decode(outputs[0]))

Retriever + Generator

In [None]:
# Get custom testing dataset
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/custom-data/train.csv")

In [None]:
# Prepare for the retriever prompts
r_template = """{prompt}\n"""

prompt = PromptTemplate(template=r_template, input_variables=['prompt'])

sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'])))

def r_format_text(example):
    r_text = prompt.format(prompt=example['prompt'])
    return {"r_text": r_text}

dataset = dataset.map(r_format_text)

In [None]:
# Print out the 'text' column to see the questions
for example in dataset['train']:
    print(example['r_text'])

## RAG

In [None]:
# Dependencies
!pip install transformers
!pip install torch
!pip install datasets
!pip install scikit-learn
!pip install langchain
!pip install PyPDF2
!pip install transformers[torch] -U
!pip install accelerate==0.20.3

Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.25.0-py3-none-any.whl (265 kB)
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.20.3
    Uninstalling accelerate-0.20.3:
      Successfully uninstalled accelerate-0.20.3
Successfully installed accelerate-0.25.0


Collecting accelerate==0.20.3
  Using cached accelerate-0.20.3-py3-none-any.whl (227 kB)
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.25.0
    Uninstalling accelerate-0.25.0:
      Successfully uninstalled accelerate-0.25.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.7.1 requires accelerate>=0.21.0, but you have accelerate 0.20.3 which is incompatible.[0m[31m
[0mSuccessfully installed accelerate-0.20.3


In [None]:
# Run this if using TPU
!pip install tensorflow -U



In [None]:
# Imports
import torch
import accelerate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoTokenizer, BloomTokenizerFast, BloomForCausalLM
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import DPRReader, RagTokenizer, RagRetriever, RagSequenceForGeneration, DPRReaderTokenizer
from transformers import pipeline
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display
import PyPDF2
import re

DPR Retriever (Skip if you use Langchain retriever)

In [None]:
# Set up retriever
r_tokenizer = DPRReaderTokenizer.from_pretrained("facebook/dpr-reader-single-nq-base")
r_model = DPRReader.from_pretrained("facebook/dpr-reader-single-nq-base").to(device)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRReaderTokenizer'.
Some weights of the model checkpoint at facebook/dpr-reader-single-nq-base were not used when initializing DPRReader: ['span_predictor.encoder.bert_model.pooler.dense.weight', 'span_predictor.encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing DPRReader from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRReader from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Get database PDF and preprocess
pdf_path = '/content/gdrive/MyDrive/Pawsey/csiro-data/372882eng.pdf'

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

# Remove non-alphanumeric characters
def preprocess_text(text):
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return processed_text

text_content = pdf_to_text(pdf_path)
preprocessed_text = preprocess_text(text_content)

# View result
preprocessed_text[500:1000]

'effective enjoyment of the human rights to water and sanitation for potentially billions of people The hydrological \nchanges induced by climate change will add challenges to the sustainable management of water resources which \nare already under severe pressure in many regions of the world \nFood security human health urban and rural settlements energy production industrial development economic \ngrowth and ecosystems are all waterdependent and thus vulnerable to the impacts of climate change Clima'

In [None]:
# Test the retriever

# Split the long text into chunks
text_chunks = [preprocessed_text[i:i+2500] for i in range(0, len(preprocessed_text), 2500)]

# Process each chunk separately
predicted_answers = []
for chunk in text_chunks:
    encoded_inputs = r_tokenizer(
        questions=["what is energy production"],
        titles=["Environment"],
        texts=[chunk],
        return_tensors="pt",
    ).to(device)

    outputs = r_model(**encoded_inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    relevance_logits = outputs.relevance_logits

    # Extract answer span from the context
    start_index = torch.argmax(outputs.start_logits)
    end_index = torch.argmax(outputs.end_logits) + 1  # Adding 1 to include the end index itself
    answer_span = encoded_inputs["input_ids"][0][start_index:end_index]

    # Decode the answer span
    decoded_answer = r_tokenizer.decode(answer_span)
    predicted_answers.append(decoded_answer)

# Aggregate the results as needed
final_answer = " ".join(predicted_answers)
print("Retrieved Answer:", final_answer)

Retrieved Answer: water agriculture water lower energy use and thus lower ghg emissions wetlands accommodate the largest carbon stocks among terrestrial ecosystems storing twice as much carbon as forests taking into account that wetlands offer multiple cobenefits including flood and drought mitigation water purification and biodiversity their restoration and conservation is of critical importance disaster risk reduction the current impacts and future anticipated risks associated with extreme events demand sustainable solutions for climate change adaptation and disaster risk reduction the range of available climate change adaptation and disaster risk reduction strategies includes hard structural and soft policy instruments approaches hard measures include enhanced water storage agriculture hydropower fossil fuels  water water consumption agriculture hydropower agriculture water energy industry


Langchain Similarity Retriever (Referencing Imam's code) (Skip if you use DPR retriever)

In [None]:
!pip install sentence-transformers
!pip install faiss-cpu

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_trans

In [None]:
# Imports
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.vectorstores import FAISS

In [None]:
huggingFaceAPIKey = "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"

In [None]:
# Get database PDF and preprocess
pdf_path = '/content/gdrive/MyDrive/Pawsey/csiro-data/372882eng.pdf'

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

# Remove non-alphanumeric characters
def preprocess_text(text):
    processed_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return processed_text

text_content = pdf_to_text(pdf_path)
preprocessed_text = preprocess_text(text_content)

# View result
preprocessed_text[500:1000]

# Specify the path for the output .txt file
output_file_path = '/content/gdrive/MyDrive/Pawsey/custom-data/preprocessed_text.txt'

# Write the preprocessed_text to the file
with open(output_file_path, 'w', encoding='utf-8') as file:
    file.write(preprocessed_text)

# Split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader(output_file_path).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

# Load into vector database
db = FAISS.from_documents(documents, HuggingFaceEmbeddings())

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# Testing response
query = "What is climate?"
allSearchResults = db.similarity_search(query, k=2)

# Concatenate page_content into one string
result_string = ""

for i, searchResults in enumerate(allSearchResults):
    result_string += searchResults.page_content + "\n"

# Display
print(result_string)
final_answer = result_string

Executive SummaryWATER AND  
CLIMATE CHANGEThe United Nations World Water Development Report 2020
World Wa ter 
Assessment 
ProgrammeUnited Nations
Educational Scientic and
Cultural OrganizationSustainable 
Development 
Goalswater and
sanitationUnited Nations
Educational Scientic and
Cultural OrganizationThe United Nations World Water Development Report 2020  Water and Climate Change2Climate change will affect the availability quality and quantity of water for basic human needs threatening the 
effective enjoyment of the human rights to water and sanitation for potentially billions of people The hydrological 
changes induced by climate change will add challenges to the sustainable management of water resources which 
are already under severe pressure in many regions of the world 
Food security human health urban and rural settlements energy production industrial development economic 
growth and ecosystems are all waterdependent and thus vulnerable to the impacts of climate change Clima

Generator

Run this if you use fine-tuned BloomZ1b (skip if use pretrained)



In [None]:
# Load the model and tokenizer
model_path = '/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune'
tokenizer_path = '/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token'

g_tokenizer = BloomTokenizerFast.from_pretrained("/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune_token")
g_model = BloomForCausalLM.from_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/bloomz1b_finetune').to(device)

Run this if you use the pretrained BloomZ1b (skip if use fine-tuned)

In [None]:
# Load the Model and tokenizer
param = "bloomz-1b1"
g_tokenizer = BloomTokenizerFast.from_pretrained("bigscience/" + param)
g_model = BloomForCausalLM.from_pretrained("bigscience/" + param).to(device)

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

---

In [None]:
# Test the model
input_question = "what is energy production"
generator_input = f"Question: {input_question} Document: {final_answer}"

inputs = g_tokenizer.encode(generator_input, return_tensors="pt").to(device)
outputs = g_model.generate(inputs)
print(g_tokenizer.decode(outputs[0]))



Question: what is energy production Document: water agriculture water lower energy use and thus lower ghg emissions wetlands accommodate the largest carbon stocks among terrestrial ecosystems storing twice as much carbon as forests taking into account that wetlands offer multiple cobenefits including flood and drought mitigation water purification and biodiversity their restoration and conservation is of critical importance disaster risk reduction the current impacts and future anticipated risks associated with extreme events demand sustainable solutions for climate change adaptation and disaster risk reduction the range of available climate change adaptation and disaster risk reduction strategies includes hard structural and soft policy instruments approaches hard measures include enhanced water storage agriculture hydropower fossil fuels  water water consumption agriculture hydropower agriculture water energy industry and


Retriev-Augmented-Generation on the custom dataset

In [None]:
# Get custom testing dataset
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/custom-data/csiro-quiz.csv")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Prepare for the retriever prompts
r_template = """{prompt}\n"""

prompt = PromptTemplate(template=r_template, input_variables=['prompt'])

sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'])))

def r_format_text(example):
    r_text = prompt.format(prompt=example['prompt'])
    return {"r_text": r_text}

dataset = dataset.map(r_format_text)

Which of the following are complementary strategies for managing and reducing the risks of climate change?


Map:   0%|          | 0/17 [00:00<?, ? examples/s]

In [None]:
# Print out the 'text' column to see the questions
for example in dataset['train']:
    print(example['r_text'])

Which of the following are complementary strategies for managing and reducing the risks of climate change?

Why is the world warming?

How has climate changed in the past?

Why do sea levels change?

How are large scale climate processes responding in a changing climate?

How is climate likely to change in the future?

What are greenhouse gases and how do they affect the climate system?

What are the sources of carbon dioxide in the atmosphere?

How are greenhouse gases measured, estimated, and reported?

How can we address climate change?

How can we adapt to a changing climate?

How does CSIRO contribute to climate change knowledge?

What are the impacts of extreme weather and climate events?

How will climate extremes change Australia?

How fast is the climate changing?

How confident are we about the science of climate change?

Where can I find more information about climate change?



In [None]:
# Get retriever answer (run this if you are using DPR retriever, skip if you use similarity retriever)
def get_rtv(text):
    # Split the long text into chunks
    predicted_answers = []
    # Process each chunk separately
    for chunk in text_chunks:
        encoded_inputs = r_tokenizer(
            questions=[text],
            titles=["Environment"],
            texts=[chunk],
            return_tensors="pt",
        ).to(device)

        outputs = r_model(**encoded_inputs)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        relevance_logits = outputs.relevance_logits

        # Extract answer span from the context
        start_index = torch.argmax(outputs.start_logits)
        end_index = torch.argmax(outputs.end_logits) + 1  # Adding 1 to include the end index itself
        answer_span = encoded_inputs["input_ids"][0][start_index:end_index]

        # Decode the answer span
        decoded_answer = r_tokenizer.decode(answer_span)
        predicted_answers.append(decoded_answer)

    # Aggregate the results as needed
    final_answer = " ".join(predicted_answers)
    return final_answer

# Test the function
get_rtv("what is water?")



In [None]:
# Get retriever answer (run this if you are using similarity retriever, skip if you use DPR retriever)
def get_rtv(text):
    allSearchResults = db.similarity_search(text, k=2)
    # Concatenate page_content into one string
    result_string = ""
    for i, searchResults in enumerate(allSearchResults):
        result_string += searchResults.page_content + "\n"
    return result_string

# Test the function
get_rtv("what is water?")



In [None]:
# Prepare for the generator prompts
template = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: {prompt}\n
A) {a}\n
B) {b}\n
C) {c}\n
D) {d}\n
E) {e}\n

Use these information to help you: {retrieved}\n

Answer:"""

prompt = PromptTemplate(template=template, input_variables=['prompt', 'a', 'b', 'c', 'd', 'e', 'retrieved'])


sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'],
                               a=sample['A'],
                               b=sample['B'],
                               c=sample['C'],
                               d=sample['D'],
                               e=sample['E'],
                               retrieved=get_rtv(sample['prompt']))))

def format_text(example):
    retrieved_value = get_rtv(example['prompt'])
    text = prompt.format(
        prompt=example['prompt'],
        a=example['A'],
        b=example['B'],
        c=example['C'],
        d=example['D'],
        e=example['E'],
        retrieved=retrieved_value
    )
    return {"text": text, "retrieved": retrieved_value}

dataset = dataset.map(format_text)


# View a row
dataset['train'][0]

Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: Which of the following are complementary strategies for managing and reducing the risks of climate change?

A) Climate change refers to any long-term trends or shifts in climate over many decades.

B) Implementing policies to regulate cloud cover.

C) Ignoring the impact of human activities on the environment.

D) Relying solely on natural climate variability.

E) Increasing greenhouse gas emissions for economic growth.


Use these information to help you:  ghgs water flood   energy wetlands water management two promising trends are generating opportunities for water projects to access climate finance the first is the increasing recognition of the mitigation potential within water and sanitation projects this trend could be particularly advantageous as mitigation made up 938 of climate financing in 2016 but water projects consisted of a fraction of 1 of that sum the second trend is an increasing emphasis on financing climate adaptation accessing climate finance can be competitive and difficult especially for complex water knowledge management water projects water and climate change10in transboundary basins technical and financial assistance can be shared up or downstream from wealthier to poorer riparian countries however even where funds are available transboundary water management irrigation noregrets 


Answer:

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

{'id': 0,
 'prompt': 'Which of the following are complementary strategies for managing and reducing the risks of climate change?',
 'A': 'Climate change refers to any long-term trends or shifts in climate over many decades.',
 'B': 'Implementing policies to regulate cloud cover.',
 'C': 'Ignoring the impact of human activities on the environment.',
 'D': 'Relying solely on natural climate variability.',
 'E': 'Increasing greenhouse gas emissions for economic growth.',
 'answer': 'A',
 'text': 'Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]\n\nQuestion: Which of the following are complementary strategies for managing and reducing the risks of climate change?\n\nA) Climate change refers to any long-term trends or shifts in climate over many decades.\n\nB) Implementing policies to regulate cloud cover.\n\nC) Ignoring the impact of human activities on the environment.\n\nD) Relying solely on natural climate 

In [None]:
# Get generator answer
def get_ans(text):
    inputs = g_tokenizer(text, return_tensors='pt')
    logits = g_model(input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda()).logits[0, -1]
    options_list = [(logits[g_tokenizer(' A').input_ids[-1]], 'A'), (logits[g_tokenizer(' B').input_ids[-1]], 'B'), (logits[g_tokenizer(' C').input_ids[-1]], 'C'), (logits[g_tokenizer(' D').input_ids[-1]], 'D'), (logits[g_tokenizer(' E').input_ids[-1]], 'E')]
    options_list = sorted(options_list, reverse=True)
    ans_list = []
    for i in range(3):
        ans_list.append(options_list[i][1])
    return ans_list

Run the 2 following if using TPU

In [None]:
!pip install torch torchvision torchaudio torch_xla

Collecting torch
  Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.1.2
    Uninstalling torch-2.1.2:
      Successfully uninstalled torch-2.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.7.1 requires accelerate>=0.21.0, but you have accelerate 0.20.3 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.1.0


In [None]:
import torch_xla.core.xla_model as xm

# Get generator answer
def get_ans(text):
    inputs = g_tokenizer(text, return_tensors='pt')

    # Move tensors to TPU
    input_ids = inputs['input_ids'].to(xm.xla_device())
    attention_mask = inputs['attention_mask'].to(xm.xla_device())

    # Forward pass on TPU
    logits = g_model(input_ids=input_ids, attention_mask=attention_mask).logits[0, -1].to(xm.xla_device())

    # Convert logits to CPU before indexing
    logits_cpu = logits.cpu()

    options_list = [
        (logits_cpu[g_tokenizer(' A').input_ids[-1]], 'A'),
        (logits_cpu[g_tokenizer(' B').input_ids[-1]], 'B'),
        (logits_cpu[g_tokenizer(' C').input_ids[-1]], 'C'),
        (logits_cpu[g_tokenizer(' D').input_ids[-1]], 'D'),
        (logits_cpu[g_tokenizer(' E').input_ids[-1]], 'E')
    ]

    options_list = sorted(options_list, reverse=True)
    ans_list = [options_list[i][1] for i in range(3)]

    return ans_list


In [None]:
# Computes average precision at k between two lists of items
# actual: list of elements to be predicted
# list of predicted elements
# k: int, maximum number of predicted elements
# returns: double, avg precision at k over input lists
def apk(actual, predicted, k=10):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

In [None]:
# Login to weights&biases
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install wandb
!wandb login < 5f205be85aa31899f0e2fc5882532096d33c54b2 # API key

Collecting wandb
  Downloading wandb-0.16.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.38.0-py2.py3-none-any.whl (252 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.8/252.8 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting gitdb<5,>=4.0.1 (from GitPython!=3.1.29,>=1.0.0->w

In [None]:
import wandb
run = wandb.init(project='baselines',
                #  name='BLOOMZ1b finetune-rag evaluation', # Use this if you are using DPR retriever
                #  name='BLOOMZ1b finetune-rag sim evaluation', # Use this if you are using similarity retriever
                #  name='BLOOMZ1b rag sim evaluation', # Use this if you are using DPR retriever w/o fine-tune
                #  name='BLOOMZ1b rag sim evaluation', # Use this if you are using similarity retriever w/o fine-tune
                #  name='BLOOMZ1b rag sim evaluation csiro-quiz', # Use this if you are using similarity retriever w/o fine-tune
                #  name='BLOOMZ1b rag dpr+pretrain evaluation csiro-quiz', # Use this if you are using similarity retriever w/o fine-tune
                 name='testing', # Use this if you are using similarity retriever w/o fine-tune
                 anonymous='must')

VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# Get Results
aps = []
eval_table = wandb.Table(columns=['Question', 'Answer', 'Prediction 1', 'Prediction 2', 'Prediction 3', 'AP', 'A', 'B', 'C', 'D', 'E', 'text', 'retrieved'])
bar = tqdm(enumerate(dataset['train']), total=len(dataset['train']))
for i, data in bar:
    ans_list = get_ans(data['text'])
    average_precision = apk([data['answer']], ans_list, k=3)
    aps.append(average_precision)
    ans1, ans2, ans3 = ans_list
    eval_table.add_data(data['prompt'],
                        data['answer'],
                        ans1,
                        ans2,
                        ans3,
                        average_precision,
                        data['A'],
                        data['B'],
                        data['C'],
                        data['D'],
                        data['E'],
                        data['text'],
                        data['retrieved'])

wandb.log({'Evaluation': eval_table})
run.finish()

100%|██████████| 17/17 [00:31<00:00,  1.88s/it]


VBox(children=(Label(value='0.110 MB of 0.182 MB uploaded\r'), FloatProgress(value=0.6028925164788489, max=1.0…

In [None]:
# View precision
mean_average_precision = np.mean(aps)
mean_average_precision

0.40196078431372545

## Profanity Filter

In [None]:
!pip install --upgrade pip setuptools

[0m

In [None]:
!pip install ruamel.yaml -U

[0m

In [None]:
!pip install wheel

[0m

In [None]:
!pip3 install profanity-filter

Collecting profanity-filter
  Using cached profanity_filter-1.3.3-py3-none-any.whl (45 kB)
Collecting cached-property<2.0,>=1.5 (from profanity-filter)
  Using cached cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)
Collecting more-itertools<9.0,>=8.0 (from profanity-filter)
  Using cached more_itertools-8.14.0-py3-none-any.whl (52 kB)
Collecting ordered-set<4.0,>=3.0 (from profanity-filter)
  Using cached ordered_set-3.1.1-py2.py3-none-any.whl
Collecting ordered-set-stubs<0.2.0,>=0.1.3 (from profanity-filter)
  Using cached ordered_set_stubs-0.1.3-py2.py3-none-any.whl (4.8 kB)
Collecting poetry-version<0.2.0,>=0.1.3 (from profanity-filter)
  Using cached poetry_version-0.1.5-py2.py3-none-any.whl (13 kB)
Collecting redis<4.0,>=3.2 (from profanity-filter)
  Using cached redis-3.5.3-py2.py3-none-any.whl (72 kB)
Collecting ruamel.yaml<0.16.0,>=0.15.89 (from profanity-filter)
  Using cached ruamel.yaml-0.15.100.tar.gz (318 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collect

In [None]:
!pip install --no-deps ruamel.yaml

Collecting ruamel.yaml
  Downloading ruamel.yaml-0.18.5-py3-none-any.whl (116 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ruamel.yaml
Successfully installed ruamel.yaml-0.18.5


In [None]:
!pip install better-profanity

Collecting better-profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: better-profanity
Successfully installed better-profanity-0.7.0


In [None]:
from better_profanity import profanity


text = "jc"
censored_text = profanity.censor(text)
print(censored_text)

jc


AttributeError: ignored

In [None]:
# Dependencies
!pip install ctransformers
!pip install langchain

Collecting langchain
  Downloading langchain-0.0.350-py3-none-any.whl (809 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.2 (from langchain)
  Downloading langchain_community-0.0.3-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1 (from langchain)
  Downloading langchain_core-0.1.0-py3-none-any.whl (189 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.1/189.1 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1.0,>=0.0.63 (from langchain)
  Downloading langsmith-0

In [None]:
# Imports
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF")
mistral_model = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-v0.1-GGUF", model_file="mistral-7b-v0.1.Q2_K.gguf", model_type="mistral", gpu_layers=0)

In [None]:
# Test the model
print(mistral_model("Can you tell me a joke?"))

In [None]:
from langchain.llms import CTransformers
llm = CTransformers(model='TheBloke/Mistral-7B-v0.1-GGUF', model_file="mistral-7b-v0.1.Q2_K.gguf", model_type='mistral')
print(llm('AI is going to'))

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

AttributeError: ignored

## Data Collection

Parse for information

In [None]:
!pip install requests beautifulsoup4



In [None]:
# Imports
import requests
from bs4 import BeautifulSoup
import csv
import re

In [None]:
def get_wikipedia_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Failed to fetch the page. Status code: {response.status_code}")
        return None

def extract_text_chunks(html):
    soup = BeautifulSoup(html, 'html.parser')
    text_chunks = []

    paragraphs = soup.find_all('p')
    for paragraph in paragraphs:
        text_chunks.append(paragraph.get_text())

    return text_chunks

def preprocess_text(text):
    # Remove citations in the format [number]
    text = re.sub(r'\[\d+\]', '', text)

    # Remove consecutive whitespaces of 2 or more
    text = re.sub(r'\s{2,}', ' ', text)

    # Remove new lines between paragraphs
    text = re.sub(r'\n\s*\n', '\n', text)

    # Strip leading and trailing whitespaces
    text = text.strip()

    return text if text else None  # Return None for empty strings


def write_to_csv(data, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)

        # Write the header separately
        header = "Environment Knowledge Base"
        writer.writerow([header])

        # Write the rest of the data
        for row in data:
            if row is not None:  # Skip None values (empty strings)
                writer.writerow([row])

def main():
    wikipedia_urls = [
        "https://en.wikipedia.org/wiki/Environmental_issues",
        "https://en.wikipedia.org/wiki/Natural_environment",
        "https://en.wikipedia.org/wiki/Biophysical_environment",
        "https://en.wikipedia.org/wiki/Ecology",
        "https://en.wikipedia.org/wiki/Environment_(systems)",
        "https://en.wikipedia.org/wiki/Built_environment",
        "https://en.wikipedia.org/wiki/Climate_change",
        "https://en.wikipedia.org/wiki/Human_impact_on_the_environment",
        "https://en.wikipedia.org/wiki/Environment_of_Australia",
        "https://en.wikipedia.org/wiki/Environmental_protection",
        "https://en.wikipedia.org/wiki/Environmental_issues_in_Australia"
    ]

    all_text_chunks = []

    for url in wikipedia_urls:
        html_content = get_wikipedia_page(url)

        if html_content:
            text_chunks = extract_text_chunks(html_content)
            preprocessed_chunks = [preprocess_text(chunk) for chunk in text_chunks]
            all_text_chunks.extend(preprocessed_chunks)
            print(f"Data extracted for {url}")

    # Write all extracted texts to a single CSV file
    filename = '/content/gdrive/MyDrive/Pawsey/custom-data/parsed-knowledge_all.csv'
    write_to_csv(all_text_chunks, filename)
    print(f"All data extracted and saved to {filename}")



if __name__ == "__main__":
    main()

Data extracted for https://en.wikipedia.org/wiki/Environmental_issues
Data extracted for https://en.wikipedia.org/wiki/Natural_environment
Data extracted for https://en.wikipedia.org/wiki/Biophysical_environment
Data extracted for https://en.wikipedia.org/wiki/Ecology
Data extracted for https://en.wikipedia.org/wiki/Environment_(systems)
Data extracted for https://en.wikipedia.org/wiki/Built_environment
Data extracted for https://en.wikipedia.org/wiki/Climate_change
Data extracted for https://en.wikipedia.org/wiki/Human_impact_on_the_environment
Data extracted for https://en.wikipedia.org/wiki/Environment_of_Australia
Data extracted for https://en.wikipedia.org/wiki/Environmental_protection
Data extracted for https://en.wikipedia.org/wiki/Environmental_issues_in_Australia
All data extracted and saved to /content/gdrive/MyDrive/Pawsey/custom-data/parsed-knowledge_all.csv


Inspecting the data

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
def inspect_csv(csv_filename):
    try:
        with open(csv_filename, 'r', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            header = next(reader)  # Assuming the first row is the header

            num_rows = 0
            unique_lengths = set()
            num_words = 0
            num_sentences = 0

            for row in reader:
                num_rows += 1
                text = row[0]
                num_words += len(word_tokenize(text))
                num_sentences += len(sent_tokenize(text))

            return {
                "Header": header,
                "Number of Rows": num_rows,
                "Number of Words": num_words,
                "Number of Sentences": num_sentences,
            }
    except Exception as e:
        return {"Error": str(e)}

# Example usage
csv_filename = '/content/gdrive/MyDrive/Pawsey/custom-data/parsed-knowledge_all.csv'
inspection_result = inspect_csv(csv_filename)

# Print the inspection result in a readable format
print("Data Inspection:")
for key, value in inspection_result.items():
    print(f"{key}: {value}")

Data Inspection:
Header: ['Environment Knowledge Base']
Number of Rows: 577
Number of Words: 51526
Number of Sentences: 2150


## Self-supervised fine-tuning Mistral 7b

Must run with the second section (Inspecting the Data) of Data Collection

Configuration

In [None]:
!pip -qqq install bitsandbytes accelerate
!pip install -q -U transformers


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

Getting the mistral model

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

In [None]:
prompt = "As a data scientist, can you explain the concept of regularization in machine learning?"

sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

Fine-tuning

In [None]:
!pip install -U peft
!pip install -U accelerate
!pip install -U trl

Collecting trl
  Downloading trl-0.7.4-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.6.0-py3-none-any.whl (100 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting docstring-parser>=0.14.1 (from tyro>=0.5.11->trl)
  Downloading docstring_parser-0.15-py3-none-any.whl (36 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Downloading shtab-1.6.5-py3-none-any.whl (13 kB)
Installing collected packages: shtab, docstring-parser, tyro, trl
Successfully installed docstring-parser-0.15 shtab-1.6.5 trl-0.7.4 tyro-0.6.0


In [None]:
!pip install wandb



In [None]:
!pip install datasets



In [None]:
# Imports
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb
from datasets import load_dataset
from trl import SFTTrainer



In [None]:
secret_hf = "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"
secret_wandb = "5f205be85aa31899f0e2fc5882532096d33c54b2"

In [None]:
!huggingface-cli login --token $secret_hf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
wandb.login(key = secret_wandb)
run = wandb.init(
    project='Fine tuning mistral 7B',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mfiona-junfei[0m ([33mllm-csiro[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
base_model = "mistralai/Mistral-7B-v0.1"
dataset_name = "/content/gdrive/MyDrive/Pawsey/custom-data/parsed-knowledge_all.csv"
new_model = "fionazhang/mistral_7b_environment"

In [None]:
# Split to train and test
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the CSV file into a pandas DataFrame
df = pd.read_csv(dataset_name)

# Split the dataset into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
from datasets import Dataset

# Assuming 'Environment Knowledge Base' is the column containing your features
feature_column = 'Environment Knowledge Base'

# Convert Pandas DataFrames to dictionaries
train_dict = {feature_column: train_df[feature_column].tolist()}
test_dict = {feature_column: test_df[feature_column].tolist()}

# Convert dictionaries to datasets.Dataset
train_dataset = Dataset.from_dict(train_dict)
test_dataset = Dataset.from_dict(test_dict)

# Print information about the datasets

train_dataset, test_dataset

(Dataset({
     features: ['Environment Knowledge Base'],
     num_rows: 461
 }),
 Dataset({
     features: ['Environment Knowledge Base'],
     num_rows: 116
 }))

In [None]:
train_dataset['Environment Knowledge Base'][100]

"Humanity's overall impact on the planet is affected by many factors, not just the raw number of people. Their lifestyle (including overall affluence and resource use) and the pollution they generate (including carbon footprint) are equally important. In 2008, The New York Times stated that the inhabitants of the developed nations of the world consume resources like oil and metals at a rate almost 32 times greater than those of the developing world, who make up the majority of the human population."

In [None]:
model.config.use_cache = False # silence the warnings
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

In [None]:
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

(True, True)

In [None]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

In [None]:
training_arguments = TrainingArguments(
    output_dir="/content/gdrive/MyDrive/Pawsey/model-out",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="wandb"
)

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field='Environment Knowledge Base',
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)



Map:   0%|          | 0/461 [00:00<?, ? examples/s]

In [None]:
trainer.train()

Step,Training Loss
25,2.1007
50,1.9757
75,2.148
100,2.0108




TrainOutput(global_step=116, training_loss=2.0764361414416084, metrics={'train_runtime': 1344.3418, 'train_samples_per_second': 0.343, 'train_steps_per_second': 0.086, 'total_flos': 2492056970698752.0, 'train_loss': 2.0764361414416084, 'epoch': 1.0})

In [None]:
trainer.model.save_pretrained('/content/gdrive/MyDrive/Pawsey/model-out/mistral_finetune_train')
wandb.finish()
model.config.use_cache = True


In [None]:
trainer.model.push_to_hub(repo_id=new_model)

adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/fionazhang/mistral_7b_environment/commit/61d2690a1083ead78d14a666f43400f70a5d322c', commit_message='Upload model', commit_description='', oid='61d2690a1083ead78d14a666f43400f70a5d322c', pr_url=None, pr_revision=None, pr_num=None)

### Merge

In [None]:
!pip -q install peft
!pip -qqq install bitsandbytes accelerate
#!pip install -q -U transformers


In [None]:
base_model = "mistralai/Mistral-7B-v0.1"
eval_dataset = "/content/gdrive/MyDrive/Pawsey/custom-data/parsed-knowledge_all.csv"
new_model = "fionazhang/mistral_7b_environment"

In [None]:
import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import torch

In [None]:
base_model_reload = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        device_map="auto",
        trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model_reload, new_model)


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

ValueError: ignored

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonF

In [None]:
prompt = "Climate change is"

sequences = pipe(
    f"<s>[INST] {prompt} [/INST]",
    do_sample=True,
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

RuntimeError: ignored

In [None]:
model = model.merge_and_unload()

In [None]:
secret_hf = "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"
secret_wandb = "5f205be85aa31899f0e2fc5882532096d33c54b2"

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!huggingface-cli login --token $secret_hf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
model.push_to_hub(repo_id=new_model)
tokenizer.push_to_hub(repo_id = new_model)



NotImplementedError: ignored

## Mistral7b fine-tune evaluation

In [None]:
!pip -qqq install bitsandbytes accelerate
!pip install -q -U transformers
!pip -q install peft

In [None]:
import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import torch

In [None]:
# Load the tokenizer, adjust configuration if needed
model_name = "mistralai/Mistral-7B-v0.1"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# Load the fine-tuned model with its trained weights
fine_tuned_model = PeftModel.from_pretrained(model, 'fionazhang/mistral_7b_environment')


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
pipe = pipeline(
    "text-generation",
    model=fine_tuned_model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    pad_token_id = 2
)

prompt = "Climate change is"

sequences = pipe(
    f"<s>[INST] {prompt} [/INST]",
    do_sample=True,
    max_new_tokens=100,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
)
print(sequences[0]['generated_text'])

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonF

<s>[INST] Climate change is [/INST] expected to have significant impacts on agriculture and food security in the region, with impacts on food security and nutrition likely to be particularly acute. In some areas, the effects will be exacerbated by other stressors, such as poverty, rapid population growth, water scarcity and overfishing. The region is also vulnerable to impacts from extreme weather events, such as drought and floods.
In addition, the region is already highly vulnerable to the impacts of climate change, such as


In [None]:
!pip install datasets
!pip install scikit-learn
!pip install langchain



In [None]:
from datasets import load_dataset
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from langchain.prompts import PromptTemplate
from IPython.display import Markdown, display

In [None]:
# Get dataset
dataset = load_dataset("csv", data_files="/content/gdrive/MyDrive/Pawsey/custom-data/csiro-quiz.csv")

In [None]:
# Prepare for the generator prompts
template = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: {prompt}\n
A) {a}\n
B) {b}\n
C) {c}\n
D) {d}\n
E) {e}\n

Answer:"""

prompt = PromptTemplate(template=template, input_variables=['prompt', 'a', 'b', 'c', 'd', 'e'])


sample = dataset['train'][0]
display(Markdown(prompt.format(prompt=sample['prompt'],
                               a=sample['A'],
                               b=sample['B'],
                               c=sample['C'],
                               d=sample['D'],
                               e=sample['E'])))

def format_text(example):
    text = prompt.format(
        prompt=example['prompt'],
        a=example['A'],
        b=example['B'],
        c=example['C'],
        d=example['D'],
        e=example['E']
    )
    return {"text": text}

dataset = dataset.map(format_text)


# View a row
dataset['train'][0]

Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]

Question: Which of the following are complementary strategies for managing and reducing the risks of climate change?

A) Climate change refers to any long-term trends or shifts in climate over many decades.

B) Implementing policies to regulate cloud cover.

C) Ignoring the impact of human activities on the environment.

D) Relying solely on natural climate variability.

E) Increasing greenhouse gas emissions for economic growth.


Answer:

{'id': 0,
 'prompt': 'Which of the following are complementary strategies for managing and reducing the risks of climate change?',
 'A': 'Climate change refers to any long-term trends or shifts in climate over many decades.',
 'B': 'Implementing policies to regulate cloud cover.',
 'C': 'Ignoring the impact of human activities on the environment.',
 'D': 'Relying solely on natural climate variability.',
 'E': 'Increasing greenhouse gas emissions for economic growth.',
 'answer': 'A',
 'text': 'Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D, E]\n\nQuestion: Which of the following are complementary strategies for managing and reducing the risks of climate change?\n\nA) Climate change refers to any long-term trends or shifts in climate over many decades.\n\nB) Implementing policies to regulate cloud cover.\n\nC) Ignoring the impact of human activities on the environment.\n\nD) Relying solely on natural climate 

In [None]:
# Get generator answer
def get_ans(text):
    inputs = tokenizer(text, return_tensors='pt')
    logits = model(input_ids=inputs['input_ids'].cuda(), attention_mask=inputs['attention_mask'].cuda()).logits[0, -1]
    options_list = [(logits[tokenizer(' A').input_ids[-1]], 'A'), (logits[tokenizer(' B').input_ids[-1]], 'B'), (logits[tokenizer(' C').input_ids[-1]], 'C'), (logits[tokenizer(' D').input_ids[-1]], 'D'), (logits[tokenizer(' E').input_ids[-1]], 'E')]
    options_list = sorted(options_list, reverse=True)
    ans_list = []
    for i in range(3):
        ans_list.append(options_list[i][1])
    return ans_list

In [None]:
# Computes average precision at k between two lists of items
# actual: list of elements to be predicted
# list of predicted elements
# k: int, maximum number of predicted elements
# returns: double, avg precision at k over input lists
def apk(actual, predicted, k=10):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

In [None]:
!pip install wandb
!wandb login < 5f205be85aa31899f0e2fc5882532096d33c54b2 # API key

/bin/bash: line 1: 5f205be85aa31899f0e2fc5882532096d33c54b2: No such file or directory


In [None]:
import wandb
run = wandb.init(project='baselines', # Name of project
                 name='mistral evaluation csiro-quiz', # Name of run
                 anonymous='must')

[34m[1mwandb[0m: Currently logged in as: [33mfiona-junfei[0m ([33mllm-csiro[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# Get results
aps = []
eval_table = wandb.Table(columns=['Question', 'Answer', 'Prediction 1', 'Prediction 2', 'Prediction 3', 'AP', 'A', 'B', 'C', 'D', 'E', 'text'])
bar = tqdm(enumerate(dataset['train']), total=len(dataset['train']))
for i, data in bar:
    ans_list = get_ans(data['text'])
    average_precision = apk([data['answer']], ans_list, k=3)
    aps.append(average_precision)
    ans1, ans2, ans3 = ans_list
    eval_table.add_data(data['prompt'],
                        data['answer'],
                        ans1,
                        ans2,
                        ans3,
                        average_precision,
                        data['A'],
                        data['B'],
                        data['C'],
                        data['D'],
                        data['E'],
                        data['text'])

wandb.log({'Evaluation': eval_table})
run.finish()

100%|██████████| 17/17 [00:26<00:00,  1.55s/it]


VBox(children=(Label(value='0.039 MB of 0.039 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

In [None]:
# View precision
mean_average_precision = np.mean(aps)
mean_average_precision

0.9705882352941176

## RAG with a PDF database

In [None]:
!pip install -U huggingface-hub

NotImplementedError: ignored

In [None]:
token = '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'

from huggingface_hub import login
login(token=token, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!pip install torch
!pip install transformers
!pip install langchain
!pip install chromadb
!pip install pypdf
!pip install xformers
!pip install sentence_transformers
!pip install InstructorEmbedding
!pip install pdf2image
!pip install pycryptodome
!pip install auto-gptq
!pip install pinecone-client

NotImplementedError: ignored

In [None]:
import torch
from auto_gptq import AutoGPTQForCausalLM
from langchain import HuggingFacePipeline, PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
import pinecone
from langchain.vectorstores import Pinecone
from pdf2image import convert_from_path
from transformers import AutoTokenizer, TextStreamer, pipeline

In [None]:
# Initializing Pinecone Vector DB
pinecone.init(
    api_key='5cfb7c53-9206-4efe-b27c-beb0f61ef496',
    environment='gcp-starter'
)

In [None]:
# Pinecone Vector DB index name
index_name = 'csiro-vector'
index = pinecone.Index(index_name)

Upsert (skip)

In [None]:
loader = PyPDFDirectoryLoader("/content/gdrive/MyDrive/Pawsey/csiro-data")
docs = loader.load()
len(docs)

In [None]:
text_splitter = CharacterTextSplitter(
        chunk_size=1000,      # Specify chunk size
        chunk_overlap=200,    # Specify chunk overlap to prevent loss of information
    )

In [None]:
docs_split = text_splitter.split_documents(docs)

In [None]:
embeddings = HuggingFaceEmbeddings()

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
# create new embedding to upsert in vector store
doc_db = Pinecone.from_documents(
          docs_split,
          embeddings,
          index_name=index_name
        )

In [None]:
query = "How CSIRO respond to climate change"

In [None]:
# search for matched entities and return score
search_docs = doc_db.similarity_search_with_score(query)

In [None]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.04399,
 'namespaces': {'': {'vector_count': 4399}},
 'total_vector_count': 4399}

---


In [None]:
# Index using vector store just built
text_field = "text"
embeddings = HuggingFaceEmbeddings()

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embeddings.embed_query, text_field
)



In [None]:
query = "How does CSIRO respond to climate change?"

vectorstore.similarity_search(
    query,
    k=1  # 3 most relevant docs
)

[Document(page_content='11Australasia  Chapter 11\n1641Table\xa011.15a | \xa0Examples of Australian adaptation strategies, plans and initiatives by government agencies at national, sub-national and regional or local levels. These examples \nhave not been assessed for their effectiveness (see Supplementary Material Table\xa0SM11.1a).\nJurisdiction Strategies/Plans/Actions\nNational Level\nAustraliaNational Climate Resilience and Adaptation Strategy 2015 (CoA, 2015)\nNational Disaster Risk Reduction Framework (2018) (CoA, 2018b)\nNational Recovery and Resilience Agency and Australian Climate Service (CoA, 2021)\nSub-national\nAustralian Capital Territory (ACT)ACT Climate Change Strategy 2019–2025 (ACT Government, 2019)\nCanberra’s Living Infrastructure Plan: Cooling the City (ACT Government, 2020b); ACT Well-being Framework (ACT Government, 2020a)\nNew South Wales NSW Climate Change Policy Framework (NSW Government, 2016)\nCoastal Management Framework (OEH, 2018b) including\nCoastal Mana

## Retriever + fine-tuned Mistral

In [None]:
!pip install langchain
!pip -qqq install bitsandbytes accelerate
!pip install -q -U transformers
!pip -q install peft

Collecting langchain
  Downloading langchain-0.0.351-py3-none-any.whl (794 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.3/794.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.2 (from langchain)
  Downloading langchain_community-0.0.4-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1 (from langchain)
  Downloading langchain_core-0.1.1-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.1.0,>=0.0.70 (from langchain)
  Downloading langsmith-0.

In [None]:
from langchain.chains import RetrievalQA
import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
import torch

Getting model

In [None]:
# Load the tokenizer, adjust configuration if needed
model_name = "mistralai/Mistral-7B-v0.1"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# Load the fine-tuned model with its trained weights
fine_tuned_model = PeftModel.from_pretrained(model, 'fionazhang/mistral_7b_environment')


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

In [None]:
# Get page content
query = "How does CSIRO respond to climate change?"

retriever = vectorstore.as_retriever()
retrieved_docs = retriever.invoke(query)
print(retrieved_docs[0].page_content)

11Australasia  Chapter 11
1641Table 11.15a |  Examples of Australian adaptation strategies, plans and initiatives by government agencies at national, sub-national and regional or local levels. These examples 
have not been assessed for their effectiveness (see Supplementary Material Table SM11.1a).
Jurisdiction Strategies/Plans/Actions
National Level
AustraliaNational Climate Resilience and Adaptation Strategy 2015 (CoA, 2015)
National Disaster Risk Reduction Framework (2018) (CoA, 2018b)
National Recovery and Resilience Agency and Australian Climate Service (CoA, 2021)
Sub-national
Australian Capital Territory (ACT)ACT Climate Change Strategy 2019–2025 (ACT Government, 2019)
Canberra’s Living Infrastructure Plan: Cooling the City (ACT Government, 2020b); ACT Well-being Framework (ACT Government, 2020a)
New South Wales NSW Climate Change Policy Framework (NSW Government, 2016)
Coastal Management Framework (OEH, 2018b) including
Coastal Management Act 2016, State Environmental Planning 

In [None]:
# Get RAG answer
def get_ans(text, tok, llm, ret):

  retrieved_docs = ret.invoke(query)
  retrieved = retrieved_docs[0].page_content[100:500]

  pipe = pipeline(
      "text-generation",
      model=llm,
      tokenizer = tok,
      torch_dtype=torch.bfloat16,
      device_map="auto",
      pad_token_id = 2
  )

  sequences = pipe(
      f"Context: {retrieved} \n Question: {text} \n Answer: ",
      do_sample=True,
      max_new_tokens=100,
      temperature=0.7,
      top_k=50,
      top_p=0.95,
      num_return_sequences=1,
  )
  return sequences[0]['generated_text']

retriever = vectorstore.as_retriever()
print(get_ans("How does CSIRO respond to climate change? ", tokenizer, model, retriever))

Context: d initiatives by government agencies at national, sub-national and regional or local levels. These examples 
have not been assessed for their effectiveness (see Supplementary Material Table SM11.1a).
Jurisdiction Strategies/Plans/Actions
National Level
AustraliaNational Climate Resilience and Adaptation Strategy 2015 (CoA, 2015)
National Disaster Risk Reduction Framework (2018) (CoA, 2018b)
Nation 
 Question: How does CSIRO respond to climate change?  
 Answer: 
 The CSIRO is the Commonwealth's national science agency. Climate science is a core focus of the CSIRO. The CSIRO is a lead agency for the 
National Climate Resilience and Adaptation Strategy 2015. The CSIRO has released a Climate Adaptation Toolkit and Climate Adaptation Australia 
portal to support adaptation planning. The CSIRO also provides information on its website and through social media


## Streamlit APP

In [15]:
!pip -q install streamlit
!wget -q -O - ipv4.icanhazip.com

35.236.159.196


In [3]:
!npm install localtunnel

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[K[?25h[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
+ localtunnel@2.0.2
added 22 packages from 22 contributors and audited 22 packages in 2.789s

3 packages are looking for funding
  run `npm fund` for details

found 1 [93mmoderate[0m severity vulnerability
  run `npm audit fix` to fix them, or `npm audit` for details


In [12]:
!pip -q install streamlit
!pip -q install torch
!pip -q install pinecone
!pip -q install langchain
!pip -q install transformers
!pip -q install peft
!pip -q install bitsandbytes accelerate
!pip -q install pinecone-client
!pip -q install sentence_transformers

[31mERROR: Could not find a version that satisfies the requirement pinecone (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pinecone[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [93]:
%%writefile /content/gdrive/MyDrive/Pawsey/WebApp/streamlit.py
import streamlit as st
import torch
import pinecone
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone
import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
from huggingface_hub import login

device = torch.device("cuda")

def hf_login():
  token = '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'
  login(token=token, add_to_git_credential=True)

def get_retriever():
  # Initializing Pinecone Vector DB
  pinecone.init(
      api_key='5cfb7c53-9206-4efe-b27c-beb0f61ef496',
      environment='gcp-starter'
  )

  # Pinecone Vector DB index name
  index_name = 'csiro-vector'
  index = pinecone.Index(index_name)

  # Index using vector store just built
  text_field = "text"
  embeddings = HuggingFaceEmbeddings()

  # switch back to normal index for langchain
  index = pinecone.Index(index_name)

  vectorstore = Pinecone(
      index, embeddings.embed_query, text_field
  )
  retriever = vectorstore.as_retriever()
  return retriever

def get_model():
  # Load the tokenizer, adjust configuration if needed
  model_name = "mistralai/Mistral-7B-v0.1"

  bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_use_double_quant=True,
  )

  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForCausalLM.from_pretrained(
          model_name,
          load_in_4bit=True,
          quantization_config=bnb_config,
          torch_dtype=torch.bfloat16,
          device_map="auto",
          trust_remote_code=True,
      )

  # Load the fine-tuned model with its trained weights
  fine_tuned_model = PeftModel.from_pretrained(model, 'fionazhang/mistral_7b_environment')
  return {'tok': tokenizer, 'model': fine_tuned_model}

def handle_userinput(text, tok, llm, ret):
    retrieved_docs = ret.invoke(text)
    retrieved_page_content = retrieved_docs[0].page_content

    # Calculate the total number of words in the page content
    total_words = len(retrieved_page_content.split())

    # Calculate the starting and ending indices for the middle 100 words
    start_index = max(0, total_words // 2 - 50)  # Ensure the start index is not negative
    end_index = start_index + 100

    # Extract the middle 100 words
    retrieved = ' '.join(retrieved_page_content.split()[start_index:end_index])

    source = retrieved_docs[0].metadata['source']
    source = source.split('/')
    source = source[-1]

    pipe = pipeline(
        "text-generation",
        model=llm,
        tokenizer=tok,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        pad_token_id=2
    )

    sequences = pipe(
        f"Context: {retrieved} \n Question: {text} \n Answer: ",
        do_sample=True,
        max_new_tokens=100,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
    )

    generated_text = sequences[0]['generated_text']
    question = st.markdown(f"<div style='background-color: #c3e3fd; padding: 10px; border-radius: 5px; margin-bottom: 10px;'><span style='color: blue;'> Question   </span> <span style='color: black;'>{text}</span></div>", unsafe_allow_html=True)
    context = st.markdown(f"<div style='background-color: #e8f5ff; padding: 10px; border-radius: 5px; margin-bottom: 10px;'><span style='color: purple;'><b> Context </b> {source} </span><br><span style='color: black;'> ...{retrieved}... </span></div>", unsafe_allow_html=True)

    # Extract the answer part
    answer_start = generated_text.find("Answer:") + len("Answer:")

    # Find the next occurrence of "Question:" or "Context:"
    question_start = generated_text[answer_start:].find("Question:")
    context_start = generated_text[answer_start:].find("Context:")

    # Adjust indices by adding answer_start
    question_start = question_start + answer_start if question_start != -1 else len(generated_text)
    context_start = context_start + answer_start if context_start != -1 else len(generated_text)

    # Identify the end index of the answer
    answer_end = min(question_start, context_start)
    answer = generated_text[answer_start:answer_end].strip()
    st.markdown(f"<div style='background-color: #9acef8; padding: 10px; border-radius: 5px;'><span style='color: purple;'> Answer </span><br><span style='color: black;'>{answer}</span></div>", unsafe_allow_html=True)


def main():
  if 'hf' not in st.session_state:
    hf_login()
    st.session_state['hf'] = True

  if 'model' not in st.session_state:
    mod_tok = get_model()
    tok = mod_tok['tok']
    mod = mod_tok['model']
    ret = get_retriever()
    st.session_state['model'] = mod
    st.session_state['token'] = tok
    st.session_state['retriever'] = ret

  tok = st.session_state['token']
  mod = st.session_state['model']
  ret = st.session_state['retriever']

  st.set_page_config(page_title="Private LLM - RAG with Fine-tuned Mistral",
                       page_icon=":books:")

  st.header("Private LLM - RAG with Fine-tuned Mistral")
  st.write("Developed by Fiona")
  query = st.text_input('Enter your prompt here', key="input_query")
  if query:
    handle_userinput(query, tok, mod, ret)

if __name__ == '__main__':
    main()


Overwriting /content/gdrive/MyDrive/Pawsey/WebApp/streamlit.py


In [91]:
!streamlit run /content/gdrive/MyDrive/Pawsey/WebApp/streamlit.py &>/content/logs.txt &

In [92]:
!npx localtunnel --port 8501

[K[?25hnpx: installed 22 in 1.475s
your url is: https://tall-zebras-flash.loca.lt
^C
