How to use this Notebook:

Steps 1 and 3 will likely take about 5 minutes each. Make sure you are on a T4 instance!

1.   Install dependencies
2.   Get a Huggingface api token and log in to the notebook! The code will likely fail if you don't do this. Guidance on how to obtain an api token can be obtained here: https://huggingface.co/docs/hub/security-tokens
3.   Import the model
4.   Prepare for inference by defining helper functions (mandatory) and explore the data as you like (not mandatory)
5.   Inference (run the model on a paper of your choice (never before seen by the model)

Finally, if you wish to see longer answers, you can change generation_config.max_new_tokens = 50 in the "Define Helper Functions" code block, but be aware this will increase the time to inference signficantly.


## Install Necessary Dependencies

In [None]:
!pip install -Uqqq pip
!pip install -qqq bitsandbytes==0.39.0
!pip install -qqq torch==2.0.1
!pip install -qqq -U git+https://github.com/huggingface/transformers.git@e03a9cc
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git@c9fbb71
!pip install -qqq datasets==2.12.0
!pip install -qqq loralib==0.1.1
!pip install -qqq einops==0.6.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.0/21.0 MB[0m [31m81.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset
from huggingface_hub import notebook_login
from peft import (
    LoraConfig,
    PeftConfig,
    PeftModel,
    get_peft_model,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

os.environ["CUDA_VISIBLE_DEVICES"] = "0"


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


In [None]:
import pandas as pd
import gdown
import re

file_id = "1u7rvNf84a__E3HaPfhcC83MLENOieBZe"
url = f'https://drive.google.com/uc?id={file_id}'
output = 'data.csv'
gdown.download(url, output, quiet=False)
df = pd.read_csv('data.csv')
filtered_df = df[df['DocID'].str.startswith('S')]

Downloading...
From: https://drive.google.com/uc?id=1u7rvNf84a__E3HaPfhcC83MLENOieBZe
To: /content/data.csv
100%|██████████| 1.31G/1.31G [00:09<00:00, 136MB/s]


## Log in to Huggingface

In [None]:
#Must log in to a huggingface account before running the next block to load the fine tuned model
#If not, the following block WILL FAIL

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Import Model

In [None]:

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "vilsonrodrigues/falcon-7b-instruct-sharded",
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

configuration_falcon.py:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

model-00001-of-00015.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

model-00002-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00003-of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

model-00004-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00005-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00006-of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

model-00007-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00008-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00009-of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

model-00010-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00011-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00012-of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

model-00013-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00014-of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

model-00015-of-00015.safetensors:   0%|          | 0.00/828M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("vilsonrodrigues/falcon-7b-instruct-sharded")
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

In [None]:
#Will fail if you do not log in to the cell above
model = PeftModel.from_pretrained(model, "CodeChemist/Scientific_Topic_Modeling")


adapter_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/18.9M [00:00<?, ?B/s]

## Prepare for Inference

### Define Helper Functions and settings

In [49]:
def generate_prompt(prompt_data):
  return f"""
<human>: {prompt_data}
<assistant>:
""".strip()

def generate_and_tokenize_prompt(prompt_data):
  full_prompt = generate_prompt(prompt_data)
  tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
  return tokenized_full_prompt

In [50]:
from datasets import Dataset
prepend_text = "Please describe the novel concepts presented in this work at the collegiate level: "
filtered_df['Abstract'] = filtered_df['Abstract'].astype(str)
filtered_df['Prompt'] = filtered_df['Abstract'].apply(lambda x: prepend_text + x)
data = filtered_df[2000:3000]

In [67]:
generation_config = model.generation_config
generation_config.max_new_tokens = 50
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [83]:
def inference(x):
  device = "cuda:0"

  #prompt = generate_and_tokenize_prompt(data.loc[x, 'Prompt'])
  prompt = f"""
          <human>: {data.loc[x, 'Prompt']}
          <assistant>:
          """.strip()

  encoding = tokenizer(prompt[:2048], return_tensors="pt").to(device)
  with torch.inference_mode():
    outputs = model.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
    )

  response = tokenizer.decode(outputs[0], skip_special_tokens=True)

  response = response.split(prompt)[1]


  return response

In [87]:
def custom_inference(text):
  device = "cuda:0"

  text = "Please describe the novel concepts presented in this work at the collegiate level: " + text

  #tokenized_text = generate_and_tokenize_prompt(text)

  #prompt = generate_and_tokenize_prompt(data.loc[x, 'Prompt'])
  prompt = f"""
          <human>: {text}
          <assistant>:
          """.strip()

  encoding = tokenizer(prompt[:2048], return_tensors="pt").to(device)
  with torch.inference_mode():
    outputs = model.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
    )

  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
  response = response.split(prompt)[1]

  '''
  # Find the index of the <assistant> tag
  assistant_index = response.find("<assistant>:")

  # Check if <assistant> tag is found
  if assistant_index != -1:
      # Extract the assistant's response
      assistant_response = response[assistant_index + len("<assistant>:"):].strip()
      return assistant_response
  else:
      # Handle the case where <assistant> tag is not found
      return "Assistant response not found in the generated output."
  '''

  return response

### Explore the Data

In [54]:
data

Unnamed: 0,DocID,Title,Abstract,BodyText,Prompt
2000,S0014305719320518,Effect of bio-based components on the chemical...,It seems to be obvious that conditions changes...,The most of the non-hydrogen bonded carbonyl g...,Please describe the novel concepts presented i...
2001,S0014305719322396,UV irradiation of Cu-based complexes with alip...,The effect UV irradiation on Cu(II)-based comp...,"Apart from these intact complex species, free ...",Please describe the novel concepts presented i...
2002,S0014480015002646,Identification of inflammatory factor TNFα inh...,The inflammatory response is one of the first ...,Here we established the mouse model for AO her...,Please describe the novel concepts presented i...
2003,S0014480016302878,TB-IRIS: Proteomic analysis of in vitro PBMC r...,Paradoxical tuberculosis-associated immune rec...,"By contrast, our data suggest that non-IRIS-gr...",Please describe the novel concepts presented i...
2004,S0014480017306470,Shallow whole genome sequencing for robust cop...,Pathology archives with linked clinical data a...,"In our hands, we have not had much success in ...",Please describe the novel concepts presented i...
...,...,...,...,...,...
2995,S0022073618303364,Improved acute haemodynamic response to cardia...,Background: The recently developed quadripolar...,The results observed in this study should be t...,Please describe the novel concepts presented i...
2996,S0022096513002476,The puzzling difficulty of tool innovation: Wh...,Tool innovation-designing and making novel too...,All participants were tested by a female exper...,Please describe the novel concepts presented i...
2997,S002209651300252X,Causal knowledge and the development of induct...,We explored the development of sensitivity to ...,There was a significant linear trend (p < .000...,Please describe the novel concepts presented i...
2998,S0022096514000496,Selective effects of explanation on learning d...,Two studies examined the specificity of effect...,The data suggest that the benefits of explanat...,Please describe the novel concepts presented i...


In [55]:
#data['Abstract']
data['Title']

2000    Effect of bio-based components on the chemical...
2001    UV irradiation of Cu-based complexes with alip...
2002    Identification of inflammatory factor TNFα inh...
2003    TB-IRIS: Proteomic analysis of in vitro PBMC r...
2004    Shallow whole genome sequencing for robust cop...
                              ...                        
2995    Improved acute haemodynamic response to cardia...
2996    The puzzling difficulty of tool innovation: Wh...
2997    Causal knowledge and the development of induct...
2998    Selective effects of explanation on learning d...
2999    Regret and adaptive decision making in young c...
Name: Title, Length: 1000, dtype: object

In [56]:
#Pick a paper you like (ideally one that is biology/chemistry/medicine related).
paper_num = 2001
print('Title')
print(data.loc[paper_num, "Title"])
print('Abstract')
print(data.loc[paper_num, "Abstract"])

Title
UV irradiation of Cu-based complexes with aliphatic amine ligands as used in living radical polymerization
Abstract
The effect UV irradiation on Cu(II)-based complexes with aliphatic amine ligands is investigated. Four aliphatic amines are used as ligands and Cu(II)Br2 as the metal source for the formation of catalyst complexes that can be used for the photoinduced Cu-RDRP of methyl acrylate. Different characterization techniques such as transient electronic absorption spectroscopy (TEAS), ultraviolet–visible (UV–Vis) spectroscopy, electrospray ionization time of flight mass spectrometry (ESI-ToF-MS) and cyclic voltammetry (CV) are applied in order to provide insights into the catalyst behaviour upon photo-irradiation. The excited-state dynamics, the electrochemical behaviour of the Cu(II)/Cu(I) redox couples and the detection of different species upon complexation of the ligand to the metal center (before and after UV irradiation) are further depicted in the quality of the obtai

## Inference

In [85]:
#Choosing paper index # 2001 because it is chem related. Feel free to change, though best result will be from bio/chem/med related papers. Choices range from 2000-2999.
#Tested working: 2001, 2024, 2025, 2030
#Tested not working: 2000, 2002, 2003, 2023
paper_num = 2001

model_response = inference(paper_num)
print('Title')
print(data.loc[paper_num, "Title"])
print('Abstract')
print(data.loc[paper_num, "Abstract"])
print("Model Summary")
print(model_response)

Title
UV irradiation of Cu-based complexes with aliphatic amine ligands as used in living radical polymerization
Abstract
The effect UV irradiation on Cu(II)-based complexes with aliphatic amine ligands is investigated. Four aliphatic amines are used as ligands and Cu(II)Br2 as the metal source for the formation of catalyst complexes that can be used for the photoinduced Cu-RDRP of methyl acrylate. Different characterization techniques such as transient electronic absorption spectroscopy (TEAS), ultraviolet–visible (UV–Vis) spectroscopy, electrospray ionization time of flight mass spectrometry (ESI-ToF-MS) and cyclic voltammetry (CV) are applied in order to provide insights into the catalyst behaviour upon photo-irradiation. The excited-state dynamics, the electrochemical behaviour of the Cu(II)/Cu(I) redox couples and the detection of different species upon complexation of the ligand to the metal center (before and after UV irradiation) are further depicted in the quality of the obtai

## Custom Inference

If you would like to run on your own text, you may set the text of your choosing below. Given this (fairly small) limitation (due to how long it would take to run otherwise), I would suggest sticking to abstracts from journals.

ex: https://pubs.acs.org/doi/10.1021/acs.joc.3c00948

In [90]:
my_text = "A chemoselective Pd-mediated carbonylative Negishi-type catalytic protocol for the synthesis of (hetero)aryl ketones is reported. The protocol employs the PEPPSI-IPr precatalyst and CO gas at atmospheric pressure (balloon) to foster the carbonylative coupling between diverse C(sp3)-hybridized organozinc reagents and a broad range of aryl iodides, including substrates carrying aldehyde, aniline, phenol, or carboxylic acid groups, and heteroaryls."

In [91]:
custom_response = custom_inference(my_text)
custom_response

' The novel carbonylative coupling reaction between organozinc reagents and aryl iodides is described. The reaction is performed under atmospheric pressure using a balloon to facilitate the coupling reaction. The reaction is applicable to a broad range of aryl iodides'