## Relevent Research:

## Embeddings
- Word2Vec: [link](https://www.tensorflow.org/tutorials/text/word2vec) an embedding system that uses a truncated neural network to create representations.
- GloVe: [link](https://edumunozsala.github.io/BlogEms/jupyter/nlp/classification/embeddings/python/2020/08/15/Intro_NLP_WordEmbeddings_Classification.html#Word-Embeddings,-GloVe-and-Text-classification) an embedding that uses the co-occurence rate of words in a corpus to create vectors.
- Hugging Face MTEB transformers: [link](https://huggingface.co/blog/mteb#:~:text=Models%20by%20average%20English%20MTEB%20score%20%28y%29%20vs,context%20awareness%20resulting%20in%20low%20average%20MTEB%20scores.) Article creating a transformer benchmark to be able to compare differing embedding formats to see which is the most accurate vs the fastest. Might be a good place to look for embedding alternatives.

## LLMs
- LangChain: [link](https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a) a framework used to create ML powered apps, has well-documented uses surrounding question-answering/text summarization
- BLOOM: [link](https://huggingface.co/bigscience/bloom#:~:text=BLOOM%20is%20an%20autoregressive%20Large%20Language%20Model%20%28LLM%29%2C,is%20hardly%20distinguishable%20from%20text%20written%20by%20humans.) Largest offline model equivalent to LLMs like ChatGPT. circumvents the security risks of online networks, but will be much slower/more space consuming running in-house, and may be less accurate.

- Falcon LLM: [link](https://falconllm.tii.ae/)

In [None]:
!pip install -U sentence_transformers

In [None]:
!pip install faiss-cpu



In [None]:
!pip install gradio

In [None]:
#Imports
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import time
from scipy.special import softmax
import math, csv
import faiss
import nltk
from nltk import tokenize
import gradio as gr

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Embedding comparison algorithm

In [None]:
"""
Given an input query and a dictionary containing the embedding for all documents
will return the document ID of the document most related to the query
"""
def most_relevent_paper_test(input, papers, model):
  #helper function, gives cosine sim of given two inputs
  def cosine_helper(input1, input2):
    return cosine_similarity([input1], [input2])

  #vector for input user prompt
  input_vec = model.encode(input)

  max_paper = None
  max_sim = -1
  max_ind = -1

  sim_list = []

  #finds cosine similarity between each paper and initial prompt, stores each similarity, and tracks paper with highest similarity
  for ind, paper in enumerate(papers):
    paper_vect = model.encode(paper)
    similarity = cosine_helper(input_vec, paper_vect)[0][0]
    sim_list += [similarity]
    if similarity > max_sim:
      max_sim = similarity
      max_paper = paper
      max_ind = ind

  #apply softmax to excentuate difference between highs/lows
  sim_list = softmax(sim_list)

  #calculates whether embedding correctly chose correct article (accuracy) and
  #the mean square error (error) based off the differences in its similarities
  exp_out_ind = exp_out[input]
  error = -math.log(sim_list[max_ind])

  #calculate accuracy: binary 0:1 if correct or false.
  accuracy = 0
  if max_ind == exp_out_ind:
    accuracy = 1

  return max_paper, sim_list, error, accuracy

In [None]:
model_names = ['flax-sentence-embeddings/stackoverflow_mpnet-base', 'paraphrase-MiniLM-L6-v2',
               'paraphrase-MiniLM-L12-v2', 'sentence-transformers/average_word_embeddings_glove.6B.300d',
               'sentence-transformers/allenai-specter', 'sentence-transformers/LaBSE',
               'sentence-transformers/all-mpnet-base-v2']
#'sentence-transformers/sentence-t5-xl' taken out of testing for being disruptively slow (>100s)

model_stats = {}
input_stats = {x: 0 for x in exp_out}
for name in model_names:
  model = SentenceTransformer(name)
  for input in exp_out.keys():
    start = time.time()
    max_paper, sim_list, error, acc = most_relevent_paper_test(input, article_titles, model)
    time_elapsed = time.time() - start
    if name in model_stats:
      model_stats[name][0] += error
      model_stats[name][1] += time_elapsed
      model_stats[name][2] += acc
    else: model_stats[name] = [error, time_elapsed, acc]
    if acc: input_stats[input] += 1

    print(f"expected: {article_titles[exp_out[input]]}")
    print(f"actual: {max_paper}")
    print(f"{time_elapsed:.2f}s for model {name} with error of {error:.4f}.")
    print(f"Accuracy: {acc}.")
    print()

In [None]:
#Articles titles which the embeddings will compare
article_titles = ["Biochar effects on soil biota – A review",
"New approaches to measuring biochar density and porosity",
"Impact of biochar amendments on the quality of a typical Midwestern agricultural soil",
"Sustainable biochar to mitigate global climate change",
"Properties of biochar",
"Predicting water retention of biochar-amended soil from independent measurements of biochar and soil properties",
"Rapid Simulation of Decade-Scale Charcoal Aging in Soil: Changes in Physicochemical Properties and Their Environmental Implications",
"Anion exchange capacity of biochar",
"Impacts of fresh and aged biochars on plant available water and water use efficiency",
"An overview of the effect of pyrolysis process parameters on biochar stability",
"Impact of Pyrolysis Temperature and Feedstock on Surface Charge and Functional Group Chemistry of Biochars",
"Does biochar improve soil water retention? A systematic review and metaanalysis",
"Determination of polycyclic aromatic hydrocarbons in biochar and biochar amended soil",
"Effect of bentonite as a soil amendment on field water-holding capacity, and millet photosynthesis and grain quality",
"Effect of biochar and biochar particle size on plant-available water of sand, silt loam, and clay soil",
"Effects of Bentonite, Hydrogel and Biochar Amendments on Soil Hydraulic Properties from Saturation to Oven Dryness",
"An emerging environmental concern: Biochar-induced dust emissions and their potentially toxic properties",
"Environmental contextualisation of potential toxic elements and polycyclic aromatic hydrocarbons in biochar",
"Hydrogen production by methane decomposition: Origin of the catalytic activity of carbon materials",
"Experimental analysis of direct thermal methane cracking",
"Integrated Modeling of U.S. Agricultural Soil Emissions of Reactive Nitrogen and Associated Impacts on Air Pollution, Health, and Climate",
"Measurement of soil water characteristic curve using HYPROP2",
"Methane Pyrolysis for Zero-Emission Hydrogen Production: A Potential Bridge Technology from Fossil Fuels to a Renewable and Sustainable Hydrogen Economy",
"Maximizing the number of oxygen-containing functional groups on activated carbon by using ammonium persulfate and improving the temperature-programmed desorption characterization of carbon surface chemistry",
                  ]

#dictionary where each key is a prompt for the embeddings, and the value is the
#index of the title in article_titles which is the expected best response
exp_out = {"What role could biochar play in combating global warming?":3,
           "Studies of biochar in Indiana agriculture": 2, "how would aging affect biochars water holding capacity?": 8,
           "tell me about anion exchange in biochar": 7, "Why is biochar used to maintain plant fauna and bacterial diversity?": 0,
           "statistics on biochar porosity": 1,"how does the HYPROP measure soil characteristics?": 21,
           "can biochar dust be harmful?": 16}

## Analytics

In [None]:
least_error_order = sorted(model_stats.items(), key=lambda item: item[1][0])
fastest_order = sorted(model_stats.items(), key=lambda item: item[1][1])
most_accurate = sorted(model_stats.items(), key= lambda item: item[1][2], reverse=True)

In [None]:
for x in least_error_order: print(f"{x[0]} had an error of {x[1][0]:.4f}")

paraphrase-MiniLM-L12-v2 had an error of 22.0339
paraphrase-MiniLM-L6-v2 had an error of 22.3875
sentence-transformers/all-mpnet-base-v2 had an error of 22.7314
flax-sentence-embeddings/stackoverflow_mpnet-base had an error of 23.1311
sentence-transformers/LaBSE had an error of 23.2598
sentence-transformers/average_word_embeddings_glove.6B.300d had an error of 23.3292
sentence-transformers/allenai-specter had an error of 24.3138


In [None]:
for x in fastest_order: print(f"{x[0]} ran for {x[1][1]:.2f}s")

sentence-transformers/average_word_embeddings_glove.6B.300d ran for 0.20s
paraphrase-MiniLM-L6-v2 ran for 6.14s
paraphrase-MiniLM-L12-v2 ran for 9.68s
flax-sentence-embeddings/stackoverflow_mpnet-base ran for 26.41s
sentence-transformers/allenai-specter ran for 26.52s
sentence-transformers/all-mpnet-base-v2 ran for 27.15s
sentence-transformers/LaBSE ran for 30.46s


In [None]:
for x in most_accurate: print(f"{x[0]} had an accuracy of {x[1][2]}/{len(exp_out)}")

sentence-transformers/allenai-specter had an accuracy of 8/8
sentence-transformers/all-mpnet-base-v2 had an accuracy of 8/8
flax-sentence-embeddings/stackoverflow_mpnet-base had an accuracy of 7/8
paraphrase-MiniLM-L6-v2 had an accuracy of 7/8
paraphrase-MiniLM-L12-v2 had an accuracy of 7/8
sentence-transformers/LaBSE had an accuracy of 6/8
sentence-transformers/average_word_embeddings_glove.6B.300d had an accuracy of 5/8


mpnet high accuracy on my tests and on MBET (linked above).
Speed not as important if vectors are calculated once then stored in database

In [None]:
for input, stat in input_stats.items():
  print(f"Prompt \"{input}\" was predicted correctly by {stat}/{len(model_names)} of the embeddings")

Prompt "What role could biochar play in combating global warming?" was predicted correctly by 7/7 of the embeddings
Prompt "Studies of biochar in Indiana agriculture" was predicted correctly by 5/7 of the embeddings
Prompt "how would aging affect biochars water holding capacity?" was predicted correctly by 7/7 of the embeddings
Prompt "tell me about anion exchange in biochar" was predicted correctly by 7/7 of the embeddings
Prompt "Why is biochar used to maintain plant fauna and bacterial diversity?" was predicted correctly by 3/7 of the embeddings
Prompt "statistics on biochar porosity" was predicted correctly by 7/7 of the embeddings
Prompt "how does the HYPROP measure soil characteristics?" was predicted correctly by 6/7 of the embeddings
Prompt "can biochar dust be harmful?" was predicted correctly by 6/7 of the embeddings


## Testing maximum input size/speed of different models

In [None]:
for name in model_names:
  model = SentenceTransformer(name)
  start = time.time()
  for i in range(0,100000,10000):
    model.encode("what. " * i + "what.")
  run_time = time.time() - start
  print(f"{name}: {run_time}")

flax-sentence-embeddings/stackoverflow_mpnet-base: 23.376994848251343
paraphrase-MiniLM-L6-v2: 3.8585009574890137
paraphrase-MiniLM-L12-v2: 7.216721296310425
sentence-transformers/average_word_embeddings_glove.6B.300d: 0.39156532287597656
sentence-transformers/allenai-specter: 22.589027881622314
sentence-transformers/LaBSE: 12.17437744140625
sentence-transformers/all-mpnet-base-v2: 16.77246904373169


## Full article embedding

In [None]:
df = pd.read_csv("academic_texts_google_search.csv")
#df = df.rename(columns={"file name":"title"})
#df = df.drop("Unnamed: 0", axis=1)
df = df.drop_duplicates()

In [None]:
df.head()

Unnamed: 0,title,text
0,Soil Systems | Free Full-Text | The 3R Princip...,The 3R Principles for Applying Biochar to Impr...
1,\r\n\tBiochar as a Soil Ameliorant: How Biocha...,"|\n[1]\n||\nAbel, S., Peters, A., Trinks, S., ..."
2,Biochar: An emerging soil amendment - Soil Health,Biochar: An emerging soil amendment\nProper us...
3,Biochar References Articles Books,|\n|\n|\n|\n|\nTerra Preta de Indio\n|\nBiocha...
4,Assessing the Potential of Using Biochar as a ...,Abstract\nBiochar is a product of pyrolysis of...


In [None]:
def embed_papers(df, model):
  embedding_dict = {}
  for ind, row in df.iterrows():
    embedding_dict[row["title"]] = model.encode(row["text"])

  return embedding_dict

In [None]:
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
embedding_dict = embed_papers(df, model)

In [None]:
"""
Given an input query (string) and a pandas df containing the embedding for all documents
will return the document name of the document most related to the query, as well
as the similarity values.
"""
def most_relevent_paper(input, papers, embeddings, model):
  #helper function, gives cosine sim of given two inputs
  def cosine_helper(input1, input2):
    return cosine_similarity([input1], [input2])

  #vector for input user prompt
  input_vec = model.encode(input)

  sim_list = []
  title_list = []

  #finds cosine similarity between each paper and initial prompt, stores each similarity, and tracks paper with highest similarity
  for _, row in papers.iterrows():
    paper = row["text"]

    paper_vect = embeddings[row["title"]]
    similarity = cosine_helper(input_vec, paper_vect)[0][0]
    sim_list += [similarity]
    title_list += [row["title"]]

  x = sorted(((value, index) for index, value in enumerate(sim_list)), reverse=True)
  y = [j for i,j in x][0:3]
  titles = [title_list[i] for i in y]

  #apply softmax to excentuate difference between highs/lows
  sim_list = softmax(sim_list)

  max_rows = papers.loc[df["title"].isin(titles)]

  return max_rows, sim_list

In [None]:
user_input = input("Prompt?: ")

max_papers, sim_list = most_relevent_paper(user_input, df, embedding_dict, model)

print(f"Chosen articles:")
titles = [x.replace("\r","").replace("\t","").replace("\n","") for x in max_papers["title"].to_list()]
for tit in titles: print(tit)

## Analyze text in smaller chunks

In [None]:
#Node class to be used for easy tracking of parent/child data.
#parent = article title
#children = individual sentences/text chunks within an article
class Node:
  def __init__(self, data, parent, children):
    self.data = data
    self.parent = parent
    self.children = children

  def __len__(self):
    return len(self.children)

In [None]:
#extract texts as pandas dataframe
df = pd.read_excel("Knowledge center data.xlsx")
df.head()

Unnamed: 0.1,Unnamed: 0,file name,text
0,0,An overview of the effect of pyrolysis process...,Contents lists available at ScienceDirect Bior...
1,1,Anion exchange capacity of biochar-c5gc00828j.pdf,"Green Chemistry PAPER Cite this: Green Chem. ,..."
2,2,Biochar effects on soil biota e A review-1-s2....,Review Biochar effects on soil biota eA review...
3,3,Does biochar improve soil water retention A sy...,Contents lists available at ScienceDirect Geod...
4,4,Impact of biochar amendments on the quality of...,Impact of biochar amendments on the quality of...


In [None]:
#creates Node datastructure for articles. Creating parent nodes which will go
#in order into the papers list, and individual sentence nodes will go into the
#sentences array
#currently the connectivity of the nodes is mostly redundant, but may come in handy in
#the future if finding relationships between a given sentence, and others in the same article
#is a future goal
papers = []
sentences = []
for _, row in df.iterrows():
  title = Node(row["file name"], None, None)
  text = tokenize.sent_tokenize(row["text"])
  fixed_text = []

  i = 0
  while i < len(text):
    if "Fig." == text[i][-4:]:
      x = " ".join(text[i:i+2])
      fixed_text.append(x)
      i += 2
    else:
      fixed_text.append(text[i])
      i += 1

  children = []
  for par in fixed_text:
    child = Node(par, title, None)
    children.append(child)

  title.children = children
  papers.append(title)
  sentences += children

In [None]:
#Calculates embedding vector for all sentences, which are stored in a faiss index
#as well as saving the vectors into a csv file
def embed_papers(sent_list, model):
  x = [sent.data for sent in sent_list]
  vectors = model.encode(x)
  vector_dim = vectors.shape[1]
  index = faiss.IndexFlatL2(vector_dim)
  faiss.normalize_L2(vectors)
  with open("vectors.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(vectors)
  index.add(vectors)
  return index

In [None]:
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

index = embed_papers(sentences, model)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
"""
Modified version of previous most_relevent_paper(). Takes input string, faiss index
 and a model. Returns the top
k sentences that match the query
"""
def most_relevent_paper(input, index, model):
  k = 10 #number of sentences to be returned
  #vector for input user prompt
  input_vec = np.array([model.encode(input)])

  faiss.normalize_L2(input_vec)
  distances, ann = index.search(input_vec, k=k)

  return [sentences[i] for i in ann[0]]

In [None]:
sentences = most_relevent_paper("how does the pyrolysis method influence envelope and skeletal density of biochar",index, model)

In [None]:
print(sentences)

In [None]:
with open("sentences.csv", "w") as f:
  writer = csv.writer(f)
  for sent in sentences:
    writer.writerow([sent.parent.data, sent.data])