RAG System Using Llama2 With Hugging Face

In [1]:
pip install faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [34]:
import tqdm
from tqdm import trange
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from transformers import LlamaTokenizer, LlamaForCausalLM
import spacy
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
import time

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jacob\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
model = SentenceTransformer('all-mpnet-base-v2')

# General Functions

In [26]:
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

In [27]:
def preprocess_text(text):
  # Lowercasing words
  text = text.lower()
  # Pass to lemmatizer
  doc = nlp(text)
  tokens = []

  # process each token
  for token in doc:
      if token.dep_ == 'neg':
          head = token.head.lemma_
          neg_token = 'not_' + head
          tokens.append(neg_token)
      elif token.lemma_.lower() not in stop_words and token.is_alpha:
          tokens.append(token.lemma_)
  return tokens

# Load Data

In [35]:
df = pd.read_csv('../results/final_data.csv')

combined_data = df.apply(lambda row: f"{row['trope_name']}: {row['trope_description']}", axis=1)


In [36]:
combined_descriptions = combined_data.tolist()
trope_names = df['trope_name'].tolist()
combined_descriptions = df.apply(lambda row: f"{row['trope_name']}: {row['trope_description']}", axis=1).tolist()
embeddings = np.array(df['d_embedding'].apply(eval).tolist())

In [37]:
# Create and add to a FAISS index
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

In [38]:
def retrieve_trope_and_generate(query, k=5):
    # Encode the query to match the pre-existing embeddings
    query_clean = preprocess_text(query)
    query_clean_flattened = ' '.join(query_clean)
    query_embedding = model.encode([query_clean_flattened], convert_to_tensor=True).cpu().detach().numpy()
    
    # Search the FAISS index
    distances, indices = index.search(query_embedding, k)
    
    # Retrieve the top-k relevant tropes and descriptions
    relevant_tropes = [trope_names[i] for i in indices[0]]
    relevant_descriptions = [combined_descriptions[i] for i in indices[0]]
    
    return relevant_tropes, relevant_descriptions

In [39]:
start_time = time.time()

In [40]:
trope_results, descriptions = retrieve_trope_and_generate(" Technology in this world is a bit more advanced than our own; hologram projectors are small and cheap enough to be handed out with magazines à la CD-ROM demos and in an omake, Mt. Lady mentions 8K television.note  This is eventually subverted as culturally, it's played straight, but only because Midoriya notes at one point that when Quirks first appeared, human culture was thrown into such an uproar that culture and technology regressed. He says that if Quirks hadn't appeared, humans would be taking interstellar holidays at that point in history. It's also confirmed to have been at least eight or nine generations since Quirks first developed, which is an unspecified amount of time in the future from the present day.")
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed Time: {elapsed_time} seconds")
for trope, desc in zip(trope_results, descriptions):
    print(f"Trope: {trope}\nDescription: {desc}")

Elapsed Time: 0.1846780776977539 seconds
Trope: This Is My Boomstick
Description: This Is My Boomstick: ['', '', '', '', 'In a God Guise or Time Travel scenario, a modern person with some technological convenience uses it to try and impress the more primitive locals. Guns and cigarette lighters are common versions, with Polaroid cameras and portable radios not far behind. Japanese fiction has a particularly ludicrous variant, where a medieval Europe analogue is introduced to such "unbelievable technologies" as fold-forged sabers and short-form writing systems . See also Convenient Eclipse for another way to impress the locals.', 'It\'s almost never played straight anymore . If the time travelermodern technology to the locals, it\'s Giving Radio to the Romans . If it\'s done with contemporary music, it\'s A Little Something We Call "Rock and Roll" . If the time traveler uses the technology to pretend to have outright supernatural powers, they\'re a Sham Supernatural . If the natives buy

In [41]:
start_time = time.time()

In [42]:
trope_results, descriptions = retrieve_trope_and_generate("At two points in the story, U.A agrees to have joint training sessions with Shiketsu and Ketsubutsu, but we never get to see this actually happening or if it even happened to begin with.")
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed Time: {elapsed_time} seconds")
for trope, desc in zip(trope_results, descriptions):
    print(f"Trope: {trope}\nDescription: {desc}")

Elapsed Time: 0.10310506820678711 seconds
Trope: Wax On Wax Off
Description: Wax On Wax Off: ['', '', "&#010;:  don't change or remove without starting a new thread.&#010;&#010;", '', "An odd form of training passed off by an unorthodox master on a skeptical student. Sometimes comes disguised as a set of chores, but just as often is a general exercise that promotes a valuable physical or mental attribute in a strange way. Always dismissed as a waste of time early on, but appreciated later . Often, this also serves as a lesson to the skeptical student to trust the master and do all the crazy things the master asks without questioning, by demonstrating that the master really knows what he's doing and is in fact effectively teaching the student.", 'May be time-compressed in a Training Montage or Hard-Work Montage . This is an integral part of Improvised Training , due to the low cost involved.', "It's commonly subverted or parodied when a mildly Genre Savvy hero initially assumes he is re