## Chaining LLM with PCA to visualize the Shakespeare text's characters and traits through correspondence analysis.

### taught process:

## steps:
1- **choose a dataset:** Shakespeare: I take a look at data and investigate the text 

2- **preprocess and clean:** remove extra characters like escape chars 
    - there are *** that separate header and footers, so I use this indicator to throw away unncessary text
    -  removed extra spaces
    
3- **LLM request:** NER: assign score to each trait of that character
    
    
4- **LLM request:** NER: assign score to each trait of that character
    - I run the for loop to have an API call and get response and investigate it.
    - After each investigation I add some more instructions to teh prompt and modify it. to get better result.
    - I printed response methods step by step until I end up with response.json['choices'][0]['message']['content']
    - converted str to dict
    
    
5- **PCA:** create a plot of principle component analysis using the 2 top components (on a 2-d plane)
    - I played around with the traits and add and remove to see how it works.


I want to use LLM to get the corpus and extract relevant sections to a character. For example outputs the text related to Hamlet. I will tell in the prompt that I'll use this input later.
Next I'll use the output of this LLM as an input to the next LLM. So in LLM 2, I want the agent to get the relevant info of a character and considering a list of traits it outputs a score for each of the traits. for examle like this:

["indecisiveness":10, "ambitious":7, "innocence":1]

Then I feed these characters together with their corresponding trait score into a PCA to map them all into a Cartesian plane and compare and analyze

Why I choose 4 characters: I searched to find the top famous characters in the Shakespeare text. For simplicity I keep it as 4 characters.

In [None]:
%pip install sacremoses==0.0.53

In [None]:
%pip install chromadb==0.4.21 tiktoken==0.5.2 sqlalchemy==2.0.15 faiss-cpu==1.7.4 langchain==0.0.352 mlflow==2.9.2 databricks-genai-inference

In [None]:
%pip install -U langchain-community 

In [None]:
dbutils.library.restartPython()

In [None]:
%sh
export DATABRICKS_TOKEN=<My_token>

In [None]:
#test llm
from langchain.chat_models import ChatDatabricks
from langchain.embeddings import DatabricksEmbeddings

embeddings = DatabricksEmbeddings(endpoint="databricks-bge-large-en") # Use to generate embeddings
llm = ChatDatabricks(endpoint="databricks-meta-llama-3-70b-instruct", temperature=0.1) # use to query models like Llama2-70b

query = "Why is the difference between Cat and Dog?"

embedding = embeddings.embed_query(query)
response = llm.invoke(query)

print(embedding[:5])
print(response.content)

In [None]:
#load dataset
from langchain.document_loaders import GutenbergLoader
from langchain.text_splitter import CharacterTextSplitter

full_text = GutenbergLoader("https://www.gutenberg.org/cache/epub/100/pg100.txt").load() # All of Shakespeare 5967830

print (f"{full_text[0].page_content} characters in the doc")

In [None]:
#clean up
content = full_text[0].page_content.split('***')
cleaned_corpus = content[2].replace("  ", " ")#.replace("\n", " ")
# cleaned_corpus is what we need in our next step
print(cleaned_corpus)

In [None]:
#extract relevant info
def extract_relevant_info(text, character):
  paragraphs = text.split(".")# split the whole text into paragraphs
  relevant_paragraphs = [paragraph for paragraph in paragraphs if character in paragraph] 
  # if a character is found in a paragraph then keep that paragraph as a relevant paragraph
  return " ".join(relevant_paragraphs)[0:3500] # to avoid token bloat (4096) I cut down the size of the text to 3500

print(len(extract_relevant_info(cleaned_corpus, "Macbeth")))
characters = ["Hamlet", "Romeo", "Juliet", "Othello", "Macbeth"] # len =[11640, 7152, 6026, 8149]

In [None]:
#extract scale 0 to 10 using LLM
from databricks_genai_inference import ChatCompletion
import json
import pandas as pd


traits = ["indecisiveness", "innocence", "jealous", "ambitious", "beauty"]
df = pd.DataFrame(columns=['Character'] + traits)
# loop over the characters and for each Character create a list of traits with their scores
for char in characters:
  prompt = f"""
           consider character {char}, from the collected work of Shakespeare, and provide scores on a scale of 0 to 10 
           for the following personality traits:
           "indecisiveness", "innocence", "jealous", "ambitious", "beauty".
           Give scores based on the {char}'s dialogs, actions, and descriptions. Use only the content provided to you and do not make up things.
           Return only the JSON output, without any additional explanatory text. Pay close attention to the dialogs, actions, and descriptions and do your best in assigning scores to each trait. If there is no implicit nor explicit mention of a specific trait for a character, and you cannot infer it, then simply consider 0 for that.
           Try to not assign similar scores for the same trait to different characters. I want to feed your output into a principle component analysis and map all traits and characters into a single Cartesian plane. So assign the scores in a way to avoid messy diagram as much as possible.
           Here is an example of desired output:
           {{
             "{char}":{{
             "indecisiveness": 7,
             "innocence": 9, 
             "jealous": 1, 
             "ambitious": 5, 
             "beauty":2
           }}
           }}
           """
  response = ChatCompletion.create(
            model="databricks-meta-llama-3-70b-instruct", #DBx serving endpoint
            messages = [           
                        {"role": "system", "content": prompt},
                        {"role": "user", "content": extract_relevant_info(cleaned_corpus, char)}],
            temperature = 0.2,
            max_tokens=4096
            )
  #print(response.json['choices'][0]['message']['content'])# type: string
  print("<<<<<<<<<<<<<<<>>>>>>>>>>>>")
  # convert string to dictionary to be able to work on it 
  response_dict = json.loads(response.json['choices'][0]['message']['content'])
  #print(response_dict, type(response_dict))
  scores = {trait: response_dict[char][trait] for trait in traits}
  scores['Character'] = char
  df = df.append(scores, ignore_index=True) 

display(df)

In [None]:
print(dir(response.json))

In [None]:
#PCA. Principal component Analysis 
import matplotlib.pyplot as plt  
from sklearn.decomposition import PCA  

df.set_index('Character', inplace=True) # 
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(df) 

In [None]:
# create Visualization

plt.figure(figsize=(12, 8)) 
# plot characters
for i, character in enumerate(df.index):
  plt.scatter(pca_result[i, 0], pca_result[i, 1], color='green', s=100, edgecolors='black')
  plt.text(pca_result[i, 0], pca_result[i, 1], character)
# plot traits
for i, trait in enumerate(df.columns):
  plt.scatter(pca.components_[0, i], pca.components_[1, i], color='cyan', marker='^')
  plt.text(pca.components_[0, i], pca.components_[1, i], trait, fontsize=9, ha='left') 
  # ha stands for "horizontal alignment." It controls the alignment of the text relative to the coordinates provided.
plt.grid(True)
plt.title("PCA of Shakespeare's charactes and traits")
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)
plt.show()
#  ["indecisiveness", "innocence", "jealous", "ambitious", "beauty"]