# Integration of Langchain & Phoenix

[Working notebook](https://colab.research.google.com/drive/15YcUJ82BHqjxO3JaqiMbXf6ymG6UxpyE?usp=sharing)


## Results
After working through a couple of different ways to use Langchain for a RAG for Shakespeare, I settled on OpenAI as the core LLM for this exercise.  Smaller locally trained models (Bloomz-7b, Falcon7b) were not doing well.

My first plan had been to get Phoenix Eval working well enough to really show the potential difference in performance between different LLM options since the beauty of these chain structures is the ability to switch pieces out.

### Here's what the end result consists of
* Ask GPT-4 to write 20 questions about Shakespeare books
* Ask GPT-4 to answer it's own questions
* Download and parse the collection
* Build embeddings into ChromaDB
* For each question
    * Generate answer through the Chat chain
    * Pull the set of retrievals used to create the answer
* Use Phoenix to assess the quality of the retrievals
* Create output report from Eval

There were definitely some challenges getting this all hooked together.  I very intentionally avoided looking at your example notebooks until I was getting stuck at some points.  To be fair, a lot of what I was doing would have been easier with the tutorials you have online.  At least in the main point of getting to the operation of Phoenix and less about making the Chat system function correctly.

### Ideas on how to make the process easier and more effective

* How can we integrate Eval into the langchain directly?
* Conda environment.yml on startup / install ...
    * not just our package, everything needed to get operating in a clean environment
* <b>Is there a better process to help the users generate the truth?</b> -- this is the biggest challenge I see right now.
* <b>Can we automate the prototyping using LLMs?</b>  -- and this is the biggest potential opportunity
    * i.e. point Phoenix at the data and we do the rest 
    * load, chain, construct challenges & truths, then test against several models
 
### How to get the truth for Eval?
The obvious answer is for a person to invest the time in labeling their own data.  However, this is a potential impediment for adoption.  That concept of hand labeling 10s or 100s of items is just enough to potenially put a user into the "maybe later" mindset.

The approach I decided to play around with is the potential to auto-gen A truth.

This is similar to a project (tiny startup) that I kicked off in October 2021.  The idea was for us to massively accelerate the process for generating labeled image data for training with MaskRNN to strengthen SOTA image recognition models.  All the user would have to do is tell us what the next set of images was about, then draw a bounding box.  Our software would then create the outlined image of the most important thing in the box.  You can see the project here if you are interested
[Scuba Steve Rapid Image Annotation](https://github.com/Gclabbe/scuba_steve)

How does that apply?  Maybe we can pre-truth the customer's data through LLM assessment, then generate an evaluation result.  This will certainly not be correct, however, the results would be an interactive session of "do you agree with this truth" to help them label their data.  If it's a large dataset, sample 10% then re-generate until the errors are dropping and they have confidence.

For this test of the idea I took the Answer that originally came from GPT-4 and the retrievals from the Chat completion and asked GPT-4 if the retrievals were good.  This essentially returned an answer or a phrase that indicated the retrievals were not good.  This was accepted as the truth.

Next, feed it into the process for RAG assessment in Phoenix.

Since it's basically GPT-4 against GPT-4, the premise was that there would be high accuracy ... there wasn't.

I didn't feel like reading a lot of Shakespeare last night, so I didn't go further.  The stage is set, though.  At this point it would be trivial to loop through the items and ask if we preferred the Generated Truth or the Phoenix assessment.  The former meaning a true problem was found.  The latter meaning we update the truth to match the Phoenix result.

### And this leads directly towards the idea of automating the process
Can we just ask the user to point us at a portion of their data, massage the data into a data chat system, prompt them with questions if needed to understand the goals of the model, assemble the chain, generate a truth and fine-tune the truth?

And take that further (mentioned below) into looping across key parameters once we have some stronger set of truth labels to check on chunk size, overlap, number of chunks, LLM options, etc.

Basically, make it very hard to walk away from Phoenix as we try to fully automate their investigation and help them get to production.  At which point, Arize swoops in as the production observability system to monitor their new app.

# Lots of other ideas started popping out
Let me know if you'd like to talk through any of these live.  There's more bouncing around.

# Advancement concepts
* Look into matching the output style of RoBerta for entailment
    * supporting, irrelevant, not-supporting 
* Explore auto-tuning ... become the Optuna of LLMs projects
    * Maybe integration of Optuna directly???
    * Model parameters
    * Key components like chunk size & overlap
    * Randomization of chunks
    * Prompt style suggestions
* How to on <insert Cloud structure here>
    * Google Cloud
    * AWS Bedrock
    * Azure <whatever>
    * WhiteSpace
    * Kaggle contest integration

# Ideas on improvements to the metrics interface on Arize

## Simplify the interface
* I get disoriented everytime I'm in there (maybe weekly)
* Too many options for detail views + dropdowns to build out a view
* Limits to amount of data being uploaded for prompts and responses needs to be fixed
* Automated clustering and grouping
    * When there are obvious outliers, don't make me play a video game to isolate them
    * Easy clustering
    * Then, just send sample from the different sets and let <foundation model> tell me what is different
* Multi-lingual options -- easier than every today

## Autogen summaries using foundation model NLG
"customer X has shown a recent change in metric Y"  
"the new model release has change accuracy by X%, however, precision seems to be down"

This can probably actually be done through NLG templating
* Use the foundation model to generate templates that match the customer's language / features / model names, etc
* Fill in the blanks of the template

## Automated monitoring
* Beyond dashboarding with so many things to click
* Simplify the entire view to more modern HTML5 type interface
* Block style model / customer / metrics with summary info and click-to-view details

## Integrated Phoenix Evaluation on RLHF
* @ Company-V, we were sending RLHF opinions and comments
    * Give value to RLHF component
    * Downgrade negative sentiment if comment is more positive (reduce bias)
    * On large population sets, search for bias (i.e. a particular customer "always rejects the probability")
