# Chunky Monkey Experiment Notebook

This notebook is a testing ground for the concepts required for chunky monkey; a data science tool for finding an optimal chunking strategy for a given corpus. Ultimately chunky monkey should be integrated with the Rag Experiment Accelerator.

The overall intended process is as follows:

1. Identify common topics in the corpus of data
2. Search for "Topic density" with a sliding window through each document
3. Dynamically assign chunks based on maximum topic density

The hypothesis is that by maximising topic density, relevance scores within a RAG system can be boosted.

## What we'll cover:

- Import some sample data
- Create a topic model using [BERTopic](https://maartengr.github.io/BERTopic/index.html)
- Extract all the topic terms
- Submit the topic terms to OpenAI to get a sensible topic label
- Scan through the document, and for each topic, calculate a topic density over a sliding window
- Select the optimal chunks based on the l2 norm of the topic densities across topics

## Future work
The aim is to integrate this work with the [RAG Experiment Accelerator](https://github.com/microsoft/rag-experiment-accelerator) so that this can be used as a target chunking strategy, and measure the overall efficacy of a RAG system.

Let's go!

## Environment Setup
We'll start by taking a look at your environment and configuring it to use the optimal GPU setup. If you don't have a GPU, don't worry, it'll just use the CPU.


In [None]:
import torch
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

has_gpu = torch.cuda.is_available()
has_mac_gpu = torch.backends.mps.is_built()
device = "mps" if has_mac_gpu else "gpu" if torch.cuda.is_available() else "cpu"

print("NVIDIA/CUDA GPU is", "available" if has_gpu else "NOT AVAILABLE")
print("MPS (Apple Metal) is", "AVAILABLE" if has_mac_gpu else "NOT AVAILABLE")
print(f"Target device is {device}")

## Import data
We'll begin by importing a list of news articles from the scikit learn package. Given a large number of documents can cover a huge number of topics, for testing/learning purposes its recommended to subset the list down to 1000 documents. We've created a helper function to bring in the data.

In [None]:
from topic.modelling import create_and_fit_topic_model
from data_processing.example import create_example_input_docs
from topic.labels import extract_all_topics, label_topics
from topic.processing import combined_densities

# Restrict to 2500 documents for testing
docs = create_example_input_docs(2500)

display(docs[:5])

## Create our topic model

What we're aiming to do here is understand what the themes (topics) within our corpus are. At a high level we're looking to sepearate 

Topic modelling is beyond the scope of this library, but we'll give a brief outline of the approach here. The intention is to scan our corpus and algorithmically determine what the common themes are. There are a number of well established techniques to do this. We've chosen the implement the [BERTopic](https://maartengr.github.io/BERTopic/index.html) package here as it is simple, pretty sophisticated and well documented.

At a high level this works by applying the following steps:
- Embed each document
- Reduce the number of dimensions
- Cluster the documents
- Tokenise the documents
- Apply a word weighting scheme

What's great about the package is that it's designed as plug and play, so that you can use your algorithm of choice at each step.

<div style="text-align: center;">

![alt text](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.svg "BERTopic components")

</div>


In [None]:
# For 2500 docs this should take less than a minute
kb_topic_model, topics, probs = create_and_fit_topic_model(docs)

### Topic Modelling
BERTopic comes with some handy analytics to support topic modelling, including some visualisations that can help evaluate your topics. Some examples below.

In [None]:
kb_topic_model.visualize_barchart()

In [None]:
kb_topic_model.visualize_topics()

## Extracting and Labelling the Topics
Now we want to take all of the topics and given them a name. Most topic modelling techniques provide an ID, and a set of words that define the topic. We want to make that ID meaningful to a human. First we extract all of the topic terms, and then we label them by constructing a prompt, and submitting it to GPT3.5, asking it to come up wth a two word label for the topic. We combine the labels and terms into a dictionary that we can then use to caluclate "densities" across potential chunks of our input documents.

In [None]:
# This should have a small number of topics, resulting in a small number of calls to OpenAI

extracted_topics = extract_all_topics(kb_topic_model.get_topics())

labelled_topics = label_topics(extracted_topics)

Let's take a look at some of our topics...

In [None]:
from itertools import islice
first_3_items = dict(islice(labelled_topics.items(), 3))
display(first_3_items)


## Topic Density Chunking (TDC)

Now we have our topics, lets see how densely they are represented in our documents. We create a sliding and expanding window that moves through each document, giving a density score for each document, in each potential chunk.

Density has been calculated as follows:

For each topic:
    Count number times a topic term appears in substring
    Divide that number by the total number of words in the substring

Take the L2 Norm across all topics for each chunk and select the chnk with the highest score.

L2 Norm (sum of squares) has been chosen as it minimises the impact of topics that are not "high scoring" and focuses more on topics that are well repesented in the chunk. Other metrics may also make sense!

> **Note**: For a large corpus, unless you (or your data scientist) have carefully curated your number of topics, you may end up needing to do a lot of computation. Remember that for each doc, for each potential chunk, for each topic you will be measuring density.  

Let's run the code and take a look at the results.

In [10]:
# This should run in a 2-3 mins with 2500 docs, grab a coffee :D!

df_new = combined_densities(docs, labelled_topics)
df_new.head()

2024-01-09 05:23:07,447 - INFO - topic.processing - Calculating topic densities for 2500 documents


Processing docs...:   0%|          | 0/2500 [00:00<?, ?it/s]

2024-01-09 05:23:07,461 - INFO - topic.processing - Calculating topic densities for document ff11cbec-ae3c-4647-b420-3ddff7f42a37
2024-01-09 05:23:07,470 - INFO - topic.processing - Calculating topic densities for document 7a6405ed-3c33-4fa9-9c8a-ce4b734f8197
2024-01-09 05:23:07,482 - INFO - topic.processing - Calculating topic densities for document 5dc1b3ef-ccb7-4cf8-8ea5-6022ef3835c2
2024-01-09 05:23:07,514 - INFO - topic.processing - Calculating topic densities for document 67d23214-15f7-4316-8566-ee9e20e77015
2024-01-09 05:23:07,524 - INFO - topic.processing - Calculating topic densities for document 3d7c9ef6-317d-4a81-8420-45d69533c577
2024-01-09 05:23:07,525 - INFO - topic.processing - Calculating topic densities for document 35f7e2ad-ce6b-4d19-8995-9c871f7aa58b
2024-01-09 05:23:07,533 - INFO - topic.processing - Calculating topic densities for document e2a50b75-7b31-45bc-85bf-79b04b2611d9
2024-01-09 05:23:07,536 - INFO - topic.processing - Calculating topic densities for docume

### Visualisation
Now we have our results, we might want to take a look at the distribution of our l2_norm. Ideally this would be maximised. Perhaps we could do some hyperparameter tuning across out chunk parameters to see what works best for our corpus?

In [None]:
import seaborn as sns
filter_data = df_new[df_new['l2_norm'] != 0]
#filter_data 

sns.boxplot(filter_data['l2_norm'])

## Conclusion

So we've seen that documents in a corpus can be dynamically chunked based on the density of topics within. Now we have the start / end data for each chunk we may want to refine it.

- *Do we want to make sure that chunks include complete sentences?*
- *What is the impact of longer vs shorter chunk lengths?*
- ...

Once we've refined our chunks, we're ready to start embedding them and add them to our vector DB of choice.