# Adjacent-clustering

Some documents are too large and might not fit into an LLM's context window. We need to split it in chunks.

In this notebook, we explore "Adjacent clustering" which groups similar (embbeded) sentences into clusters.

**Pros**

Adjectives clustering has some advantages over other methods:

- Keep semantic coherence between sentences (unlike LangChain's splitter)
- Control the length of the chunks (unlike spacy and nltk tokenizers)
- Keep the order of the original sentences (unlike K-means clustering)

## References
- https://learn.deeplearning.ai/langchain/lesson
- https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-3858c4a0997a#:~:text=The%20Langchain%20Character%20Text%20Splitter%20works%20by%20recursively%20dividing%20the,meet%20the%20desired%20size%20criterion.

## TODO

- Use HuggingFace embeddings
- Use Other VectorStores

**Imports**

In [162]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

import urllib
from langchain.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader

from langchain.vectorstores import DocArrayInMemorySearch, FAISS, Chroma
from langchain.docstore.document import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI 
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

from PyPDF2 import PdfReader

import numpy as np

import spacy
nlp = spacy.load('en_core_web_sm')

from IPython.display import display, Markdown, HTML

In [11]:
# Load and set API key
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

model_name="gpt-3.5-turbo"

**Read the webpage as a LangChain Document**

In [9]:
# Download the page
webpage = 'https://en.wikipedia.org/wiki/Artificial_intelligence'
webpage_path = 'data/artificial-intelligence.html'
urllib.request.urlretrieve(webpage, webpage_path)

# Create a loder for the page
loader = BSHTMLLoader(webpage_path, open_encoding='utf-8')
docs = loader.load()
docs



In [12]:
file_path = 'data/Survey of Success Factors in Data Science Project.pdf'

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        pdf = PdfReader(file)
        text = " ".join(page.extract_text() for page in pdf.pages)
    return text

# Extract text from the PDF and split it into sentences
text = extract_text_from_pdf(file_path)
text

'A survey study of success factors in data science projects\nIñigo Martinez\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\nimartinez@vicomtech.orgElisabeth Viles\nUniversity of Navarra\nTECNUN School of Engineering\nDonostia-San Sebastián 20018, Spain\neviles@tecnun.esIgor G Olaizola\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\niolaizola@vicomtech.org\nAbstract —In recent years, the data science community has\npursued excellence and made signiﬁcant research efforts to de-\nvelop advanced analytics, focusing on solving technical problems\nat the expense of organizational and socio-technical challenges.\nAccording to previous surveys on the state of data science project\nmanagement, there is a signiﬁcant gap between technical and\norganizational processes. In this article we present new empirical\ndata from a survey to 237 data science professionals on the\nuse of project managem

## Clustering similar sentences

The Wikipedia page about Artificial Intelligence is a large document. It cannot fit into GPT-3.5 context window. We need to split it in chunks.

We use "Adjacent clustering" to group similar (embbeded) sentences into clusters. This way, we keep semantic coherence between sentences (unlike LangChain's splitter) and we also control the lenght of the chunks (unlike spacy and nltk tokenizers).

In [17]:
# %python -m spacy download en_core_web_sm

In [77]:
# a = [[0]]
# a.append([])
# print(a)
# a[-1].append(1)
# print(a)

**Cluster the sentences**

In [118]:
def process(text):
    """
    Tokenize the text and return the sentences
    and their normalized vectors
    """ 
    doc = nlp(text)
    sents = list(doc.sents)
    # Normalize and stack the vectors
    vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])

    return sents, vecs


def create_clusters(sents, vecs, threshold):
    """Create clusters of similar sentences"""
    # Initialize the clusters
    clusters = [[0]]            # first cluster
    for i in range(1, len(sents)):
        # If the similarity between current and previous sentence
        # is below the threshold, created a new cluster
        if np.dot(vecs[i], vecs[i-1]) < threshold:
            clusters.append([])

        # Append sentence to the last created cluster
        clusters[-1].append(i)          
    
    return clusters


def clean_text(text):
    """Cleaning logic here"""
    return text

In [119]:
# Process the chunk
threshold = 0.3
sents, vecs = process(text)

# Cluster the sentences
clusters = create_clusters(sents, vecs, threshold)

In [120]:
def cluster_text(clusters):
    """
    Group sentences of the same cluster into a single text.
    """

    # Initialize the clusters lengths list and final texts list
    clusters_lens = []
    final_texts = []


    for cluster in clusters:
        # Join all the sentences in the cluster
        cluster_txt = clean_text(' '.join([sents[i].text for i in cluster]))
        cluster_len = len(cluster_txt)
        
        # If the cluster is too short, break the iteration
        if cluster_len < 60:
            continue
        
        # If the cluster is too long, break it into smaller clusters
        elif cluster_len > 3000:
            threshold = 0.6
            sents_div, vecs_div = process(cluster_txt)
            subclusters = create_clusters(sents_div, vecs_div, threshold)
            
            # For each subcluster, do the same as the main loop
            for subcluster in subclusters:
                div_txt = clean_text(' '.join([sents_div[i].text for i in subcluster]))
                div_len = len(div_txt)
                
                if div_len < 60 or div_len > 3000:
                    continue
                
                # Append the subcluster to the final lists
                clusters_lens.append(div_len)
                final_texts.append(div_txt)

        # If the cluster is of the right size, append it to the final list        
        else:
            clusters_lens.append(cluster_len)
            final_texts.append(cluster_txt)

    print(f"{len(clusters_lens)}")

    return final_texts, clusters_lens

final_texts, _ = cluster_text(clusters)

In [127]:
print(f"{len(final_texts)} clusters created ")
print(final_texts[0])

37
A survey study of success factors in data science projects
Iñigo Martinez
Vicomtech Foundation
Basque Research and Technology Alliance
Donostia-San Sebastián 20009, Spain
imartinez@vicomtech.orgElisabeth Viles
University of Navarra
TECNUN School of Engineering
Donostia-San Sebastián 20018, Spain
eviles@tecnun.esIgor G Olaizola
Vicomtech Foundation
Basque Research and Technology Alliance
Donostia-San Sebastián 20009, Spain
iolaizola@vicomtech.org
Abstract —In recent years, the data science community has
pursued excellence and made signiﬁcant research efforts to de-
velop advanced analytics, focusing on solving technical problems
at the expense of organizational and socio-technical challenges.
 According to previous surveys on the state of data science project
management, there is a signiﬁcant gap between technical and
organizational processes. In this article we present new empirical
data from a survey to 237 data science professionals on the
use of project management methodologies f

## Create a QA chain

**Convert text to Document**

In [134]:
# metadata = {"source": "Wikipedia", "topic": "Artificial Intelligence"}
metadata = {}
final_docs = [
    Document(page_content=text, metadata=metadata) for text in final_texts
]
final_docs[:3]

[Document(page_content='A survey study of success factors in data science projects\nIñigo Martinez\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\nimartinez@vicomtech.orgElisabeth Viles\nUniversity of Navarra\nTECNUN School of Engineering\nDonostia-San Sebastián 20018, Spain\neviles@tecnun.esIgor G Olaizola\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\niolaizola@vicomtech.org\nAbstract —In recent years, the data science community has\npursued excellence and made signiﬁcant research efforts to de-\nvelop advanced analytics, focusing on solving technical problems\nat the expense of organizational and socio-technical challenges.\n According to previous surveys on the state of data science project\nmanagement, there is a signiﬁcant gap between technical and\norganizational processes. In this article we present new empirical\ndata from a survey to 237 data science professionals on the

In [165]:
with get_openai_callback() as cb:
    # Shorthand for the index creator
    index = VectorstoreIndexCreator(
        embedding=OpenAIEmbeddings(),
        vectorstore_cls=Chroma,
    ).from_documents(final_docs)

    query = "What is this document about?"
    response = index.query(
        llm=ChatOpenAI(),
        question=query, 
        chain_type="stuff",
    )

    print(cb)

print(response)

Tokens Used: 142
	Prompt Tokens: 120
	Completion Tokens: 22
Successful Requests: 1
Total Cost (USD): $0.00022399999999999997
This document is about the survey results for the question "which title best describes your role?" (Q2).


In [167]:
# Embbed the docs and store them in a vector store
db = Chroma.from_documents(
    final_docs,
    OpenAIEmbeddings()
)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": len(final_docs)}       # select only 2 relevant chunks to search from
)
llm = ChatOpenAI(model=model_name, temperature=0.0)

# Create RetievalQA chain.                                                                                                                                                                                      
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    return_source_documents=True,
    verbose=True
)

# Run the chain
query = """What should I know about this document? \ 
    format your answer in markdown"""
# response = qa_chain.run(query)
response = qa_chain({"query": query})

# Display the response
display(Markdown(response["result"]))



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


This document appears to be a survey questionnaire related to data science projects. It includes questions about job experience level, stakeholders, gender, project duration, working style, project magnitude, relevance of factors in data science projects, and the use of data science project methodologies. The document also mentions the results section, which likely contains the findings of the survey. The document includes a figure labeled "Fig. 1" which likely shows the distribution of survey respondents based on their answers.

In [170]:
_ = ''.join([d.page_content for d in final_docs])
len(_)

27574

In [168]:
response["source_documents"]

[Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_content='Using the responses from question Q10\n(Do you usually follow a data science project methodology? )', metadata={}),
 Document(page_conte

In [137]:
# Run the chain
query = """What should I know about this document? \ 
    format your answer in markdown"""
response = qa_chain.run(query)

display(Markdown(response))

This document appears to be a survey questionnaire related to data science projects. It includes questions about job experience level, stakeholders, gender, project duration, working style, project magnitude, relevance of factors in data science projects, and the use of data science project methodologies. The document also mentions that the results section follows the survey questions.

Additionally, the document discusses various aspects related to data science methodology. It covers topics such as metadata enrichment, data visualization tools, data security and privacy, establishing timelines and deliverables, team collaboration and coordination, and communicating results to end-users. It emphasizes the importance of understanding team member skills and roles, deploying code, data, and models, identifying project risks and pitfalls, proactive team communication, and meeting project requirements. The document seems to provide guidance on how professionals should approach data science methodology.

However, without the full document, it is not possible to provide more specific details or insights.

In [143]:
final_docs[20]

Document(page_content='It is also worth noting the factors\nwith the highest standard deviations: deployment pipeline to\nproduction anddata security and privacy had more variance\nin their scores, which could indicate that professionals have\ndiffering views on the importance of these factors.\n  Furthermore, an additional analysis of these scores was\nperformed to determine whether there were any signiﬁcant dif-\nferences between professionals who use a project methodology\nand those who do not.', metadata={})