# Document Clustering with K-means  

Some documents are too large and might not fit into an LLM's context window. We need to split it in chunks.

In this notebook, we explore K-Means clustering to group similar (embbeded) sentences into clusters. 

**Pros**

K-means has some advantages over other metohds:

- Keep semantic coherence between sentences (unlike LangChain's splitter).

**Cons**

- We lose the order of the sentences because similar sentences might be far away from each other in the document.

See "Adjacent clustering" for a solution to this problem.

**TL;DR**

While clustering is a good way to split large documents, it is not perfect. We can see that sentences are cut off, potentially important information and sentence order are lost, etc.

## Importing Libraries

In [38]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

from PyPDF2 import PdfReader
import urllib

from langchain.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.docstore.document import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
import string
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

from wordcloud import WordCloud
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

from IPython.display import display, Markdown, HTML

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balde\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\balde\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
# Load and set API key
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']

model_name="gpt-3.5-turbo"

**Download the webpage**

To read the webpage

```python
file = urllib.request.urlopen(webpage)
myfile = f.read()
print(myfile.decode("utf-8"))
```

## Clustering the document

**Load the page with LangChain**

In [13]:
# # Download the page
# webpage = 'https://en.wikipedia.org/wiki/Artificial_intelligence'
# webpage_path = 'data/artificial-intelligence.html'
# urllib.request.urlretrieve(webpage, webpage_path)

# # Create a loder for the page
# loader = BSHTMLLoader(webpage_path, open_encoding='utf-8')
# docs = loader.load()
# docs

**Load the PDF**

In [4]:
file_path = 'data/Survey of Success Factors in Data Science Project.pdf'

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        pdf = PdfReader(file)
        text = " ".join(page.extract_text() for page in pdf.pages)
    return text

# Extract text from the PDF and split it into sentences
text = extract_text_from_pdf(file_path)
text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\balde\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'A survey study of success factors in data science projects\nIñigo Martinez\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\nimartinez@vicomtech.orgElisabeth Viles\nUniversity of Navarra\nTECNUN School of Engineering\nDonostia-San Sebastián 20018, Spain\neviles@tecnun.esIgor G Olaizola\nVicomtech Foundation\nBasque Research and Technology Alliance\nDonostia-San Sebastián 20009, Spain\niolaizola@vicomtech.org\nAbstract —In recent years, the data science community has\npursued excellence and made signiﬁcant research efforts to de-\nvelop advanced analytics, focusing on solving technical problems\nat the expense of organizational and socio-technical challenges.\nAccording to previous surveys on the state of data science project\nmanagement, there is a signiﬁcant gap between technical and\norganizational processes. In this article we present new empirical\ndata from a survey to 237 data science professionals on the\nuse of project managem

**Clustering the sentences**

We first embed the sentences into vectors so they can be clustered.

In [6]:
# Embbeding sentences
model = SentenceTransformer('all-MiniLM-L6-v2')
# sentences = sent_tokenize(docs[0].page_content)       # for webpage above
sentences = sent_tokenize(text)
embeddings = model.encode(sentences)

# Clustering sentences
n_clusters = 30
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(embeddings)

  super()._check_params_vs_input(X, default_n_init=10)


In [7]:
def clean_sentence(sentence):
    """
    Tokenize a sentence, convert it to lower case,
    remove punctuation, non-alphabetic tokens and stop words.
    """
    stop_words = set(stopwords.words('english'))

    tokens = word_tokenize(sentence)
    tokens = [w.lower() for w in tokens]
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    words = [word for word in stripped if word.isalpha()]
    words = [w for w in words if not w in stop_words]

    return words

In [35]:
def cluster_sentences(clusters, n_clusters):
    """Group sentences into clusters."""

    # Will hold the each cluster of sentences
    clustered_texts = []

    for i in range(n_clusters):
        # Store all the sentences belonging to the same cluster
        current_cluster_sentences = [
            sentences[j] for j in range(len(sentences)) if clusters[j] == i
        ]
        
        # Clean the cluster sentences
        cleaned_sentences = [
            ' '.join(clean_sentence(s)) for s in current_cluster_sentences
        ]
        text = ' '.join(cleaned_sentences)

        # Append the cluster's sentences to the doc_chunks list
        clustered_texts.append(text)

    return clustered_texts

clustered_texts = cluster_sentences(clusters, n_clusters)
clustered_texts[0]

'provide additional proﬁling survey respondents roles priorities executing data science projects fact companies report minimal impact ai furthermore provide additional proﬁling survey respondents roles data scientist data engineer business analyst etc among survey respondents data scientists data engineers business analysts data analysts machine learning engineers software developers level job experience among professionals polled distributed follows senior associate entrylevel executive implies majority data science projects large involve large number stakeholders case survey mostly business organizations furthermore survey participants state use kind data science project methodology another say depends project several aspects survey respondents work data science professionals evalu ated including experience level role data scientist data engineer business analyst etc however survey participants state follow data science project methodology question items count economist consultant st

**Visualizing the clusters**

In [10]:
def plot_word_clouds(texts):
    """Generate and plot word clouds for each chunk of documents generated by clustering."""

    for i, chunk in enumerate(texts):

        wordcloud = WordCloud(
            max_font_size=50, max_words=100, background_color="white"
        ).generate(chunk)

        plt.figure()
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis("off")
        plt.title(f"Cluster {i+1}")
        plt.show()

# plot_word_clouds(clustered_texts[:5])

## Querying the document

**Convert sentences to Document**

In [13]:
def convert_to_docs(texts):
    """Convert list of text to a list of documents"""
    for i, text in enumerate(texts):
        texts[i] = Document(page_content=text, metadata={"source": file_path})

    return texts

doc_chunks = convert_to_docs(clustered_texts)
doc_chunks[:3]

[Document(page_content='provide additional proﬁling survey respondents roles priorities executing data science projects fact companies report minimal impact ai furthermore provide additional proﬁling survey respondents roles data scientist data engineer business analyst etc among survey respondents data scientists data engineers business analysts data analysts machine learning engineers software developers level job experience among professionals polled distributed follows senior associate entrylevel executive implies majority data science projects large involve large number stakeholders case survey mostly business organizations furthermore survey participants state use kind data science project methodology another say depends project several aspects survey respondents work data science professionals evalu ated including experience level role data scientist data engineer business analyst etc however survey participants state follow data science project methodology question items count 

**Trim the document**

In [37]:
# Remove docs that are too short or too long
doc_trimmed = [doc for doc in doc_chunks if len(doc.page_content) > 30 and len(doc.page_content) < 3000]

print(len(doc_trimmed), ' documents extracted.')
doc_trimmed[:2]

25  documents extracted.


[Document(page_content='furthermore survey conducted nearly data analytics leaders corinium intelligence found data science organizations established standardized processes microsoft team data science process', metadata={'source': 'local'}),
 Document(page_content='important success factors precisely describing stakeholders needs communicating results end users team collaboration coordination scrum kanban customized processes followed top three important success factors precisely describe stakeholders needs communicate avg std data augmentation metadata enrichmentdata visualization toolsdefine data lifecycle workflowharness knowledge future workestablishing timelines deliverablesdata security privacyunderstanding team member skills roledeployment pipeline productionversion control code data modelsidentify project potential risks pitfallsdevelop strategy meet project requirementsproactive team communicationteam collaboration coordinationcommunicating results describe stakeholders needs 

**Query the document**

In [26]:
# Embbed the docs and store them in a vector store
db = DocArrayInMemorySearch.from_documents(
    doc_trimmed,
    OpenAIEmbeddings()
)

retriever = db.as_retriever()
llm = ChatOpenAI(model=model_name, temperature=0.0)

# Create RetievalQA chain.                                                                                                                                                                                      
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    verbose=True
)

# Run the chain
query = """What should I know about this document? \ 
    format your answer in markdown"""
response = qa_stuff.run(query)

display(Markdown(response))



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


This document discusses the use of different methodologies in data science projects. The main methodology mentioned is CRISP-DM (Cross-Industry Standard Process for Data Mining), which was found to be commonly used by respondents in a survey conducted by KDnuggets. However, the document highlights that the percentage of people using CRISP-DM has changed significantly in recent years, indicating that technology has advanced faster than organizational processes for handling data science projects. 

The document also mentions that while many participants are aware of CRISP-DM, there are other methodologies such as Agile DS Lifecycle, RAMSYSMIDSTMICROSOFT TDSP, IBM FMDSDOMINO DS, and DS Lifecycle that some respondents are also familiar with. It suggests that the use of CRISP-DM may be decreasing compared to previous surveys.

Additionally, the document contains information about a scoring system based on a point Likert scale. It includes a figure that shows the distribution of scores, along with the weighted average and standard deviation. The factors are listed in descending order of importance. The weighted average and standard deviation are calculated based on the response count and weight of each answer choice. The figure also includes a comparison of the weighted average scores for people who follow or do not follow a data science project methodology.

Lastly, the document allows personal use of its content.

## Conclusion

While clustering is a good way to split large documents, it is not perfect. We can see that sentences are cut off, potentially important information and sentence order are lost, etc.

## References
- https://learn.deeplearning.ai/langchain/lesson
- https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-3858c4a0997a#:~:text=The%20Langchain%20Character%20Text%20Splitter%20works%20by%20recursively%20dividing%20the,meet%20the%20desired%20size%20criterion.