# Document Clustering
### Using Embeddings

In [1]:
import json
import os
import random
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

#### Define Parameters

In [8]:
n_clusters = 10 
n_samples = 10
n = 100
batch_size=64

## Load Data

In [9]:
def create_df_multi(n=25):
    list_of_text = []
    dir = 'data/aligned data/c=4'

    files = [filename for filename in os.listdir(dir)]
    sampled_files = random.choices(files,k=n)

    for filename in sampled_files:
        # choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+filename)
        data = json.load(f)
        choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+choice)
        data = json.load(f)

        for x in data:
            for y in x['TURNS']:
                text = ' '.join(y['UTTERANCES'])

                # text = clean_text(text)
                list_of_text.append(text)
    # df = pd.DataFrame(list_of_text)
    return list_of_text

In [10]:
docs = create_df_multi(n)
df = pd.DataFrame(docs)
print(df.shape)
df.head()

(258748, 1)


Unnamed: 0,0
0,Hello everyone! Close. Welcome to tonight's ep...
1,"Oh my gosh, all that crazy merch that we have!..."
2,"Hide your neighbors, hide your wife."
3,That's right. And we also have our Scanlan con...
4,Fantastic.


## Embed Docs



- Load Embedding Model
- Process Batches (May Take Awhile)

In [11]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-mpnet-base-v2')

embeddings = model.encode(docs, batch_size=batch_size, show_progress_bar=True)

print(embeddings[0].shape)

Batches:   0%|          | 0/4043 [00:00<?, ?it/s]

(768,)


### Save embeddings just in case

In [12]:
df['embeddings'] = embeddings[:].tolist()
df.to_csv('data/doc_embeds.csv')

## Define Output Functions

In [15]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)[0]
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [13]:
n_clusters = 5 
n_samples = 10

In [17]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 126127 / 258748
	- Dungeons and Dragons!
	- (blows nose)
	- That's typically how it works.
	- That is a good deal, sir, and I will give you the copper.
	- This one here?
	- We'll see each other, again, I hope. Grog.
	- Okay, so which do you want to know, of all the things that you can choose from?
	- Pulling this ripcord.
	- I didn't expect you to say that.
	- Oh snap! Shit's getting real.




Cluster 1: 18941 / 258748
	- "He'll survive." And you can see now there's a look of pained anger in Uriel's eyes at seeing the fa
	- "To be fair Kima, in my experience and knowledge, ships function well for the movement of goods and 
	- "Parts smoldering and left behind, the path continues over the Lucidian Ocean, eastward beyond the s
	- I don't say anything. (long pause)
	- "I don't know of the name 'The Gentleman,' but--"
	- "To his credit, the current Plank King has done a very fine job of maintaining organization in the R
	- He exits the building. Another five minutes pass, and ev

## 10 Clusters

In [18]:
n_clusters = 10 
n_samples = 10

In [18]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 126127 / 258748
	- Send him on his merry way.
	- (laughs)
	- Who's to say where he's going?
	- Can we get everybody else but Sam and Marisha to leave?
	- Do we have to get in disguise again in order to leave?
	- Two more editions of the art book!
	- Would you like to see the library?
	- I'm okay at surviving.
	- God damn it-- the chain devils. All right.
	- Percy, you are aware that here have been problems in the past with various mountain-faring giants th




Cluster 1: 18941 / 258748
	- "Yes."
	- Hi. I step in after her. Yes, I've made some changes to the place. Living quarters are upstairs, the
	- "Dragons’ teeth actually, strangely enough, are a little easier to find. I would put, each dragon’s 
	- "Do you have some?"
	- She gives a narrow look through her eyes, looks to one of the other guards that you can see standing
	- "Interesting. You seem to be struggling, Percy."
	- "Yes, I am a skilled swordsman!"
	- "Now, we've had a few weeks of peace and recovery, of rebuildi

## 20 Clusters

In [19]:
n_clusters = 20 
n_samples = 10

In [20]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 14354 / 258748
	- By the way, guys, this finally came in today.
	- It's pretty awesome, we're excited.
	- That's amazing.
	- How's the weather; is it clear or grey?
	- These horns will be yours. and definitely go follow Kai on Instagram and on Facebook and definitely 
	- It's just the two of them at the moment.
	- I feel like I can do so much more. (phone rings)
	- Thank you both.
	- And during that whole exchange, I've been changing cloaks and putting on the red one. Great.
	- Kima is very happy.




Cluster 1: 15861 / 258748
	- It would take you about an hour or so to do so. To a studied physician, probably not going to hold u
	- Okay, I'm going to do what I originally planned which is to make the hole not look like a perfect ho
	- I don't know. I've never done it before.
	- No, you do not.
	- Yes, of course.
	- I can do this for the future.
	- Hey, you can get the note.
	- I mean, you can. The progress will slow a bit because you're now taking it through the hills and the

In [21]:
## 50 Clusters

In [22]:
n_clusters = 50 
n_samples = 10

In [23]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 4754 / 258748
	- I've already thought about it.
	- Go for it.
	- You could. But I know these two.
	- You have no choice?
	- I'm going to jack it up and see if I can--
	- I did!
	- You can go around this one there.
	- Once a year, I can do that. It takes a year to rest.
	- I don't know. I mean, like, I kind of want to. I mean, so like. Like bad things.
	- Yes, you can. Make a stealth check.




Cluster 1: 7041 / 258748
	- I have a platinum bracelet which I assumed was just fancy and pretty. Here, look at that one too, Pe
	- I was going to put it on that leather.
	- The biggest one's a bit--
	- Each page is a centimeter thick, though, on this book.
	- That's Travis's!
	- Currently, it looks like there was another sheet that looks like most has been torn off. The rest of
	- Gotcha. Hey, Miss Jester? Apparently you picked something up in Zadash-- I'm supposed to see the thi
	- I've also carved, in Elven, in the inside, "Kick Me."
	- Sinew.
	- It's big?




Cluster 2: 3628 / 2587