# Document Clustering
### Using Embeddings

In [1]:
import json
import os
import random
import pandas as pd
import numpy as np
# from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import OPTICS
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AgglomerativeClustering


In [2]:
n_clusters = 10 
n_samples = 10
n = 100
batch_size=64

## Load Data

In [3]:
embeddings = np.load('data/embeddings_cleaned.npy')
embeddings.shape

(1387497, 768)

In [4]:
df = pd.read_csv('data/full_transcripts_cleaned.csv')
print(df.shape)
df.head()

(1387497, 6)


Unnamed: 0.1,Unnamed: 0,series,episode,line,char,text
0,1,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,2,ADJUSTED,"Hello, one and all, and welcome back to anoth..."
1,2,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,3,ADJUSTED,"Heroes of Pure Heart."" She's also a dungeon m..."
2,3,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,4,Kelly,"Yeah, this is big, woo, the crowd goes wild. ..."
3,4,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,5,Brennan,"Flowers, (laughs) flowers, flowers. Kelly, it..."
4,5,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,6,Kelly,"Thanks for having me, Brennan. It's really ex..."


## Define Output Functions

In [6]:
def sample_clusters(labels, n_clusters = 10, n_samples = 10, column = 'label'):
    df[column] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df[column] == i])} / {len(df)}')
        sample = df[df[column] == i].sample(n_samples, replace=True)['text']
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [7]:
n_clusters = 5 
n_samples = 10

In [8]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column= f'cluster_{n_clusters}')

Cluster 0: 284750 / 1387497
	-  We've got to bail on it and come back for it later, if at all.
	-  We've got a little bit of business, a little bit of activity ahead of us, so we want to curve to th
	-  Uh, if you want to try to cop a look, you can roll a stealth check for me.   
	-  Just out of curiosity, have you mapped any new rooms within Halls of Halas? 
	-  Build a Brennan workshop.
	-  I'd like to see that.
	-   Oh, like you’r e on their side still?   
	-  Yeah, you know, what if we--
	-  And move down and help out with the chain.
	-    Yaaay!   




Cluster 1: 136031 / 1387497
	-  Yeah. 
	-  12.
	-  Yeah, 39.
	-  Okay.
	-  Yeah. 
	-  Yeah.
	-  Uh, no. 
	-  Fuck. 
	-  All right.
	-  Yeah.




Cluster 2: 257584 / 1387497
	-  This is good. That feels like, I mean, that's not just a good cover story, that's just a good story
	-  “That's all right.”
	-  Letters, let me help you out and I hook the-- 
	-  “I'm working on Shorthalt's Hall of Lady Favors.”
	-  And then I'm like, 
	-  Bu

## 10 Clusters

In [9]:
n_clusters = 10 
n_samples = 10

In [10]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column= f'cluster_{n_clusters}')

Cluster 0: 129394 / 1387497
	-  “Wonderful to see you as well. And please, do send me some fresher treats.”
	-  He looks at you. Go ahead and make an opposed Athletics check. You can do Acrobatics.
	-  It would be an honor, [drops die] sorry. It would still be an honor.
	-  (laughs) Yeah.
	-  "Then I didn't see it. I didn't touch it. 350 gold." (laughter) 
	-  (laughs) 
	-  (claps) 
	-  (laughs) 
	-  As you walk out talking about like, I'm glad we're keeping closed, yeah. Ronnie looks over and says
	-  "That is a word, yes."




Cluster 1: 158186 / 1387497
	-  Last we left our heroes off, they were going to see a Broadway show, and celebrating Christmas, thi
	-  And you are aware, as you get very close to the edge of this burning rune in the ground, you can fe
	-  To make it seem booby trapped. 
	-  Wow. The drow is down?
	-  You all pile into the elevator as that question hangs in your head. Why kill the Squire? This was s
	-  Power? The voices, the voices can reach you. I don't know 

## 20 Clusters

In [11]:
n_clusters = 20 
n_samples = 10

In [12]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column= f'cluster_{n_clusters}')

Cluster 0: 81910 / 1387497
	-  Okay, our brigade... Okay, Talespire crew, Brigade Tigers are going to... All right, they're gonna 
	-  Well, we've had some great Fjord spotlights, and maybe more for everyone as we go forward. So they'
	-  We want to go back to the Leaky Tap?
	-  Yeah, but then he'll pop where we are.
	-  Yeah, exactly. 
	-  We may not like it because we're very Humano centric as people. Clearly my bicycle was the chosen o
	-  Yeah, I also don't want to stick my neck out, but anything that is a thumb in the eye of the Empire
	-    This is the result of Griffin‘s Surge.   
	-  If there's pirates here, space pirates are gonna love gambling.
	-  And we see nothing.




Cluster 1: 35191 / 1387497
	-  Okay. 
	-    Okay.   
	-  Okay, okay, all right.
	-  Okay. 
	-  Yeah, okay. 
	-  Okay.
	-  Okay, okay. 
	-  That's fine. 
	-  All right. 
	-  Okay.




Cluster 2: 88068 / 1387497
	-  Caleb. 
	-  Marquet! Marquet it is.
	-  Ford, Cad, Yasha? Or Yasha--? 
	-  The vulture’s the bi

## 50 Clusters

In [13]:
n_clusters = 50 
n_samples = 10

In [14]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column= f'cluster_{n_clusters}')

Cluster 0: 23743 / 1387497
	-  There's enough time, it's an hour. 
	-    Point of order, it kind of sounds like, from the way you described y‘all‘s dream, that the Quell‘
	-  I'd like to see that.
	-  It could.
	-  We'll be in and out of Eshteross', yeah. 
	-  I know. Next week, possibly at the live show. There we go. That can be something for you to referen
	-   No, no fireworks.   
	-  I think they'll get out when the house catches on fire.
	-  Yeah, god, five years!
	-  Big gamble. That's a big gamble for a tiny payout. 




Cluster 1: 25106 / 1387497
	-    Okay.   
	-   Alright …  
	-  Okay. 
	-  Great. 
	-  Okay.
	-  Okay.
	-  Okay. 
	-  Okay.
	-    Okay.   
	-  Okay. 




Cluster 2: 51762 / 1387497
	-  (gags) For the love of– (gags)
	-  I have that. I have a lot of-- 
	-  Nerd shit? 
	-   What d‘you got, Dad?   
	-  Space Jam 3. 
	-  Tom Cruise ass. 
	-  Oh yeah, in twin speak-
	-  New die.
	-  The Wurst gunner channel in the Wurst. 
	-  What's your alignment? 




Cluster 3: 119

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,series,episode,line,char,text,cluster_5,cluster_10,cluster_20,cluster_50
0,1,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,2,ADJUSTED,"Hello, one and all, and welcome back to anoth...",2,8,2,20
1,2,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,3,ADJUSTED,"Heroes of Pure Heart."" She's also a dungeon m...",3,8,2,29
2,3,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,4,Kelly,"Yeah, this is big, woo, the crowd goes wild. ...",2,0,6,40
3,4,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,5,Brennan,"Flowers, (laughs) flowers, flowers. Kelly, it...",2,0,4,20
4,5,dimension 20,The_Joy_of_Mistakes_(with_Kelly_Lynne_D%27Ange...,6,Kelly,"Thanks for having me, Brennan. It's really ex...",2,0,4,20


In [16]:
df.to_csv('data/full_transcripts_cleaned.csv')