# Document Clustering
### Using Embeddings

In [10]:
import json
import os
import random
import pandas as pd
import numpy as np
# from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans

#### Define Parameters

In [2]:
n_clusters = 10 
n_samples = 10
n = 100
batch_size=64

## Load Data

In [3]:
embeddings = np.load('data/embeddings_cleaned.npy')
embeddings.shape

(935818, 768)

In [4]:
df = pd.read_csv('data/full_transcripts_cleaned.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,series,episode,line,char,text
0,0,critical role,The_Echo_Tree.txt,0,MATT,(laughing)
1,1,critical role,The_Echo_Tree.txt,1,LAURA,"And welcome back to tonight's episode. So, pi..."
2,2,critical role,The_Echo_Tree.txt,2,MATT,Yeah.
3,3,critical role,The_Echo_Tree.txt,3,LAURA,"It's been a long week. However, you guys have..."
4,4,critical role,The_Echo_Tree.txt,4,MATT,"Ahh, totally."


## Define Output Functions

In [5]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)['text']
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [6]:
n_clusters = 5 
n_samples = 10

In [13]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 80033 / 935818
	-  Yes, yeah. 
	-  Yeah. 
	-  Yep. 
	-  All righty. 
	-  Okay!
	-  Yeah. 
	-  Yes?
	-  Okay! 
	-  All right. 
	-  Sure.




Cluster 1: 182804 / 935818
	-  Fjord, you're on deck.
	-  Can we hide?
	-  I have Guidance. 
	-  Amazing. You rise back up to face the erinyes above you. Following your turn, it is going to be the
	-  "Right." He tightens his grip on his axe. 
	-  And passes out again.
	-  "Oh. Mom!" Runs in and gives you a big hug. For a second, and then pushes you out and goes, "Hey." 
	-  (laughing) Make a perception check.
	-  Is he going to be looking for me? You got to do what you need to do.
	-  We're not going to evacuate several thousand people, it's just not going to happen.




Cluster 2: 92073 / 935818
	-  Plus performance. 
	-  Rolled a natural 18, so 28.
	-  Does a 17 hit? 
	-  That's a bad one. Eight.
	-  Eyy, that's nine plus three is 12 plus four is 16. 
	-  That is 14 points of piercing damage. 
	-  I'll say it's three-quarters cover. T

## 10 Clusters

In [14]:
n_clusters = 10 
n_samples = 10

In [15]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 112665 / 935818
	-  I mean, it's a bonus action so I can't go very high with the level, can I?
	-  Don't hit the turtle.
	-  All right so… this would be Counterspell.
	-  16 misses. It swings out and you actually catch the spear off to the side and you're now in a stren
	-  Make an attack with disadvantage.
	-  I can Stone Shape it a little bit thinner.
	-  I wrap my hands in cloth, and then I do it.
	-  You got it, all right, cool. So you swing out of the way. Pike, you're up. You're conscious, you're
	-  No, I don't think, I don't know that I can-- I can't use an unarmed strike because I just used a sp
	-  That would be your one attack. 




Cluster 1: 149785 / 935818
	-  You could possibly end it.
	-  We could still run, I'm just saying.
	-  Hold on a second. Are you sure you don't want to smoke this bitch?
	-  All right. With a couple of provisions here, all right? One
	-  God. Save the plate for FCG. 
	-  All right, what the fuck are we going to do?
	-  No reason to spe

## 20 Clusters

In [16]:
n_clusters = 20 
n_samples = 10

In [17]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 48183 / 935818
	-  All right, roll damage. 
	-  Advantage and one re-roll, if you need it. 
	-  So I would be the only one who took damage if I hit that thing? 
	-  Grog is not in melee.
	-  I think the target of the attack has to be the target of the Cutting Words. Because yeah, it reduce
	-  The little bit of a pull would pull her a little more central and I'm going to use my superiority d
	-  You can use an attack to try and disarm it, if you'd like.
	-  That was action. Wait, hold on. I will use bonus action to do Second Wind on myself.
	-  We reroll ones for hit points.
	-  Then do it! Why don't you set a noise out behind it so it chases off in the other direction? 




Cluster 1: 80378 / 935818
	-  Okay. Let's go.
	-  Yeah cool, I'm just hanging.
	-  I'm going to keep looking at coins.
	-  I'm going to-- 
	-  Can I use-- oh shit. I'm going shitballs, shit.
	-  I'm too scared to do it. I'm too scared to do it. I think he's going to attack us soon. He sounded 
	-  I'm go

## 50 Clusters

In [18]:
n_clusters = 50 
n_samples = 10

In [19]:
kmeans = MiniBatchKMeans(n_clusters = n_clusters, random_state= 42, batch_size=2048, max_iter=500)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 28088 / 935818
	-  Here we go. I knew you were my brother. 
	-  You look great. 
	-  ♪ My hump, my hump, my hump, my hump ♪
	-  That's so cool.
	-  Thank you to our amazing crew.
	-  All right, and Tova's standing there looking strong and the camera's turning around her. It's like 
	-  I-- I made my own little placard for the-- I made my little cards. 
	-  Peachy-peachy.
	-  I'm being a nerd. Have fun!
	-  You'll thank me later. I'm just going to bap the sheet. 




Cluster 1: 23171 / 935818
	-  It's present.
	-  Cut the feed, cut the feed. 
	-  To what?
	-  Next year. That's right! Yeah, yeah. Oh my god. Aaah! Next year in 2021. 
	-  Which one? 
	-  It's true. Up. 
	-  The theme's running. 
	-  I'm just going to say
	-  This is you, this is it right here.
	-  Sit down.




Cluster 2: 18109 / 935818
	-  I need some supplies. 
	-  Do we have glitter? 
	-  I bought them. But I want Tiberius to make me more.
	-  Sure, yes. How much for four healing potions?
	-  How long have yo