# Embed Corpus
### Using Sentence Transformers

In [12]:
import json
import os
import random
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle
from sklearn.cluster import KMeans

## Load Data

In [4]:
df = pd.read_csv('data/full_transcripts_cleaned.csv')

In [5]:
docs = df['text'].values.tolist()
docs

['(laughing)',
 " And welcome back to tonight's episode. So, picking up where we left off, Vox Machina, having to find and collect the artifacts known as the Vestiges of Divergence to prepare themselves for battle against the Chroma Conclave, a group of extremely dangerous chromatic dragons who are wreaking havoc on the Tal'Dorei countryside, and get vengeance for the loss of Vex and Vax's father in the past. Er, sorry, mother. The father's still around, and kind of a dick. ",
 ' Yeah.',
 " It's been a long week. However, you guys have been making your way into the Feywild in search of a specific item known as Fenthras, a bow of which Vex has been sort of keeping an eye on this whole time.",
 ' Ahh, totally.',
 " Just like that. (laughs) We're all having Gen Con plague residual, here. They've made their way through a number of strange, dangerous, and awkward encounters in the Feywild, including an encounter with their father, I meant to say earlier, in Syngorn, which is currently hidin

In [4]:
docs = create_df_multi(n)
df = pd.DataFrame(docs)
print(df.shape)
df.head()

(60350, 1)


Unnamed: 0,0
0,"Hello, everyone, and welcome to tonight's epis..."
1,"But no, seriously, don't do that. That's okay."
2,Still in front of a green screen.
3,Still in front of a green screen. Totally. Tha...
4,"The merchandises? Well, you know, we do have t..."


## Embed Docs



- Load Embedding Model
- Process Batches (May Take Awhile)

In [7]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-mpnet-base-v2')

embeddings = model.encode(docs, batch_size=64, show_progress_bar=True)

print(embeddings[0].shape)

Batches:   0%|          | 0/14623 [00:00<?, ?it/s]

(768,)


In [9]:
np.save('data/embeddings_cleaned.npy', embeddings)

### Save embeddings just in case

In [11]:
embeddings[0].tolist()

[-0.01831851713359356,
 -0.014044173061847687,
 -0.01376888807862997,
 -0.020507827401161194,
 -0.0071776569820940495,
 0.009587233886122704,
 -0.03227889910340309,
 0.0074552916921675205,
 -0.0823998749256134,
 -0.02122035063803196,
 0.00803094170987606,
 -0.04777408391237259,
 -0.03849656879901886,
 0.007662064395844936,
 0.06783058494329453,
 -0.04066281020641327,
 0.023037610575556755,
 -0.004215782508254051,
 0.029753129929304123,
 -0.025848926976323128,
 0.009797895327210426,
 0.013975023292005062,
 -0.010840379633009434,
 0.015033246017992496,
 -0.051209207624197006,
 0.014495793730020523,
 0.03291277959942818,
 0.0015469701029360294,
 0.04282026365399361,
 -0.0343502052128315,
 0.017413655295968056,
 0.034994520246982574,
 -0.0022685567382723093,
 -0.014081939123570919,
 2.7196635983273154e-06,
 0.011549316346645355,
 0.006496992893517017,
 -0.0077456827275455,
 -0.05323972925543785,
 -0.00597526878118515,
 -0.009959000162780285,
 0.0606897808611393,
 -0.03696264699101448,
 0.0

In [12]:
df['embeddings'] = embeddings[:].tolist()
df.to_csv('data/doc_embeds.csv')

## Define Output Functions

In [15]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)[0]
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [13]:
n_clusters = 5 
n_samples = 10

In [17]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

: 

: 

## 10 Clusters

In [18]:
n_clusters = 10 
n_samples = 10

In [19]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2465 / 60350
	- "I did not but good to know."
	- "Fair enough. Well, let's get you back there and talk to the boss, if he's awake. He should be." She
	- "I'm quite good with fire. That's not entirely impossible."
	- "Excuse us." He leads you guys back to where the cellar is.
	- "Me and the clerics might have to go see it, then."
	- Reaches into his satchel, pulls out a small leather ledger, past a few pages. There's little counts,
	- "It would-- it would be a change of pace that is quite welcome. Take a seat, both of you."
	- "Perhaps you have a third or fourth dagger I can try."
	- "That's wonderful. I think in the long run, that'll be a very useful feature, but it's also not that
	- "I don't judge. If you worship what you want, you can, just be careful about it."




Cluster 1: 5661 / 60350
	- 21 points of damage.
	- Does the repair take an action?
	- Oh shit, he's going to try to attack it. Can he do that?
	- So you've shot, reload, shot. And this turn--
	- (crossbow relo

## 20 Clusters

In [20]:
n_clusters = 20 
n_samples = 10

In [21]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2365 / 60350
	- "Give me a little bit of time, I need to see if I can make this part of my communion. Hold on a seco
	- "But Utugash is strong but slow. Slovenly, sluggish. Has reveled in his own opulence for far too lon
	- "Not as often as I get woken up from my sleep. Which is never. Because only stupid people do that."
	- "What?! No! Not at all."
	- "Nice to meet you as well, Jester. Inside there is Jan. She helps run the place. Fantastic. I'll go 
	- "Of Taryon Darrington!"
	- "Adra."
	- "Okay." He goes ahead and continues the path of the Balleater.
	- She giggles in your ear. All right.
	- "No problem."




Cluster 1: 4482 / 60350
	- Cask. Cask of ale. Cask.
	- All of it. All of the research?
	- The barrel is too big, you barely got Dork the ox in there. Which, for the record, they did fit a sm
	- Like marble, or?
	- He has two rapiers, so it's like holding a stiletto.
	- I suppose that is a growth industry. This is interesting, I'm going to think it over. I'm going to 

In [24]:
## 50 Clusters

In [22]:
n_clusters = 50 
n_samples = 10

In [23]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2315 / 60350
	- Yeah, there's no way you just thought that up.
	- Right. So we all heard that?
	- Good to know.
	- That's not how that works.
	- I mean, technically, yeah.
	- That's the thing.
	- Like super-secret spy. No, this is a very privileged role.
	- Oh, so it's going to--
	- I think so. I feel like I'm trying to learn that from you, Jester.
	- Yeah, you're right.




Cluster 1: 1268 / 60350
	- Suddenly he goes (muffled panic noises).
	- I telepathically tell Frumpkin to jump up and sit in the lap of the woman in the monk's robes who ju
	- She takes it and adds it to her tea.
	- And you see a bit of heartbreak but he nods his head and reaches out and takes you in, holds you the
	- Okay. She glances over at you as you come running around the corner, this giant engine of muscle and
	- "Apparently." The Gentleman now sits there. His comfortable attire today comes in a billowing linen,
	- He steps back. The lights dim on the inside of the temple as you exit. You guys step