# Embed Corpus
### Using Sentence Transformers

In [47]:
import json
import os
import random
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle
from sklearn.cluster import MiniBatchKMeans as KMeans
from sklearn.manifold import TSNE

## Load Data

In [15]:
df = pd.read_csv('data/full_transcripts_cleaned.csv')

In [16]:
selection = np.random.choice(df.episode.unique(),50)
df = df[df.episode.isin(selection)]

In [17]:
docs = df['text'].values.tolist()
docs

['Dimension 20Adventuring AcademySeason 2 Episode 16Animating Your Table (with Benjamin Scott)< [Previous Episode] | [Next Episode] >',
 "\xa0Hello, one at all. And welcome back to another thrilling episode of Adventuring Academy. I am your humble dungeon master Brennan Lee Mulligan. And this is Dropout’s home vodcast where we talk about all things related to tabletop RPGs. Running them at your home for your friends and going on all sorts of fantastical adventures. Oh my goodness! Our guest today, I could not be more excited. Our guest today is one of YouTube's premier animators and D&D storytellers that creates all kinds of funny relatable stories about D&D, the chaotic mishaps that happened at the table, emotional moments, problem players, all sorts of storied and fabled legends of tabletops all over the world. You know him, you love him, he's the creator of Puffin Forrest. My friend and yours, Benjamin Scott.(giggling)",
 '\xa0Thank you. I appreciate the intro.(laughing)',
 ' That w

## Embed Docs



- Load Embedding Model
- Process Batches (May Take Awhile)

In [18]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
# from os import device_encoding


model = SentenceTransformer('all-mpnet-base-v2')

embeddings = model.encode(docs, batch_size=64, show_progress_bar=True, device = 'cuda')

print(embeddings[0].shape)

Batches:   0%|          | 0/1092 [00:00<?, ?it/s]

(768,)


In [19]:
np.save('data/embeddings_subset.npy', embeddings)
df.to_csv('data/transcript_subset')

### Save embeddings just in case

In [11]:
embeddings[0].tolist()

[-0.01831851713359356,
 -0.014044173061847687,
 -0.01376888807862997,
 -0.020507827401161194,
 -0.0071776569820940495,
 0.009587233886122704,
 -0.03227889910340309,
 0.0074552916921675205,
 -0.0823998749256134,
 -0.02122035063803196,
 0.00803094170987606,
 -0.04777408391237259,
 -0.03849656879901886,
 0.007662064395844936,
 0.06783058494329453,
 -0.04066281020641327,
 0.023037610575556755,
 -0.004215782508254051,
 0.029753129929304123,
 -0.025848926976323128,
 0.009797895327210426,
 0.013975023292005062,
 -0.010840379633009434,
 0.015033246017992496,
 -0.051209207624197006,
 0.014495793730020523,
 0.03291277959942818,
 0.0015469701029360294,
 0.04282026365399361,
 -0.0343502052128315,
 0.017413655295968056,
 0.034994520246982574,
 -0.0022685567382723093,
 -0.014081939123570919,
 2.7196635983273154e-06,
 0.011549316346645355,
 0.006496992893517017,
 -0.0077456827275455,
 -0.05323972925543785,
 -0.00597526878118515,
 -0.009959000162780285,
 0.0606897808611393,
 -0.03696264699101448,
 0.0

In [12]:
df['embeddings'] = embeddings[:].tolist()
df.to_csv('data/doc_embeds.csv')

## Define Output Functions

In [37]:
def sample_clusters(labels, n_clusters = 10, n_samples = 10, column = 'labels'):
    df[column] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df[column] == i])} / {len(df)}')
        sample = df[df[column] == i].sample(n_samples, replace=True)['text']
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [38]:
n_clusters = 5 
n_samples = 10

In [39]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column='cluster_5')

Cluster 0: 9961 / 69873
	-   Oh, no, I‘m fine.   
	-  Thank you so much, thank you so much. (laughing) 
	-  “Nat 10.”
	-  Where'd the inspiration for this come from?" He lifts up the mask from the side of the table. You c
	-   Okay. And I feel like we could‘ve done more for Phandalin —  
	-  “All right. Then decide. I mean, if you wish to progress–”
	-    [laughs]   
	-  "It's all right. We mean not to take it from you."
	-  [soft] Okay!
	-    [imitating Columbo] Uh, just one more question. Uh, Adelaide —   




Cluster 1: 9894 / 69873
	-  Awesome!
	-   That‘s right.   
	-  Okay.
	-  Done.
	-  Okay. I will do that.
	-  All right.
	-  Yeah.
	-  Incredible.
	-   Yeah!   
	-  Damn




Cluster 2: 14191 / 69873
	-  You can do non-lethal damage to him. So you can do a full sneak attack and make it non-lethal, if y
	-  And here‟s the best part — that‟s a three plus one, that‟s a fo ur— but I‟m gonna use Psychic Blade
	-  All righty, so the necromancer that you just hit--
	-  That's enough to 

## 10 Clusters

In [40]:
n_clusters = 10 
n_samples = 10

In [41]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column='cluster_10')

Cluster 0: 9919 / 69873
	-  We traveled into the fucking nine hells to get Pike a set of armor. We went and battled a city of v
	-    It‘s legally binding. When you serve a subpoena, that‘s legally binding. You have to say —you hav
	-  I just look up, covered like Carrie, in the movie. Just dripping with fake blood.
	-  The thing under the tree?
	-  Pens you a new contract with him on the back. Says,
	-  Fear actually happens in a giant ass cone, I believe.
	-   She picks up, like, a baseball cap, and is like, already imprinting “Nermal’s Pile” on it. She’s n
	-  So fucking sick! You dream about it every day! And then just devils are real and fucking, you can j
	-  All magical, I assume. I assume they're glowing.
	-  “Whaska is one of the few giants that still maintains the privilege of freedom. There was a coup wi




Cluster 1: 9952 / 69873
	-  Make a perception check.
	-  Maybe a birthday card.
	-  Right, but had I told you, you would've been less injured by the sudden betrayal, and

## 20 Clusters

In [42]:
n_clusters = 20 
n_samples = 10

In [43]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column='cluster_20')

Cluster 0: 5386 / 69873
	-  We'll have to throw a hand ax at him.
	-  You've trained with Jackson in this exact talent.
	-   He takes you in  his arms and he cradles you. His fur is  so much softer  than you thought it was 
	-  He owes me 20 bucks!   Cloaca Patron  [Travis] 
	-    [laughs] Travis fuckin‘ said Barns a nd Nobles out loud and blacked out.   
	-  Wait a minute, if Markus is there then who's flying the ship?[everyone laughing]
	-  He fuck her so hard in one of them, they break the bed. He vampire fucked her so hard.
	-   Yoda style.   
	-  Or like commanding his troops. Not like magically commanding.
	-  Yeah, he might not have it.




Cluster 1: 5960 / 69873
	-  What about best out of five? You're probably so bored with life; why not?
	-  I didn't actually want it. I just wanted to best you.
	-    I wish I could, too.   
	-  Oh, baby. Hold on one second.
	-  Don't stretch out the elastic, like, adjust it a little, your head's a lot-- aww. 
	-   Okay , read your poem.   
	-

## 50 Clusters

In [45]:
n_clusters = 50 
n_samples = 10

In [46]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples, column='cluster_50')

Cluster 0: 1176 / 69873
	-  "Really?" 
	-  "To the left, sir." 
	-  "That is good. I'm proud of you. Do you wish to train?"
	-  "It would behoove you to tell me. I would like to keep this a very, very positive relationship. It'
	-  He looks back at Horris. Horris is like, "Right. Well, if you come across any dark physicianry, ple
	-  "But of course." 
	-  “This one's almost ruined us a few times.”
	-  “None yet, before you came in and rudely interrupted my update from Verath over there, but it seems
	-  “Here.” And he takes a moment to think of which house you're referring to.
	-  "Stalwart warrior of our mountain secrets."




Cluster 1: 1855 / 69873
	-  Actually yeah. You're a little thicker than Scanlan. Got some–
	-  The problem with like, being a monk, is you really... you do have to beat things to death?   
	-  Hey, man, yeah. I think it's bullshit. That's not how I roll back in my day. Look, I'm not a great 
	-  So, you're still in heavy leathers.
	-    Is this a choice I'm havi

In [56]:
model = TSNE(n_components= 2,
            perplexity = 60,
            early_exaggeration = 6)

dim_reduce = model.fit_transform(embeddings)
dim_reduce[:5]



array([[-20.495972 ,  19.639614 ],
       [-16.085897 ,  19.828945 ],
       [-11.513922 ,  33.263107 ],
       [  1.9483192,  30.601934 ],
       [-19.034723 ,  21.996254 ]], dtype=float32)

In [53]:
df['x1'] = dim_reduce[:,0]
df['y1'] = dim_reduce[:,1]

In [54]:
df.to_csv('data/transcript_subset.csv')