# Embed Corpus
### Using Sentence Transformers

In [None]:
! pip install sentence_transformers
! pip install hdbscan

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hdbscan
  Downloading hdbscan-0.8.28.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 8.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: hdbscan
  Building wheel for hdbscan (PEP 517) ... [?25l[?25hdone
  Created wheel for hdbscan: filename=hdbscan-0.8.28-cp37-cp37m-linux_x86_64.whl size=2340271 sha256=c02a6a567b277860968e457f042a9110f1874fbc623644eb28db1ee9d4ff25b6
  Stored in directory: /root/.cache/pip/wheels/6e/7a/5e/259ccc841c085fc41b99ef4a71e896b62f5161f2bc8a14c97a
Successfully built hdbscan
Installing collected packages: hdbscan
Successfully installed hdbscan-0.8.28


In [None]:
import json
import os
import random
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle
# from sklearn.cluster import KMeans

## Load Data

In [None]:
df = pd.read_csv('full_transcripts_cleaned.csv')

In [None]:
docs = df['text'].values.tolist()
docs

['\xa0Hello, one and all, and welcome back to another exciting episode of "Adventuring Academy," Dimension 20\'s vodcast, where we talk about all things gaming and tabletop, and how to play, and what different theories and practices work the best at your table and tables all over. Thank you so much for tuning in. Oh my gosh, everybody, today I am so excited for the incredible guest who\'s gonna be here talking with us about all things tabletop. You know her work. She is an animation TV writer for "My Little Pony',
 ' Heroes of Pure Heart." She\'s also a dungeon master. You\'ve seen her dungeon mastering on "Girls Guts Glory," which is streamed from Wizards of the Coast. And you also can find her musical that she wrote and directed, "STARRY," the musical, available on Spotify right now, as well as an upcoming project, which is gonna be incredible and D&D, an all-native D&D actual play show coming soon to a live stream near you. We are so excited to have her here on the show. Please give

## Embed Docs



- Load Embedding Model
- Process Batches (May Take Awhile)

In [None]:
# model = SentenceTransformer('all-MiniLM-L6-v2')
model = SentenceTransformer('all-mpnet-base-v2')

embeddings = model.encode(docs, batch_size=456, show_progress_bar=True)

print(embeddings[0].shape)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/3043 [00:00<?, ?it/s]

(768,)


In [None]:
np.save('/content/drive/MyDrive/embeddings_cleaned.npy', embeddings)


### Save embeddings just in case

In [None]:
embeddings[0].tolist()

[-0.002142562298104167,
 -0.026236597448587418,
 -0.01574837788939476,
 0.025495609268546104,
 0.03486192971467972,
 -0.017398877069354057,
 0.02647981233894825,
 0.03225597366690636,
 -0.05187968537211418,
 0.03522002696990967,
 0.045982133597135544,
 -0.0720503106713295,
 -0.014020022004842758,
 0.044196199625730515,
 0.06478025019168854,
 -0.07172895967960358,
 0.01821950636804104,
 0.011060502380132675,
 -0.07510791718959808,
 0.03720562160015106,
 -0.01036227960139513,
 0.015437924303114414,
 -0.04318411648273468,
 0.010663907043635845,
 -0.03388722985982895,
 -0.050251271575689316,
 -0.00638780789449811,
 0.009859152138233185,
 0.015568800270557404,
 0.014551300555467606,
 0.046619538217782974,
 -0.03287293761968613,
 0.0345999039709568,
 0.0197263415902853,
 2.3765019250276964e-06,
 0.004195962566882372,
 -0.030414476990699768,
 -0.0006133782444521785,
 -0.008546890690922737,
 0.013090326450765133,
 0.013666542246937752,
 0.04005613923072815,
 -0.03137534484267235,
 0.0355290472

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df['embeddings'] = embeddings[:].tolist()
df.to_csv('data/doc_embeds.csv')

## Define Output Functions

In [None]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)[0]
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')

## 5 Clusters

In [None]:
n_clusters = 5 
n_samples = 10

In [None]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

: 

: 

## 10 Clusters

In [None]:
n_clusters = 10 
n_samples = 10

In [None]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2465 / 60350
	- "I did not but good to know."
	- "Fair enough. Well, let's get you back there and talk to the boss, if he's awake. He should be." She
	- "I'm quite good with fire. That's not entirely impossible."
	- "Excuse us." He leads you guys back to where the cellar is.
	- "Me and the clerics might have to go see it, then."
	- Reaches into his satchel, pulls out a small leather ledger, past a few pages. There's little counts,
	- "It would-- it would be a change of pace that is quite welcome. Take a seat, both of you."
	- "Perhaps you have a third or fourth dagger I can try."
	- "That's wonderful. I think in the long run, that'll be a very useful feature, but it's also not that
	- "I don't judge. If you worship what you want, you can, just be careful about it."




Cluster 1: 5661 / 60350
	- 21 points of damage.
	- Does the repair take an action?
	- Oh shit, he's going to try to attack it. Can he do that?
	- So you've shot, reload, shot. And this turn--
	- (crossbow relo

## 20 Clusters

In [None]:
n_clusters = 20 
n_samples = 10

In [None]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2365 / 60350
	- "Give me a little bit of time, I need to see if I can make this part of my communion. Hold on a seco
	- "But Utugash is strong but slow. Slovenly, sluggish. Has reveled in his own opulence for far too lon
	- "Not as often as I get woken up from my sleep. Which is never. Because only stupid people do that."
	- "What?! No! Not at all."
	- "Nice to meet you as well, Jester. Inside there is Jan. She helps run the place. Fantastic. I'll go 
	- "Of Taryon Darrington!"
	- "Adra."
	- "Okay." He goes ahead and continues the path of the Balleater.
	- She giggles in your ear. All right.
	- "No problem."




Cluster 1: 4482 / 60350
	- Cask. Cask of ale. Cask.
	- All of it. All of the research?
	- The barrel is too big, you barely got Dork the ox in there. Which, for the record, they did fit a sm
	- Like marble, or?
	- He has two rapiers, so it's like holding a stiletto.
	- I suppose that is a growth industry. This is interesting, I'm going to think it over. I'm going to 

In [None]:
## 50 Clusters

In [None]:
n_clusters = 50 
n_samples = 10

In [None]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(embeddings)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2315 / 60350
	- Yeah, there's no way you just thought that up.
	- Right. So we all heard that?
	- Good to know.
	- That's not how that works.
	- I mean, technically, yeah.
	- That's the thing.
	- Like super-secret spy. No, this is a very privileged role.
	- Oh, so it's going to--
	- I think so. I feel like I'm trying to learn that from you, Jester.
	- Yeah, you're right.




Cluster 1: 1268 / 60350
	- Suddenly he goes (muffled panic noises).
	- I telepathically tell Frumpkin to jump up and sit in the lap of the woman in the monk's robes who ju
	- She takes it and adds it to her tea.
	- And you see a bit of heartbreak but he nods his head and reaches out and takes you in, holds you the
	- Okay. She glances over at you as you come running around the corner, this giant engine of muscle and
	- "Apparently." The Gentleman now sits there. His comfortable attire today comes in a billowing linen,
	- He steps back. The lights dim on the inside of the temple as you exit. You guys step