# Topic Modeling
### Without Preprocessing

In [1]:
import json
import os
import random
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

#### Define Parameters

In [42]:
n_clusters = 5 
n = 20 #Transcripts to use
max_df=0.01
min_df=0.0001
n_samples = 10

## Load Data

In [3]:
def create_df_multi(n=25):
    list_of_text = []
    dir = 'data/aligned data/c=4'

    files = [filename for filename in os.listdir(dir)]
    sampled_files = random.choices(files,k=n)

    for filename in sampled_files:
        # choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+filename)
        data = json.load(f)
        choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+choice)
        data = json.load(f)

        for x in data:
            for y in x['TURNS']:
                text = ' '.join(y['UTTERANCES'])
                list_of_text.append(text)
    df = pd.DataFrame(list_of_text)
    return df

In [4]:
df = create_df_multi(n)
print(df.shape)
df.head()

(50767, 1)


Unnamed: 0,0
0,"Hello everyone, and welcome to tonight's episo..."
1,We play Dungeons & Dragons!
2,We should just get a sound cue for that at thi...
3,It's expected now.
4,It's like our version of (air horn).


## Vectorize Data



- Count Vectorizer
- Tfidf Vectorizer

Use both with each model

In [14]:
# tfidf=TfidfVectorizer(stop_words='english',max_df=.7,min_df=2,token_pattern=r'(?u)\b[A-Za-z]+\b')

tfidf = TfidfVectorizer(stop_words='english', 
    max_df=max_df,
    min_df=min_df,
    token_pattern=r'(?u)\b[A-Za-z]+\b'
    )
tfidf_sparse = tfidf.fit_transform(df[0])
print(tfidf_sparse.shape)
tfidf_df = pd.DataFrame(tfidf_sparse.toarray())
tfidf_df.tail()

(50767, 4985)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4975,4976,4977,4978,4979,4980,4981,4982,4983,4984
50762,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50764,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50766,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
cv = CountVectorizer(stop_words='english', 
    max_df=max_df,
    min_df=min_df,
    token_pattern=r'(?u)\b[A-Za-z]+\b'
    )
cv_sparse = cv.fit_transform(df[0])
print(cv_sparse.shape)
cv_df = pd.DataFrame(cv_sparse.toarray())
cv_df.tail()

(50767, 4985)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4975,4976,4977,4978,4979,4980,4981,4982,4983,4984
50762,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50763,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50764,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50765,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
50766,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Define Output Functions

In [56]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)[0]
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')


## 5 Clusters

In [43]:
n_clusters = 5 
n_samples = 10

10

In [40]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0:
	- It's hard.
	- Let's go.
	- Aw. These were lame pirates! No treasure.
	- And the big hat?
	- "Well that's good, eh?"
	- I don't want--
	- We're running to the tree. So we can Transport Via Plants, was the thought.
	- What is that?!
	- As long as we abstain as well as we can from violence and as long as your men in the crew understand
	- Beau, you got this.




Cluster 1:
	- "I ask you, each and every one of you--" You hear some children giggle as he makes eye contact. "--g
	- "Come. Come. We shall speak, there is much to discuss, and it is growing cold out here, and you coul
	- That is eight points of lightning damage against you. Against you, Nott-- that is a 22 to hit. You t
	- Okay. So as you watch them both walk past you, one of them continues past with his hammer out, rushi
	- As the three of you complete your incantations, the energy unleashed from the arcane and divine scho
	- Okay. Cerkonos comes in. The first thing you notice-- previously, you had seen what was hi

In [41]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0:
	- Oh, god.
	- Oh my god, why is he doing that?
	- Oh god. I miss Pass Without a Trace so much.
	- Oh god!
	- God damn it. Okay.
	- He might have saw the face of a god.
	- If he's the god, we just go in behind him as the god.
	- Getting to know his new god.
	- Thank god! Such an asshole.
	- Oh my god.




Cluster 1:
	- Yeah, we all run out that--
	- I'm going to run up where Fjord disappeared.
	- All right, I'm going to use my bonus action to dash and run 60 feet after Percy to catch up.
	- Yes, that's the point of this, to get Trinket to run into you is the point. His attack bonus is plus
	- What's going on in your past, and if that means we have to go fight some shit, we're with you, and i
	- I would like to use my movement and my bonus action to run. I want to beeline towards the rogue. How
	- That's good, because who knows what else we'll run into while we're down here.
	- You've got to run. Just run.
	- I run back inside.
	- I was really tricksy because I would sit ther

## 10 Clusters

In [44]:
n_clusters = 10 
n_samples = 10

In [49]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0:
	- Thank you.
	- I'm going to--
	- Look, could we hear Horace when he was interrogating him?
	- No, this chamber is far further below. It's 40 feet below where you guys initially came through, so 
	- Oh, one of the chests.
	- Right, remember, we were going to let him stay him the guest room, but he was like, I'm going to sta
	- I think this might be a good look for me. How do I get it to stay?
	- Okay, make another dexterity check.
	- Baby steps.
	- I tried to save her. I was like, get the hint, what's going on?




Cluster 1:
	- Eyes previously locked on the frightening creature now seek a source and find, atop a platform withi
	- Eyes previously locked on the frightening creature now seek a source and find, atop a platform withi
	- Eyes previously locked on the frightening creature now seek a source and find, atop a platform withi
	- Eyes previously locked on the frightening creature now seek a source and find, atop a platform withi
	- Eyes previously locked on the frighte

In [50]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0:
	- Thank gods.
	- Yes, but their shelters are destroyed now, so.
	- Yeah, that was a much better roll. So that's 24 points of damage.
	- Okay.
	- Does it look like there's a central part where there's a lot of people gathered?
	- That's going to take a lot of thinking.
	- 12 points healed to you, Vex. All right. The mage, who is currently held in the whirlwind-- let's se
	- It's about to drop.
	- I don't want to enrich all of the--
	- And we can see it-- sorry.




Cluster 1:
	- She painted the ocean. It was amazing.
	- Amazing.
	- Yep! That's amazing.
	- That's amazing.
	- I. Was amazing. (all laugh)
	- Amazing.
	- That was amazing! You are so drenched right now.
	- Amazing.
	- I want to be on the flying carpet, 'cause that's amazing.
	- And the hospital was amazing.




Cluster 2:
	- I'm like the most wicked narwhal of all time! Give me a pool! (laughs)
	- I'm like the most wicked narwhal of all time! Give me a pool! (laughs)
	- (laughs)
	- Okay. (laughs)
	- (laughs) Yeah.

## 20 Clusters

In [51]:
n_clusters = 20 
n_samples = 10

In [55]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 33
	- "Oh, just drop me off at the nearest port you get to, please."
	- There ain't no fucking port around here, motherfucker.
	- Shouldn't we get to a port or something before we--?
	- It's a resource stop. It might have some repairs, in your experience. It's not a major port by any m
	- "Oh, just drop me off at the nearest port you get to, please."
	- "Oh, just drop me off at the nearest port you get to, please."
	- In the room you're in currently, there is not a port window.
	- Before port.
	- Because this is on a trade run, and this whole isle is part of that, could we make port for more ext
	- Look at the port window! Is there a port window?




Cluster 1: 48221
	- Percy? I'll try to back towards Percy.
	- Seems pretty relevant.
	- Are you lying to me to get me to pay you for a new gun?
	- He's not in a frenzied rage, though.
	- I'm going to take out my swords, and just I'm going to try and cleave some of these tentacles off. S
	- You guys have to get out of that tunnel

In [57]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 33 / 50767
	- It's a resource stop. It might have some repairs, in your experience. It's not a major port by any m
	- Starboard? Port?
	- We're heading to port regardless. I figure at some point--
	- Port Damali.
	- In the room you're in currently, there is not a port window.
	- Is this perhaps something to do with Sir Cadigan and Port Damali stuff? Do you know a Cadigan? Would
	- As a note, too, when the Squalleater came to port that night, Darktow was resupplied with cannonball
	- Well I came up from the town of Port Damali.
	- Port Damali.
	- Well I came up from the town of Port Damali.




Cluster 1: 48221 / 50767
	- Yeah!
	- Uh.
	- Yeah.
	- I'm not looking at your notes, I'm just looking through the wall.
	- I got it, I got it, I got it! (laughter)
	- Can you disarm the traps so you don't keep just setting them off?
	- Oh, I was about to say, and--
	- They wiped my poo away.
	- (singing) Whitestone saw the fire.
	- No. That's a kill shot.




Cluster 2: 189 / 50767
	- Y