# Document Clustering
### Without Preprocessing

In [1]:
import json
import os
import random
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

#### Define Parameters

In [7]:
n_clusters = 5 
n = 20 #Transcripts to use
max_df=0.1
min_df=0.0001
n_samples = 10

## Load Data

In [3]:
def create_df_multi(n=25):
    list_of_text = []
    dir = 'data/aligned data/c=4'

    files = [filename for filename in os.listdir(dir)]
    sampled_files = random.choices(files,k=n)

    for filename in sampled_files:
        # choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+filename)
        data = json.load(f)
        choice = random.choice(os.listdir(dir))
        # print(choice)
        f = open(dir+'/'+choice)
        data = json.load(f)

        for x in data:
            for y in x['TURNS']:
                text = ' '.join(y['UTTERANCES'])
                list_of_text.append(text)
    df = pd.DataFrame(list_of_text)
    return df

In [4]:
df = create_df_multi(n)
print(df.shape)
df.head()

(49916, 1)


Unnamed: 0,0
0,"Hello, everyone! And welcome to tonight's epis..."
1,"Indeed. Tonight, we are sadly down one Sam Rie..."
2,Where is he?
3,I actually don't know.
4,Watching his kids!


## Vectorize Data



- Count Vectorizer
- Tfidf Vectorizer

Use both with each model

In [8]:
# tfidf=TfidfVectorizer(stop_words='english',max_df=.7,min_df=2,token_pattern=r'(?u)\b[A-Za-z]+\b')

tfidf = TfidfVectorizer(stop_words='english', 
    max_df=max_df,
    min_df=min_df,
    token_pattern=r'(?u)\b[A-Za-z]+\b'
    )
tfidf_sparse = tfidf.fit_transform(df[0])
print(tfidf_sparse.shape)
tfidf_df = pd.DataFrame(tfidf_sparse.toarray())
tfidf_df.tail()

(49916, 6016)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6006,6007,6008,6009,6010,6011,6012,6013,6014,6015
49911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49913,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49914,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49915,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
cv = CountVectorizer(stop_words='english', 
    max_df=max_df,
    min_df=min_df,
    token_pattern=r'(?u)\b[A-Za-z]+\b'
    )
cv_sparse = cv.fit_transform(df[0])
print(cv_sparse.shape)
cv_df = pd.DataFrame(cv_sparse.toarray())
cv_df.tail()

(49916, 6016)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6006,6007,6008,6009,6010,6011,6012,6013,6014,6015
49911,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49912,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49913,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49914,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49915,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Define Output Functions

In [10]:
def sample_clusters(labels, n_clusters = n_clusters, n_samples = 10):
    df['labels'] = labels
    for i in range(n_clusters):
        print(f'Cluster {i}: {len(df[df.labels == i])} / {len(df)}')
        sample = df[df.labels == i].sample(n_samples, replace=True)[0]
        for x in sample:
            print('\t- ' + x[:100])
        print('\n')
        print('==='*30)
        print('\n')


## 5 Clusters

In [11]:
n_clusters = 5 
n_samples = 10

In [12]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 3747 / 49916
	- I can't do anything too much.
	- I'm a nice guy! I can't resist the charm of a woman, especially one who looks identical to you.
	- Oh, because I can't see the board.
	- I don't know, is that what we're supposed to do? Kill The Gentleman?
	- You don't know that.
	- Yeah, I know, we don't have to relive it.
	- Well, they can't play as well as you, but they're pretty good.
	- Any further attempts by the creature automatically fail until 24 hours have elapsed. He ain't going 
	- You want to go now? You don't want to look around or any--?
	- We can't do anything for them right now until we actually know what's happening.




Cluster 1: 2300 / 49916
	- Okay.
	- Okay. The door does not appear to be trapped, but it is locked.
	- Wow. Fucking betrayal. Okay.
	- Okay. Let's go!
	- Okay.
	- Okay! But with taxpayer's money. Also something else you taught me.
	- Okay! So you move in, to there. Go ahead and make your attacks.
	- Okay.
	- Okay, go ahead and roll to attack.

In [13]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 43346 / 49916
	- It looks like an oversized, swollen brain with a set of claws.
	- 11.
	- Drop disguise-- in the middle of the crowd?
	- 29.
	- Too conscious?
	- Technically anyone can use a shield, but if you're not proficient in it--
	- There isn't a cone aspect to it.
	- What would you like to look like?
	- Radar picking up incoming debris field. Impact in 30 seconds.
	- I stand in the doorway.




Cluster 1: 1396 / 49916
	- Slip by the back, yeah.
	- Yeah.
	- "Yeah!"
	- That hits, yeah.
	- Yeah, that does.
	- Yeah, you see him.
	- You can see the two sides, yeah.
	- Yeah.
	- Yeah, I--
	- Yeah, we're not supposed to be here.




Cluster 2: 2766 / 49916
	- The door's going to blow up if we knock on it? I mean, we don't have much luck with doors.
	- Yeah, I don't feel doing that. Maybe you should just go upstairs.
	- I don't want to ride around in The Shit.
	- Don't know his name, he's just a dickhead. That's all you know?
	- Well, how else will they discern who I am? I don

## 10 Clusters

In [14]:
n_clusters = 10 
n_samples = 10

In [15]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2819 / 49916
	- Yeah, it's a 23.
	- Yeah, white noise.
	- Yeah, right? With the doves--
	- Yeah, I mean, what're we going to--
	- Yeah.
	- Yeah!
	- Yeah. I mean, once we kill this thing, which we will, we're diving into the unknown.
	- Yeah! I'll just crouch down then and hold my action until I see her.
	- Sure. Yeah. There's one that's trying to get away with one of our deckhands?
	- Yeah, that hits.




Cluster 1: 2511 / 49916
	- Are we going? Are we ready?
	- Word. I'm still a person. I'm gonna go ahead and also do a call lighting spell. Bam, call lightning 
	- What?! No! You know what, Vex, I'm still, I'm still practicing.
	- That is terrible, it's a terrible path you're going down right now, I am uncomfortable with this--
	- I'm going to fail this miserably.
	- Okay. So the (gunshot noise) blast from Bad News scatters through the air and slams into her side, a
	- I'm the Little Sapphire. I lift up my-- I'm the Little Sapphire.
	- Now it's the creature's turn. With you r

In [16]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 86 / 49916
	- So terrible.
	- Does feel like a terrible abuse of hospitality.
	- If you have played a lot of Fortnite, watch and marvel at how terrible we are at the game.
	- I think it's a terrible plan.
	- Terrible.
	- Terrible evil, the Baumbauchs.
	- Ooh, that's terrible. 13.
	- Terrible.
	- Oh, I know, I know, it's terrible!
	- Thank you, I often get confused with Phillip the Terrible.




Cluster 1: 2707 / 49916
	- Hope we don't have to leave that way, but she can just dispel it.
	- We walked into a sort of political snafu because we didn't know the lay of the land, so is there any
	- "I'm making this up as I go! I don't know what I'm doing."
	- (Minnesotan accent) "Oh, don't ya know?"
	- I'm going to look at Frumpkin again and be like: Caleb, if you're watching, I don't trust Avantika? 
	- I will sing to him, (singing) You had a bad day, you gettin' knocked down. I sing you a song just to
	- I know. I know, I'm afraid these rogue tendencies I've been learning from you

## 20 Clusters

In [17]:
n_clusters = 20 
n_samples = 10

In [18]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(cv_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 2206 / 49916
	- Like it's day one. Vox Machina, let's do this!
	- It is quite a busy time in the afternoon. There is actually a bit of hustle and bustle. Looking arou
	- News from the front! This is Nigel J. Creamhutch reporting from BBC London. March 23, 1944 may be a 
	- What are you, like, a hag now? I sit on it, and I open the door, and I fly through the doorway.
	- Scanlan, you had made your way back to Kymal, in which you, along with the aid of a few of your frie
	- Very terrible discreet hand-off, where it's like-- She adjusts her glasses once more. "Thank you, an
	- I like to think that she looks both ways before attacking just to see--
	- As she hits me, my eyes flare white, and I go: Aah! Like that.
	- Yeah, do you have mead that doesn't taste like piss?
	- Like how, what do you mean?




Cluster 1: 138 / 49916
	- As you guys step out into the main road, you can see the cracked stones that make up the walkway her
	- As you do, you feel the blood fill your lungs and

In [19]:
kmeans = KMeans(n_clusters = n_clusters, random_state= 42)
kmeans = kmeans.fit(tfidf_df)
sample_clusters(kmeans.labels_, n_clusters=n_clusters, n_samples=n_samples)

Cluster 0: 1470 / 49916
	- Oh, just you, okay.
	- Oh. You think so?
	- Oh, yeah! [laughter]
	- Oh god.
	- Oh, that's right!
	- Oh, I believe I will.
	- Oh yas, queen.
	- Oh no.
	- Oh. Awesome.
	- Oh, no!




Cluster 1: 2060 / 49916
	- We're not going to say Ioun.
	- So 13 points of damage. All righty. So you hack at it as it's running away, but it still manages to 
	- What did--
	- I am going to back up like five or six steps and also do the same thing Nott did.
	- Where are we going to sell these? The world's on fire!
	- 21 points of damage?
	- Did we get him in?
	- I'd gain a few hit points, but it's nothing bad. I can take a potion.
	- Did Pike roll?
	- He's going for Scanlan.




Cluster 2: 30661 / 49916
	- I step into the acid.
	- That's the one. Thanks, everybody.
	- Bring back fucking cookies!
	- Okay, sharpshooter, go for the shot.
	- Is it still disadvantage to the middle target?
	- "Up by the temples."
	- I'll be hanging back a bit, as well.
	- Correct. That's two off of it; 