## Load Package

You can either start this notebook in the `intuit-topic` directory, or append the directory to the path within this session as seen below.

In [1]:
import os
import json
import sys
import importlib
add_modules = [
    '/home/shuyang/data4/intuit-topic'
]
for m in add_modules:
    if m not in sys.path:
        sys.path.append(m)

print('\n- '.join(sorted(filter(None, sys.path))))

/data4/shuyang/menu-generation
- /home/shuyang/data4/intuit-topic
- /root/.ipython
- /usr/lib/python3.7
- /usr/lib/python3.7/lib-dynload
- /usr/lib/python3.7/site-packages
- /usr/lib/python3.7/site-packages/IPython/extensions
- /usr/lib/python3.7/site-packages/deepx_website-0.0.1-py3.7.egg
- /usr/lib/python3/dist-packages
- /usr/lib/python37.zip
- /usr/local/lib/python3.7/dist-packages


## Load Data

This can be any text data. Here we use the Ubuntu dataset from [Perkins and Yang 2020](https://arxiv.org/pdf/1908.11487.pdf)

In [2]:
import pandas as pd

base_dir = '/home/shuyang/data4/ubuntu_perkins'
proc_loc = os.path.join(base_dir, 'askubuntu_processed.csv')
raw_loc = os.path.join(base_dir, 'askubuntu_raw.csv')

df_p = pd.read_csv(proc_loc, engine='c').rename(columns={
    'view1': 'q1',
    'view2': 'q2',
    'question_body': 'body'
})
df_p = df_p.dropna()
df_p

Unnamed: 0,id,label,q1,q2,body
0,1,UNK,how to get the `` your battery is broken '' me...,maybe these instructions will help you to get ...,"every time i turn on my computer , i see a mes..."
1,3,UNK,how can i set the software center to install s...,you can modify the policykit permissions to al...,how can i set the software center to allow non...
2,6,UNK,how to graphically interface with a headless s...,"yes , x forwarding over ssh is a beautiful thi...",i have a ubuntu development server at work . i...
3,7,UNK,how do i run a successful ubuntu hour ?,"try and make it as regular as possible , that ...",i 'm taking my be-stickered laptop to a coffee...
4,8,UNK,how do i go back to kde splash / login after i...,splash screen is configured by the alternative...,"i started with ubuntu karmic , and wanted to t..."
...,...,...,...,...,...
104470,1097755,UNK,command not found after setting source in bash...,"im clueless on what causes this , but the simp...","following the instructions here , and everythi..."
104471,1097761,UNK,changing individual letter position with bash,"can do that with the transform command , e.g ....",if i have a file called and it contains one li...
104472,1097781,UNK,how to move multiple folders to another direct...,i could type whatever was typed in this articl...,how can i move multiple folders into another d...
104473,1097793,UNK,how to remove betblocker,according to betblocker.org : please remember ...,i 've installed betblocker but have noticed th...


In [3]:
# An example of what one data point looks like
# Input data is simply a list of strings, with each string corresponding to one document/turn of dialog
docs = df_p['body'].values.tolist()
print('{:,} question bodies'.format(len(docs)))
docs[0]

104,284 question bodies


'every time i turn on my computer , i see a message saying something like : i am already aware that my battery is bad . how do i suppress this message ?'

## Performing Clustering

The Overwrite flag is set to True for demo purposes. For analysis, set to False so you can load the models you've trained.

In [4]:
bdir = '/home/shuyang/data4/BERTopic-ubuntu'

OVERWRITE = True

In [5]:
# Document index : (topic ID, topic probability)
DOCUMENT_TOPICS = dict()

# Topic ID : Topic keywords
TOPIC_DICTIONARY = dict()

# Previous topic ID : New topic ID
REDUCTION_MAPS = dict()

In [6]:
import os
from intuit_topic import BERTopic
from datetime import datetime
import pickle
import pandas as pd

mdir = os.path.join(bdir, 'base_model')
os.makedirs(mdir, exist_ok=True)
sfile = os.path.join(mdir, 'model.bin')
bfile = os.path.join(mdir, 'top_probs.pkl')

start = datetime.now()
if not OVERWRITE and os.path.exists(sfile):
    topic_model = BERTopic.load(path=sfile)
    print('{} - Model loaded from {} ({:.2f} MB)'.format(
        datetime.now() - start,
        sfile, os.path.getsize(sfile) / 1024**2
    ))
    
    # Base topics and probabilities
    if os.path.exists(bfile):
        start_bload = datetime.now()
        base_topics, base_probs = pd.read_pickle(bfile)
        print('{} - Loaded {:,} base topics & {:,} probs from {} ({:.2f} MB)'.format(
            datetime.now() - start_bload,
            len(base_topics), len(base_probs),
            bfile, os.path.getsize(bfile) / 1024**2
        ))
    else:
        base_topics, base_probs = topic_model.transform(docs)
        print('{} - Extracted topics and probs'.format(datetime.now() - start))
        # Save topics & probs
        start_bsave = datetime.now()
        with open(bfile, 'wb') as wf:
            pickle.dump((base_topics, base_probs), wf)
        print('{} - Saved {:,} topics & {:,} probs to {} ({:.2f} MB)'.format(
            datetime.now() - start_bsave,
            len(base_topics), len(base_probs),
            bfile, os.path.getsize(bfile) / (1024**2)
        ))
else:
    topic_model = BERTopic(
        language='english',
        top_n_words=10,
        n_gram_range=(1, 3),
        min_topic_size=5,
        verbose=True,
        device='cpu')
    base_topics, base_probs = topic_model.fit_transform(docs)
    print('{} - Extracted topics and probs'.format(datetime.now() - start))
    
    # Save model itself
    start_save = datetime.now()
    topic_model.save(sfile)
    print('{} - Saved model with {:,} topics to {} ({:.2f} MB)'.format(
        datetime.now() - start_save,
        len(topic_model.get_topic_freq()),
        sfile, os.path.getsize(sfile) / (1024**2)
    ))

    # Save topics & probs
    start_bsave = datetime.now()
    with open(bfile, 'wb') as wf:
        pickle.dump((base_topics, base_probs), wf)
    print('{} - Saved {:,} topics & {:,} probs to {} ({:.2f} MB)'.format(
        datetime.now() - start_bsave,
        len(base_topics), len(base_probs),
        bfile, os.path.getsize(bfile) / (1024**2)
    ))

# SAVE TOPICS
topic_map = topic_model.get_topics()
n_topics = len(topic_map)
TOPIC_DICTIONARY[n_topics] = topic_map
DOCUMENT_TOPICS[n_topics] = [(topic, prob) for topic, prob in zip(base_topics, base_probs)]

## VISUALIZE TOPIC HIERARCHY
start_hier = datetime.now()
fig = topic_model.visualize_hierarchy(width=2000, height=1000)
hloc = os.path.join(mdir, "hier.html")
fig.write_html(hloc)
print('{} ({} total) - Saved hierarchy visualization to {} ({:.2f} KB)'.format(
    datetime.now() - start_hier,
    datetime.now() - start, 
    hloc, os.path.getsize(hloc) / 1024))

## VISUALIZE TERM RANKS
start_trank = datetime.now()
fig = topic_model.visualize_term_rank(width=2000, height=1000)
tloc = os.path.join(mdir, 'tranks.html')
fig.write_html(tloc)
print('{} ({} total) - Saved term rank visualization to {} ({:.2f} KB)'.format(
    datetime.now() - start_trank,
    datetime.now() - start, 
    tloc, os.path.getsize(tloc) / 1024))

Created BERTopic object - call .fit() or .fit_transform() to train model!
CREATED SENTENCE TRANSFORMER WRAPPER


Batches: 100%|██████████| 3259/3259 [20:20<00:00,  2.67it/s]
2022-02-26 23:27:38,376 - BERTopic - Transformed documents to Embeddings


0:20:29.504200 (0:20:29.572183 total) - Transformed documents to embeddings


2022-02-26 23:29:36,972 - BERTopic - Reduced dimensionality with UMAP


0:01:58.600622 (0:22:28.172911 total) - Reduced dimensionality with UMAP


2022-02-26 23:29:48,273 - BERTopic - Clustered UMAP embeddings with HDBSCAN


0:00:11.296345 (0:22:39.469601 total) - Clustered UMAP embeddings with HDBSCAN
0:29:25.021058 - Extracted topics and probs
0:00:51.646344 - Saved model with 1,782 topics to /home/shuyang/data4/BERTopic-ubuntu/base_model/model.bin (1168.66 MB)
0:00:00.004063 - Saved 104,284 topics & 104,284 probs to /home/shuyang/data4/BERTopic-ubuntu/base_model/top_probs.pkl (1.15 MB)
0:00:02.304714 (0:30:18.995155 total) - Saved hierarchy visualization to /home/shuyang/data4/BERTopic-ubuntu/base_model/hier.html (4035.89 KB)
0:00:00.975466 (0:30:19.970788 total) - Saved term rank visualization to /home/shuyang/data4/BERTopic-ubuntu/base_model/tranks.html (4295.12 KB)


## Looking at Topics

Here we can see the keywords our model has identified for each topic for a random sample of topics (and outliers).

Using seed 1111 for Ubuntu, we can see the following topics:
- 445: Hypervisors in VirtualBox
- 404: IP address questions
- 705: Fail2ban for DDOS protection
- 1567: Recovery and recovery media for Grub Linux bootloader
- 1315: Installing the Audacity software

In [7]:
import random

def visualize_topics(topic_map: dict, keys: list = None, k: int = -1, show_scores: bool = False):
    outlier_key = -1
    all_keys_visualization = sorted([k for k in topic_map.keys() if k != -1])
    
    # Sample a few keys for visualizing topics
    if keys:
        sample_keys_vis = keys
    elif k is None or k < 0 or k > len(all_keys_visualization):
        sample_keys_vis = all_keys_visualization
    else:
        sample_keys_vis = random.sample(all_keys_visualization, k=k)
    
    # Outliers
    print('Outlier keywords:')
    if show_scores:
        display(topic_map[outlier_key])
    else:
        display([t[0] for t in topic_map[outlier_key]])
    print()
    
    # Other keys
    for k in sample_keys_vis:
        print(f'Topic {k}:')
        if show_scores:
            display(topic_map[k])
        else:
            display([t[0] for t in topic_map[k]])
        print()

random.seed(1111)
print('Visualizing keywords for topics @ leaf level with {:,} total topics'.format(
    n_topics
))
visualize_topics(
    topic_map=TOPIC_DICTIONARY[n_topics],
    k=5,
    show_scores=False
)

Visualizing keywords for topics @ leaf level with 1,782 total topics
Outlier keywords:


['the', 'to', 'and', 'it', 'is', 'in', 'my', 'of', 'that', 'this']


Topic 445:


['media',
 'kodi',
 'tv',
 'stream',
 'minidlna',
 'appletv',
 'to stream',
 'media server',
 'can stream',
 'play']


Topic 404:


['rkhunter',
 'rootkits',
 'chkrootkit',
 'rootkit',
 'infected',
 'false',
 'false positives',
 'positives',
 'for rootkits',


Topic 705:


['watcher',
 'script',
 'the script',
 'it to run',
 'on startup',
 'startup',
 'the watcher daemon',
 'have the watcher',
 'which makes changes',
 'watcher daemon']


Topic 1567:


['blinking',
 'stop blinking',
 'cursor',
 'the cursor',
 'blinking immediately',
 'blinking immediately after',
 'for gnome 38',
 'after the focus',
 'to stop blinking',
 'focus']


Topic 1315:


['language',
 'languages',
 'default language',
 'firefox',
 'language to',
 'english',
 'united',
 'for dutch',
 'default language to',
 'dutch']




## Iteratively Refining Topics

This is a demonstration of how we merge topics upwards from the original K topics to 512, 128, 32, and finally 16 broad topics.

In [8]:
import sys

for n_topics in [512, 128, 32, 16]:
    print('\n========= {} =========\n'.format(n_topics))
    mdir = os.path.join(bdir, 'model_{}'.format(n_topics))
    os.makedirs(mdir, exist_ok=True)
    
    # Reduce topics, in place modification
    start = datetime.now()
    base_topics, base_probs, topic_reduction = topic_model.reduce_topics(
        docs=docs,
        topics=base_topics,
        probabilities=base_probs,
        nr_topics=n_topics,
    )
    print('{} - Extracted topics and probs with reduction to {:,}'.format(
        datetime.now() - start, n_topics))

    # Save topics & probs
    start_bsave = datetime.now()
    bfile = os.path.join(mdir, 'top_probs.pkl')
    with open(bfile, 'wb') as wf:
        pickle.dump((base_topics, base_probs), wf)
    print('{} - Saved {:,} topics & {:,} probs to {} ({:.2f} MB)'.format(
        datetime.now() - start_bsave,
        len(base_topics), len(base_probs),
        bfile, os.path.getsize(bfile) / (1024**2)
    ))

    # SAVE TOPICS
    topic_map = topic_model.get_topics()
    n_topics = len(topic_map)
    TOPIC_DICTIONARY[n_topics] = topic_map
    DOCUMENT_TOPICS[n_topics] = [(topic, prob) for topic, prob in zip(base_topics, base_probs)]
    REDUCTION_MAPS[n_topics] = topic_reduction

    # Save model itself
    sfile = os.path.join(mdir, 'model.bin')
    start_save = datetime.now()
    topic_model.save(sfile)
    print('{} - Saved model with {:,} topics to {} ({:.2f} MB)'.format(
        datetime.now() - start_save,
        len(topic_model.get_topic_freq()),
        sfile, os.path.getsize(sfile) / (1024**2)
    ))

    ## VISUALIZE TOPIC HIERARCHY
    start_hier = datetime.now()
    fig = topic_model.visualize_hierarchy(width=2000, height=1000)
    hloc = os.path.join(mdir, "hier.html")
    fig.write_html(hloc)
    print('{} - Saved hierarchy visualization to {} ({:.2f} KB)'.format(
        datetime.now() - start_hier, hloc, os.path.getsize(hloc) / 1024))

    ## VISUALIZE TERM RANKS
    start_trank = datetime.now()
    fig = topic_model.visualize_term_rank(width=2000, height=1000)
    tloc = os.path.join(mdir, 'tranks.html')
    fig.write_html(tloc)
    print('{} - Saved term rank visualization to {} ({:.2f} KB)'.format(
        datetime.now() - start_trank, tloc, os.path.getsize(tloc) / 1024))
    
    print('Visualizing keywords @ level with {:,} total topics'.format(
        n_topics
    ))
    visualize_topics(
        topic_map=TOPIC_DICTIONARY[n_topics],
        k=5,
        show_scores=False
    )



0:00:04.342547 [_reduce_to_n_topics] created (1782, 1782) similarities matrix
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

100%|██████████| 104284/104284 [00:00<00:00, 3634057.53it/s]

......0:00:25.967846 [_reduce_to_n_topics] Created topic map
0:00:26.001086 [_reduce_to_n_topics] Mapped topics



2022-02-26 23:40:42,700 - BERTopic - Reduced number of topics from 1782 to 513


0:03:13.862917 [_reduce_to_n_topics] Updated representations
0:03:13.901669 - Extracted topics and probs with reduction to 512
0:00:00.004278 - Saved 104,284 topics & 104,284 probs to /home/shuyang/data4/BERTopic-ubuntu/model_512/top_probs.pkl (1.16 MB)
0:00:35.859168 - Saved model with 513 topics to /home/shuyang/data4/BERTopic-ubuntu/model_512/model.bin (1128.57 MB)
0:00:00.262793 - Saved hierarchy visualization to /home/shuyang/data4/BERTopic-ubuntu/model_512/hier.html (3712.88 KB)
0:00:00.298842 - Saved term rank visualization to /home/shuyang/data4/BERTopic-ubuntu/model_512/tranks.html (3789.85 KB)
Visualizing keywords @ level with 513 total topics
Outlier keywords:


['the', 'to', 'and', 'it', 'is', 'in', 'my', 'that', 'of', 'this']


Topic 145:


['sublime',
 'sublime text',
 'text',
 'installed sublime',
 'sublimetext',
 'open sublime',
 'open',
 'install sublime',
 'installed sublime text',
 'st2']


Topic 372:


['brasero',
 'burn',
 'to burn',
 'cd',
 'dvd',
 'using brasero',
 'brasero to',
 'k3b',
 'nero',
 'burner']


Topic 457:


['subdomain',
 'host',
 'apache',
 'virtual host',
 'virtual',
 'server',
 'site',
 'domain',
 'wordpress',
 'website']


Topic 85:


['eclipse',
 'the eclipse',
 'install eclipse',
 'java',
 'installed eclipse',
 'cdt',
 'of eclipse',
 'eclipse and',
 'pydev',
 'project']


Topic 191:


['conky',
 'the conky',
 'my conky',
 'color',
 'conkyrc',
 'color e0e0e0',
 'e0e0e0',
 'desktop',
 'conky window',
 'conky manager']




0:00:02.560979 [_reduce_to_n_topics] created (513, 513) similarities matrix
........................................................................................................................................................................................................................................................................................................................................................................................

100%|██████████| 104284/104284 [00:00<00:00, 3372258.57it/s]

........0:00:10.140600 [_reduce_to_n_topics] Created topic map
0:00:10.175206 [_reduce_to_n_topics] Mapped topics



2022-02-26 23:43:06,808 - BERTopic - Reduced number of topics from 513 to 129


0:01:47.603282 [_reduce_to_n_topics] Updated representations
0:01:47.649777 - Extracted topics and probs with reduction to 128
0:00:00.004524 - Saved 104,284 topics & 104,284 probs to /home/shuyang/data4/BERTopic-ubuntu/model_128/top_probs.pkl (1.19 MB)
0:00:36.450054 - Saved model with 129 topics to /home/shuyang/data4/BERTopic-ubuntu/model_128/model.bin (1107.54 MB)
0:00:00.106674 - Saved hierarchy visualization to /home/shuyang/data4/BERTopic-ubuntu/model_128/hier.html (3617.07 KB)
0:00:00.097663 - Saved term rank visualization to /home/shuyang/data4/BERTopic-ubuntu/model_128/tranks.html (3636.79 KB)
Visualizing keywords @ level with 129 total topics
Outlier keywords:


['the', 'to', 'and', 'it', 'is', 'in', 'this', 'my', 'that', 'of']


Topic 23:


['video',
 'convert',
 'to convert',
 'files',
 'to',
 'mp4',
 'audio',
 'format',
 'ffmpeg',
 'mp3']


Topic 86:


['dns',
 'dns server',
 'server',
 'dnsmasq',
 'resolve',
 'resolvconf',
 'the dns',
 'bind9',
 'domain',
 'to']


Topic 27:


['download',
 'to download',
 'wget',
 'youtube',
 'file',
 'downloading',
 'youtubedl',
 'url',
 'to',
 'the']


Topic 112:


['workspaces',
 'workspace',
 'the workspace',
 'switch',
 'of workspaces',
 'to',
 'switcher',
 'viewport',
 'switch to',
 'the']


Topic 37:


['fan',
 'temperature',
 'the fan',
 'cpu',
 'laptop',
 'is',
 'the temperature',
 'sensors',
 'the',
 'hot']




0:00:01.291922 [_reduce_to_n_topics] created (129, 129) similarities matrix
...................................................................................

100%|██████████| 104284/104284 [00:00<00:00, 3684878.80it/s]

.............0:00:02.741013 [_reduce_to_n_topics] Created topic map
0:00:02.772784 [_reduce_to_n_topics] Mapped topics



2022-02-26 23:45:05,602 - BERTopic - Reduced number of topics from 129 to 33


0:01:22.022864 [_reduce_to_n_topics] Updated representations
0:01:22.071670 - Extracted topics and probs with reduction to 32
0:00:00.003970 - Saved 104,284 topics & 104,284 probs to /home/shuyang/data4/BERTopic-ubuntu/model_32/top_probs.pkl (1.24 MB)
0:00:36.281101 - Saved model with 33 topics to /home/shuyang/data4/BERTopic-ubuntu/model_32/model.bin (1090.95 MB)
0:00:00.044042 - Saved hierarchy visualization to /home/shuyang/data4/BERTopic-ubuntu/model_32/hier.html (3593.37 KB)
0:00:00.048862 - Saved term rank visualization to /home/shuyang/data4/BERTopic-ubuntu/model_32/tranks.html (3598.53 KB)
Visualizing keywords @ level with 33 total topics
Outlier keywords:


['the', 'to', 'and', 'it', 'is', 'in', 'this', 'my', 'that', 'of']


Topic 11:


['the', 'usb', 'boot', 'and', 'ubuntu', 'to', 'install', 'it', 'on', 'with']


Topic 17:


['mysql',
 'to',
 'the',
 'and',
 'it',
 'this',
 'error',
 'server',
 'phpmyadmin',
 'is']


Topic 2:


['kernel', 'the', 'to', 'is', 'and', 'the kernel', 'it', 'of', 'that', 'this']


Topic 23:


['printer',
 'print',
 'the',
 'the printer',
 'to',
 'printing',
 'page',
 'and',
 'cups',
 'is']


Topic 30:


['login',
 'the',
 'screen',
 'to',
 'and',
 'login screen',
 'the login',
 'my',
 'it',
 'in']




0:00:00.582651 [_reduce_to_n_topics] created (33, 33) similarities matrix
...........

100%|██████████| 104284/104284 [00:00<00:00, 3656141.22it/s]

.....0:00:00.853389 [_reduce_to_n_topics] Created topic map
0:00:00.885698 [_reduce_to_n_topics] Mapped topics



2022-02-26 23:46:59,449 - BERTopic - Reduced number of topics from 33 to 17


0:01:17.354167 [_reduce_to_n_topics] Updated representations
0:01:17.408785 - Extracted topics and probs with reduction to 16
0:00:00.072514 - Saved 104,284 topics & 104,284 probs to /home/shuyang/data4/BERTopic-ubuntu/model_16/top_probs.pkl (1.26 MB)
0:00:36.389725 - Saved model with 17 topics to /home/shuyang/data4/BERTopic-ubuntu/model_16/model.bin (1085.96 MB)
0:00:00.042526 - Saved hierarchy visualization to /home/shuyang/data4/BERTopic-ubuntu/model_16/hier.html (3589.52 KB)
0:00:00.071263 - Saved term rank visualization to /home/shuyang/data4/BERTopic-ubuntu/model_16/tranks.html (3592.18 KB)
Visualizing keywords @ level with 17 total topics
Outlier keywords:


['the', 'to', 'and', 'it', 'is', 'in', 'this', 'my', 'that', 'of']


Topic 9:


['theme', 'the', 'to', 'themes', 'and', 'in', 'is', 'it', 'of', 'color']


Topic 5:


['python', 'to', 'the', 'and', 'it', 'install', 'is', 'in', 'this', 'pip']


Topic 13:


['upgrade',
 'to',
 'the',
 'ubuntu',
 'to upgrade',
 'is',
 'and',
 'it',
 'lts',
 'release']


Topic 4:


['the', 'to', 'share', 'samba', 'and', 'is', 'on', 'my', 'mount', 'in']


Topic 3:


['to',
 'the',
 'permissions',
 'and',
 'permission',
 'user',
 'file',
 'files',
 'is',
 'it']




In [9]:
DOCUMENT_TOPICS.keys()

dict_keys([1782, 513, 129, 33, 17])

## EDA for Documents with Topics

Here we show:
- The % of documents at each level with no topics assigned
- Quantiles for topic likelihood as reported by our model
- Most commonly assigned topics by level

In [10]:
import numpy as np
from collections import Counter
import pandas as pd
pd.set_option('display.max_colwidth', None)

for n_topics, doc_topic_list in DOCUMENT_TOPICS.items():
    print('\n---- {} ----'.format(n_topics))
    n_outliers = 0
    non_outlier_probs = []
    topic_freqs = Counter()
    for top, top_prob in doc_topic_list:
        if top == -1:
            n_outliers += 1
            continue
        non_outlier_probs.append(top_prob)
        topic_freqs[top] += 1
    print('{:,} outliers ({:.2f}%)'.format(n_outliers, n_outliers / len(doc_topic_list) * 100.0))
    print('Topic probability quantiles:')
    q_levels = [0.25, 0.50, 0.9, 0.95]
    quants = np.quantile(non_outlier_probs, q_levels)
    for ql, qv in zip(q_levels, quants):
        print('  {}th Quantile: {:.1f} %'.format(int(100*ql), qv*100))
    
    print('> Top 5 most frequent topics:')
    df = []
    for tid, tfreq in topic_freqs.most_common(5):
        df.append({'Topic': [t for t, cs in TOPIC_DICTIONARY[n_topics][tid]], 'Freq': tfreq})
    display(pd.DataFrame(df))


---- 1782 ----
47,427 outliers (45.48%)
Topic probability quantiles:
  25th Quantile: 72.0 %
  50th Quantile: 98.0 %
  90th Quantile: 100.0 %
  95th Quantile: 100.0 %
> Top 5 most frequent topics:


Unnamed: 0,Topic,Freq
0,"[kernel, the kernel, kernels, linux kernel, kernel version, kernel is, my kernel, kernel and, new kernel, old kernels]",968
1,"[python, pip, 27, python3, python 27, anaconda, of python, the python, for python, numpy]",762
2,"[samba, share, shares, mount, nfs, shared, the share, access, to mount, network]",717
3,"[encrypted, encryption, encrypt, luks, passphrase, to encrypt, truecrypt, an encrypted, ecryptfs, home]",655
4,"[juju, maas, openstack, node, nodes, deploy, charm, the maas, to deploy, charms]",621



---- 513 ----
54,781 outliers (52.53%)
Topic probability quantiles:
  25th Quantile: 70.7 %
  50th Quantile: 97.9 %
  90th Quantile: 100.0 %
  95th Quantile: 100.0 %
> Top 5 most frequent topics:


Unnamed: 0,Topic,Freq
0,"[kernel, the kernel, kernels, linux kernel, kernel version, version, update, upgrade, kernel is, boot]",1011
1,"[python, pip, 27, python3, python 27, install, the python, anaconda, to install, of python]",820
2,"[samba, share, mount, shares, nfs, shared, access, network, the share, server]",744
3,"[ssd, partition, partitions, windows, hdd, drive, install ubuntu, the ssd, to install ubuntu, gb]",705
4,"[encrypted, encryption, encrypt, home, luks, passphrase, to encrypt, truecrypt, password, partition]",655



---- 129 ----
69,372 outliers (66.52%)
Topic probability quantiles:
  25th Quantile: 78.3 %
  50th Quantile: 100.0 %
  90th Quantile: 100.0 %
  95th Quantile: 100.0 %
> Top 5 most frequent topics:


Unnamed: 0,Topic,Freq
0,"[partition, windows, drive, ubuntu, partitions, ssd, hdd, and, on, install]",1183
1,"[kernel, the kernel, kernels, the, to, is, and, version, of, that]",1075
2,"[share, samba, mount, shared, the, folder, access, to, shares, network]",832
3,"[python, pip, install, to, installed, 27, the, to install, and, python3]",820
4,"[wifi, wireless, network, broadcom, driver, the, my, the wifi, and, is]",756



---- 33 ----
85,914 outliers (82.38%)
Topic probability quantiles:
  25th Quantile: 87.4 %
  50th Quantile: 100.0 %
  90th Quantile: 100.0 %
  95th Quantile: 100.0 %
> Top 5 most frequent topics:


Unnamed: 0,Topic,Freq
0,"[partition, windows, to, ubuntu, the, and, my, on, drive, swap]",1395
1,"[wifi, wireless, the, to, network, my, and, is, it, connection]",1219
2,"[kernel, the, to, is, and, the kernel, it, of, that, this]",1075
3,"[permissions, to, the, permission, and, user, files, group, file, folder]",921
4,"[the, share, samba, to, and, mount, on, my, is, folder]",832



---- 17 ----
91,547 outliers (87.79%)
Topic probability quantiles:
  25th Quantile: 88.6 %
  50th Quantile: 100.0 %
  90th Quantile: 100.0 %
  95th Quantile: 100.0 %
> Top 5 most frequent topics:


Unnamed: 0,Topic,Freq
0,"[the, wifi, to, wireless, and, is, my, it, network, on]",1568
1,"[partition, the, to, ubuntu, and, windows, my, on, is, it]",1395
2,"[kernel, the, to, is, and, it, the kernel, of, that, this]",1075
3,"[to, the, permissions, and, permission, user, file, files, is, it]",921
4,"[the, to, share, samba, and, is, on, my, mount, in]",832
