# Texttiling with LDA for Topic Detection
> This project is used to find the topics in a meeting text scripts.

* Copyright (C) 2016-2017 NCKUIIM Project
* Author: Tai-Chia Huang
* Email: vallwesture@gmail.com

# Workflow
* * *
###### 1. Train Corpus with LDA
1. Create wikipedia dictionary
2. Create wikipedia MmCorpus from step 1
3. Use LDA to train corpus, to find the topic most-associated with each word
4. Test: get topic similarity from topics
* * *
###### 2. Texttiling Algorithm
1. Declaration: Tokenization
2. Declaration: Block Comparison and Lexical Score Determination
3. Declaration: Smoothing and Boundary identification
4. Run Demo
* * *
###### 3. Given the segment, find the topic via LDA
1. use LDA to find most related topic
2. now we have the topic_id, let's find out which word has the highest prob related to that topic

# 1. Load LDA dictionary and model

In [2]:
# import and setup modules we'll be using in this notebook
import logging
import itertools

import numpy as np
import gensim

logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore

In [3]:
%time id2word_wiki = gensim.corpora.Dictionary.load('wiki.dictionary', mmap=None)

CPU times: user 1.57 s, sys: 634 ms, total: 2.2 s
Wall time: 4.74 s


In [4]:
# ignore words that appear in less than 20 documents or more than 10% documents
# filter用來去除不常見的字，但是我們想要的是「所有的」字，所以應該是filter掉常見的字
# no_below代表出現次數小於n的就去除，no_above代表大於m的比例就去除
# 所以應該讓no_above變大比較好，但也不一定，可能高頻字很多
id2word_wiki.filter_extremes(no_below=20, no_above=0.05)
print(id2word_wiki)

Dictionary(100000 unique tokens: [u'biennials', u'fawn', u'gai', u'constan\u021ba', u'nunnery']...)


The function `get_term_topics` returns the odds of that particular word belonging to a particular topic. 
A few examples:

In [5]:
lda_model = gensim.models.LdaModel.load('wiki_lda.mm')

# 2. Texttiling Algorithm

Use block comparison

When determine the number of boundaries, we use conservative measure, HC

refers to: http://www.nltk.org/_modules/nltk/tokenize/texttiling.html

In [6]:
import nltk
import re
import math
import pylab
from nltk.tokenize.api import TokenizerI

try:
    import numpy
except ImportError:
    pass

### 2.1 Declaration: Tokenization

### TokenTableField: 用來儲存每個單詞的資訊

In [7]:
class TokenTableField(object):
    """A field in the token table holding parameters for each token,
    used later in the process"""
    def __init__(self,
                 first_pos,
                 ts_occurences,
                 similarity_topics=[], # [(1, 0.3), (2, 0.45), ...]
                 total_count=1,
                 par_count=1,
                 last_par=0,
                 last_tok_seq=None): 
        self.__dict__.update(locals())
        del self.__dict__['self']

### TokenSequence: 用來把許多句塞到固定大小的虛擬句中（pseudosentence）

In [8]:
class TokenSequence(object):
    "A token list with its original length and its index"
    def __init__(self,
                 index,
                 wrdindex_list,
                 original_length=None):
        original_length=original_length or len(wrdindex_list)
        self.__dict__.update(locals())
        del self.__dict__['self']

In [9]:
def _mark_paragraph_breaks(text):
    """Identifies indented text or line breaks as the beginning of
    paragraphs"""

    MIN_PARAGRAPH = 100
    """匹配至少兩個的\n，用來作為分段用"""
    pattern = re.compile("[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*")
    matches = pattern.finditer(text)

    last_break = 0
    pbreaks = [0]

    for pb in matches:
        print pb.group()
        # start() return the beginning index
        if pb.start()-last_break < MIN_PARAGRAPH:
            continue
        else:
            pbreaks.append(pb.start())
            last_break = pb.start()

    return pbreaks

In [10]:
def _divide_to_tokensequences(text, w=20):
    "Divides the text into pseudosentences of fixed size"
    wrdindex_list = []
    matches = re.finditer("\w+", text)
    
    tuple_matches = []
    for match in matches:
        tuple_matches.append((match.group(), match.start()))
    numMatched = len(tuple_matches)
                             
    for index, match in enumerate(tuple_matches):
        wrdindex_list.append((match[0], match[1]))
        
        firstTupleList = nltk.pos_tag([match[0]])
        firstTuple = firstTupleList[0]
        
        if index + 1 < numMatched:
            secondTupleList = nltk.pos_tag([tuple_matches[index + 1][0]])
            secondTuple = secondTupleList[0]
            
            if (firstTuple[1] == 'JJ' and secondTuple[1] == 'NN') or \
               (firstTuple[1] == 'NN' and secondTuple[1] == 'NN') or \
               (firstTuple[1] == 'VB' and secondTuple[1] == 'NN'):
                n_gram2 = match[0] + ' ' + tuple_matches[index + 1][0]
                wrdindex_list.append((n_gram2, match[1]))  # since we add a combination of words
                                                           # we should assign start as the average
        else:
            continue

    return [TokenSequence(i/w, wrdindex_list[i:i+w])
            for i in range(0, len(wrdindex_list), w)]

In [11]:
def _create_token_table(token_sequences, par_breaks, model, id2word_wiki):
    "Creates a table of TokenTableFields"
    token_table = {}
    current_par = 0
    current_tok_seq = 0
    pb_iter = par_breaks.__iter__()

    current_par_break = next(pb_iter)
    if current_par_break == 0:
        try:
            current_par_break = next(pb_iter) #skip break at 0
        except StopIteration:
            raise ValueError(
                "No paragraph breaks were found(text too short perhaps?)"
                )
    for ts in token_sequences: # 把token_sentence加到token_table
        for word, index in ts.wrdindex_list:
            try:
                while index > current_par_break:
                    current_par_break = next(pb_iter)
                    current_par += 1
            except StopIteration:
                #hit bottom
                pass

            if word in token_table:
                token_table[word].total_count += 1

                if token_table[word].last_par != current_par:
                    token_table[word].last_par = current_par
                    token_table[word].par_count += 1

                if token_table[word].last_tok_seq != current_tok_seq:
                    token_table[word].last_tok_seq = current_tok_seq
                    token_table[word]\
                            .ts_occurences.append([current_tok_seq,1])
                else:
                    token_table[word].ts_occurences[-1][1] += 1
            else: #new word
                sim_to_topics = []
                try:
                    print '\nget topic: %s ...' % str(word)
                    split = word.split(' ')
                    
                    if(len(split) == 1): # single term
                        sim_to_topics = model.get_term_topics(word.encode('ascii'), minimum_probability=0.000000001)
                        print 'sim_to_topics: '
                        print sim_to_topics
                    else:
                        bow_vector = id2word_wiki.doc2bow(split)
                        sim_to_topics = model.get_document_topics(bow_vector, minimum_probability=0.000000001)
                        print 'get n = 2 gram: %s' % word.encode('ascii')

                    print 'in lda model: %s' % str(word)
                except: 
                    print 'not in lda model: %s' % str(word)
                    
                token_table[word] = TokenTableField(first_pos=index,
                                                    ts_occurences= \
                                                    [[current_tok_seq,1]],
                                                    similarity_topics= \
                                                    sim_to_topics, 
                                                    total_count=1,
                                                    par_count=1,
                                                    last_par=current_par,
                                                    last_tok_seq= \
                                                      current_tok_seq)
        current_tok_seq += 1

    return token_table

### 2.2 Declaration: Block Comparison and Lexical Score Determination

In [12]:
def sim_btw_blocks(block1, block2, token_table):
    score = 0.0
    # step 1: 取出block所有單字到list
    b1_words = []
    b2_words = []
    for word_list in block1:
        for word in word_list.wrdindex_list:
            b1_words.append(word)
            
    for word_list in block2:
        for word in word_list.wrdindex_list:
            b2_words.append(word)
    
    # step 2: 把word拿去query，找出屬於該word的token_table
    b1_words_token = [token_table[b1_word[0]]
                      for b1_word in b1_words]

    b2_words_token = [token_table[b2_word[0]]
                      for b2_word in b2_words]
    
    # step 3: 把每個字對應的topic_id拿出來
    b1_topic_ids = [] # [0, 1, 1, 3, 4, 4, 5....]
    b2_topic_ids = [] # [0, 0, 1, 3, 4, ...]
    for b1_word_token in b1_words_token: # each token contains an array of tuple(s) of sim and topic
        if b1_word_token.similarity_topics: # [(topic_id, sim), (), ...]
            for topic_sim in b1_word_token.similarity_topics:
                b1_topic_ids.append(topic_sim[0]) # add topic to list
                
    for b2_word_token in b2_words_token:
        if b2_word_token.similarity_topics:
            for topic_sim in b2_word_token.similarity_topics:
                b2_topic_ids.append(topic_sim[0]) # add topic to list
    
    # step 4: 找出共同的topic_id
    common_topics = list(set(b1_topic_ids) & set(b2_topic_ids))
    
    """average the similarities of the topics"""

    if len(common_topics) > 0:
        b1_topic_sim = []
        b2_topic_sim = []
        for b1_word_token in b1_words_token:
            for topic_sim in b1_word_token.similarity_topics:
                b1_topic_sim.append(topic_sim) # (1, 0.677)
                         
        for b2_word_token in b2_words_token:
            for topic_sim in b2_word_token.similarity_topics:
                b2_topic_sim.append(topic_sim) # (3, 0.66)
        
        b1_b2_avg_sim = []
        for match_id in common_topics:
            """sum similarities of b1"""
            b1_sum = 0.0
            b1_count = 0
            b1_avg_sim = 0.0
            for index, b1_topic_id in enumerate(b1_topic_ids):
                if match_id == b1_topic_id: # [(0, 0.77), (1, 0.43), (1, 0.56)...]
                    b1_sum += b1_topic_sim[index][1] # index for which topic, 1 for similarity
                    b1_count += 1
            
            """sum similarities of b2"""
            b2_sum = 0.0
            b2_count = 0
            b2_avg_sim = 0.0
            for index, b2_topic_id in enumerate(b2_topic_ids):
                if match_id == b2_topic_id:
                    b2_sum += b2_topic_sim[index][1]
                    b2_count += 1
            
            """average b1 and b2"""
            try:
                b1_avg_sim = b1_sum / b1_count
                b2_avg_sim = b2_sum / b2_count
            except ZeroDivisionError: 
                pass
            
            """multiply avg of b1 and b2 given match_id"""
            product = b1_avg_sim * b2_avg_sim
            b1_b2_avg_sim.append(product)
        
        score = sum(prod_sim for prod_sim in b1_b2_avg_sim) / len(b1_b2_avg_sim)        
    return score

In [13]:
def _block_comparison(tokseqs, token_table, k):
    "Implements the block comparison method"
    gap_scores = []
    numgaps = len(tokseqs)-1

    for curr_gap in range(numgaps):
        #adjust window size for boundary conditions
        if curr_gap < k-1:
            window_size = curr_gap + 1
        elif curr_gap > numgaps-k:
            window_size = numgaps - curr_gap
        else:
            window_size = k

        b1 = [ts for ts in tokseqs[curr_gap-window_size+1 : curr_gap+1]]
        b2 = [ts for ts in tokseqs[curr_gap+1 : curr_gap+window_size+1]]
        score = sim_btw_blocks(b1, b2, token_table)
        gap_scores.append(score)

    return gap_scores

### 2.3 Declaration: Smoothing and Boundary identification

In [14]:
def _smooth_scores(gap_scores, smoothing_width):
    "Wraps the smooth function from the SciPy Cookbook"
    return list(smooth(numpy.array(gap_scores[:]),
                       window_len = smoothing_width+1))

In [15]:
def _depth_scores(scores):
    """Calculates the depth of each gap, i.e. the average difference
    between the left and right peaks and the gap's score"""

    depth_scores = [0 for x in scores]
    #clip boundaries: this holds on the rule of thumb(my thumb)
    #that a section shouldn't be smaller than at least 2
    #pseudosentences for small texts and around 5 for larger ones.

    clip = min(max(len(scores)/10, 2), 5)
    index = clip

    for gapscore in scores[clip:-clip]:
        lpeak = gapscore
        for score in scores[index::-1]:
            if score >= lpeak:
                lpeak = score
            else:
                break
        rpeak = gapscore
        for score in scores[index:]:
            if score >= rpeak:
                rpeak = score
            else:
                break
        depth_scores[index] = lpeak + rpeak - 2 * gapscore
        index += 1

    return depth_scores

In [16]:
def _identify_boundaries(depth_scores):
    """Identifies boundaries at the peaks of similarity score
    differences"""

    boundaries = [0 for x in depth_scores]

    avg = sum(depth_scores)/len(depth_scores)
    stdev = numpy.std(depth_scores)

    # using HC (conservative measure)
    cutoff = avg-stdev/2.0

    depth_tuples = sorted(zip(depth_scores, range(len(depth_scores))))
    depth_tuples.reverse()
    hp = list(filter(lambda x:x[0]>cutoff, depth_tuples))

    for dt in hp:
        boundaries[dt[1]] = 1
        for dt2 in hp: #undo if there is a boundary close already
            if dt[1] != dt2[1] and abs(dt2[1]-dt[1]) < 16 \
                   and boundaries[dt2[1]] == 1:
                boundaries[dt[1]] = 0
    return boundaries

In [17]:
def _normalize_boundaries(text, boundaries, paragraph_breaks, w):
    """Normalize the boundaries identified to the original text's
    paragraph breaks"""

    norm_boundaries = []
    char_count, word_count, gaps_seen = 0, 0, 0
    seen_word = False

    for char in text:
        char_count += 1
        if char in " \t\n" and seen_word:
            seen_word = False
            word_count += 1
        if char not in " \t\n" and not seen_word:
            seen_word=True
        if gaps_seen < len(boundaries) and word_count > \
                                           (max(gaps_seen*w, w)):
            if boundaries[gaps_seen] == 1:
                #find closest paragraph break
                best_fit = len(text)
                for br in paragraph_breaks:
                    if best_fit > abs(br-char_count):
                        best_fit = abs(br-char_count)
                        bestbr = br
                    else:
                        break
                if bestbr not in norm_boundaries: #avoid duplicates
                    norm_boundaries.append(bestbr)
            gaps_seen += 1

    return norm_boundaries

In [18]:
#Pasted from the SciPy cookbook: http://www.scipy.org/Cookbook/SignalSmooth
def smooth(x,window_len=11,window='flat'):
    """smooth the data using a window with requested size.

    This method is based on the convolution of a scaled window with the signal.
    The signal is prepared by introducing reflected copies of the signal
    (with the window size) in both ends so that transient parts are minimized
    in the beginning and end part of the output signal.

    :param x: the input signal
    :param window_len: the dimension of the smoothing window; should be an odd integer
    :param window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
        flat window will produce a moving average smoothing.

    :return: the smoothed signal

    example::

        t=linspace(-2,2,0.1)
        x=sin(t)+randn(len(t))*0.1
        y=smooth(x)

    :see also: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve,
        scipy.signal.lfilter

    TODO: the window parameter could be the window itself if an array instead of a string
    """

    if x.ndim != 1:
        raise ValueError("smooth only accepts 1 dimension arrays.")

    if x.size < window_len:
        raise ValueError("Input vector needs to be bigger than window size.")

    if window_len < 3:
        return x

    if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
        raise ValueError("Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'")

    s=numpy.r_[2*x[0]-x[window_len:1:-1],x,2*x[-1]-x[-1:-window_len:-1]]

    #print(len(s))
    if window == 'flat': #moving average
        w = numpy.ones(window_len,'d')
    else:
        w = eval('numpy.' + window + '(window_len)')

    y = numpy.convolve(w/w.sum(), s, mode='same')

    return y[window_len-1:-window_len+1]

### 2.4 Run Demo

In [19]:
from nltk import word_tokenize, pos_tag
# load combined text
filePath = './Sample.txt'

output = []
with open(filePath, 'r') as f:
    for line in f:
        analyzedLine = word_tokenize(line)
        pos_line = pos_tag(analyzedLine)
        output.append(pos_line)

In [20]:
tokenized = []
for sentence in output:
    newSentence = ''
    for aTuple in sentence:
        if aTuple[0] == '.' or aTuple[0] == '?':
            newSentence += aTuple[0]
        else:
            if newSentence == '':
                newSentence += aTuple[0]
            else:
                newSentence += ' ' + aTuple[0]
    if len(newSentence) > 0: 
        tokenized.append(newSentence)

In [21]:
outputFile = open('./tokenized.txt', 'w')
for line in tokenized:
    outputFile.write(line)
    outputFile.write('\n\n')
outputFile.close()

In [22]:
# parameter declaration
# token-sequence size 
# ie. how many words will be included in one sentence
w = 20 

# block size
# ie. how many toksequs will be included in one block
k = 10 

smooth_rounds = 1
smoothing_width = 2

In [23]:
text = open('./tokenized.txt').read().decode('utf8')

In [24]:
"""Return a tokenized copy of *text*, where each "token" represents
a separate topic."""

lowercase_text = text.lower()
text_length = len(lowercase_text)
paragraph_breaks = _mark_paragraph_breaks(text)











































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [25]:
# Remove punctuation
nopunct_text = ''.join(c for c in lowercase_text
                               if re.match("[a-z\-\' \n\t]", c))
nopunct_par_breaks = _mark_paragraph_breaks(nopunct_text)
tokseqs = _divide_to_tokensequences(nopunct_text, w)











































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































In [26]:
stopWordsFilePath = './stops.txt'
stopWordsFile = open(stopWordsFilePath, 'r+')
stopWordsFile.seek(0)

# load stop word into array
stopwords = []
for line in stopWordsFile:
    stopwords.append(line.strip())

# Filter stopwords
for ts in tokseqs:
    new = []
    for wi in ts.wrdindex_list:
        split = wi[0].split(' ') 
        if(len(split) == 2): # n_gram = 2
            firstWord = split[0]
            secondWord = split[1]
            if (firstWord in stopwords) or (secondWord in stopwords):
                continue
            else:
                new.append(wi)
        else:
            if wi[0] in stopwords:
                continue
            else:
                new.append(wi)
            
    ts.wrdindex_list = new

In [27]:
token_table = _create_token_table(tokseqs, nopunct_par_breaks, lda_model, id2word_wiki);
# End of the Tokenization step


get topic: god ...
sim_to_topics: 
[(0, 0.00027128781368857983), (1, 0.00012554599369798394), (2, 0.00013482649247364251), (3, 0.0010919936566173142), (4, 0.00080495225521345643), (5, 0.00042855305911382037), (6, 7.3388432854537013e-05), (7, 0.00017904404479521699), (8, 8.1394841511849353e-05), (9, 0.00015313646610717261), (10, 0.00011761496504767783), (11, 0.00015728723514165326), (12, 0.00025293163984035699), (13, 0.00026602106824061679), (14, 0.00023490126880902669), (15, 0.00024305818856461907), (16, 0.00022716942742226581), (17, 0.00050666504942860433), (18, 0.00015760549960328946), (19, 0.00018987132672341846), (20, 0.0001052145453338182), (21, 0.0010068727511501902), (22, 1.8821597865485463e-05), (23, 0.00013761389463008946), (24, 8.1590464507964072e-05), (25, 0.00020813622537693542), (26, 0.00021505385097748332), (27, 0.00038718795574838586), (28, 0.00029272821657514293), (29, 0.0003058357916744278), (30, 0.00013008874608054342), (31, 6.3423830080427374e-05), (32, 0.0002953080

In [28]:
gap_scores = _block_comparison(tokseqs, token_table, k)

In [29]:
smooth_scores = _smooth_scores(gap_scores, smoothing_width)

In [30]:
# Boundary identification
depth_scores = _depth_scores(smooth_scores)
segment_boundaries = _identify_boundaries(depth_scores)

normalized_boundaries = _normalize_boundaries(text,
                                                   segment_boundaries,
                                                   paragraph_breaks, w)
# End of Boundary Identification
segmented_text = []
prevb = 0

for b in normalized_boundaries:
    if b == 0:
        continue
    segmented_text.append(text[prevb:b])
    prevb = b

if prevb < text_length: # append any text that may be remaining
    segmented_text.append(text[prevb:])

if not segmented_text:
    segmented_text = [text]

### what we get are gap_scores, smooth_scores, depth_scores, segment_boundaries

In [31]:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly

plotly.offline.init_notebook_mode()

In [32]:
trace0 = go.Scatter(
    x = range(len(gap_scores)),
    y = gap_scores,
    name = 'gap_scores'
)
trace1 = go.Scatter(
    x = range(len(smooth_scores)),
    y = smooth_scores,
    name = 'smooth_scores'
)
trace2 = go.Scatter(
    x = range(len(depth_scores)), 
    y = depth_scores, 
    name = 'depth_scores', 
)
data = [trace0, trace1, trace2]

plotly.offline.iplot({
    "data": data,
    "layout": go.Layout(title='scores')
})

# # save offline
# imgName = './output data/ldaResult/' + fileRoot + '/scores.png'
# print imgName
# py.image.save_as(data, imgName)

In [33]:
trace3 = go.Scatter(
    x = range(len(segment_boundaries)), 
    y = segment_boundaries, 
    name = 'segment_boundaries'
)

plotly.offline.iplot({
    "data": [trace3],
    "layout": go.Layout(title='seg-boundaries')
})

# boundariesImgName = './output data/ldaResult/' + fileRoot + '/boundaries.png'
# py.image.save_as([trace3], boundariesImgName)

# 3. Given the segment, find the topic via LDA

In [34]:
# remove any stop word in segment_text
all_segment_text = []
for aSeg in segmented_text:
    single_seg = []
    matches = re.finditer("\w+", aSeg)
    for index, match in enumerate(matches):
        word = str(match.group())
        word = word.lower()
        if word not in stopwords:
            single_seg.append((word, index))
    all_segment_text.append(single_seg)

### 3.1 use LDA to find most related topic

In [35]:
topics = []
for index, aSeg in enumerate(all_segment_text):
    seg_text = [word[0] for word in aSeg]
    seg_topics = []
    bow_vector = id2word_wiki.doc2bow(seg_text)
    try:
        seg_topics = lda_model.get_document_topics(bow_vector, minimum_probability=0.000000001)
    except:
        print 'fail to find topics given the document'
    topics.append({'index': index, 'topics': seg_topics, 'seg_text': aSeg})

In [36]:
for index, topic in enumerate(topics):
    print 'Topic: %s' % str(index)
    print topic['topics']

Topic: 0
[(0, 0.00015625000000000017), (1, 0.00015625000000000017), (2, 0.00015625000000000017), (3, 0.00015625000000000017), (4, 0.00015625000000000017), (5, 0.00015625000000000017), (6, 0.00015625000000000017), (7, 0.00015625000000000017), (8, 0.00015625000000000017), (9, 0.00015625000000000017), (10, 0.00015625000000000017), (11, 0.00015625000000000017), (12, 0.00015625000000000017), (13, 0.00015625000000000017), (14, 0.00015625000000000017), (15, 0.00015625000000000017), (16, 0.00015625000000000017), (17, 0.00015625000000000017), (18, 0.00015625000000000017), (19, 0.020559055776547687), (20, 0.00015625000000000017), (21, 0.00015625000000000017), (22, 0.048596342853691826), (23, 0.00015625000000000017), (24, 0.00015625000000000017), (25, 0.00015625000000000017), (26, 0.035038375344532496), (27, 0.00015625000000000017), (28, 0.00015625000000000017), (29, 0.00015625000000000017), (30, 0.00015625000000000017), (31, 0.00015625000000000017), (32, 0.00015625000000000017), (33, 0.000156250

In [37]:
# select the most related topic of the segment(which has the highest probability)
for aSeg in topics:
    topic_sim = aSeg['topics']
    
    highest_topic_id = topic_sim[0][0]
    highest_prob = topic_sim[0][1]
    for aTuple in topic_sim:
        if aTuple[1] > highest_prob:
            highest_prob = aTuple[1]
            highest_topic_id = aTuple[0]
    aSeg['high_topic_id'] = highest_topic_id
    aSeg['hight_topic_prob'] = highest_prob

In [38]:
for index, aSeg in enumerate(topics):
    print 'Segment: %d' % index
    print 'Topic: %s' % str(aSeg['high_topic_id'])
    print 'Prob: %f\n' % aSeg['hight_topic_prob']

Segment: 0
Topic: 68
Prob: 0.109634

Segment: 1
Topic: 53
Prob: 0.132681

Segment: 2
Topic: 80
Prob: 0.259936

Segment: 3
Topic: 22
Prob: 0.159878

Segment: 4
Topic: 90
Prob: 0.122155

Segment: 5
Topic: 53
Prob: 0.196132

Segment: 6
Topic: 1
Prob: 0.102952

Segment: 7
Topic: 67
Prob: 0.126762

Segment: 8
Topic: 67
Prob: 0.161679

Segment: 9
Topic: 80
Prob: 0.308587

Segment: 10
Topic: 53
Prob: 0.151078

Segment: 11
Topic: 32
Prob: 0.145911

Segment: 12
Topic: 8
Prob: 0.129497

Segment: 13
Topic: 80
Prob: 0.117694

Segment: 14
Topic: 23
Prob: 0.113538

Segment: 15
Topic: 45
Prob: 0.095229

Segment: 16
Topic: 80
Prob: 0.151920

Segment: 17
Topic: 80
Prob: 0.264192

Segment: 18
Topic: 80
Prob: 0.306165

Segment: 19
Topic: 79
Prob: 0.113178

Segment: 20
Topic: 12
Prob: 0.110365

Segment: 21
Topic: 45
Prob: 0.119951

Segment: 22
Topic: 66
Prob: 0.163780

Segment: 23
Topic: 39
Prob: 0.125049

Segment: 24
Topic: 80
Prob: 0.116233

Segment: 25
Topic: 52
Prob: 0.154564

Segment: 26
Topic: 57
Pr

### 3.2 now we have the topic_id, let's find out which word has the highest prob related to that topic

In [39]:
# n-gram: 1
for aSeg in topics:
    word_sims = []
    for wordTuple in aSeg['seg_text']:
        word = wordTuple[0]
        print 'word: %s' % word
        prob = 0.0
        topic_sim = []
        try:
            posList = nltk.pos_tag([word])
            pos = posList[0]
            if pos[1] == 'NN':
                topic_sim = lda_model.get_term_topics(str(word), minimum_probability=0.000000001)
            else:
                print 'pos is not NN'
                continue
        except:
            print 'not in lda_model: %s' % word
            
        for topic_prob in topic_sim:
            if topic_prob[0] == aSeg['high_topic_id']: # 如果該word有對應到該topic_id，才去比較similarity
                if topic_prob[1] > prob:
                    prob = topic_prob[1]
        a_word_sim = {'word': word, 'prob': prob}
        word_sims.append(a_word_sim)
    aSeg['word_to_topic_sim'] = word_sims

word: god
word: feel
word: natural
pos is not NN
word: detect
word: meeting
word: started
pos is not NN
word: starting
pos is not NN
word: voila
not in lda_model: voila
word: cool
word: start
word: idea
word: start
word: speak
word: bit
word: speech
word: coding
pos is not NN
word: speech
word: coding
pos is not NN
word: feedback
word: ve
word: ve
word: supposed
pos is not NN
word: speech
word: coding
pos is not NN
word: stuff
word: hynek
word: bit
word: based
pos is not NN
word: hilbert
word: transforms
pos is not NN
word: temporal
pos is not NN
word: context
word: deriving
pos is not NN
word: parameters
pos is not NN
word: tranf
not in lda_model: tranf
word: transcs
not in lda_model: transcs
word: transmitted
pos is not NN
word: decoder
word: andso
not in lda_model: andso
word: bit
word: stuff
word: people
pos is not NN
word: l_p_c_
not in lda_model: l_p_c_
word: based
pos is not NN
word: ca
pos is not NN
word: encode
word: speech
word: course
word: solve
word: biggest
pos is not NN


In [41]:
# n-gram: 2
for aSeg in topics:
    word_sims = []
    seg_text = aSeg['seg_text']
    length = len(seg_text)
    
    for index, wordTuple in enumerate(seg_text):
        if index + 1 < length:
            prob = 0.0
            topic_sim = []
            
            secondTuple = seg_text[index + 1]

            if (secondTuple[1] - wordTuple[1]) == 1:  # check whether they are adjacent
                gram_2 = [wordTuple[0], secondTuple[0]]

                posList = nltk.pos_tag(gram_2)
                firstPos = posList[0]
                secondPos = posList[1]
            
                try:
                    if (firstPos[1] == 'JJ' and secondPos[1] == 'NN') or \
                       (firstPos[1] == 'NN' and secondPos[1] == 'NN') or \
                       (firstPos[1] == 'VB' and secondPos[1] == 'NN'):
                        print 'find combination: %s %s' % (wordTuple[0], secondTuple[0])
                        bow_vector = id2word_wiki.doc2bow([wordTuple[0], secondTuple[0]])
                        topic_sim = lda_model.get_document_topics(bow_vector, minimum_probability=0.00000001)
                    else:
                        print 'pos combination is not ADJ + NN or NN + NN or VB + NN'
                        continue
                except:
                    print 'not in lda_model: %s %s' % (word, second)
            
                for topic_prob in topic_sim:
                    if topic_prob[0] == aSeg['high_topic_id']: # 如果該word有對應到該topic_id，才去比較similarity
                        if topic_prob[1] > prob:
                            prob = topic_prob[1]
                        
                word_combin = wordTuple[0] + ' ' + secondTuple[0]
                word_combin_sim = {'word': word_combin, 'prob': prob}
                word_sims.append(word_combin_sim)
            else:
                continue
    aSeg['word_2_to_topic_sim'] = word_sims

pos combination is not ADJ + NN or NN + NN or VB + NN
find combination: voila cool
pos combination is not ADJ + NN or NN + NN or VB + NN
pos combination is not ADJ + NN or NN + NN or VB + NN
pos combination is not ADJ + NN or NN + NN or VB + NN
pos combination is not ADJ + NN or NN + NN or VB + NN
pos combination is not ADJ + NN or NN + NN or VB + NN
pos combination is not ADJ + NN or NN + NN or VB + NN
find combination: temporal context
find combination: tranf transcs
pos combination is not ADJ + NN or NN + NN or VB + NN
find combination: decoder andso
pos combination is not ADJ + NN or NN + NN or VB + NN
find combination: l_p_c_ stuff
find combination: source signal
find combination: white board
find combination: l_p_c_ stuff
find combination: speech signal
find combination: hilbert transform
pos combination is not ADJ + NN or NN + NN or VB + NN
find combination: analytic signal
find combination: analytic signal
find combination: apply hilbert
find combination: hilbert transform
find

In [42]:
# find the word who has the highest probability
for aSeg in topics:
    word = 'no word matched!'
    prob = 0.0
    for word_sim in aSeg['word_to_topic_sim']:
        if word_sim['prob'] > prob:
            word = word_sim['word']
            prob = word_sim['prob']
    for word2_sim in aSeg['word_2_to_topic_sim']:
        if word2_sim['prob'] > prob:
            word = word2_sim['word']
            prob = word2_sim['prob']
    aSeg['topic_word'] = {'word': word, 'prob':prob}

In [43]:
remove_newline_text = [item.splitlines() for item in segmented_text]
result_text = []

newline = u''
for seg in remove_newline_text:
    new_seg = []
    for line in seg:
        if line != newline:
            new_seg.append(line)
    result_text.append(new_seg)

In [44]:
# Finally, result is heeeeeeeeere:
resultPath = 'Sample_Seg.txt'
result = open(resultPath, 'w')
for index, aSeg in enumerate(topics):
    topic = 'Topic: %s' % str(aSeg['high_topic_id'])
    print topic
    result.write(topic + '\n')
    
    topic_word = 'Topic word: %s' % aSeg['topic_word']['word']
    print topic_word
    result.write(topic_word + '\n')
    
    print 'Context: '
    seg_context = result_text[index]
    print seg_context
    for seg in seg_context:
        result.write(seg + '\n')
    result.write('\n\n')
result.close()

Topic: 68
Topic word: tranf transcs
Context: 
[u'Oh my God.', u"Yeah , still does n't feel natural.", u"I I wish , you know , the the the the room would just detect when an when a meeting 's started it starting.", u'Yeah.', u'.', u'Voila cool.', u'So you better start.', u'It was your idea.', u'So.', u'Okay.', u'I should start? Yeah , no so I thought that we might s t speak a little bit about speech coding because nobody is doing here speech coding and I would like to have some maybe feedback because except me.', u"I 've I 've supposed to some speech coding stuff here with Hynek uh but you know that little bit at least that uh based on some let 's say Hilbert transforms using a longer temporal context and deriving some parameters would be tranf transcs transmitted to a decoder andSo it 's a little bit different than the stuff which people are using now , which is like a L_P_C_ based.", u'And but we are at the beginning with everything more or less.', u"So still um we ca n't uh d encode 