In [None]:
import re
from gensim import corpora, models
import pandas as pd
import gensim
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
posts_df = pd.read_pickle('../data/interum/text_target.pkl')

In [3]:
# convert into features and target 
feature = posts_df['tokens']
label = posts_df['target']  # label 

In [4]:
# remove words longer than length 2

feature = feature.apply(
    lambda x: [w for w in x if len(w) > 2])

## feature engineering for clustering

In [6]:
# dictionary for train 
dictionary = gensim.corpora.Dictionary(feature)

In [7]:
dictionary.filter_extremes(no_below =1, no_above=0.05, keep_n=50000)

In [8]:
# create bag of words 
bow = [dictionary.doc2bow(doc) for doc in feature]
# tfidf for bow 
tfidf = models.TfidfModel(bow)
corpus_tfidf = tfidf[bow]

In [9]:
# generate a model for bag of words with 5 topics 
lda_model = gensim.models.LdaMulticore(
    bow, num_topics=5, id2word=dictionary, passes=2, workers=4,random_state=42)

In [12]:
# generate a model for bag of tfidf with 5 topics 
lda_model_tfidf = gensim.models.LdaMulticore(
    corpus_tfidf, num_topics=5, id2word=dictionary, passes=2, workers=4,random_state=42)

In [11]:
def topic_top_word(model):
    '''
    input:
    model: lda_model (bow or tfidf)
    return:
    a dataframe with top words for each topic 
    '''
    topics= model.print_topics(num_topics=5,num_words=5) 
    topics_dict = {}
    for topic in topics:
        topics_dict[topic[0]] = re.findall('[a-z]+',topic[1])
    df = pd.DataFrame(topics_dict)
    df.columns = ['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4']
    return df

In [13]:
topic_top_word(lda_model)

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4
0,byte,android,std,org,log
1,long,script,foo,key,source
2,thread,item,event,import,org
3,char,model,option,foo,framework
4,size,date,template,self,module


In [14]:
topic_top_word(lda_model_tfidf)

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4
0,script,date,std,foo,thread
1,byte,item,foo,key,log
2,event,script,vector,print,arraylist
3,long,android,template,log,json
4,image,button,date,bar,private


## Predict use both models 

In [15]:
# predict with bow lda
pred_bow = []
for i in bow:
    result = lda_model[i]
    if len(result) > 1:
        pred_bow.append(sorted(result, key=lambda x: x[1], reverse = True )[0][0])
    else:
        pred_bow.append(result[0][0])    

In [16]:
# predict with tfidf lda
pred_tdif = []
for i in bow:
    result = lda_model_tfidf[i]
    if len(result) > 1:
        pred_tdif.append(sorted(result, key=lambda x: x[1], reverse = True )[0][0])
    else:
        pred_tdif.append(result[0][0])
        

In [17]:
# combine results 
result = pd.DataFrame([pred_bow,pred_tdif,label]).T
result.columns = ['bow','tdif','true_label']

### combine model to original df and look at text

In [18]:
combined_df = pd.concat([posts_df[['text']],result], axis = 1)

### function to look at each topic separately

In [19]:
def text_topic(df, model,topic):
    '''
    input:
    df: raw text, lda model, and topic num
    returns:
    random text for that topic 
    '''
    texts = df[df[model]==topic].text
    inds = df[df[model]==topic].text.index
    ind = np.random.choice(inds)
    return texts[ind]
    

## Bow and topic 0 to 4 we will check 5 random texts

In [22]:
for topic in range(0,5,1):
    i = 0
    while i < 5:
        print(f'for model bow and topic {topic}:')
        print('------------------------------')
        print(text_topic(combined_df,'bow',topic))   
        print('------------------------------')
        i += 1

for model bow and topic 0:
------------------------------


In RStudio you can run parts of code in the code editing window and the
results appear in the console.

You can also do cool stuff like selecting whether you want everything up to
the cursor to run or everything after the cursor or just the part that you
selected and so on. And there are hot keys for all that stuff.

It's like a step above the interactive shell in Python -- there you can use
readline to go back to previous individual lines but it doesn't have any
concept of what a function is a section of code etc.

Is there a tool like that for Python? Or do you have some sort of similar
workaround that you use say in vim?


------------------------------
for model bow and topic 0:
------------------------------


Is my function of creating cookie correct? and how do i delete cookie at the
beginning of my program run? is there a simple coding?

function createCookie(namevaluedays)

    
    
    <script> function setCookie(c_

## tfidf and topic 0 to 4 we will check 5 random texts¶

In [25]:
for topic in range(0,5,1):
    i = 0
    while i < 5:
        print(f'for model bow and topic {topic}:')
        print('------------------------------')
        print(text_topic(combined_df,'tdif',topic))   
        print('------------------------------')
        i += 1

for model bow and topic 0:
------------------------------


I'm writing a unit test for this one method which returns void . I would like
to have one case that the test passes when there is no exception thrown. How
do I write that in C#?

    
    
    Assert.IsTrue(????) 

(My guess is this is how I should check but what goes into ??? )

I hope my question is clear enough.


------------------------------
for model bow and topic 0:
------------------------------


I know how to make a **view** condition in AngularJS that will display or hide
dom element dependent on the condition:

    
    
    <div ng-show= {{isTrue}} >Some content</div> 

but how do I create a **render** condition that determines whether to render
or not the div?


------------------------------
for model bow and topic 0:
------------------------------


I need to get a hold of every flag every switch used in the build process by
the Visual Studio binaries. I tried to obtain a verbose output by using
`vcbuild` but 