> In this notebook, we attempted to make a step further from the LDA mallet model results. As the LDA Mallet model only provides the topic keywords instead of the specific topic itself, we aimed to extract that topic, or concept, from the list of topic keywords. <br><br>
  As each word has multiple meanings, we believe that for words under a particular concept, the meanings of the words are similar to each other. We used the WordNet module to obtain the meanings as Synsets, and the similarity is computed with Wu-Palmer Similarity Score<br>
  Therefore, we took the following steps to achieve our goal: <ol>
    <li>Compute the pairwise synset similarity for all sysnet pairs of different pair of words</li>
    <li>Set a threshold for the similarity score for filtering</li>
    <li>Generate a network for the remaining synsets, identify and filter components to obtain the main components</li>
    <li>For each component, compute the similarity scores between a synset with every of its hypernyms and use the median score.</li> 
    <li>Use the median score to build a weighted hypernym-synset tree <br>
        For each node, the weight is equal to the sum of its score and all its children's scores </li>
    <li>Select the appropriate hypernyms based on the the score, where the score must be between the 15th and 30th percentile of scores, to have a good extent of generality and specificity.</li>
    <li>Extract the nouns from the preprocessed selected hypernyms' definitions as Concepts/Topics</li>
    </ol>
  After testing on some of the topic keywords generated from the LDA model, we found that the resultant concepts from the approach is still not accurate with a lot of noise. Due to the time constraint, we did not go further and explore more approaches. However, we found out a paper that may be suitable for our case and we may test it out in the future: 
  http://www.wangzhongyuan.com/tutorial/ACL2016/Understanding-Short-Texts/Slides/Understanding-Short-Texts-Part-II-Explicit-Representation.pdf

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.corpus import wordnet as wn
from itertools import combinations,product
import networkx as nx

## Step by Step ##

In [9]:
# example topic keywords
topic = ['war','kill','conflict','refugee','military','soldier','weapon','Afghanistan','attack','peace']

In [10]:
# 1.Compute the pairwise synset similarity for all sysnet pairs of different pair of words

synset_pair = [] # [(s1,s2,w1,w2,score)]
scores = []
for w1,w2 in combinations(topic,r=2):
    sw1 = wn.synsets(w1)
    sw2 = wn.synsets(w2)
    
    for s1, s2 in product(sw1,sw2):
        score = wn.wup_similarity(s1,s2)
        if not score:
            continue
        if (s1,s2,w1,w2,score) in synset_pair or (s2,s1,w1,w2,score) in synset_pair:
            continue
        synset_pair.append((s1,s2,w1,w2,score))
        scores.append(score)

In [11]:
# 2.Set a threshold for the similarity score for filtering
# We use three possible ways to filter out the synset:
#    1. A arbitrary number of 30 as the maximum number of sysnets to be included
#    2. Use the 90th percentile of the similarity score as the cut-off
#    3. Use the fix value of 0.5 for the similarity score as the cut-off
# We use the the way that gives the min number of synsets
# E.g. the number is 10, then Obtain the 10th highest similarity score and use that as the cut-off

scores = np.array(scores)

max_sysnet_num = 30
NinetyPercentile_num = len(scores[scores>np.percentile(scores,90)])
halfScore_num = len(scores[scores>0.5])

min_synset_num = min(NinetyPercentile_num,halfScore_num,max_sysnet_num)
   
synset_pair_sorted = sorted(synset_pair,key=lambda x:x[-1], reverse=True)
threshold = synset_pair_sorted[min_synset_num][-1]
synset_pair_selected = [v for v in synset_pair_sorted if v[-1] >= threshold if v[0] != v[1]]

In [12]:
# 3. Generate a network for the remaining synsets and identify and filter components to obtain the main components
# The way to filter the component is through a threshold similar to step 2 
#     where we use the 3rd largest component size as the cut-off 
nodes = set([w[0] for w in synset_pair_selected] + [w[1] for w in synset_pair_selected] )
weighted_edge_list = {
    (w[0],w[1]):{'score':w[-1],'words':(w[2],w[3])} for w in synset_pair_selected
}

G = nx.Graph()

G.add_nodes_from(nodes)
G.add_edges_from(weighted_edge_list)
nx.set_edge_attributes(G, weighted_edge_list)


components = list(nx.connected_components(G))
if len(components) > 3:
    component_size = sorted(list(set([len(c) for c in components])),reverse=True)
    try:
        size_threshold = component_size[2]
    except:
        size_threshold = component_size[-1]

    main_components = [c for c in components if len(c) >= size_threshold]
else:
    main_components = components

In [13]:
# 4. For each component, compute the similarity scores between a synset with every of its hypernyms and use the median score.
import statistics
def generate_hypernym_median_dict(above_threshold_synsets):
    '''
    @param above_threshold_synsets: list, a list of synsets 
    @return hypernym_median_dict: dict, a dictionary with synset/hypernym as keys, and the median score as values
    
    Description: The function will compute the similarity score between a synset or a hypernym with all its hypernyms, and use 
    the median score as the final score for that particular synset/hypernym
    '''
    
    hypernym_dict = {}
    for s in above_threshold_synsets:
        for p in s.hypernym_paths():
            for hypernym in p:
                hypernym_dict[hypernym] = 1 
                    
    # score dict - for each hypernym in the tree, find the similarity scores with all synsets
    hypernym_score_dict = {}
    curr_hyp = None
    for hypernym, syn in product(hypernym_dict.keys(),above_threshold_synsets):
        score = wn.wup_similarity(hypernym,syn)
        if score == None:
            continue
        try:
            hypernym_score_dict[hypernym].append(score)
        except:
            hypernym_score_dict[hypernym] = [score]
    
    hypernym_median_dict = {k:statistics.median(v) for k,v in hypernym_score_dict.items()} # take the median score
    
    return hypernym_median_dict

# 5. Use the median score to build a weighted hypernym-synset tree
def generate_weighted_hypernym_tree(synsets,synset_score_dict):
    '''
    @param synsets: list
    @synset_score_dict: dictionary
    
    @return tree: dictionary, the key-value pair is parent-children where the parent is the direct hypernym to all the children
    @return synset_weights, the weight of each synset/hypernym 
    
    Description: the function will use the synset and synset_score_dict given to construct a tree with the weights. The tree is
    in the form of a dictionary with keys being the parent node and value being the list of the children nodes. The weight of 
    each node is calculated as the sum of its score and all its children's scores 
    '''
    tree= {}
    synset_weights = synset_score_dict.copy()
    for syn in synsets:
        for p in syn.hypernym_paths():
            for i in range(-2,-len(p)-1,-1): # the last one is the synset itself
                score = synset_weights[p[i+1]]
                
                if p[i] in synset_weights:
                    synset_weights[p[i]] += score
                else:
                    synset_weights[p[i]] = score
                
                if p[i] in tree:
                    if p[i+1] not in tree[p[i]]:
                        tree[p[i]].append(p[i+1])
                else:
                    tree[p[i]] = [(p[i+1])]
    
    return tree,synset_weights

In [3]:
# Help function for Step 7: Extract the nouns from the preprocessed selected hypernyms' definitions as Concepts/Topics
import spacy
from functions.convert_to_nouns import convert

nlp = spacy.load("en_core_web_sm")

def get_convert_tag(tag):
    if "NN" in tag[:2] == 'NN':
        return 'n'
    elif  tag[:2] == 'JJ':
        return 'a'
    elif  tag[:2] == 'VB':
        return 'v'
    elif tag[:2] == 'RB':
        return 'r'
    return ''
def extract_concept_noun_phrases(definition):
    doc = nlp(definition)
    result = []
    for noun_phrase in list(doc.noun_chunks):
        noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
        if noun_phrase.root.tag_ == "NN":
            noun_doc = nlp(noun_phrase.text)
            for t in noun_doc:
                if not t.is_stop:
                    if 'NN' in t.tag_:
                        result.append(t.lemma_)
                    else:
                        from_tag = get_convert_tag(t.tag_)
                        if from_tag != '':
                            result.append(convert(t.lemma_,from_tag,'n'))
    if result:
        return result
    else:
        return ''

In [14]:
concepts= {} # topic:[concepts]
for i,comp in enumerate(main_components):
    topic_concept = set()
    
    hypernym_median_dict = generate_hypernym_median_dict(comp)
    tree,synset_weights = generate_weighted_hypernym_tree(comp,hypernym_median_dict)
    
    # 6. Select the appropriate hypernyms based on the the score
    ceiling_threshold = np.percentile(list(synset_weights.values()),30)
    floor_threshold = np.percentile(list(synset_weights.values()),15)
    
    synset_weights_selected = [(k,v) for k,v in synset_weights.items() if v <= ceiling_threshold and v>=floor_threshold]
    
    # 7. Extract the nouns from the preprocessed selected hypernyms' definitions as Concepts/Topics
    definitions = [k.definition() for k,v in synset_weights_selected]
    def_tokens = [extract_concept_noun_phrases(d) for d in definitions]
    
    for ts in def_tokens:
        try:
            topic_concept = topic_concept | set(ts) 
        except:
            print(definitions)
            print(def_tokens)
            print(ts)
    concepts[i] = list(topic_concept)

In [34]:
concepts[2] = concepts[2][:-1]

In [36]:
temp = pd.DataFrame.from_dict(concepts).T
temp.index = ['concept {}'.format(i) for i in temp.index]
temp

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
concept 0,hostility,offensive,activeness,struggle,sport,game,war,armed,course,waging,conflict,military,persuader,enemy,disagreement,argument
concept 1,opposition,state,absence,war,,,,,,,,,,,,
concept 2,destruction,missile,ship,plane,enemy,life,tank,act,,,,,,,,


## Create a function and Run it on all topic keywords ##

In [4]:
def generate_topic(keywords):
    synset_pair = []
    scores = []
    for w1,w2 in combinations(keywords,r=2):
        sw1 = wn.synsets(w1)
        sw2 = wn.synsets(w2)

        for s1, s2 in product(sw1,sw2):
            score = wn.wup_similarity(s1,s2)
            if not score:
                continue
            if (s1,s2,w1,w2,score) in synset_pair or (s2,s1,w1,w2,score) in synset_pair:
                continue
            synset_pair.append((s1,s2,w1,w2,score))
            scores.append(score)
            
    scores = np.array(scores)

    max_sysnet_num = 30
    NinetyPercentile_num = len(scores[scores>np.percentile(scores,90)])
    halfScore_num = len(scores[scores>0.5])

    min_synset_num = min(NinetyPercentile_num,halfScore_num,max_sysnet_num)

    synset_pair_sorted = sorted(synset_pair,key=lambda x:x[-1], reverse=True)
    threshold = synset_pair_sorted[min_synset_num][-1]
    synset_pair_selected = [v for v in synset_pair_sorted if v[-1] >= threshold if v[0] != v[1]]
    
    nodes = set([w[0] for w in synset_pair_selected] + [w[1] for w in synset_pair_selected] )
    weighted_edge_list = {
        (w[0],w[1]):{'score':w[-1],'words':(w[2],w[3])} for w in synset_pair_selected
    }

    G = nx.Graph()

    G.add_nodes_from(nodes)
    G.add_edges_from(weighted_edge_list)
    nx.set_edge_attributes(G, weighted_edge_list)


    components = list(nx.connected_components(G))
    if len(components) > 3:
        component_size = sorted(list(set([len(c) for c in components])),reverse=True)
        try:
            size_threshold = component_size[2]
        except:
            size_threshold = component_size[-1]
        main_components = [c for c in components if len(c) >= size_threshold]
    else:
        main_components = components
        
    concepts= {} # topic:[concepts]
    for i,comp in enumerate(main_components):
        topic_concept = set()

        hypernym_median_dict = generate_hypernym_median_dict(comp)
        tree,synset_weights = generate_weighted_hypernym_tree(comp,hypernym_median_dict)

        # 6. Select the appropriate hypernyms based on the the score
        ceiling_threshold = np.percentile(list(synset_weights.values()),30)
        floor_threshold = np.percentile(list(synset_weights.values()),15)

        synset_weights_selected = [(k,v) for k,v in synset_weights.items() if v <= ceiling_threshold and v>=floor_threshold]

        # 7. Extract the nouns from the preprocessed selected hypernyms' definitions as Concepts/Topics
        definitions = [k.definition() for k,v in synset_weights_selected]
        def_tokens = [extract_concept_noun_phrases(d) for d in definitions]

        for ts in def_tokens:
            try:
                topic_concept = topic_concept | set(ts) 
            except:
                print(definitions)
                print(def_tokens)
                print(ts)
        concepts[i] = list(topic_concept)
    return concepts

In [5]:
topicWordsMatrix = pd.read_csv('data/output/topicWordMatrix.csv')

In [6]:
import ast
topicWords = topicWordsMatrix.applymap(lambda x:ast.literal_eval(x)[1])
topicWords

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,war,woman,patient,country,kind,work,datum,school,water,money,...,feel,brain,science,planet,ocean,city,art,government,cell,world
1,kill,man,cancer,world,structure,time,information,kid,energy,company,...,people,neuron,people,universe,fish,building,image,country,dna,people
2,conflict,girl,disease,change,form,people,internet,student,oil,business,...,experience,memory,question,space,water,design,work,power,gene,country
3,refugee,sex,health,China,system,change,people,child,carbon,market,...,life,consciousness,study,star,sea,build,create,political,body,India
4,military,boy,doctor,growth,pattern,problem,phone,teacher,climate,pay,...,love,sleep,answer,light,coral,place,light,democracy,life,child
5,soldier,gender,drug,time,work,good,online,education,fuel,buy,...,happy,cell,problem,life,animal,space,artist,people,human,poor
6,weapon,female,care,economic,understand,create,technology,learn,power,product,...,happiness,body,find,time,shark,people,project,citizen,genetic,community
7,Afghanistan,gay,medical,economy,thing,life,open,teach,gas,create,...,emotion,mind,scientist,galaxy,boat,live,picture,public,molecule,live
8,attack,talk,treatment,global,simple,thing,find,class,solar,cost,...,mind,control,wrong,particle,ice,community,color,change,genome,village
9,peace,young,hospital,future,shape,team,web,high,material,sell,...,feeling,child,datum,energy,whale,street,photograph,vote,bacteria,poverty


In [7]:
topic_concepts = {}
for i in range(35):
    topic_keywords = topicWords.iloc[:,i].tolist()
    print(topic_keywords)
    concept = generate_topic(topic_keywords)
    topic_concepts[i] = concept

['war', 'kill', 'conflict', 'refugee', 'military', 'soldier', 'weapon', 'Afghanistan', 'attack', 'peace']
['woman', 'man', 'girl', 'sex', 'boy', 'gender', 'female', 'gay', 'talk', 'young']
['patient', 'cancer', 'disease', 'health', 'doctor', 'drug', 'care', 'medical', 'treatment', 'hospital']
['country', 'world', 'change', 'China', 'growth', 'time', 'economic', 'economy', 'global', 'future']
['kind', 'structure', 'form', 'system', 'pattern', 'work', 'understand', 'thing', 'simple', 'shape']
['work', 'time', 'people', 'change', 'problem', 'good', 'create', 'life', 'thing', 'team']
['datum', 'information', 'internet', 'people', 'phone', 'online', 'technology', 'open', 'find', 'web']
['school', 'kid', 'student', 'child', 'teacher', 'education', 'learn', 'teach', 'class', 'high']
['water', 'energy', 'oil', 'carbon', 'climate', 'fuel', 'power', 'gas', 'solar', 'material']
['money', 'company', 'business', 'market', 'pay', 'buy', 'product', 'create', 'cost', 'sell']
['human', 'animal', 'speci

In [8]:
import pickle
pickle.dump(topic_concepts,open('data/pickle/topic_concepts.p','wb'))