---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 4 - Document Similarity & Topic Modelling

## Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

You will need to finish writing the following functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. 

*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*

## So good article that can give you a full grasp.
https://www.geeksforgeeks.org/nlp-synsets-for-a-word-in-wordnet/

In [200]:
import nltk
from nltk.corpus import wordnet as wn

hello = wn.synset('hello.n.01')
print ("Synset name :  ", hello.name())
# Defining the word
print ("Synset meaning : ", hello.definition())
# list of phrases that use the word in context
print ("Synset example : ", hello.examples())


from nltk.corpus import wordnet
syn = wordnet.synsets('hello')[0]
print ("\nSynset name :  ", syn.name())
# Defining the word
print ("Synset meaning : ", syn.definition())
# list of phrases that use the word in context
print ("Synset example : ", syn.examples())

Synset name :   hello.n.01
Synset meaning :  an expression of greeting
Synset example :  ['every morning they exchanged polite hellos']

Synset name :   hello.n.01
Synset meaning :  an expression of greeting
Synset example :  ['every morning they exchanged polite hellos']


In [201]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None

## 1. doc_to_synsets trails.

In [202]:
doc1 = 'Fish are nvqjp friends.'
doc1_tokens = nltk.word_tokenize(doc1)
doc1_pos = nltk.pos_tag(doc1_tokens)
new_tags = [convert_tag(pos[1]) for pos in doc1_pos]

doc1_synsets = []
for token , tag in zip(doc1_tokens,new_tags):
    syn_list = wn.synsets(token , tag)
    
    # if the list that contains the token's synonyms is empty which means 
    # there is no match for this token, skip it.
    if len(syn_list) == 0:
        continue
    # Here, we only need the first synonym match among all synonyms
    doc1_synsets.append(syn_list[0])

doc1_synsets

[Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]

In [203]:
def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    tokens = nltk.word_tokenize(doc)
    nltk_pos = nltk.pos_tag(tokens)
    wn_pos = [convert_tag(pos[1]) for pos in nltk_pos]
    
    synsets = []
    for token , tag in zip(tokens,wn_pos):
        syn_list = wn.synsets(token , tag)
    # if the list that contains the token's synonyms is empty which means 
    # there is no match for this token, skip it.
        if len(syn_list) == 0:
            continue
    # Here, we only need the first synonym match among all synonyms.
        synsets.append(syn_list[0])
            
    return synsets

doc_to_synsets('Fish are nvqjp friends.')

[Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]

## 2. similarity_score trails.

In [204]:
s1 = doc_to_synsets('I like cats')
s2 = doc_to_synsets('I like dogs')
print(s1)
print(s2)

[Synset('iodine.n.01'), Synset('wish.v.02'), Synset('cat.n.01')]
[Synset('iodine.n.01'), Synset('wish.v.02'), Synset('dog.n.01')]


In [205]:
path_sim = [[x.path_similarity(y) for x in s1] for y in s2]

path_sim_array = np.array(path_sim , dtype = np.float64)
largest = np.nanmax(path_sim_array ,axis=0)
print(path_sim_array)
largest.mean()

[[1.         0.125      0.05882353]
 [       nan 1.                nan]
 [0.08333333 0.08333333 0.2       ]]


0.7333333333333334

In [206]:
def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    path_sim = [[x.path_similarity(y) for x in s1] for y in s2]

    path_sim_array = np.array(path_sim , dtype = np.float64)
    largest = np.nanmax(path_sim_array ,axis=0)
    
    return largest.mean()

similarity_score(doc_to_synsets('I like cats'),
                 doc_to_synsets('I like dogs'))

0.7333333333333334

In [207]:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    tokens = nltk.word_tokenize(doc)
    nltk_pos = nltk.pos_tag(tokens)
    wn_pos = [convert_tag(pos[1]) for pos in nltk_pos]
    
    synsets = []
    for token , tag in zip(tokens,wn_pos):
        syn_list = wn.synsets(token , tag)
    # if the list that contains the token's synonyms is empty which means 
    # there is no match for this token, skip it.
        if len(syn_list) == 0:
            continue
    # Here, we only need the first synonym match among all synonyms.
        synsets.append(syn_list[0])
            
    return synsets


def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    path_sim = [[x.path_similarity(y) for x in s1] for y in s2]

    path_sim_array = np.array(path_sim , dtype = np.float64)
    largest = np.nanmax(path_sim_array ,axis=0)
    
    return np.nanmean(largest)


def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

### test_document_path_similarity

Use this function to check if doc_to_synsets and similarity_score are correct.

*This function should return the similarity score as a float.*

In [208]:
def test_document_path_similarity():
    doc1 = 'This is a function to test document_path_similarity.'
    doc2 = 'Use this function to see if your code in doc_to_synsets \
    and similarity_score is correct!'
    return document_path_similarity(doc1, doc2)

test_document_path_similarity()

  largest = np.nanmax(path_sim_array ,axis=0)


0.554265873015873

<br>
___
`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [209]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('paraphrases.csv')
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


___

### most_similar_docs

Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

*This function should return a tuple `(D1, D2, similarity_score)`*

## My_Solution.

In [210]:
docs1 = paraphrases['D1']
docs2 = paraphrases['D2']

result = []
for doc1 , doc2 in zip(docs1 , docs2):
    result.append(document_path_similarity(doc1,doc2))
index = np.argsort(result)
print(index)   
print(docs1[index[-1]])
print(docs2[index[-1]])
print(result[index[-1]])
print(max(result))


  largest = np.nanmax(path_sim_array ,axis=0)


[16 17 10 18 12  8  7  0 11 14  9  3  6 19  5 15  2  1  4 13]
"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.
"Iran should be on notice that attempts to remake Iraq in Iran's image will be aggressively put down," he said.

0.9753086419753086
0.9753086419753086


In [211]:
def most_similar_docs():
    docs1 = paraphrases['D1']
    docs2 = paraphrases['D2']

    symm_path_sim = []
    for doc1 , doc2 in zip(docs1 , docs2):
        symm_path_sim.append(document_path_similarity(doc1,doc2))
    # sort the result ascendingly to extract the largest one easily.
    index = np.argsort(symm_path_sim)[-1]    
    D1 = docs1[index]
    D2 = docs2[index]
    largest = symm_path_sim[index]
    
    return (D1 , D2 , largest)
most_similar_docs()

  largest = np.nanmax(path_sim_array ,axis=0)


('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 0.9753086419753086)

## Another Geeky Solution.

In [None]:
simi = np.array([document_path_similarity(row['D1'], row['D2']) for _, row in paraphrases.iterrows()])
ind = np.nanargmax(simi)
similarity_score = np.nanmax(simi)
D1 = paraphrases['D1'][ind]
D2 = paraphrases['D2'][ind]

(D1, D2, similarity_score)

### label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

*This function should return a float.*

# My_Solution.

In [212]:
from sklearn.metrics import accuracy_score
docs1 = paraphrases['D1']
docs2 = paraphrases['D2']
correct_labels = paraphrases['Quality']

sim_scores = [document_path_similarity(doc1,doc2) for doc1,doc2 in zip(docs1,docs2)]

predictions = []
for score in sim_scores :
    if score > 0.75:
        predictions.append(1)
    else:
        predictions.append(0)

        
accuracy_score(predictions,correct_labels)         

  largest = np.nanmax(path_sim_array ,axis=0)


0.8

## Geeky one:

In [213]:
from sklearn.metrics import accuracy_score

simi = [document_path_similarity(row['D1'], row['D2']) for _, row in paraphrases.iterrows()]
predictions = list(map(lambda x: 1 if x > 0.75 else 0, simi))

accuracy_score(predictions, paraphrases['Quality'])

  largest = np.nanmax(path_sim_array ,axis=0)


0.8

In [214]:
def label_accuracy():
    from sklearn.metrics import accuracy_score
    docs1 = paraphrases['D1']
    docs2 = paraphrases['D2']
    correct_labels = paraphrases['Quality']

    sim_scores = [document_path_similarity(doc1,doc2) for doc1,doc2 in zip(docs1,docs2)]

    predictions = []
    for score in sim_scores :
        if score > 0.75:
            predictions.append(1)
        else:
            predictions.append(0)
            
    return accuracy_score(predictions,correct_labels)
label_accuracy()

  largest = np.nanmax(path_sim_array ,axis=0)


0.8

## Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

- Source : https://www.geeksforgeeks.org/latent-dirichlet-allocation/

In [199]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
# min/max document frequency
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform to create a sparse matrix.
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [221]:
id_map

{76: 'best',
 335: 'group',
 33: 'america',
 409: 'know',
 726: 'similar',
 544: 'organization',
 23: 'address',
 514: 'new',
 899: 'york',
 842: 'usa',
 450: 'lot',
 386: 'information',
 58: 'available',
 528: 'number',
 326: 'good',
 455: 'luck',
 749: 'sounds',
 84: 'blood',
 359: 'hey',
 708: 'send',
 231: 'don',
 776: 'study',
 363: 'history',
 524: 'north',
 241: 'early',
 604: 'probably',
 34: 'american',
 491: 'mode',
 837: 'understand',
 201: 'decide',
 632: 'reading',
 474: 'mean',
 687: 'said',
 822: 'toronto',
 324: 'going',
 873: 'win',
 184: 'cup',
 895: 'yeah',
 673: 'right',
 641: 'rec',
 365: 'hockey',
 332: 'great',
 214: 'didn',
 778: 'stupid',
 423: 'leave',
 571: 'place',
 63: 'bad',
 25: 'advice',
 464: 'major',
 344: 'hard',
 447: 'lose',
 817: 'time',
 441: 'long',
 26: 'age',
 373: 'hurt',
 897: 'years',
 297: 'following',
 854: 'video',
 835: 'type',
 120: 'caused',
 652: 'relatively',
 889: 'worst',
 713: 'seriously',
 780: 'suggest',
 443: 'look',
 658: 'rep

In [223]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# Your code here:
ldamodel = gensim.models.ldamodel.LdaModel(corpus , num_topics=10,
                                           id2word=id_map,
                                           passes=25 , random_state=34)

### lda_topics

Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')`

for example.

*This function should return a list of tuples.*

In [225]:
ldamodel.print_topics(num_topics=10, num_words=10)

[(0,
  '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'),
 (1,
  '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'),
 (2,
  '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'),
 (3,
  '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'),
 (4,
  '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'),
 (5,
  '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'),
 (6,
  '0.0

In [227]:
def lda_topics():
    result  = ldamodel.print_topics(num_topics=10, num_words=10)
    
    return result

### topic_distribution

For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function should return a list of tuples, where each tuple is `(#topic, probability)`*

In [235]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [276]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
# min/max document frequency
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform to create a sparse matrix.
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

ldamodel = gensim.models.ldamodel.LdaModel(corpus , num_topics=10,
                                           id2word=id_map,
                                           passes=25 , random_state=34)




# transform new_doc to a document-term matrix
new_doc_vectorized = vect.transform(new_doc)

# Convert sparse matrix to gensim corpus.
new_corpus = gensim.matutils.Sparse2Corpus(new_doc_vectorized,
                                       documents_columns=False)

list(ldamodel.get_document_topics(new_corpus))

[[(0, 0.020003108),
  (1, 0.020003324),
  (2, 0.020001281),
  (3, 0.49674758),
  (4, 0.020004038),
  (5, 0.020004129),
  (6, 0.020002972),
  (7, 0.020002645),
  (8, 0.020003129),
  (9, 0.34322783)]]

In [277]:
def topic_distribution():
    # transform new_doc to a document-term matrix using the previous Vect
    new_doc_vectorized = vect.transform(new_doc)

    # Convert sparse matrix to gensim corpus.
    new_corpus = gensim.matutils.Sparse2Corpus(new_doc_vectorized,
                                           documents_columns=False)

    result = list(ldamodel.get_document_topics(new_corpus))[0]

    return result

topic_distribution()

[(0, 0.020003108),
 (1, 0.020003324),
 (2, 0.020001281),
 (3, 0.496777),
 (4, 0.020004038),
 (5, 0.020004129),
 (6, 0.02000297),
 (7, 0.020002645),
 (8, 0.020003129),
 (9, 0.3431984)]

### topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

## - So Meaningful Appproach:
1. In lda_topics, we've already got the most significant 10 words in each topic like this:


- (9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')


2. We also defined 'document_path_similarity(doc1, doc2)' at the beginning.


3. Put them together. We can find the similarity between the 10 words in each topic and the 12 candidate topic names, i.e. ["Health", 'Science', 'Automobiles', 'Politics', 'Government', 'Travel', 'Computers & IT', 'Sports', 'Business', 'Society & Lifestyle', 'Religion', 'Education']. And name each topic with the candidate topic name that has the highest score of similarity.


- For example, if  similarity between (9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')  and "Science" is 0.6 higher than 'Health' and others. We can name the topic 9 "Science".



 Source : https://www.coursera.org/learn/python-text-mining/discussions/weeks/4/threads/x1DSkZ44EeiclQpjXTIwdA

In [264]:
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, strip_numeric

filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

topic_10word = ldamodel.print_topics(num_topics=10, num_words=10)

the10word = []
for topic in topic_10word:
    the10word.append(preprocess_string(topic[1], filters))

the10word

[['edu',
  'com',
  'thanks',
  'mail',
  'know',
  'does',
  'info',
  'monitor',
  'looking',
  'don'],
 ['ground',
  'current',
  'just',
  'want',
  'use',
  'using',
  'used',
  'power',
  'speed',
  'output'],
 ['drive',
  'disk',
  'scsi',
  'drives',
  'hard',
  'controller',
  'card',
  'rom',
  'floppy',
  'bus'],
 ['time',
  'atheism',
  'list',
  'left',
  'alt',
  'faq',
  'probably',
  'know',
  'send',
  'months'],
 ['car',
  'just',
  'don',
  'bike',
  'good',
  'new',
  'think',
  'year',
  'cars',
  'time'],
 ['game',
  'team',
  'year',
  'games',
  'play',
  'season',
  'players',
  'win',
  'hockey',
  'good'],
 ['information',
  'help',
  'medical',
  'new',
  'use',
  'research',
  'university',
  'number',
  'program'],
 ['don',
  'people',
  'think',
  'just',
  'say',
  'know',
  'does',
  'good',
  'god',
  'way'],
 ['use',
  'apple',
  'power',
  'time',
  'data',
  'software',
  'pin',
  'memory',
  'simms',
  'port'],
 ['space',
  'nasa',
  'science',
  '

In [265]:
candi = ["Health", 'Science', 'Automobiles', 'Politics', 'Government', 'Travel', 
        'Computers & IT', 'Sports', 'Business', 'Society & Lifestyle',
         'Religion', 'Education']

# group = group of ten words in the list of lists called `the10word` above.
simi_arr = np.array([[document_path_similarity(' '.join(group), j) for group in the10word] for j in candi])
simi_arr

  largest = np.nanmax(path_sim_array ,axis=0)


array([[0.07468834, 0.08934035, 0.07720982, 0.0755887 , 0.07043956,
        0.07565247, 0.07883634, 0.09225941, 0.08660058, 0.09379902],
       [0.10048008, 0.09632035, 0.08063223, 0.10286103, 0.08853535,
        0.09581148, 0.13506494, 0.10409091, 0.13218864, 0.63492063],
       [0.06674298, 0.0887026 , 0.10305372, 0.06304317, 0.71879902,
        0.07138312, 0.06312564, 0.07102564, 0.07027086, 0.07051893],
       [0.12001134, 0.1486678 , 0.10925533, 0.12199546, 0.11055556,
        0.1219494 , 0.12880076, 0.15218254, 0.1272592 , 0.14170996],
       [0.09651618, 0.10524634, 0.08791645, 0.09790765, 0.08994652,
        0.15476641, 0.13464313, 0.14131313, 0.11882735, 0.12673993],
       [0.08161927, 0.11748222, 0.07448229, 0.09898038, 0.09321862,
        0.11986451, 0.12420714, 0.09420746, 0.11688424, 0.09034091],
       [0.08640407, 0.10376082, 0.11569859, 0.08602945, 0.09784382,
        0.08861451, 0.10157105, 0.09004079, 0.1050512 , 0.14465909],
       [0.08893606, 0.18203463, 0.0806322

In [270]:
# for the first group of 10 word which field of 12 has the
# highest similarity score and so on:
# 0 : apply function to each column.
ind = np.nanargmax(simi_arr, axis=0)
ind

array([ 3, 11,  6, 11,  2, 11, 11,  9, 11,  1], dtype=int64)

In [267]:
# extract the feilds accordingly.
[candi[i] for i in ind]

['Politics',
 'Education',
 'Computers & IT',
 'Education',
 'Automobiles',
 'Education',
 'Education',
 'Society & Lifestyle',
 'Education',
 'Science']

In [271]:
def topic_names():
    from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, strip_numeric

    filters = [lambda x: x.lower(), strip_punctuation, strip_numeric]

    topic_10word = ldamodel.print_topics(num_topics=10, num_words=10)

    the10word = []
    for topic in topic_10word:
        the10word.append(preprocess_string(topic[1], filters))

    candi = ["Health", 'Science', 'Automobiles', 'Politics', 'Government', 'Travel', 
        'Computers & IT', 'Sports', 'Business', 'Society & Lifestyle',
         'Religion', 'Education']
    simi_arr = np.array([[document_path_similarity(' '.join(group), j) for group in the10word] for j in candi])
    
    ind = np.nanargmax(simi_arr, axis=0)
    result = [candi[i] for i in ind]
    
    return result

topic_names()

  largest = np.nanmax(path_sim_array ,axis=0)


['Politics',
 'Education',
 'Computers & IT',
 'Education',
 'Automobiles',
 'Education',
 'Education',
 'Society & Lifestyle',
 'Education',
 'Science']