In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)

In [2]:
import ktrain
ktrain.__version__

Using TensorFlow backend.


using Keras version: 2.2.4


'0.6.0'

# Learning from Unlabeled Text Data

Unlabeled, unstructured text or document data abound, and it is often necessary to "make sense" of these data for various applications.  Examples include:
- *exploratory analysis of text data*:  discovering relevant information for which one may not have even known to look by providing rich overviews of the informatation space  
- *building training sets for text classification*:  identifying positive and negative example documents to train a [text classifier](https://en.wikipedia.org/wiki/Document_classification). 
- *document recommender system*:  recommending semantically similar documents given a specific document of interest.

Each of these examples involve **learning from unlabeled text data**.  In this notebook, we will show you how to accomplish the above with minimal coding using *ktrain*.    The *ktrain* library is an open-source, augmented ML library build around Keras and scikit-learn.  It can be installed with `pip3 install ktrain` and is [available on GitHub](https://github.com/amaiya/ktrain).

We will use the well-known [20-newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) for this demonstration.

## Get Raw Document Data

In [3]:
# #dtic
# maxlen = 5000 # truncate documents to 5000 words
# docs = []
# for idx, filename in enumerate(ktrain.text.extract_filenames('/home/amaiya/data/publicDTIC_Text/')):
#     if idx % 2000 == 0: print(idx)
#     with open(filename, 'r') as f:
#         text = f.read()
#         text = " ".join(text.split()[:maxlen])
#         docs.append(text)

In [4]:
# 20newsgroups
from sklearn.datasets import fetch_20newsgroups

# we only want to keep the body of the documents!
remove = ('headers', 'footers', 'quotes')

# fetch train and test data
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)

# a list of 18,846 cleaned news in string format
# only keep letters & make them all lower case
texts = [' '.join(filter(str.isalpha, raw.lower().split())) for raw in
        newsgroups_train.data + newsgroups_test.data]
targets = [target for target in list(newsgroups_train.target) + list(newsgroups_test.target)]
categories = [newsgroups_train.target_names[target] for target in targets]

## Train an LDA Topic Model to Discover Topics

In [5]:
%%time
tm = ktrain.text.get_topic_model(texts, n_topics=40, n_features=10000)

preprocessing texts...
fitting model...
iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5
done.
CPU times: user 6min 53s, sys: 18min 51s, total: 25min 44s
Wall time: 53.2 s


#### Pre-compute Document-Topic matrix
We will pre-compute the document-topic matrix.  Each document is represented as a probability distribution over the topics.
We will also filter out documents whose maximum topic probability is less than 0.25 in order to consider the most representative documents for each topic.  This may help to improve clarity of visualizations by removing "unfocused" documents. 

In [6]:
%%time
tm.build(texts, threshold=0.25)

done.
CPU times: user 36.1 s, sys: 1min 34s, total: 2min 10s
Wall time: 5.65 s


Since the `build` method prunes documents based on threshold, we should prune the original data in a similar way for consistency.  This can be accomplished with the `filter` method. 

In [7]:
texts = tm.filter(texts)
categories = tm.filter(categories)

This is useful, for example, if you want to display the newsgroup category in a visualization of documents and topics.

## Inspecting Topics
Let's take a look at the discovered topics by document count.

In [8]:
tm.print_topics(show_counts=True)

topic:5 | count:6202 | just think like people know good really going make want
topic:1 | count:1915 | use image program using available version window run graphics display
topic:11 | count:1184 | drive card disk hard scsi know use serial does apple
topic:26 | count:1068 | god jesus believe people does bible christian say christians man
topic:0 | count:1014 | game team games play hockey players player year won lost
topic:19 | count:858 | people government armenian said president armenians turkish united did rights
topic:12 | count:606 | space high water launch energy air solar years low earth
topic:28 | count:463 | used phone data use new number private access using systems
topic:7 | count:392 | does question know right questions discussion group answer certain set
topic:4 | count:317 | car new model thanks price engine video cars does like
topic:24 | count:89 | gun use drug cause law tax food laws free body
topic:34 | count:85 | key chip government encryption clipper security law keys 

The topic with the most documents appears to be conversational questions, replies, and comments that aren't focused on a particular subject.  Other topics are focused on specific domains (e.g., topic 27 with label "*jews israel jewish israeli arab muslims palestinian peace arabs land*").

#### Examining a Sample Topic

Let's look at the topic probabilities for the document most relevant to topic 26, which appears to be about Christianity and religion:

In [9]:
tm.get_doctopics(topic_ids=[26])[0]

array([0.00105053, 0.06334099, 0.00105098, 0.00105042, 0.00105048,
       0.00105193, 0.00105042, 0.00105283, 0.00105069, 0.00105042,
       0.00105042, 0.00105094, 0.00105097, 0.00105056, 0.00105042,
       0.00105045, 0.00105049, 0.01033958, 0.00105042, 0.00105152,
       0.00105042, 0.0010509 , 0.01004638, 0.00105048, 0.00105112,
       0.00105042, 0.86602134, 0.01347644, 0.00105148, 0.00105042,
       0.00105042, 0.0010506 , 0.00105042, 0.00105043, 0.00105059,
       0.00105048, 0.00105042, 0.00105042, 0.00105068, 0.00105068])

It's topic probability for the topic in question (Christianity) is **86.6%**, as can be seen in the array (sixth row - second column).

In [10]:
tm.get_doctopics(topic_ids=[26])[0][26]

0.8660213401727025

Let's examine this document as a sanity check to make sure it is pertains to this topic:

In [11]:
tm.get_docs(topic_ids=[26])[0]

('no matter how it is to suggest that a common moral system created by mankind is it is not contrary to reason to suggest that a common moral system created by mankind is in for the bible to be of any use to mankind as a moral it must be interpreted by mankind and a workable moral system created for everyday the jewish talmud is the result of centuries of biblical scholars analysing every word of the torah to understand the morality behind the children of israel were given a very strict set of judicial and ceremonial laws to follow and yet this was clearly not enough to cover every instance of moral dilemma in their for a the situation is no it seems to me that the only code of morality that we have from the christian god is that which is contained in the bible we can see from the diverse opinions in the christian newsgroups is not there may well be an absolute morality defined by the god for mankind to follow but it seems that we only have a subset simply because the concept was writt

Looks right to me.  Note that the `get_docs` method returns a list of tuples of the form:
`(text, doc_id, probability_score, topic_id)`

... where 
- `text` is the raw text of the document
- `doc_id` is an index into the array returned by `get_doctopics`
- `topic_id` is the index of the topic in the range of `range(n_topics)`
- `probability_score` is the relevance of this document to the topic represented by `topic_id`

Within each `topic_id`, the tuples are sorted in desceding order by `probability_score`.  Hence, the first item is the most relevant document to the selected topic (`topic_id=26`).  

Next, let's examine a set of topics that seem to be related to the same larger theme.


#### Examining Technology-Related Topics

We will examine a series of topics that appear to be about technology.  For each topic, we will first examine the topic label using `tm.topics` and then view the top document pertaining to that topic using `get_docs`.  

A topic about **cryptography**:

In [12]:
tm.topics[34]

'key chip government encryption clipper security law keys algorithm message'

In [13]:
tm.get_docs([34])[0]

('description of skipjack mostly chip structure the clipper chip contains a classified block encryption algorithm called the algorithm uses bit keys with for the and has rounds of scrambling with for the it supports all des modes of throughput is mbits a an family key that is common to all chips a serial number an secret key that unlocks all messages encrypted with the chip the key k and message stream m digitized are then fed into the clipper chip to produce two the encrypted message and a law enforcement three it looks like each bits of input gives you bits of bits bits bits n f bits bits do you really need to transmit all bits each or do you only transmit the bits of wiretap block at the all would be really obnoxious for applications like cellular phones even regular phones over how do the des modes interact with the do the various feedback modes only apply to the message or also to the wiretap if the wiretap block is only transmitted at the does it get incorporated into everything 

A topic about **Microsoft Windows**:

In [14]:
tm.topics[33]

'windows dos software font printer print fonts driver microsoft laser'

In [15]:
tm.get_docs([33])[0]

('software publishing superbase windows ocr system readright for windows ocr system readright for dos unregistered zortech bit compiler with multiscope windows whitewater resource library source code commonview windows applications framework for borland spontaneous assembly library with source code microsoft macro assembly microsoft windows sdk documentation microsoft foxpro wordperfect toolkit kedwell software databoss c code generator kedwell installboss installation generator liant software windows application framework with source code ibm toolkit cbtree library with source code symantec timeline for windows timeslip timesheet professional for windows',
 93,
 0.5940832459912094,
 33)

A topic about **computer hardware**:

In [16]:
tm.topics[11]

'drive card disk hard scsi know use serial does apple'

In [17]:
tm.get_docs([11])[0]

('here is my gateway micronics isa ram ide hd drive ide hd drive adaptec scsi with scsi bios enabled seagate scsi drive when i boot up i get the adaptec bios but it says something scsi bios not and i get to the seagate i go into phoenixbios remove the entry for drive and i can access the is there a way to get two ide drives and the seagate at the same i have but it just hangs the',
 3243,
 0.8519254747730485,
 11)

A topic about **Apple Macs**:

In [18]:
tm.topics[23]

'mac ibm info pc ii usa fast mode dec quadra'

In [19]:
tm.get_docs([23])[0]

('all this shows is that you know much about a controler range is indeed and that is all you have right about scsi a controller with burst note the increase in the mac quadra uses this version of so it does some pc use this set up with burst or fast with burst and with burst by your own data the scsi is twice as fast as is correct with a controller chip can reach which is indeed faster than of is all these scsi facts have been posted to this newsgroup in my mac ibm info sheet by ftp on in the as should be but may still be part of this problem is both mac and ibm pc are inconsiant about what scsi is though it is well documented that the quadra has a chip an apple salesperson said uses a fast at a burst it does is maximum synchronous and quadra uses ansynchronous scsi which is it seems that mac and ibm see interface and think when it maybe a interface driven in the machine by a controller chip in mode is much faster then true can',
 6,
 0.5495197635790885,
 23)

## Visualizing Topics

Let's combine these technology-related documents into a set of positive examples of technology-focused posts and visualize them.  To do so, we must represent each document as probability distribution over topics.  We will also compile the original newsgroup categories for each document, so that they can be included in the visualization. (This is why we invoked the `filter` method earler.)

In [20]:
tech_topics = [11, 23, 33, 34]
tech_probs = tm.get_doctopics(topic_ids=tech_topics)
doc_ids = [doc[1] for doc in tm.get_docs(topic_ids=tech_topics)]
newsgroup_categories = [categories[doc_id] for doc_id in doc_ids]

In [21]:
tm.visualize_documents(doc_topics=tech_probs, extra_info={'cat': newsgroup_categories})

reducing to 2 dimensions...[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1341 samples in 0.002s...
[t-SNE] Computed neighbors for 1341 samples in 0.116s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1341
[t-SNE] Computed conditional probabilities for sample 1341 / 1341
[t-SNE] Mean sigma: 0.073910
[t-SNE] KL divergence after 250 iterations with early exaggeration: 70.887177
[t-SNE] KL divergence after 1000 iterations: 1.186863
done.


If hovering over these points, you'll see that the cryptography-related documents are off to the right and deemed less similar to the rest of the documents related to Windows, Macs, and computer hardware.  This agrees with intuition. 

To keep things more narrowly focused, let's remove the cryptography documents from the set before proceeding to the next section.

In [22]:
tech_topics = [11, 23, 33]
tech_probs = tm.get_doctopics(topic_ids=tech_topics)
tech_texts = tm.get_texts(topic_ids=tech_topics)

## Scoring Documents by Similarity

Once you've identified a set of documents that are interesting to your use case, you may want to identify additional documents that are semantically similar to this set.  Here, suppose we wanted to identify new documents that are related to computer technology.  We can accomplish this with the `find` method.  The `find` method compiles a list of seed documents based on supplied `topic_ids` or `doc_ids` and scores new documents based on their similarity to the seed documents. Internally, this is accomplished by training a rudimentary [One-Class classifier](https://en.wikipedia.org/wiki/One-class_classification).  While this classifier can be used as is, it may be more useful to use the `find` method to help compile a training set for a traditinal binary classifier.

In [23]:
tm.train_scorer(topic_ids=tech_topics)

We can now invoke the `scorer` method to measure the degree to which new documents are similar to our technology-related topics.  Let's use `scorer` to measure the similarity of the remaining documents in the corpus. Note that, although we are applying the scorer to documents within the set corpus used to train the topic model, this is not required. Our `scorer` can be applied to any arbitrary set of documents.

Let's retrieve the text associated with all documents **not** associated with topid_ids 11, 23, and 33.

In [24]:
other_topics = [i for i in range(tm.n_topics) if i not in tech_topics]
other_texts = [d[0] for d in tm.get_docs(topic_ids=other_topics)]

Let's score these documents and place into a Pandas dataframe.

In [25]:
# score documents based on similarity
other_scores = tm.score(other_texts)

In [26]:
# display results in Pandas dataframe
other_preds = [int(score > 0) for score in other_scores]
data = sorted(list(zip(other_preds, other_scores, other_texts)), key=lambda item:item[1], reverse=True)
print('Top Inliers (or Most Similar to Our Technology-Related Topics')
print('\t\tNumber of Predicted Inliners: %s' % sum(other_preds))
df = pd.DataFrame(data, columns=['Prediction', 'Score', 'Text'])
df.head()

Top Inliers (or Most Similar to Our Technology-Related Topics
		Number of Predicted Inliners: 130


Unnamed: 0,Prediction,Score,Text
0,1,0.209006,my cousin got a second internal ide drive seagate can look up the model number if and been to help him install got a vested since busted and i have to use his until i get mine already has a seagate ide hd i forget the model number i can find i seem to get the bloody thing managed to get or the other drive up the other but not both the same whenever i the thing hangs during bootup gets past the system the ide instruction says it supports two i think configured the cmos the plugged in i even learned about the relationship that two hds are supposed to have pcs were into and i think i configured the jumpers one is the the new one is the many thanks in this is practically an emergency have papers to do on this thing for barnes suranet operations voice fax i speak for suranet and they speak for been told by our local computer guru that you do this unless you perform a low level format on your existing hard drive and set your system up for two hard drives from the i took him at his and i have not tried to find out any more about because not going to back everything up just to add another if anyone knows for sure what the scoop i would like to know thanks in advance bill willis do not do a low level format on an ide drive unless you have the executable for doing so supplied by the these are available from or mail but the mail version costs a nominal in addition to the jumper on an ide drive there is also another jumper to indicate whether a slave is get it the cabling is not an issue as long as pin goes to pin goes to pin no twisting or swapping on an ide be sure of pin on all three components do not make assumptions are ok but assumptions are if the cable and jumpers are and the cmos setup is then you may have to do an fdisk followed by a high level i have never personally found this but perhaps there is something gone wrong with the data on the probably not but i understand your predicament you will probably throw salt over your wear funny clothes and do a spooky sounding chant while dancing around the room if someone said it might good luck
1,1,0.193166,about two months ago i purchased the adaptec driver for use with a at the time this seemed the thing to do as the documentation i had with my adaptec scsi controller said that this is the driver to be used with since then i have learn that this driver is out of date in a major way and that adaptec have an upgrade deal for going to the next driver think called or i too fussed about this until i upgraded by drive from a sony to a sony i now find that the will not i assume it is not being handled correctly by the should i chase adaptec for an if so does anyone know their fax any assistance regards everything else works certainly seems that sony have caught up with the rest with the
2,1,0.192934,a few posts somebody mentioned that the duo might crash if it has the wrong kind of ram in my duo crashes sometimes after and i am wondering if there is any software which will tell me whether or not i have the right kind of ram i had thought that the problem was the battery thanks in
3,1,0.172741,does support the graphics accelerator board in the sun thanks in
4,1,0.17268,hi another of those type been given an oldish phillips televideo terminal type without a but no problem so when i dismantled i discovered that it is really just a standard rgb monitor with built in software phillips kindly labelled the circuit board with the rgb so i connected it up as a monitor and he presto it worked sort the problem is that i have no idea where to connect the sync the display rolls but does change modes only to cga but useful for my any of you wonderful people any knowledge of phillips tried phillips in the uk and a very helpful guy told me that he has had several enquiries of this but phillips computer is now under the auspices of dec least in the dec said sorry phillips make it any what is it a uk support dealer said so any


As you can see, we see we've found additional technology-related posts in the dataset.

Our `scorer` assigns a score to each document, where  higher scores indicate a  higher degree of similarity to technology-related seed docments.  The `scorer` implements a decision function to make binary decisions on similarity such that documents with positive scores are deemed as similar and negative scores are deemed dissimilar. We've used this to create  a prediction of 1 for similar and 0 for dissimilar.  This identifies 130 documents as similar.   The `scorer`, however, employs a One-Class classifier, which tends to be more strict.  That is, there are likely documents with negative scores close to zero that are also similar.  Let's look at these.

In [27]:
df[df.Score <=0].head()

Unnamed: 0,Prediction,Score,Text
130,0,-0.001667,i am trying to compile a chart for windows and dos performance of local bus video card so if you have a and one of the local bus video cards please email me your winbench and in please give me winmark score at and i will post the chart if enough response if tseng vlb cl vlb based local bus ati ultra pro vlb orchid celsius vlb agx based vlb they matox mga based video cards
131,0,-0.001815,this may be a simple question we have a number of which we use to link to a mainframe using novell lan workplace for dos windows to make life easier for us we are thinking of using windows for workgroups to allow file sharing across our pc now does anyone know if it is possible to use and lan workplace for dos at the same ie can i access a file on another pc while being logged on to the mainframe at the same any help well
132,0,-0.00214,i have recently picked up a page scanner by the name of ii model the software for it was made for windows and will not work with the newwer does any one out there kow were i could find the company that made this beast say and the name gms a division of does anyone know if these companies still exist and if they do they have an email if anyone knows of a programme that is able to access this
133,0,-0.003871,hello i have a problem with my micro solutions sometimes it sometimes it i will either start a or start a tape and at about percent i get an error either saying the tape is bad or the has aborted for an unknown if i turn everything off and wait a half hour it works is it because the tape backup is too has anyone had similar
134,0,-0.006398,i have just added a panasonic dot matrix printer to a i installed the appropriate windows printer driver one specifically for this but unable to persuade the poxy thing to print what appears to be happening is that the truetype fonts get printed my experiments show that all graphic images example a line drawing from corel print graphicsworkshop for windows happily prints gifs ms notepad and ms write will print correctly providing the in the text are printer when i print truetype some lines appear to be printed in the wrong if i change the text font to a printer the problem is i have tried using printer drivers for printers which the namely epsom and ibm proprinter and the same problem if there is some kind soul who can tell me just what the hell is going i would be most


As you can see, these documents are also similar and related to technology (albeit slightly different aspects of technology than that of our seed set of documents).  Such negatively-scored documents are useful for identifying so-called informative examples.  Since documents are sorted by score (descending order), we can start at the beginning of the dataframe containing negatively-scored documents and add documents to the positive class until we start seeing negative documents that are **not** related to technology.  These informative negative examples can, then, be added to a negative class for training a traditional binary classsifier.  This process is referred to as [active learning](https://en.wikipedia.org/wiki/Active_learning_(machine_learning)).

Documents later in the list are the most dissimilar to technology-related posts and may not be informative as inclusions into a training set.  As you can see in the cell below, sports-related documents are the most dissimilar to technology.

In [29]:
df[df.Score <=0].tail(3)

Unnamed: 0,Prediction,Score,Text
13760,0,-3.951539,playoff leaders as of april player team gp g a pts pim pit juneau bos noonan chi mogilny buf neely bos brown stl jagr pit oates bos carson la hunter was stevens nj cullen tor hull stl khristich was linden van racine det shanahan stl sydor la yzerman det bure van coffey det drake det emerson stl van johansson was lapointe que niedermayer nj ramsey pit sandstrom la smehlik buf stevens pit adams van barr nj bellows mon burr det chiasson det craven van dahlquist cal dionne mon felsner stl ferraro nyi francis pit gilmour tor hannan buf heinze bos howe det huddy la king win lafontaine buf lefebvre tor mcsorley la millen la ronning van rucinsky que sakic que sheppard det steen win suter cal sweeney buf tipett pit yawney cal young que barnes win borschevsky tor brunet mon chelios chi ciccarelli det clark tor desjardins mon dipietro mon donnelly la driver nj duchesne que ellett tor elynuik was flatley nyi fleury cal gallant det gill tor granato la gretzky la guerin nj hawerchuk buf holik nj housley win janney stl chi khmylev buf krygier was larmer chi macinnis cal matteau chi mceachern pit mclean van mcrae stl mullen pit muller mon murphy pit murzyn van otto cal pearson tor pivonka was primeau det probert det reichel cal ricci que robitaille la roenick chi samuelsson pit semak nj shannon win shuchuk la sundin que sutter chi taylor la tocchet pit vaske nyi
13761,0,-3.956802,player team gp g a pts pim pit francis pit oates bos yzerman det coffey det was mogilny buf thomas nyi lapointe que johansson was carson la brown stl fleury cal van flatley nyi macinnis cal ferraro nyi mceachern pit neely bos turgeon nyi bellows mon jagr pit khmylev buf khristich was hawerchuk buf hogue nyi juneau bos pit pit lafontaine buf ramsey pit smehlik buf noonan chi gilmour tor hull stl otto cal reichel cal bure van drake det linden van nieuwendyk cal roberts cal young que buf nj tocchet pit carpenter was pit ronning van suter cal yawney cal adams van chiasson det craven van cullen tor dahlquist cal king win racine det rychel la shanahan stl sheppard det sydor la barnes win emerson stl gill tor granato la gretzky la housley win janney stl king nyi kozlov det sandstrom la shuchuk la vaske nyi damphousse mon elynuik was guerin nj hannan buf holik nj muller mon sakic que semak nj sundin que taglianetti pit tipett pit barasso pit bondra was carney buf cavallini was desjardins mon duchesne que niedermayer nj ricci que ridley was pit blake la borschevsky tor zelepukin nj nyi burr det domi win fedorov det felsner stl howe det huddy la kurri la lefebvre tor lidstrom det lowry stl mcsorley la millen la mironov tor numminen win paslawski cal steen win ysebaert det anderson tor berube cal chelios chi ciccarelli det clark tor dahl cal dipietro mon donnelly la ellett tor gallant det chi kennedy det larmer chi matteau chi mclean van mcrae stl murzyn van musil cal pearson tor primeau det probert det ranheim cal robitaille la roenick chi selanne win shannon win skrudland cal sutter chi taylor la zhitnik la barr nj bourque bos burridge was dionne mon heinze bos leschyshyn que presley buf rucinsky que smolinski bos wood buf brunet mon daniels pit donato bos driver nj gusarov que houlder buf pit kamensky que krygier was loney pit may was miller was odelein mon pivonka was shaw bos straka pit belanger mon chorske nj druce win eagles win errey buf ewen mon foligno tor goulet chi grimson chi hughes bos kovalenko que leeman mon mcllwain tor osbourne tor richer nj roberge mon schneider mon watters la weimer bos andreychuk tor ashton cal babych van baron stl bassen stl baumgartner tor bautin win belfour chi berg tor billington nj blue bos borsato win bozon stl butcher stl cheveldae det conacher la corkum buf diduck van dirk van erickson win essensa win gilbert chi graham chi hardy la hedican stl hrudey la chi johansson cal joseph stl kasatonov nj kennedy win konstantinov det krushelnyski tor lidster van lumme van macoun tor marchment chi miller stl momesso van moog bos muni chi murphy chi murray chi nedved van olausson win pearson que potvin tor quintal stl ramage mon stl rouse tor russell chi ruuttu chi bos chi sandlak van semenov van stern cal van bos terreri nj tkachuk win ulanov win valk van vernon cal wilson stl zezel tor zhamnov win zombo stl albelin nj anderson was audette buf bodger buf brisebois mon nj carbonneau mon cavallini que cote was bos daigneault mon dalgarno nyi daneyko nj douris bos fetisov nj finn que fitzgerald nyi foote que fuhr buf bos haller mon hatcher was healy nyi hextall que hough que iafrate was jennings pit jones was kasparaitis nyi keane mon kimble bos krupp nyi kvartalnov bos leclair mon leach bos lebeau mon ledyard buf loiselle nyi maclean nj malakov nyi may buf mckay nj nicholls nj nolan que norton nyi patterson buf pilon nyi poulin bos roy mon savard mon simon que stapleton pit stastny nj sutton buf nyi tabaracci was vukota nyi wesley bos wolanin que
13762,0,-4.154509,of adirondack cdi leads first round springfield indians vs providence bruins gm springfield providence gm springfield providence gm providence at springfield gm providence at springfield gm springfield at providence gm providence at springfield gm springfield at providence cd islanders vs adirondack red wings gm adirondack cdi gm cdi at adirondack gm adirondack at cdi gm adirondack at cdi gm cdi at adirondack gm adirondack at cdi gm cdi at adirondack baltimore skipjacks at binghamton rangers gm baltimore at binghamton gm baltimore at binghamton gm binghamton at baltimore gm binghamton at baltimore gm baltimore at binghamton gm binghmaton at baltimore gm baltimore at binghamton utica devils vs rochester americans gm utica at rochester gm utica at rochester gm rochester at utica gm rochester at utica gm utica at rochester gm rochester at utica gm utica at rochester moncton hawks vs st maple leafs gm st moncton gm moncton vs st at halifax gm st at moncton gm st at moncton gm moncton vs st at halifax gm st at moncton gm moncton vs st at halifax cape breton oilers vs fredericton canadiens gm fredericton cape breton gm cape breton at fredericton gm fredericton at cape breton gm fredericton at cape breton gm cape breton at fredericton gm fredericton at cape breton gm cape breton at fredericton


## Recommending Similar Documents

In the previous section, given a set of seed documents, we scored **new** documents based on similarity.  Here, we will reverse this process. Given a **new** document, we will find (or recommend) documents that are semantically similar to it from the set of documents returned by `get_docs()`.

We must first train the recommender.

In [30]:
tm.train_recommender()

Now, let's create some text about Christianity and recommend the top 3 newsgroup posts similar to this text.

In [31]:
rawtext = """
          Jesus is a religious leader whose life and teachings are recorded in the Bible’s New Testament. 
          He is a central figure in Christianity and is emulated as the incarnation of God by many Christians 
          all over the world.
          """

In [32]:
for i, doc in enumerate(tm.recommend(text=rawtext, n=3)):
    print('RESULT #%s'% (i+1))
    print('TEXT:\n\t%s' % (doc[0]))
    print('NEWSGROUP:\n\t%s'% (categories[doc[1]]))
    print('TOPIC:\n\t%s' % (tm.topics[doc[3]]))
    print()

RESULT #1
TEXT:
	witnesses do not believe that christians are required to observe the whether it is on saturday or the sabbath was part of a covenent between god and the israelites and is not required for
NEWSGROUP:
	talk.religion.misc
TOPIC:
	god jesus believe people does bible christian say christians man

RESULT #2
TEXT:
	the argument for genealogy being that of mary is very according to luke and when he began his jesus himself was about thirty years of being supposedly the son of the son of aside from the fact that mary is not there are two possible either joseph was her father or he was her clearly this is not a third would be that the son of was her father and just happened to have the name as the man to whom she was but that would seem to be grasping at the most straightforward interpretation is that luke had no intention of tracing genealogy which case he would have named but that he traces her from son the matthew descendant list most definitely traces down from to matthew and