# About this helper
This helper script will read the plot_summaries.txt file and output it as a dictionary with<p>
k: movie_id<br>
v: plot_summary
    
This pre-processing is required for the subsequent analysis

In [1]:
from collections import defaultdict
import nltk
import os
import pickle
import spacy
import gensim

In [2]:
# Download stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/devel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
# Change the paths if required
data_folder = os.path.join('data', 'MovieSummaries')
file_plot_summaries = 'plot_summaries.txt'
file_movie_metadata = 'movie.metadata.tsv'

In [4]:
f_plot_summaries = os.path.join(data_folder, file_plot_summaries)

In [5]:
with open(f_plot_summaries, 'r') as f:
    plot_summaries = f.readlines()

In [6]:
type(plot_summaries), len(plot_summaries)

(list, 42306)

In [7]:
plot_summaries[0]

"23890098\tShlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.\n"

In [8]:
dict_plot_summaries = defaultdict(str)
for plot in plot_summaries:
    movie_id, plot_summary = plot.split('\t')
    dict_plot_summaries[movie_id] = plot_summary

## Check that the number of items processed is correct

In [9]:
print(len(dict_plot_summaries))

42306


In [10]:
for k,v in dict_plot_summaries.items():
    print(f'{k}: {v}')
    break

23890098: Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.



# Initialise the Spacy model

In [11]:
# Lemmatization using Spacy
from tqdm.notebook import tqdm

def lemmatization(texts, allowed_postags=['NOUN', 'PROPN', 'ADJ', 'VERB', 'ADV']):
    nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
    texts_out = []
    rejected_texts = []
    with tqdm(total=len(texts)) as pbar:
        for text in texts:
            doc = nlp(text)
            new_text = [token.lemma_ for token in doc if token.pos_ in allowed_postags]
            #new_text = []
            #for token in doc:
            #    if token.pos_ in allowed_postags:
            #        new_text.append(token.lemma_)
            #    else:
            #        rejected_texts.append(token.text)
            final = " ".join(new_text)
            texts_out.append(final)
            pbar.update()
        
    return(texts_out)

In [12]:
nlp = spacy.load('en_core_web_sm')

In [13]:
raw_docs = [v for _, v in dict_plot_summaries.items()]
print(f'Total of {len(raw_docs)} records loaded')

Total of 42306 records loaded


# Load Lemmatized File or run the process

We will first check if there is a local pickle file with the lemmatized output already.  This is for subsequent runs or sharing of the output, as this will save around 5-10 minutes for the lemmatization process to run.

If there is a need to refresh the lemmatization process, simply remove the pickle file (or change the file reference accordingly) from the code below.

### ⚠️⚠️ <br>IMPORTANT NOTE: IF YOU HAVE CHANGED THE INPUT FILE, YOU MUST RUN THIS ONCE TO RECREATE THE LEMMATIZED TOKENS

In [14]:
LEMMATIZED_FILE = 'lemmatized_docs.p'

try:
    with open(LEMMATIZED_FILE, 'rb') as f:
        lemmatized_docs = pickle.load(f)
except FileNotFoundError:
    input(f'Lemmatized input file {LEMMATIZED_FILE} not found.  Press ENTER to run lemmatization process (~10 minutes)')
    lemmatized_docs = lemmatization(raw_docs)

In [15]:
# Print out some samples
for i in range(2):
    print(raw_docs[i])
    print(lemmatized_docs[i])
    print()

Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.

Shlykov hard work taxi driver Lyosha saxophonist develop bizarre love hate relationship prejudice realize be so different after all

The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker's son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past vic

# Gensim module
Here we will use the gensim module to perform the following:<p>
    pre-processing 
    xxx
    xxx
    

In [16]:
import gensim

from gensim.utils import simple_preprocess

In [17]:
# Pre-process using gensim
# This will tokenize the words, and remove some basic stop words
preprocessed_docs = [simple_preprocess(s, deacc=False) for s in lemmatized_docs]

In [18]:
# Remove further stopwords using spacy stopwords
en = spacy.load('en_core_web_sm')
spacy_stopwords = en.Defaults.stop_words

# Check the words that currently exists.  The data structure is a set, and you can extend the stopwords
# if you wish to eliminate other types of words / characters from the corpus
print(len(spacy_stopwords), type(spacy_stopwords))
print(spacy_stopwords)

326 <class 'set'>
{'onto', 'us', 'either', 'should', 're', 'also', 'beyond', 'see', 'indeed', 'against', 'some', 'regarding', 'move', 'did', 'full', 'off', 'everywhere', 'often', 'or', 'whose', 'no', 'by', 'always', 'nevertheless', 'were', 'do', 'may', 'down', 'my', 'none', 'becomes', 'to', 'empty', 'here', 'everything', '’d', 'ever', 'n’t', 'herein', 'who', 'ten', 'his', 'that', 'same', 'though', 'whereby', 'ours', 'seems', 'as', 'side', 'amount', 'why', 'someone', 'forty', 'six', 'twenty', 'seem', 'eight', 'anyway', 'throughout', 'perhaps', 'sometime', 'former', 'beside', 'make', 'about', 'four', 'just', 'over', 'last', 'because', 'we', 'many', 'be', 'anything', 'least', 'front', 'been', 'whom', 'namely', 'not', 'anyhow', 'sometimes', 'very', 'give', 'this', 'towards', 'neither', 'yourselves', 'then', 'used', 'across', 'get', 'formerly', 'made', 'there', "'re", 'any', 'others', 'whence', '‘ll', 'without', 'nobody', 'really', 'among', 'must', 'wherever', 'yet', 'hereafter', 'yourself'

In [19]:
preprocessed_docs = [[w for w in doc if w not in spacy_stopwords] for doc in preprocessed_docs]

In [20]:
for i in range(2):
    print(f'original #{i}- ')
    print(raw_docs[i])
    print()
    print(f'pre-processed #{i}')
    print(preprocessed_docs[i])
    print()

original #0- 
Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.


pre-processed #0
['shlykov', 'hard', 'work', 'taxi', 'driver', 'lyosha', 'saxophonist', 'develop', 'bizarre', 'love', 'hate', 'relationship', 'prejudice', 'realize', 'different']

original #1- 
The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker's son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are 

# Create Dictionary from Gensim

In [21]:
from gensim import corpora
dictionary = corpora.Dictionary(preprocessed_docs)
print(dictionary)
print('unique tokens', len(dictionary))

Dictionary(125097 unique tokens: ['bizarre', 'develop', 'different', 'driver', 'hard']...)
unique tokens 125097


In [22]:
token_to_id = dictionary.token2id
print(type(token_to_id))
print(token_to_id)

<class 'dict'>


In [23]:
# To get the word from token, use dictionary[idx]
idx = 2
print(f'index: {idx}, word: {dictionary[idx]}')

# To get the index from token, use token_to_id[word]
word_check = 'hungry'
print(f'word: {word_check}, index: {token_to_id[word_check]}')

index: 2, word: different
word: hungry, index: 6466


# Convert doc to vectors

In [24]:
# Test on a sample of 1 document (plot summary)
doc = preprocessed_docs[0]
vec = dictionary.doc2bow(doc)
print(vec)
print(doc)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
['shlykov', 'hard', 'work', 'taxi', 'driver', 'lyosha', 'saxophonist', 'develop', 'bizarre', 'love', 'hate', 'relationship', 'prejudice', 'realize', 'different']


In [25]:
# Create the entire corpus of sparse vectors
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_docs]

In [26]:
# Check words in a document
doc_check = 3

print(f'index\tfreq\tword')
for idx, freq in bow_corpus[doc_check]:
    print(f'{idx:>5}\t{freq:>4}\t{dictionary[idx]}')

index	freq	word
   16	   1	able
   23	   1	allow
   34	   1	attention
   36	   1	avoid
   37	   1	away
   42	   2	begin
   53	   1	care
   65	   1	convince
   72	   3	day
   73	   1	dead
   85	   1	drop
   92	   1	fall
   96	   2	find
  105	   1	furious
  139	   1	late
  140	   1	later
  156	   1	need
  158	   3	night
  160	   2	old
  166	   2	past
  183	   1	provide
  188	   1	recover
  190	   3	return
  191	   2	reveal
  206	   1	set
  211	   1	snow
  245	   4	time
  253	   1	turn
  262	   1	warn
  270	   1	year
  276	   1	action
  280	   1	appear
  282	   2	arrive
  297	   2	come
  299	   2	confront
  302	   2	court
  303	   1	crime
  306	   4	decide
  308	   1	early
  310	   1	enter
  317	   2	friend
  325	   1	hidden
  339	   2	jail
  341	   1	judge
  348	   2	learn
  354	   1	manage
  373	   1	pass
  377	   1	plan
  401	   2	sentence
  406	   1	stay
  418	   2	try
  421	   1	wife
  423	   1	abandon
  424	   1	accomplice
  425	   2	activity
  426	   1	advantage
  427	   1	amenity


# Build index for similarity matching later

In [27]:
from gensim import models

# train the model (using TFIDF)
tfidf = models.TfidfModel(bow_corpus)

# TEST STRING - this segment only matches word tokens for testing.  Document similarity is later.
test_string = 'in this hunger games, the competitors are pitted against each other for survival.  Find out who will last the longest!'
test_words = test_string.lower().split()
print(tfidf[dictionary.doc2bow(test_words)])

[(96, 0.07030812077965193), (126, 0.4704150634580455), (88843, 0.879639946923852)]


In [28]:
print(dictionary[96])

find


In [29]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=len(dictionary))

In [30]:
# Randomly pick one document that the user likes
# query_document = raw_docs[1].split()
query_document = '''
The documentary is primarily a look at the making of the film Return of the Jedi, which had been released that year. However it includes considerable material and behind-the-scenes footage from the two previous Star Wars films and an extensive interview with Star Wars creator George Lucas who discusses his influences, his original plans, and the process of creating the saga. The footage of the making of Return of the Jedi includes a look at the creation of the various alien creatures seen in the film , on location in Yuma Desert in Arizona for the sequence aboard Jabba's sailbarge, on location in the redwood forests of Northern California for the Endor scenes, the filming of the speeder bike chase sequence, and the creation of the various alien languages and the songs "Lapti Nek" as performed by the character Sy Snootles and the Ewok celebration song at the end of the film. It also includes the original deleted scene from the first Star Wars film featuring Han Solo's meeting with Jabba the Hutt, who was then played by Irish actor Declan Mulholland .
'''.split()

query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]

for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)

4689 0.42362058
29987 0.14867638
21551 0.123634055
2031 0.11762673
13112 0.11689922
12308 0.11596168
38332 0.11399215
9041 0.113553345
993 0.11340341
30413 0.108312786
18830 0.103781395
7734 0.10296257
35552 0.0984547
24964 0.09812363
16939 0.098043956
27540 0.09769078
42234 0.096014306
26579 0.09547068
28963 0.09546195
20296 0.09447149
16031 0.09368409
12578 0.09251281
27948 0.091580436
14409 0.09041944
7507 0.090254545
14234 0.089555144
10323 0.08934182
39803 0.089226745
5298 0.087637834
35374 0.087196
1145 0.087128915
31367 0.08712112
37534 0.08623884
32505 0.08594247
30973 0.08524677
18376 0.08454181
38570 0.08435594
27027 0.084084176
41330 0.08363733
17462 0.08363371
754 0.08288505
5694 0.082846776
37111 0.081782706
30357 0.081686415
22647 0.081595086
6231 0.08151596
31257 0.081304714
19624 0.08086137
10094 0.08083588
31279 0.080787696
23821 0.080541715
8073 0.080307655
35686 0.079826824
13868 0.07979795
11568 0.07968017
35891 0.07943304
6717 0.07932665
6216 0.07917164
15483 0.077

5119 0.016602397
13652 0.016595002
13897 0.01659057
23387 0.016586438
37920 0.016576521
15649 0.01656221
4342 0.016554084
21385 0.0165513
33844 0.016548667
24468 0.01654836
31088 0.016547952
31451 0.01653807
29255 0.01653443
17915 0.016533421
22099 0.01653302
37354 0.016523167
6783 0.016520333
10302 0.016512832
13142 0.01650928
35573 0.016504237
12100 0.016502
10587 0.016500477
8157 0.016498499
5990 0.016494699
31362 0.016485164
9001 0.016483694
23398 0.016481461
26098 0.0164794
24025 0.016476206
14600 0.016466381
6305 0.016463313
27666 0.016455973
14540 0.01645372
25823 0.0164478
308 0.016446274
39541 0.016444765
30841 0.016441094
18749 0.01644057
15683 0.016439646
6321 0.016437871
21658 0.016435822
40451 0.016435077
36357 0.016431203
39525 0.016425638
30431 0.016422167
38340 0.01641969
39179 0.016405012
27119 0.016404822
27200 0.016403824
1052 0.016398262
17875 0.016394002
31229 0.016374776
9033 0.016372195
7824 0.016361404
42232 0.016360793
39108 0.016358474
41396 0.01635743
36307 0

35190 0.009800391
38832 0.009800343
3480 0.009799736
945 0.009799448
21231 0.009798549
24553 0.0097983675
3348 0.009797101
2295 0.009791673
8603 0.0097902175
20942 0.009789225
32692 0.009787112
32064 0.009786649
16326 0.009785887
32909 0.009785665
39027 0.009782794
5878 0.00977932
7448 0.009778292
28260 0.009774827
16233 0.009771662
41309 0.0097710015
253 0.009770578
33717 0.009766461
35019 0.009765429
12501 0.009764794
12734 0.009764028
1705 0.009763595
3298 0.009763248
14636 0.009760031
40187 0.009755736
24019 0.009755016
15455 0.009753771
25050 0.009749917
31275 0.00974957
35343 0.009746869
39751 0.009745791
40082 0.009744528
9819 0.009742636
30173 0.009741358
32905 0.009740614
24266 0.009739838
40160 0.009739464
41401 0.009738693
14480 0.009735618
3679 0.0097354315
10108 0.009735197
18131 0.009732816
34622 0.009732495
28332 0.009731454
38272 0.009731443
14437 0.009729492
30334 0.009729441
9860 0.009727947
6869 0.0097265495
41002 0.0097262105
35407 0.009725232
29601 0.00972389
25703

39890 0.007214695
12297 0.0072135087
18364 0.0072131255
19345 0.0072109764
7204 0.00720905
39920 0.0072087944
29865 0.007208491
28868 0.007208313
40416 0.0072080214
17493 0.007205833
22342 0.007204092
5551 0.00720103
3947 0.007198232
22231 0.007197867
36411 0.0071970234
18690 0.0071961163
34025 0.007194818
28048 0.0071916156
5095 0.0071914624
36490 0.0071914196
24363 0.0071905046
8632 0.007188602
26826 0.007188348
16463 0.0071863695
235 0.007186326
14343 0.0071859565
24603 0.007185673
17129 0.007184216
14598 0.007183997
2782 0.007181604
10066 0.0071814912
23371 0.007180611
38171 0.007180472
11112 0.0071803415
20906 0.0071802586
41307 0.007178394
29416 0.007178098
13158 0.007176484
6487 0.00717648
39684 0.007174124
33543 0.007174053
31958 0.0071735317
1501 0.00717246
37044 0.0071723964
6374 0.00717215
6481 0.0071715377
42208 0.0071714753
26731 0.0071711135
9371 0.0071702115
7046 0.0071700853
32367 0.00716869
13552 0.007167001
15696 0.007164493
23617 0.0071641942
23740 0.0071615647
24634

3860 0.0053888084
17975 0.0053886576
29702 0.0053884787
38068 0.005387884
39534 0.005387819
23659 0.0053875484
19675 0.0053867777
3683 0.0053851544
26282 0.005384949
31831 0.0053847516
37445 0.0053843143
28149 0.005384055
31051 0.005382886
40065 0.005382416
4946 0.005380949
17928 0.0053792773
21516 0.0053792032
13979 0.0053781085
9459 0.005377575
36343 0.005377236
42201 0.005377074
20592 0.005376885
14592 0.0053755455
15693 0.0053744144
2928 0.0053743823
2513 0.0053736274
18637 0.0053733448
32128 0.0053728702
3542 0.0053728367
36481 0.005372772
36849 0.0053726104
42296 0.005372428
12619 0.0053722
34451 0.005372095
31539 0.0053720144
3202 0.005371024
27708 0.0053694253
11415 0.005369401
3512 0.0053693205
29364 0.005369258
18782 0.005369176
9011 0.0053687664
21782 0.005368319
24652 0.005367378
1176 0.0053673186
33602 0.0053670565
14496 0.005366724
22849 0.0053661745
20793 0.0053657293
27384 0.0053641903
5780 0.005363697
18444 0.0053635775
16044 0.0053631654
28190 0.005362982
25613 0.0053

19963 0.0044311075
30947 0.004429868
30095 0.0044295313
39837 0.0044293012
27505 0.0044290293
38656 0.0044288146
41400 0.0044287373
20743 0.0044287182
36499 0.0044286894
15372 0.0044276686
38536 0.004427542
37294 0.004425873
1621 0.004425589
9178 0.004424898
20133 0.004424477
15527 0.004424461
19498 0.004424119
39399 0.0044217096
39588 0.0044213645
2855 0.004420043
36322 0.0044195056
9600 0.004419343
28358 0.004418927
42022 0.0044188835
23492 0.0044185836
5550 0.004418521
27569 0.0044184662
33944 0.004418341
920 0.004418185
9115 0.00441794
7053 0.004416253
14897 0.004416174
21445 0.0044155205
9043 0.004414552
30567 0.0044141947
32922 0.00441402
9549 0.004413167
31414 0.004411147
20822 0.0044093234
13651 0.004409266
10072 0.0044086836
33203 0.004406578
2209 0.004406483
6913 0.004406462
2731 0.004406373
18064 0.004406337
27139 0.004405806
15061 0.0044057136
33798 0.0044057104
10195 0.0044056685
21634 0.0044049458
39363 0.00440445
24828 0.004404199
38223 0.0044041416
26200 0.0044038217
16

32737 0.0035100635
29224 0.0035098554
24584 0.003509412
15961 0.0035085925
40494 0.0035082502
26453 0.00350816
1246 0.003508024
36610 0.0035075906
29765 0.0035073322
9919 0.0035072286
22393 0.0035068104
16212 0.00350638
7377 0.00350631
25896 0.0035055773
12559 0.0035053878
21195 0.0035044453
15937 0.003503543
6555 0.0035033901
13473 0.0035029883
12163 0.0035028346
19931 0.0035027328
37421 0.003502639
34170 0.0035024127
8652 0.0035021333
7021 0.0035019813
18035 0.0035017086
41249 0.003501678
24834 0.0035013554
3427 0.00350135
141 0.0035011857
28990 0.0035008984
34689 0.0035007878
1615 0.0034992753
18466 0.0034991067
14352 0.003498914
23716 0.003498304
30588 0.003498149
30071 0.0034980138
38228 0.0034980036
11201 0.003497687
11228 0.003497172
9108 0.0034970308
8448 0.0034967442
11587 0.003496614
1002 0.0034965526
29826 0.0034939712
32997 0.0034932042
13063 0.0034928226
37620 0.003492724
40003 0.0034924222
26077 0.003492199
27213 0.0034921456
9927 0.0034921037
1385 0.0034917325
2380 0.003

31388 0.0028015368
12385 0.0028013778
18509 0.0028013173
21974 0.0028008982
1879 0.0028008681
10222 0.002800827
9773 0.0028004155
13060 0.002800281
20081 0.0028000611
29681 0.0028000085
11771 0.0027994392
3001 0.0027987391
28088 0.0027984811
5318 0.0027984492
6848 0.0027983948
37676 0.0027979757
37009 0.002797882
21483 0.0027970446
34813 0.0027969899
17078 0.0027969163
12417 0.0027966322
15442 0.002796499
2112 0.00279589
41745 0.002795475
37993 0.0027952672
6607 0.002795141
726 0.0027951272
29567 0.0027950432
17327 0.002794847
28154 0.0027948236
6012 0.0027947887
5882 0.0027947314
29802 0.002794683
19245 0.002794012
32793 0.00279397
41543 0.0027938385
5924 0.0027934585
16751 0.00279344
26854 0.0027932725
2667 0.0027931766
15513 0.00279265
28034 0.0027926364
25359 0.00279256
8625 0.0027924438
3980 0.0027922704
20367 0.0027922166
2991 0.0027921812
15011 0.0027921507
32898 0.0027917428
29452 0.0027915228
41067 0.0027915193
31353 0.0027912096
5164 0.002790933
10492 0.0027908273
1189 0.0027

30104 0.0025540064
23608 0.0025539836
12005 0.0025536155
35233 0.0025534928
35142 0.0025534267
26269 0.00255311
1098 0.0025522718
17168 0.0025521144
19837 0.0025521144
6967 0.0025518592
17516 0.0025517724
38784 0.002551761
4681 0.0025516506
33710 0.002551447
4127 0.0025513063
29174 0.0025512066
1313 0.0025510744
4724 0.0025505475
20056 0.0025503659
11244 0.002550109
2043 0.0025498394
21080 0.0025491356
114 0.0025489782
13303 0.0025489426
23565 0.002548892
14979 0.0025488874
35943 0.0025478736
28794 0.002547555
4150 0.0025474746
14730 0.002547443
11323 0.0025474364
21187 0.0025473502
10773 0.002547202
9501 0.0025470634
8861 0.0025468844
32301 0.002546771
2957 0.002546687
17821 0.0025466108
37720 0.0025465214
33878 0.002546521
17941 0.0025463225
24539 0.0025461088
31084 0.0025460734
35011 0.0025459805
90 0.0025457498
40656 0.0025456708
26838 0.0025454958
3113 0.002545453
30779 0.0025451756
30775 0.0025450948
19084 0.002544555
3135 0.0025445055
33210 0.002543777
16145 0.0025437349
38773 0

29963 0.0020442451
38691 0.0020440738
31998 0.002043964
40765 0.0020434754
37305 0.0020434242
28890 0.0020431501
10934 0.0020431492
11486 0.0020431108
3290 0.0020429366
22950 0.0020429143
7933 0.002042868
22469 0.0020424596
17040 0.0020423648
34422 0.0020418076
9905 0.0020417864
35754 0.0020417487
12688 0.0020413848
22011 0.0020412216
8208 0.002041078
19373 0.0020408973
32994 0.0020406372
41904 0.0020403932
35762 0.0020402619
3481 0.0020402325
26054 0.0020402311
18694 0.0020400777
38163 0.002040051
9755 0.0020396016
3024 0.002039398
15819 0.002039204
6683 0.0020390628
2892 0.0020389168
21706 0.0020387264
16206 0.0020386265
15407 0.002038394
10852 0.0020380737
17094 0.0020380178
6242 0.0020379764
35040 0.0020374726
13558 0.0020369985
5152 0.002036314
316 0.0020361103
13716 0.0020360758
16582 0.0020357664
21905 0.0020357063
41255 0.0020353987
5143 0.0020353773
24800 0.0020350982
8862 0.0020350842
37815 0.0020343685
23183 0.0020336984
29389 0.002033003
18920 0.0020326918
32698 0.002032535

28597 0.001593034
29214 0.0015928996
3759 0.0015923554
18266 0.001591885
24901 0.0015916274
26682 0.001591593
33927 0.0015909744
41971 0.0015909378
37665 0.0015907959
38313 0.0015907662
4155 0.001590316
9758 0.0015900732
30010 0.0015888211
18863 0.001588312
21337 0.0015879475
30296 0.0015878413
41560 0.0015877906
14938 0.0015877562
16340 0.0015876838
27966 0.0015875346
8666 0.0015873776
1870 0.0015871186
29178 0.0015869078
26841 0.0015858982
33125 0.0015857767
1636 0.0015852647
1767 0.0015852418
26635 0.0015851624
1806 0.0015848895
6881 0.0015844952
13731 0.0015836837
1296 0.0015836627
2006 0.0015836244
19036 0.0015833712
3256 0.0015827287
34872 0.0015819146
34784 0.0015818796
34361 0.0015816812
3522 0.0015809285
21732 0.001580718
37743 0.001580615
14762 0.0015802684
42127 0.0015802577
40725 0.0015800669
40960 0.001579934
33371 0.0015797224
29572 0.0015796011
14631 0.0015793529
41503 0.0015792409
4034 0.0015790206
27527 0.0015787928
40039 0.001578788
14211 0.0015785018
11492 0.00157822

21368 0.0011332716
36066 0.0011331573
37855 0.001132707
26798 0.0011320773
28989 0.0011316574
22552 0.00113141
11813 0.0011310609
485 0.001131052
2734 0.0011307825
36901 0.0011305579
20913 0.0011305495
28493 0.0011301727
41065 0.0011297021
11539 0.0011292952
7905 0.0011291095
42102 0.0011290482
35016 0.0011285618
21155 0.0011284652
19759 0.0011282692
4105 0.0011280617
24313 0.001128045
8846 0.0011278179
35826 0.0011276676
36183 0.0011273345
22117 0.001127285
16222 0.001126983
7328 0.0011269059
33246 0.0011264106
26973 0.0011263432
17868 0.0011261215
9260 0.0011259577
15249 0.0011258156
6725 0.0011258153
31113 0.0011257912
37023 0.0011257359
29669 0.0011257064
31830 0.0011255441
32633 0.0011252895
2584 0.001124909
32475 0.0011247295
3502 0.0011244351
4790 0.001124117
24384 0.0011236502
36328 0.0011232048
29111 0.0011230898
2162 0.001123029
552 0.0011227735
35381 0.0011227423
18396 0.0011224855
5518 0.0011219992
35530 0.0011214088
37214 0.0011213372
13608 0.0011212798
29712 0.0011212259


40197 0.00069079146
2678 0.0006903937
14109 0.0006902962
34914 0.0006902945
5585 0.0006902475
10297 0.00069020555
8830 0.00068981305
22061 0.0006897519
37969 0.0006895549
113 0.0006895461
1463 0.0006894871
41184 0.0006894294
15110 0.000689301
5278 0.0006892825
8926 0.0006892399
11716 0.0006891643
28117 0.0006890404
23960 0.0006890396
11732 0.00068901386
14673 0.00068846135
13981 0.000688443
18036 0.0006881425
34060 0.0006865869
13411 0.00068648416
32849 0.0006863183
38043 0.00068614556
16209 0.0006857854
23110 0.0006856869
38013 0.00068530045
18439 0.00068468956
29395 0.0006844499
23109 0.0006843498
11405 0.00068428356
38189 0.00068349624
16697 0.00068338757
3690 0.0006833162
9591 0.0006831465
40636 0.00068254134
11910 0.0006822613
631 0.0006817494
40751 0.0006816409
13331 0.00068110396
21980 0.00068077183
6710 0.0006806351
16077 0.00068063283
36091 0.00068058947
6987 0.00068046537
14788 0.0006804462
26197 0.0006804131
42263 0.0006803345
11422 0.00068022474
29774 0.0006801141
1582 0.00

1205 0.00014441721
10639 0.0001442473
8365 0.00014118051
4392 0.00014109333
7972 0.00014076523
32708 0.00014027525
5498 0.00014022598
35813 0.00013811259
9574 0.00013707524
4064 0.00013640452
33947 0.00013586563
5603 0.00013580799
14846 0.00013545509
25042 0.00013514496
9727 0.00013376957
27608 0.00013354773
29522 0.00013301485
33070 0.00013277594
22550 0.00013254538
9903 0.00013129732
42205 0.00013048848
3847 0.00013019162
15292 0.00012760833
35918 0.00012730192
12020 0.0001269924
20261 0.00012667806
33113 0.00012497822
369 0.00012481082
19504 0.00012458849
19543 0.00012453337
11150 0.00012449553
35715 0.00012432288
8401 0.00012427587
32707 0.00012391627
3534 0.0001221105
27097 0.00012196518
24404 0.000121785684
4476 0.0001217377
209 0.000119408156
41246 0.000118757554
41326 0.00011479874
38726 0.00011386548
8748 0.00011372629
22598 0.00011188243
11414 0.00011130741
33812 0.00011068101
21075 0.00010861181
23644 0.00010796706
16847 0.0001078826
8898 0.00010711315
10132 0.00010662883
30

5842 0.0
5843 0.0
5845 0.0
5850 0.0
5852 0.0
5855 0.0
5857 0.0
5863 0.0
5868 0.0
5872 0.0
5876 0.0
5881 0.0
5884 0.0
5886 0.0
5887 0.0
5888 0.0
5891 0.0
5893 0.0
5894 0.0
5895 0.0
5898 0.0
5903 0.0
5904 0.0
5905 0.0
5906 0.0
5907 0.0
5908 0.0
5909 0.0
5910 0.0
5922 0.0
5926 0.0
5933 0.0
5935 0.0
5936 0.0
5938 0.0
5943 0.0
5945 0.0
5946 0.0
5950 0.0
5952 0.0
5954 0.0
5957 0.0
5958 0.0
5962 0.0
5970 0.0
5975 0.0
5977 0.0
5979 0.0
5981 0.0
5984 0.0
5985 0.0
5991 0.0
5993 0.0
5997 0.0
6003 0.0
6004 0.0
6005 0.0
6013 0.0
6016 0.0
6018 0.0
6020 0.0
6021 0.0
6023 0.0
6030 0.0
6031 0.0
6034 0.0
6038 0.0
6042 0.0
6043 0.0
6048 0.0
6051 0.0
6052 0.0
6053 0.0
6055 0.0
6056 0.0
6058 0.0
6059 0.0
6065 0.0
6070 0.0
6071 0.0
6073 0.0
6077 0.0
6078 0.0
6079 0.0
6085 0.0
6086 0.0
6087 0.0
6089 0.0
6090 0.0
6097 0.0
6100 0.0
6103 0.0
6108 0.0
6110 0.0
6113 0.0
6115 0.0
6116 0.0
6117 0.0
6120 0.0
6121 0.0
6124 0.0
6125 0.0
6128 0.0
6131 0.0
6132 0.0
6135 0.0
6136 0.0
6142 0.0
6143 0.0
6144 0.0
6149 0.0
6

11890 0.0
11893 0.0
11897 0.0
11902 0.0
11903 0.0
11919 0.0
11920 0.0
11921 0.0
11923 0.0
11927 0.0
11929 0.0
11934 0.0
11936 0.0
11940 0.0
11941 0.0
11942 0.0
11945 0.0
11946 0.0
11949 0.0
11952 0.0
11957 0.0
11968 0.0
11969 0.0
11977 0.0
11978 0.0
11981 0.0
11982 0.0
11988 0.0
11992 0.0
11997 0.0
11999 0.0
12000 0.0
12001 0.0
12002 0.0
12003 0.0
12004 0.0
12006 0.0
12008 0.0
12013 0.0
12016 0.0
12017 0.0
12021 0.0
12031 0.0
12038 0.0
12040 0.0
12042 0.0
12043 0.0
12048 0.0
12049 0.0
12053 0.0
12058 0.0
12061 0.0
12062 0.0
12063 0.0
12064 0.0
12073 0.0
12074 0.0
12077 0.0
12084 0.0
12088 0.0
12089 0.0
12094 0.0
12097 0.0
12103 0.0
12107 0.0
12108 0.0
12113 0.0
12115 0.0
12127 0.0
12129 0.0
12131 0.0
12134 0.0
12136 0.0
12149 0.0
12152 0.0
12153 0.0
12154 0.0
12155 0.0
12156 0.0
12157 0.0
12164 0.0
12165 0.0
12172 0.0
12176 0.0
12178 0.0
12181 0.0
12182 0.0
12187 0.0
12189 0.0
12192 0.0
12194 0.0
12197 0.0
12198 0.0
12203 0.0
12204 0.0
12208 0.0
12211 0.0
12214 0.0
12217 0.0
12219 0.0


18201 0.0
18203 0.0
18205 0.0
18208 0.0
18209 0.0
18210 0.0
18213 0.0
18215 0.0
18218 0.0
18219 0.0
18220 0.0
18226 0.0
18227 0.0
18230 0.0
18232 0.0
18233 0.0
18235 0.0
18236 0.0
18237 0.0
18245 0.0
18247 0.0
18248 0.0
18249 0.0
18251 0.0
18253 0.0
18254 0.0
18255 0.0
18256 0.0
18257 0.0
18259 0.0
18260 0.0
18261 0.0
18263 0.0
18267 0.0
18268 0.0
18269 0.0
18270 0.0
18271 0.0
18275 0.0
18276 0.0
18281 0.0
18284 0.0
18285 0.0
18286 0.0
18287 0.0
18288 0.0
18296 0.0
18300 0.0
18306 0.0
18309 0.0
18311 0.0
18312 0.0
18314 0.0
18315 0.0
18316 0.0
18321 0.0
18326 0.0
18328 0.0
18330 0.0
18339 0.0
18341 0.0
18342 0.0
18344 0.0
18350 0.0
18351 0.0
18352 0.0
18355 0.0
18356 0.0
18357 0.0
18360 0.0
18366 0.0
18370 0.0
18374 0.0
18377 0.0
18382 0.0
18383 0.0
18384 0.0
18393 0.0
18394 0.0
18397 0.0
18398 0.0
18399 0.0
18400 0.0
18402 0.0
18411 0.0
18412 0.0
18413 0.0
18414 0.0
18415 0.0
18420 0.0
18422 0.0
18427 0.0
18428 0.0
18430 0.0
18431 0.0
18433 0.0
18436 0.0
18442 0.0
18443 0.0
18446 0.0


24415 0.0
24420 0.0
24421 0.0
24423 0.0
24426 0.0
24436 0.0
24443 0.0
24444 0.0
24447 0.0
24448 0.0
24450 0.0
24454 0.0
24456 0.0
24457 0.0
24458 0.0
24459 0.0
24462 0.0
24467 0.0
24469 0.0
24473 0.0
24474 0.0
24475 0.0
24482 0.0
24483 0.0
24491 0.0
24494 0.0
24496 0.0
24500 0.0
24501 0.0
24508 0.0
24510 0.0
24513 0.0
24516 0.0
24517 0.0
24523 0.0
24529 0.0
24531 0.0
24532 0.0
24534 0.0
24536 0.0
24538 0.0
24540 0.0
24545 0.0
24550 0.0
24552 0.0
24554 0.0
24555 0.0
24558 0.0
24563 0.0
24567 0.0
24568 0.0
24570 0.0
24572 0.0
24573 0.0
24575 0.0
24577 0.0
24578 0.0
24579 0.0
24585 0.0
24587 0.0
24590 0.0
24591 0.0
24592 0.0
24595 0.0
24596 0.0
24597 0.0
24601 0.0
24604 0.0
24609 0.0
24611 0.0
24613 0.0
24614 0.0
24616 0.0
24617 0.0
24618 0.0
24621 0.0
24624 0.0
24628 0.0
24631 0.0
24633 0.0
24635 0.0
24641 0.0
24644 0.0
24645 0.0
24646 0.0
24649 0.0
24650 0.0
24651 0.0
24654 0.0
24655 0.0
24656 0.0
24657 0.0
24660 0.0
24662 0.0
24666 0.0
24667 0.0
24668 0.0
24671 0.0
24674 0.0
24675 0.0


30393 0.0
30398 0.0
30399 0.0
30401 0.0
30404 0.0
30408 0.0
30414 0.0
30416 0.0
30419 0.0
30420 0.0
30421 0.0
30423 0.0
30427 0.0
30428 0.0
30429 0.0
30432 0.0
30436 0.0
30438 0.0
30441 0.0
30443 0.0
30444 0.0
30446 0.0
30448 0.0
30449 0.0
30452 0.0
30455 0.0
30456 0.0
30458 0.0
30461 0.0
30465 0.0
30466 0.0
30467 0.0
30470 0.0
30471 0.0
30472 0.0
30474 0.0
30480 0.0
30483 0.0
30490 0.0
30491 0.0
30496 0.0
30498 0.0
30499 0.0
30500 0.0
30501 0.0
30505 0.0
30507 0.0
30508 0.0
30519 0.0
30521 0.0
30522 0.0
30523 0.0
30524 0.0
30525 0.0
30533 0.0
30537 0.0
30549 0.0
30552 0.0
30555 0.0
30557 0.0
30560 0.0
30562 0.0
30565 0.0
30568 0.0
30575 0.0
30580 0.0
30585 0.0
30587 0.0
30591 0.0
30592 0.0
30595 0.0
30597 0.0
30599 0.0
30600 0.0
30602 0.0
30603 0.0
30608 0.0
30609 0.0
30610 0.0
30611 0.0
30612 0.0
30614 0.0
30615 0.0
30617 0.0
30618 0.0
30619 0.0
30622 0.0
30623 0.0
30628 0.0
30634 0.0
30636 0.0
30637 0.0
30638 0.0
30643 0.0
30645 0.0
30648 0.0
30651 0.0
30652 0.0
30658 0.0
30663 0.0


35824 0.0
35829 0.0
35830 0.0
35831 0.0
35832 0.0
35834 0.0
35839 0.0
35843 0.0
35848 0.0
35851 0.0
35853 0.0
35854 0.0
35856 0.0
35860 0.0
35862 0.0
35863 0.0
35864 0.0
35866 0.0
35868 0.0
35873 0.0
35874 0.0
35877 0.0
35879 0.0
35880 0.0
35881 0.0
35890 0.0
35892 0.0
35894 0.0
35901 0.0
35902 0.0
35903 0.0
35906 0.0
35910 0.0
35912 0.0
35917 0.0
35919 0.0
35925 0.0
35927 0.0
35929 0.0
35930 0.0
35941 0.0
35942 0.0
35950 0.0
35952 0.0
35953 0.0
35954 0.0
35955 0.0
35956 0.0
35957 0.0
35966 0.0
35971 0.0
35974 0.0
35979 0.0
35987 0.0
35988 0.0
35989 0.0
35994 0.0
35995 0.0
35996 0.0
35999 0.0
36001 0.0
36003 0.0
36011 0.0
36019 0.0
36020 0.0
36021 0.0
36025 0.0
36029 0.0
36032 0.0
36042 0.0
36047 0.0
36053 0.0
36058 0.0
36061 0.0
36063 0.0
36064 0.0
36068 0.0
36073 0.0
36074 0.0
36080 0.0
36085 0.0
36087 0.0
36092 0.0
36099 0.0
36101 0.0
36102 0.0
36103 0.0
36105 0.0
36106 0.0
36110 0.0
36112 0.0
36114 0.0
36116 0.0
36119 0.0
36126 0.0
36127 0.0
36129 0.0
36131 0.0
36132 0.0
36138 0.0


41716 0.0
41718 0.0
41719 0.0
41722 0.0
41725 0.0
41728 0.0
41729 0.0
41730 0.0
41731 0.0
41733 0.0
41736 0.0
41741 0.0
41743 0.0
41747 0.0
41749 0.0
41751 0.0
41752 0.0
41759 0.0
41761 0.0
41763 0.0
41771 0.0
41772 0.0
41773 0.0
41777 0.0
41779 0.0
41784 0.0
41785 0.0
41787 0.0
41788 0.0
41790 0.0
41791 0.0
41794 0.0
41796 0.0
41797 0.0
41801 0.0
41808 0.0
41814 0.0
41815 0.0
41819 0.0
41823 0.0
41831 0.0
41832 0.0
41833 0.0
41835 0.0
41838 0.0
41840 0.0
41842 0.0
41843 0.0
41845 0.0
41847 0.0
41848 0.0
41856 0.0
41860 0.0
41862 0.0
41870 0.0
41875 0.0
41877 0.0
41879 0.0
41880 0.0
41882 0.0
41883 0.0
41884 0.0
41888 0.0
41889 0.0
41893 0.0
41896 0.0
41897 0.0
41898 0.0
41901 0.0
41902 0.0
41903 0.0
41906 0.0
41907 0.0
41913 0.0
41914 0.0
41917 0.0
41923 0.0
41924 0.0
41925 0.0
41927 0.0
41930 0.0
41931 0.0
41932 0.0
41933 0.0
41938 0.0
41947 0.0
41949 0.0
41953 0.0
41956 0.0
41957 0.0
41960 0.0
41963 0.0
41964 0.0
41967 0.0
41968 0.0
41972 0.0
41973 0.0
41975 0.0
41977 0.0
41980 0.0


In [32]:
# Check the plot summary
raw_docs[2031]

'The Astro Investigation and Defence Service  sends Derek, Frank, Ozzy, and Barry to investigate the disappearance of everyone in the town of Kaihoro, New Zealand. They find the town has been overrun by space aliens disguised as humans. Barry kills one of the aliens and is attacked by others. After Derek notifies Frank and Ozzy, he begins torturing Robert, an alien they caught earlier. Robert\'s screaming attracts a number of aliens in the area. Derek kills the would-be rescuers, but he is attacked by Robert and falls over a cliff, to his presumed death. Meanwhile, a charity collector named Giles is passing through Kaihoro. He is attacked by Robert, who has been eating the brains of the alien killed earlier by Barry. Giles escapes in his car and stops at a nearby house for help. Another alien answers the door and captures Giles. He later wakes up in a tub of water and is told he is about to be eaten. Derek also wakes up to find that he landed in a seagull\'s nest. He also finds that hi

# Some random thoughts

1. This is a rudimentary model, need to explore whether other pre-processing and/or models work better
2. Need some functions / interface to make it more intuitive - now is more a POC prototype
3. What kind of support/penalty algorithm should we use to increase/decrease ranking based on Like/Dislike
4. Can we introduce some Explainability concepts to explain "We recomended this because \[Fill in what were the similarity matches]
