# lab-12-6 sequence to sequence with attention (Keras + eager version)

### simple neural machine translation training

* sequence to sequence
  
### Reference
* [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)
* [Neural Machine Translation with Attention from Tensorflow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb)

In [47]:
from __future__ import absolute_import, division, print_function

# Import TensorFlow >= 1.10 and enable eager execution
import tensorflow as tf

tf.enable_eager_execution()

from matplotlib import font_manager, rc

rc('font', family='AppleGothic') #for mac

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

from pprint import pprint
import numpy as np
import os
import pandas as pd
print(tf.__version__)

1.15.0


In [48]:
df = pd.read_json('moviedata_preprocessed.json')

In [55]:
df.drop(columns = 'level_0', axis= 1, inplace = True)

In [57]:
df = df.reset_index()

In [2]:
sources = [['I', 'feel', 'hungry'],
     ['tensorflow', 'is', 'very', 'difficult'],
     ['tensorflow', 'is', 'a', 'framework', 'for', 'deep', 'learning'],
     ['tensorflow', 'is', 'very', 'fast', 'changing']]
targets = [['나는', '배가', '고프다'],
           ['텐서플로우는', '매우', '어렵다'],
           ['텐서플로우는', '딥러닝을', '위한', '프레임워크이다'],
           ['텐서플로우는', '매우', '빠르게', '변화한다']]

In [58]:
df.movie_summary = df.movie_summary.apply(lambda x: x.lower())
df.movie_summary = df.movie_summary.apply(lambda x: x.replace('.',''))

In [59]:
df.movie_summary

0       four teen girls diving in a ruined underwater ...
1       after a space merchant vessel perceives an unk...
2       a small team of elite american intelligence of...
3       set deep in the wilds of appalachia where beli...
4       a newlywed couple are passing through a vacati...
5       former texas rangers sam ward and logan kelihe...
6       a strongwilled female stock car driver challen...
7       a young man from the countryside uses his skil...
8       on a business trip to new orleans a damaged ma...
9       a young couple grieving the recent death of th...
10      men capture the creature from the black lagoon...
11      the sadistic tale of a lonely mentally handica...
12      a newspaper photographer researches an 1873 do...
13      when the kinetic rory moves into his room in t...
14      a customs officer who can smell fear develops ...
15      a sheriff tries to stop the killing spree of a...
16      earth has been conquered by robots from a dist...
17      detent

In [60]:
df.movie_tagline

0                                        the next chapter
1                            its alien the 8th passenger.
2                     90 minutes. 22 miles. zero back up.
3                                    we all have our sins
4       these are the daughters of darkness... they ar...
5       the gunslinger whose life he had saved forced ...
6                      its not just a mans world anymore.
7           every journey must begin with one small step.
8        you never know whos going to be your wakeup call
9                              death isnt always the end.
10                       allnew 3d thills original poster
11                          you cant run from a nightmare
12                                   hell hath no fury...
13                            live life like you mean it.
14                              sense something beautiful
15      hes an indestructible man fused with powers be...
16                                robots never lie.......
17      this s

In [61]:

sources = [w.split() for w in df.movie_summary]
targets = [w.split() for w in df.movie_tagline]

In [62]:
sources

[['four',
  'teen',
  'girls',
  'diving',
  'in',
  'a',
  'ruined',
  'underwater',
  'city',
  'quickly',
  'learn',
  'theyve',
  'entered',
  'the',
  'territory',
  'of',
  'the',
  'deadliest',
  'shark',
  'species',
  'in',
  'the',
  'claustrophobic',
  'labyrinth',
  'of',
  'submerged',
  'caves',
  '47',
  'meters',
  'down',
  'uncaged',
  'follows',
  'the',
  'diving',
  'adventure',
  'of',
  'four',
  'teenage',
  'girls',
  'sophie',
  'nlisse',
  'corinne',
  'foxx',
  'brianne',
  'tju',
  'and',
  'sistine',
  'stallone',
  'exploring',
  'a',
  'submerged',
  'mayan',
  'city',
  'once',
  'inside',
  'their',
  'rush',
  'of',
  'excitement',
  'turns',
  'into',
  'a',
  'jolt',
  'of',
  'terror',
  'as',
  'they',
  'discover',
  'the',
  'sunken',
  'ruins',
  'are',
  'a',
  'hunting',
  'ground',
  'for',
  'deadly',
  'great',
  'white',
  'sharks',
  'with',
  'their',
  'air',
  'supply',
  'steadily',
  'dwindling',
  'the',
  'friends',
  'must',
  'n

In [63]:
# vocabulary for sources
s_vocab = list(set(sum(sources, [])))
s_vocab.sort()
s_vocab = ['<pad>'] + s_vocab
source2idx = {word : idx for idx, word in enumerate(s_vocab)}
idx2source = {idx : word for idx, word in enumerate(s_vocab)}

pprint(source2idx)

{'0': 1,
 '000': 2,
 '00000': 3,
 '007': 4,
 '04': 5,
 '06': 6,
 '1': 7,
 '10': 8,
 '100': 9,
 '1000': 10,
 '10000': 11,
 '100000': 12,
 '1000000': 13,
 '10000000': 14,
 '1000th': 15,
 '1001': 16,
 '1007': 17,
 '100foottall': 18,
 '100point': 19,
 '100th': 20,
 '100yearold': 21,
 '104': 22,
 '1087': 23,
 '10m': 24,
 '10th': 25,
 '10thcentury': 26,
 '10year': 27,
 '10yearold': 28,
 '11': 29,
 '111': 30,
 '1114': 31,
 '112': 32,
 '114': 33,
 '1145': 34,
 '117': 35,
 '11pm': 36,
 '11th': 37,
 '11year': 38,
 '11yearold': 39,
 '12': 40,
 '120': 41,
 '1200': 42,
 '12000': 43,
 '1200s': 44,
 '1214': 45,
 '122748': 46,
 '123': 47,
 '1249': 48,
 '125': 49,
 '1250': 50,
 '125000': 51,
 '126th': 52,
 '12at': 53,
 '12chapter': 54,
 '12day': 55,
 '12episode': 56,
 '12hour': 57,
 '12step': 58,
 '12th': 59,
 '12week': 60,
 '12year': 61,
 '12yearold': 62,
 '12yearsmissing': 63,
 '13': 64,
 '130': 65,
 '13000': 66,
 '130000': 67,
 '1303': 68,
 '131': 69,
 '1314': 70,
 '132': 71,
 '1325': 72,
 '1327': 7

 'adolphe': 1250,
 'adolpheleopold': 1251,
 'adonde': 1252,
 'adonijah': 1253,
 'adonis': 1254,
 'adopt': 1255,
 'adopted': 1256,
 'adopteddaughter': 1257,
 'adopting': 1258,
 'adoption': 1259,
 'adoptive': 1260,
 'adopts': 1261,
 'adorable': 1262,
 'adoration': 1263,
 'adore': 1264,
 'adored': 1265,
 'adores': 1266,
 'adoring': 1267,
 'adorned': 1268,
 'adrain': 1269,
 'adrenal': 1270,
 'adrenalin': 1271,
 'adrenaline': 1272,
 'adrenalinedeath': 1273,
 'adrenalineinjected': 1274,
 'adrenalinerushing': 1275,
 'adresses': 1276,
 'adrian': 1277,
 'adriana': 1278,
 'adriani': 1279,
 'adriano': 1280,
 'adriatic': 1281,
 'adrien': 1282,
 'adrienne': 1283,
 'adrift': 1284,
 'ads': 1285,
 'adult': 1286,
 'adulteress': 1287,
 'adulteries': 1288,
 'adulterous': 1289,
 'adultery': 1290,
 'adultfilm': 1291,
 'adulthood': 1292,
 'adultonly': 1293,
 'adults': 1294,
 'adultstill': 1295,
 'advance': 1296,
 'advanced': 1297,
 'advancement': 1298,
 'advancements': 1299,
 'advances': 1300,
 'advancing':

 'annika': 2355,
 'anniken': 2356,
 'annistiem': 2357,
 'anniversary': 2358,
 'announce': 2359,
 'announced': 2360,
 'announcement': 2361,
 'announcer': 2362,
 'announcers': 2363,
 'announces': 2364,
 'announcing': 2365,
 'annoy': 2366,
 'annoyance': 2367,
 'annoyances': 2368,
 'annoyed': 2369,
 'annoying': 2370,
 'annoyingly': 2371,
 'annoys': 2372,
 'anns': 2373,
 'annthe': 2374,
 'annual': 2375,
 'annuity': 2376,
 'annulled': 2377,
 'annulment': 2378,
 'anny': 2379,
 'anointed': 2380,
 'anointees': 2381,
 'anomalies': 2382,
 'anomaly': 2383,
 'anonymity': 2384,
 'anonymous': 2385,
 'anonymously': 2386,
 'anorexic': 2387,
 'another': 2388,
 'anotherand': 2389,
 'anotheritalian': 2390,
 'anothers': 2391,
 'anotherthey': 2392,
 'ans': 2393,
 'anselmo': 2394,
 'anshu': 2395,
 'ansi': 2396,
 'ansky': 2397,
 'answer': 2398,
 'answered': 2399,
 'answering': 2400,
 'answers': 2401,
 'ant': 2402,
 'antagonism': 2403,
 'antagonist': 2404,
 'antagonistic': 2405,
 'antagonized': 2406,
 'antagon

 'bachchan': 3605,
 'bachelor': 3606,
 'bachelorette': 3607,
 'bachelorhood': 3608,
 'bachelors': 3609,
 'bachelorstag': 3610,
 'bachman': 3611,
 'bachmann': 3612,
 'back': 3613,
 'backand': 3614,
 'backbencher': 3615,
 'backbone': 3616,
 'backbreaking': 3617,
 'backdrop': 3618,
 'backdropped': 3619,
 'backdrops': 3620,
 'backed': 3621,
 'backers': 3622,
 'backett': 3623,
 'backfire': 3624,
 'backfires': 3625,
 'backfiring': 3626,
 'background': 3627,
 'backgrounds': 3628,
 'backing': 3629,
 'backlash': 3630,
 'backlot': 3631,
 'backlots': 3632,
 'backpack': 3633,
 'backpacker': 3634,
 'backpackers': 3635,
 'backpacking': 3636,
 'backpacks': 3637,
 'backpage': 3638,
 'backpagecom': 3639,
 'backs': 3640,
 'backseat': 3641,
 'backside': 3642,
 'backstabbing': 3643,
 'backstage': 3644,
 'backstory': 3645,
 'backstreets': 3646,
 'backtrack': 3647,
 'backup': 3648,
 'backward': 3649,
 'backwards': 3650,
 'backwater': 3651,
 'backwoods': 3652,
 'backyard': 3653,
 'bacon': 3654,
 'bacteria': 

 'biggame': 4854,
 'bigger': 4855,
 'biggest': 4856,
 'biggs': 4857,
 'biggsley': 4858,
 'bighaired': 4859,
 'bighearted': 4860,
 'bigmanintown': 4861,
 'bigmoney': 4862,
 'bigot': 4863,
 'bigoted': 4864,
 'bigotry': 4865,
 'bigotschased': 4866,
 'bigscreen': 4867,
 'bigshot': 4868,
 'bigtime': 4869,
 'biguy': 4870,
 'bigwig': 4871,
 'bijenkorf': 4872,
 'biji': 4873,
 'bijli': 4874,
 'bijou': 4875,
 'bike': 4876,
 'bikel': 4877,
 'biker': 4878,
 'bikeracer': 4879,
 'bikers': 4880,
 'biking': 4881,
 'bikini': 4882,
 'bikiniclad': 4883,
 'bilal': 4884,
 'bilateral': 4885,
 'bilbao': 4886,
 'bilbo': 4887,
 'bilge': 4888,
 'bilias': 4889,
 'bilis': 4890,
 'bilk': 4891,
 'bilked': 4892,
 'bilkis': 4893,
 'bilks': 4894,
 'bill': 4895,
 'billable': 4896,
 'billar': 4897,
 'billboard': 4898,
 'billcollectors': 4899,
 'billeci': 4900,
 'billed': 4901,
 'billet': 4902,
 'billie': 4903,
 'billies': 4904,
 'billing': 4905,
 'billings': 4906,
 'billingsgate': 4907,
 'billingsley': 4908,
 'billion':

 'brute': 6195,
 'brutish': 6196,
 'brutus': 6197,
 'bruxelles': 6198,
 'bruza': 6199,
 'bruzas': 6200,
 'bruzzi': 6201,
 'bryan': 6202,
 'bryans': 6203,
 'bryant': 6204,
 'bryants': 6205,
 'bryerly': 6206,
 'bryerson': 6207,
 'bryon': 6208,
 'bs': 6209,
 'bsi': 6210,
 'bsqueda': 6211,
 'bstandards': 6212,
 'bsv': 6213,
 'btec': 6214,
 'bthory': 6215,
 'bu': 6216,
 'bub': 6217,
 'bubba': 6218,
 'bubber': 6219,
 'bubbers': 6220,
 'bubble': 6221,
 'bubblebrained': 6222,
 'bubbleheaded': 6223,
 'bubbling': 6224,
 'bubbly': 6225,
 'bubonic': 6226,
 'buccaneers': 6227,
 'buchanan': 6228,
 'bucharest': 6229,
 'buchner': 6230,
 'buck': 6231,
 'bucker': 6232,
 'bucket': 6233,
 'buckeye': 6234,
 'bucking': 6235,
 'buckingham': 6236,
 'buckle': 6237,
 'buckles': 6238,
 'buckley': 6239,
 'buckman': 6240,
 'buckmans': 6241,
 'buckminster': 6242,
 'bucknell': 6243,
 'buckner': 6244,
 'bucks': 6245,
 'buckskin': 6246,
 'bucktown': 6247,
 'bucky': 6248,
 'bucolic': 6249,
 'bucyks': 6250,
 'bud': 6251

 'changeling': 7604,
 'changes': 7605,
 'changing': 7606,
 'changs': 7607,
 'changthang': 7608,
 'chanler': 7609,
 'channel': 7610,
 'channeled': 7611,
 'channeling': 7612,
 'channels': 7613,
 'channing': 7614,
 'channon': 7615,
 'chans': 7616,
 'chant': 7617,
 'chantal': 7618,
 'chantel': 7619,
 'chanteuse': 7620,
 'chantry': 7621,
 'chants': 7622,
 'chanyat': 7623,
 'chaos': 7624,
 'chaotic': 7625,
 'chap': 7626,
 'chapa': 7627,
 'chaparral': 7628,
 'chapel': 7629,
 'chapelle': 7630,
 'chapelstrict': 7631,
 'chaperon': 7632,
 'chaperone': 7633,
 'chaperoned': 7634,
 'chaperones': 7635,
 'chaperoning': 7636,
 'chapin': 7637,
 'chaplain': 7638,
 'chaplains': 7639,
 'chaplin': 7640,
 'chaplins': 7641,
 'chapman': 7642,
 'chappy': 7643,
 'chapter': 7644,
 'chapters': 7645,
 'character': 7646,
 'characterdrive': 7647,
 'characterdriven': 7648,
 'characteristic': 7649,
 'characteristics': 7650,
 'characterized': 7651,
 'characters': 7652,
 'charactor': 7653,
 'charade': 7654,
 'charades': 

 'come': 8854,
 'comeback': 8855,
 'comebut': 8856,
 'comedian': 8857,
 'comedians': 8858,
 'comedic': 8859,
 'comedien': 8860,
 'comedienne': 8861,
 'comedies': 8862,
 'comedy': 8863,
 'comedya': 8864,
 'comedydrama': 8865,
 'comedyhorror': 8866,
 'comedysatire': 8867,
 'comedyvariety': 8868,
 'comedywithmusic': 8869,
 'comely': 8870,
 'comeons': 8871,
 'comes': 8872,
 'comet': 8873,
 'comeuppance': 8874,
 'comfort': 8875,
 'comfortable': 8876,
 'comfortably': 8877,
 'comforted': 8878,
 'comforter': 8879,
 'comforting': 8880,
 'comfortnursing': 8881,
 'comforts': 8882,
 'comfy': 8883,
 'comic': 8884,
 'comical': 8885,
 'comically': 8886,
 'comicbased': 8887,
 'comicbook': 8888,
 'comics': 8889,
 'comienza': 8890,
 'comin': 8891,
 'coming': 8892,
 'comingattractions': 8893,
 'comingofage': 8894,
 'comingofagefamily': 8895,
 'comingout': 8896,
 'comissioner': 8897,
 'comm': 8898,
 'command': 8899,
 'commandant': 8900,
 'commanded': 8901,
 'commandeered': 8902,
 'commandeers': 8903,
 'co

 'cross': 10303,
 'crossborder': 10304,
 'crossbows': 10305,
 'crossbreed': 10306,
 'crossbreeding': 10307,
 'crosscanada': 10308,
 'crosscountry': 10309,
 'crosscultural': 10310,
 'crosscutting': 10311,
 'crossdresser': 10312,
 'crossdressing': 10313,
 'crossed': 10314,
 'crosses': 10315,
 'crosseyed': 10316,
 'crossfire': 10317,
 'crossfit': 10318,
 'crossgenerational': 10319,
 'crossgenre': 10320,
 'crosshairs': 10321,
 'crossing': 10322,
 'crossings': 10323,
 'crossley': 10324,
 'crossover': 10325,
 'crosspurposes': 10326,
 'crossroads': 10327,
 'crosssection': 10328,
 'crossstate': 10329,
 'crossstitching': 10330,
 'crosstown': 10331,
 'crossword': 10332,
 'crotchety': 10333,
 'crow': 10334,
 'crowbar': 10335,
 'crowbarwaving': 10336,
 'crowd': 10337,
 'crowded': 10338,
 'crowder': 10339,
 'crowds': 10340,
 'crowley': 10341,
 'crowleys': 10342,
 'crown': 10343,
 'crowned': 10344,
 'crowning': 10345,
 'crowns': 10346,
 'crows': 10347,
 'crowthers': 10348,
 'croydon': 10349,
 'croyt

 'desai': 11603,
 'desaix': 11604,
 'desantis': 11605,
 'desantisoz': 11606,
 'desarrolla': 11607,
 'descend': 11608,
 'descendant': 11609,
 'descendants': 11610,
 'descended': 11611,
 'descendent': 11612,
 'descendents': 11613,
 'descending': 11614,
 'descends': 11615,
 'descent': 11616,
 'deschanel': 11617,
 'deschler': 11618,
 'describe': 11619,
 'described': 11620,
 'describes': 11621,
 'describing': 11622,
 'description': 11623,
 'descriptions': 11624,
 'desdemona': 11625,
 'desecrates': 11626,
 'desecration': 11627,
 'desegregated': 11628,
 'desenlace': 11629,
 'desert': 11630,
 'deserted': 11631,
 'desertedshowing': 11632,
 'deserter': 11633,
 'deserters': 11634,
 'deserthoned': 11635,
 'desertion': 11636,
 'deserts': 11637,
 'deserve': 11638,
 'deserved': 11639,
 'deserves': 11640,
 'deserving': 11641,
 'desfilar': 11642,
 'desha': 11643,
 'deshay': 11644,
 'deshays': 11645,
 'deshi': 11646,
 'desi': 11647,
 'design': 11648,
 'designated': 11649,
 'designed': 11650,
 'designer'

 'drifterhitman': 12998,
 'drifters': 12999,
 'drifting': 13000,
 'drifts': 13001,
 'drill': 13002,
 'driller': 13003,
 'drilling': 13004,
 'drills': 13005,
 'drina': 13006,
 'drink': 13007,
 'drinker': 13008,
 'drinking': 13009,
 'drinkingeven': 13010,
 'drinks': 13011,
 'drip': 13012,
 'driscoll': 13013,
 'driscollthe': 13014,
 'driss': 13015,
 'drive': 13016,
 'driveby': 13017,
 'drivein': 13018,
 'driveins': 13019,
 'drivel': 13020,
 'driven': 13021,
 'driver': 13022,
 'driverless': 13023,
 'drivers': 13024,
 'driverwhile': 13025,
 'drives': 13026,
 'driveway': 13027,
 'driving': 13028,
 'drivinghot': 13029,
 'droids': 13030,
 'droll': 13031,
 'drone': 13032,
 'drones': 13033,
 'drool': 13034,
 'drooling': 13035,
 'drools': 13036,
 'drop': 13037,
 'droplaug': 13038,
 'dropout': 13039,
 'dropouts': 13040,
 'dropped': 13041,
 'droppedin': 13042,
 'dropping': 13043,
 'droppings': 13044,
 'drops': 13045,
 'drosselmeier': 13046,
 'drought': 13047,
 'drove': 13048,
 'drovers': 13049,
 'd

 'erlands': 14353,
 'ermine': 14354,
 'erna': 14355,
 'erneman': 14356,
 'ernest': 14357,
 'ernesto': 14358,
 'ernests': 14359,
 'ernie': 14360,
 'ernies': 14361,
 'ernst': 14362,
 'eros': 14363,
 'erotic': 14364,
 'erotica': 14365,
 'erotically': 14366,
 'eroticism': 14367,
 'erotico': 14368,
 'errands': 14369,
 'errant': 14370,
 'erratic': 14371,
 'erratically': 14372,
 'erraticand': 14373,
 'erring': 14374,
 'errol': 14375,
 'erroll': 14376,
 'erroneously': 14377,
 'error': 14378,
 'errors': 14379,
 'errs': 14380,
 'erskine': 14381,
 'erstwhile': 14382,
 'erudite': 14383,
 'erudition': 14384,
 'eruera': 14385,
 'erupt': 14386,
 'erupted': 14387,
 'erupting': 14388,
 'eruption': 14389,
 'eruptions': 14390,
 'erupts': 14391,
 'erwin': 14392,
 'erzhlt': 14393,
 'es': 14394,
 'esau': 14395,
 'escadrille': 14396,
 'escalate': 14397,
 'escalates': 14398,
 'escalating': 14399,
 'escapade': 14400,
 'escapades': 14401,
 'escape': 14402,
 'escaped': 14403,
 'escapee': 14404,
 'escapees': 1440

 'fiction': 15778,
 'fictional': 15779,
 'fictionalised': 15780,
 'fictionalized': 15781,
 'fictions': 15782,
 'fictitious': 15783,
 'fida': 15784,
 'fiddle': 15785,
 'fiddler': 15786,
 'fide': 15787,
 'fidel': 15788,
 'fidelity': 15789,
 'fidgety': 15790,
 'field': 15791,
 'fieldbiologists': 15792,
 'fielding': 15793,
 'fields': 15794,
 'fiend': 15795,
 'fiendish': 15796,
 'fiends': 15797,
 'fiennes': 15798,
 'fierce': 15799,
 'fiercely': 15800,
 'fiercelycatholic': 15801,
 'fierceness': 15802,
 'fiercest': 15803,
 'fierson': 15804,
 'fiersons': 15805,
 'fiery': 15806,
 'fiesta': 15807,
 'fievel': 15808,
 'fifa': 15809,
 'fifi': 15810,
 'fifis': 15811,
 'fifteen': 15812,
 'fifteenth': 15813,
 'fifteenthousand': 15814,
 'fifteenyearold': 15815,
 'fifth': 15816,
 'fifthcolumnists': 15817,
 'fifthfloor': 15818,
 'fifthgeneration': 15819,
 'fifties': 15820,
 'fifty': 15821,
 'fiftyfour': 15822,
 'fiftynine': 15823,
 'fiftyninth': 15824,
 'fiftyone': 15825,
 'fiftysomething': 15826,
 'figa

 'furtrader': 17102,
 'furtrading': 17103,
 'furuya': 17104,
 'fury': 17105,
 'fuse': 17106,
 'fusecabinet': 17107,
 'fuses': 17108,
 'fusiliers': 17109,
 'fusing': 17110,
 'fusion': 17111,
 'fuss': 17112,
 'fussbudget': 17113,
 'fussy': 17114,
 'futa': 17115,
 'futas': 17116,
 'futile': 17117,
 'futilely': 17118,
 'futility': 17119,
 'futtock': 17120,
 'future': 17121,
 'futures': 17122,
 'futurewife': 17123,
 'futurist': 17124,
 'futuristic': 17125,
 'fuzz': 17126,
 'fuzzy': 17127,
 'fuzzy22': 17128,
 'fyodor': 17129,
 'g': 17130,
 'gaar': 17131,
 'gabby': 17132,
 'gabe': 17133,
 'gabita': 17134,
 'gabitas': 17135,
 'gable': 17136,
 'gables': 17137,
 'gabor': 17138,
 'gabriel': 17139,
 'gabriella': 17140,
 'gabrielle': 17141,
 'gabriels': 17142,
 'gaby': 17143,
 'gabys': 17144,
 'gachman': 17145,
 'gacys': 17146,
 'gad': 17147,
 'gadget': 17148,
 'gadgetladen': 17149,
 'gadgetry': 17150,
 'gadgets': 17151,
 'gaelic': 17152,
 'gaffar': 17153,
 'gag': 17154,
 'gaga': 17155,
 'gagged': 

 'grudge': 18545,
 'grudging': 18546,
 'grudgingly': 18547,
 'grueling': 18548,
 'gruell': 18549,
 'gruelling': 18550,
 'gruesome': 18551,
 'gruesomely': 18552,
 'gruff': 18553,
 'grump': 18554,
 'grumpily': 18555,
 'grumps': 18556,
 'grumpy': 18557,
 'grunder': 18558,
 'grundys': 18559,
 'grunge': 18560,
 'grunt': 18561,
 'grushenko': 18562,
 'grusomely': 18563,
 'gs': 18564,
 'gt500are': 18565,
 'gta': 18566,
 'gto': 18567,
 'guadagninis': 18568,
 'guadalcanal': 18569,
 'guadalupe': 18570,
 'guajira': 18571,
 'guam': 18572,
 'guan': 18573,
 'guangcheng': 18574,
 'guarantee': 18575,
 'guaranteed': 18576,
 'guaranteeing': 18577,
 'guarantees': 18578,
 'guaranty': 18579,
 'guard': 18580,
 'guardand': 18581,
 'guarded': 18582,
 'guardian': 18583,
 'guardians': 18584,
 'guardianship': 18585,
 'guarding': 18586,
 'guardone': 18587,
 'guards': 18588,
 'guardsman': 18589,
 'guardsmen': 18590,
 'guatemala': 18591,
 'gubernatorial': 18592,
 'guddu': 18593,
 'gudrun': 18594,
 'guen': 18595,
 'g

 'hideously': 19852,
 'hideout': 19853,
 'hideouts': 19854,
 'hides': 19855,
 'hideyoshi': 19856,
 'hiding': 19857,
 'hiedra': 19858,
 'hierarchy': 19859,
 'hieroglyphs': 19860,
 'higashikawa': 19861,
 'higgins': 19862,
 'high': 19863,
 'highachieving': 19864,
 'highaltitude': 19865,
 'highbrow': 19866,
 'highclass': 19867,
 'highend': 19868,
 'highenergy': 19869,
 'higher': 19870,
 'higherrisk': 19871,
 'higherups': 19872,
 'highest': 19873,
 'highestcrime': 19874,
 'highflying': 19875,
 'highfunctioning': 19876,
 'highjacking': 19877,
 'highkicking': 19878,
 'highland': 19879,
 'highlander': 19880,
 'highlands': 19881,
 'highlevel': 19882,
 'highlife': 19883,
 'highlight': 19884,
 'highlighted': 19885,
 'highlighting': 19886,
 'highlights': 19887,
 'highliving': 19888,
 'highly': 19889,
 'highlycultivated': 19890,
 'highlyprized': 19891,
 'highlyregarded': 19892,
 'highmaintenance': 19893,
 'highminded': 19894,
 'highoctane': 19895,
 'highpaying': 19896,
 'highpitched': 19897,
 'high

 'impersonating': 21101,
 'impersonation': 21102,
 'impersonations': 21103,
 'impersonator': 21104,
 'impervious': 21105,
 'impetuous': 21106,
 'impetus': 21107,
 'impinged': 21108,
 'impish': 21109,
 'implacable': 21110,
 'implant': 21111,
 'implantation': 21112,
 'implanted': 21113,
 'implanting': 21114,
 'implants': 21115,
 'implausible': 21116,
 'implement': 21117,
 'implementation': 21118,
 'implementing': 21119,
 'implements': 21120,
 'implicate': 21121,
 'implicated': 21122,
 'implicates': 21123,
 'implicating': 21124,
 'implication': 21125,
 'implications': 21126,
 'implicitly': 21127,
 'implied': 21128,
 'implies': 21129,
 'imploded': 21130,
 'imploding': 21131,
 'implores': 21132,
 'imply': 21133,
 'impolite': 21134,
 'import': 21135,
 'importance': 21136,
 'important': 21137,
 'importantly': 21138,
 'importantthe': 21139,
 'imported': 21140,
 'importer': 21141,
 'importing': 21142,
 'imports': 21143,
 'importune': 21144,
 'impose': 21145,
 'imposed': 21146,
 'imposes': 21147

 'janie': 22444,
 'janies': 22445,
 'janine': 22446,
 'janis': 22447,
 'janitor': 22448,
 'janitors': 22449,
 'janka': 22450,
 'jankens': 22451,
 'janki': 22452,
 'janna': 22453,
 'jannetier': 22454,
 'janoski': 22455,
 'jans': 22456,
 'jansci': 22457,
 'jansen': 22458,
 'jansens': 22459,
 'janson': 22460,
 'janta': 22461,
 'january': 22462,
 'janus': 22463,
 'janvan': 22464,
 'jap': 22465,
 'japan': 22466,
 'japanese': 22467,
 'japaneseamerican': 22468,
 'japaneseheld': 22469,
 'japaneseoccupied': 22470,
 'japaneseruled': 22471,
 'japanesesupported': 22472,
 'japanesetaken': 22473,
 'japans': 22474,
 'japanse': 22475,
 'jara': 22476,
 'jaras': 22477,
 'jareckis': 22478,
 'jared': 22479,
 'jareds': 22480,
 'jariwalas': 22481,
 'jaron': 22482,
 'jarred': 22483,
 'jarret': 22484,
 'jarrett': 22485,
 'jarvis': 22486,
 'jas': 22487,
 'jasminder': 22488,
 'jasmine': 22489,
 'jasmines': 22490,
 'jason': 22491,
 'jasons': 22492,
 'jasper': 22493,
 'jaspers': 22494,
 'jass': 22495,
 'jatin': 2

 'kostas': 23851,
 'kosumi': 23852,
 'kot': 23853,
 'kota': 23854,
 'kotchov': 23855,
 'kothag': 23856,
 'kotzebue': 23857,
 'kousevoeten': 23858,
 'kovac': 23859,
 'kovacs': 23860,
 'kovcs': 23861,
 'kowalski': 23862,
 'kowelska': 23863,
 'kozara': 23864,
 'kozelsk': 23865,
 'kozyr': 23866,
 'kpop': 23867,
 'kpw': 23868,
 'krairuzan': 23869,
 'krakatau': 23870,
 'krakow': 23871,
 'kram': 23872,
 'kramer': 23873,
 'krampus': 23874,
 'kranj': 23875,
 'krankenmal': 23876,
 'kransterdam': 23877,
 'kras': 23878,
 'krash': 23879,
 'krasnaya': 23880,
 'krasnov': 23881,
 'krause': 23882,
 'kravitz': 23883,
 'kray': 23884,
 'krbesgaard': 23885,
 'kreeg': 23886,
 'krein': 23887,
 'kremer': 23888,
 'kremlin': 23889,
 'krewl': 23890,
 'krillet': 23891,
 'kripac': 23892,
 'kris': 23893,
 'krishna': 23894,
 'krishnan': 23895,
 'krista': 23896,
 'kristallnacht': 23897,
 'kristen': 23898,
 'kristens': 23899,
 'kristian': 23900,
 'kristie': 23901,
 'kristin': 23902,
 'kristoff': 23903,
 'kristy': 2390

 'lolly': 25232,
 'loma': 25233,
 'lomas': 25234,
 'lomax': 25235,
 'lombard': 25236,
 'lombardi': 25237,
 'lombok': 25238,
 'lomond': 25239,
 'loms': 25240,
 'lon': 25241,
 'lonavala': 25242,
 'londeners': 25243,
 'london': 25244,
 'londonbased': 25245,
 'londoner': 25246,
 'londoners': 25247,
 'londons': 25248,
 'lone': 25249,
 'loneliness': 25250,
 'lonely': 25251,
 'loner': 25252,
 'loners': 25253,
 'lonesome': 25254,
 'lonetta': 25255,
 'long': 25256,
 'longabandoned': 25257,
 'longago': 25258,
 'longas': 25259,
 'longawaited': 25260,
 'longboat': 25261,
 'longboiling': 25262,
 'longdead': 25263,
 'longdistance': 25264,
 'longdisused': 25265,
 'longed': 25266,
 'longended': 25267,
 'longer': 25268,
 'longerserving': 25269,
 'longest': 25270,
 'longestranged': 25271,
 'longevity': 25272,
 'longewala': 25273,
 'longfellow': 25274,
 'longforgotten': 25275,
 'longhammer': 25276,
 'longhe': 25277,
 'longheld': 25278,
 'longhidden': 25279,
 'longhorn': 25280,
 'longhorns': 25281,
 'long

 'marya': 26592,
 'maryam': 26593,
 'maryana': 26594,
 'marybeth': 26595,
 'maryland': 26596,
 'marylouise': 26597,
 'marys': 26598,
 'marzone': 26599,
 'masamba': 26600,
 'masashi': 26601,
 'mascarado': 26602,
 'mascarenhas': 26603,
 'mascot': 26604,
 'masculine': 26605,
 'masculinely': 26606,
 'masculinity': 26607,
 'masekela': 26608,
 'masen': 26609,
 'maserati': 26610,
 'masetto': 26611,
 'masettos': 26612,
 'mash': 26613,
 'masha': 26614,
 'mashas': 26615,
 'mashiko': 26616,
 'mask': 26617,
 'masked': 26618,
 'masks': 26619,
 'maso': 26620,
 'masochism': 26621,
 'masochist': 26622,
 'masochistic': 26623,
 'mason': 26624,
 'masonry': 26625,
 'masons': 26626,
 'masonthe': 26627,
 'masque': 26628,
 'masquerade': 26629,
 'masquerades': 26630,
 'masquerading': 26631,
 'mass': 26632,
 'massa': 26633,
 'massachusetts': 26634,
 'massachusettsbased': 26635,
 'massacre': 26636,
 'massacred': 26637,
 'massage': 26638,
 'masse': 26639,
 'masses': 26640,
 'masseur': 26641,
 'massimo': 26642,
 

 'mildmannered': 27600,
 'mildred': 27601,
 'mildreds': 27602,
 'mile': 27603,
 'miles': 27604,
 'milestone': 27605,
 'miley': 27606,
 'milford': 27607,
 'milgram': 27608,
 'milgrams': 27609,
 'milhous': 27610,
 'milieu': 27611,
 'militant': 27612,
 'militantly': 27613,
 'militants': 27614,
 'militarism': 27615,
 'militarist': 27616,
 'military': 27617,
 'militarybiological': 27618,
 'militaryhospital': 27619,
 'militarys': 27620,
 'militarythemed': 27621,
 'militia': 27622,
 'militiastyle': 27623,
 'milius': 27624,
 'milk': 27625,
 'milked': 27626,
 'milkman': 27627,
 'milkshake': 27628,
 'mill': 27629,
 'milla': 27630,
 'milland': 27631,
 'millar': 27632,
 'millas': 27633,
 'millbarge': 27634,
 'millbrook': 27635,
 'millenarian': 27636,
 'millenia': 27637,
 'millenium': 27638,
 'millennia': 27639,
 'millennial': 27640,
 'millennials': 27641,
 'millennium': 27642,
 'miller': 27643,
 'millers': 27644,
 'millhausers': 27645,
 'millhouse': 27646,
 'millican': 27647,
 'millie': 27648,
 'm

 'nagrale': 28904,
 'nags': 28905,
 'naguib': 28906,
 'naharin': 28907,
 'nahum': 28908,
 'nail': 28909,
 'nailbiting': 28910,
 'nailed': 28911,
 'nailing': 28912,
 'nails': 28913,
 'naina': 28914,
 'nair': 28915,
 'nairobi': 28916,
 'naive': 28917,
 'naively': 28918,
 'naivelypositive': 28919,
 'naivet': 28920,
 'naivete': 28921,
 'naivety': 28922,
 'nak': 28923,
 'nakajo': 28924,
 'nakamura': 28925,
 'naked': 28926,
 'nakku': 28927,
 'nakoshi': 28928,
 'nakros': 28929,
 'naksan': 28930,
 'nalini': 28931,
 'nam': 28932,
 'namchulwoo': 28933,
 'name': 28934,
 'namecalling': 28935,
 'namechange': 28936,
 'named': 28937,
 'namediamond': 28938,
 'nameless': 28939,
 'namely': 28940,
 'names': 28941,
 'namesake': 28942,
 'namesakes': 28943,
 'namib': 28944,
 'naming': 28945,
 'namok': 28946,
 'namyi': 28947,
 'nan': 28948,
 'nana': 28949,
 'nance': 28950,
 'nancy': 28951,
 'nancys': 28952,
 'nandakumar': 28953,
 'nandita': 28954,
 'nandu': 28955,
 'nang': 28956,
 'nanking': 28957,
 'nanni':

 'offender': 30253,
 'offenders': 30254,
 'offending': 30255,
 'offends': 30256,
 'offense': 30257,
 'offenses': 30258,
 'offensive': 30259,
 'offensives': 30260,
 'offer': 30261,
 'offered': 30262,
 'offering': 30263,
 'offerings': 30264,
 'offers': 30265,
 'offfield': 30266,
 'offguard': 30267,
 'office': 30268,
 'officer': 30269,
 'officercapt': 30270,
 'officercaptain': 30271,
 'officercolonel': 30272,
 'officeronly': 30273,
 'officers': 30274,
 'officerturnedgunrunner': 30275,
 'offices': 30276,
 'officesingertypist': 30277,
 'official': 30278,
 'officialinsurance': 30279,
 'officially': 30280,
 'officials': 30281,
 'officiate': 30282,
 'officious': 30283,
 'offing': 30284,
 'offlimits': 30285,
 'offline': 30286,
 'offroad': 30287,
 'offscreen': 30288,
 'offshore': 30289,
 'offspring': 30290,
 'offstage': 30291,
 'offthecuff': 30292,
 'offthegrid': 30293,
 'offthemap': 30294,
 'offworld': 30295,
 'ofhong': 30296,
 'oflair': 30297,
 'oft': 30298,
 'often': 30299,
 'oftenabsent': 30

 'passer': 31512,
 'passerby': 31513,
 'passersby': 31514,
 'passes': 31515,
 'passing': 31516,
 'passion': 31517,
 'passionate': 31518,
 'passionately': 31519,
 'passionless': 31520,
 'passionof': 31521,
 'passions': 31522,
 'passive': 31523,
 'passon': 31524,
 'passos': 31525,
 'passport': 31526,
 'passports': 31527,
 'password': 31528,
 'passwords': 31529,
 'past': 31530,
 'pasta': 31531,
 'paste': 31532,
 'pasted': 31533,
 'pastellfarbenen': 31534,
 'pastiche': 31535,
 'pastime': 31536,
 'pastlife': 31537,
 'pastor': 31538,
 'pastoral': 31539,
 'pastors': 31540,
 'pasts': 31541,
 'pasttheirprime': 31542,
 'pastures': 31543,
 'pat': 31544,
 'patagonia': 31545,
 'patagonic': 31546,
 'patan': 31547,
 'patch': 31548,
 'patched': 31549,
 'patches': 31550,
 'patchup': 31551,
 'patchwork': 31552,
 'pateints': 31553,
 'patelarmammootty': 31554,
 'paten': 31555,
 'patens': 31556,
 'patent': 31557,
 'patents': 31558,
 'paterfamilias': 31559,
 'paternal': 31560,
 'paternalistic': 31561,
 'pat

 'pomeranian': 32839,
 'pomp': 32840,
 'pompeii': 32841,
 'pompeiis': 32842,
 'pompeo': 32843,
 'pomposity': 32844,
 'pompous': 32845,
 'pon': 32846,
 'ponce': 32847,
 'pond': 32848,
 'ponder': 32849,
 'pondering': 32850,
 'ponders': 32851,
 'pondok': 32852,
 'pong': 32853,
 'ponies': 32854,
 'ponorogo': 32855,
 'pons': 32856,
 'pontiac': 32857,
 'pontian': 32858,
 'pontificates': 32859,
 'pontios': 32860,
 'pony': 32861,
 'ponyo': 32862,
 'ponyos': 32863,
 'pooch': 32864,
 'pooda': 32865,
 'poodle': 32866,
 'pooh': 32867,
 'poohbear': 32868,
 'pooja': 32869,
 'poojas': 32870,
 'pool': 32871,
 'poole': 32872,
 'poolroom': 32873,
 'pools': 32874,
 'poolside': 32875,
 'poomadanthapuram': 32876,
 'poor': 32877,
 'poorend': 32878,
 'poorer': 32879,
 'poorest': 32880,
 'poorhouse': 32881,
 'poorly': 32882,
 'poorlyarmed': 32883,
 'pootie': 32884,
 'pop': 32885,
 'popculture': 32886,
 'pope': 32887,
 'popes': 32888,
 'popeye': 32889,
 'poplar': 32890,
 'popov': 32891,
 'popper': 32892,
 'pop

 'purses': 34099,
 'pursuant': 34100,
 'pursue': 34101,
 'pursued': 34102,
 'pursuer': 34103,
 'pursuers': 34104,
 'pursues': 34105,
 'pursuing': 34106,
 'pursuinghusbands': 34107,
 'pursuit': 34108,
 'pursuits': 34109,
 'purviance': 34110,
 'purvis': 34111,
 'purviss': 34112,
 'push': 34113,
 'pushed': 34114,
 'pusher': 34115,
 'pushers': 34116,
 'pushes': 34117,
 'pushing': 34118,
 'pushmepullyu': 34119,
 'pushp': 34120,
 'pushy': 34121,
 'puss': 34122,
 'pussy': 34123,
 'pussycat': 34124,
 'pussycats': 34125,
 'put': 34126,
 'putative': 34127,
 'puthiayara': 34128,
 'putonghua': 34129,
 'puts': 34130,
 'putsi': 34131,
 'putt': 34132,
 'puttering': 34133,
 'putterman': 34134,
 'puttermans': 34135,
 'putting': 34136,
 'puttiputti': 34137,
 'puyi': 34138,
 'puyol': 34139,
 'puzzle': 34140,
 'puzzled': 34141,
 'puzzles': 34142,
 'puzzling': 34143,
 'pvc': 34144,
 'pvt': 34145,
 'pyare': 34146,
 'pyeongryung': 34147,
 'pygmalion': 34148,
 'pygmies': 34149,
 'pygmy': 34150,
 'pyiotr': 341

 'renaissance': 35514,
 'renaissanceera': 35515,
 'renaldo': 35516,
 'rename': 35517,
 'renamed': 35518,
 'renames': 35519,
 'renard': 35520,
 'renata': 35521,
 'renato': 35522,
 'renault': 35523,
 'render': 35524,
 'rendered': 35525,
 'rendering': 35526,
 'renders': 35527,
 'rendevous': 35528,
 'rendezvous': 35529,
 'rending': 35530,
 'rendition': 35531,
 'renditions': 35532,
 'rene': 35533,
 'renee': 35534,
 'renees': 35535,
 'renegade': 35536,
 'renegades': 35537,
 'reneged': 35538,
 'reneging': 35539,
 'renew': 35540,
 'renewal': 35541,
 'renewed': 35542,
 'renewing': 35543,
 'renews': 35544,
 'renfield': 35545,
 'renford': 35546,
 'renfrew': 35547,
 'renfro': 35548,
 'renick': 35549,
 'renner': 35550,
 'rennie': 35551,
 'renny': 35552,
 'reno': 35553,
 'renounce': 35554,
 'renounces': 35555,
 'renovate': 35556,
 'renovated': 35557,
 'renovating': 35558,
 'renovation': 35559,
 'renovations': 35560,
 'renowed': 35561,
 'renown': 35562,
 'renowned': 35563,
 'rens': 35564,
 'rensing':

 'rudd': 36753,
 'rudderless': 36754,
 'rude': 36755,
 'rudeboy': 36756,
 'rudely': 36757,
 'rudenbergs': 36758,
 'rudeness': 36759,
 'rudgate': 36760,
 'rudi': 36761,
 'rudimentary': 36762,
 'rudis': 36763,
 'rudo': 36764,
 'rudolf': 36765,
 'rudolph': 36766,
 'rudra': 36767,
 'rudy': 36768,
 'rudyard': 36769,
 'rudys': 36770,
 'rufe': 36771,
 'ruffalo': 36772,
 'ruffalos': 36773,
 'ruffian': 36774,
 'ruffians': 36775,
 'ruffles': 36776,
 'rufo': 36777,
 'rufus': 36778,
 'rug': 36779,
 'rugby': 36780,
 'rugged': 36781,
 'ruggedly': 36782,
 'rugger': 36783,
 'ruggiero': 36784,
 'ruggles': 36785,
 'ruhl': 36786,
 'ruhlin': 36787,
 'ruin': 36788,
 'ruined': 36789,
 'ruining': 36790,
 'ruinous': 36791,
 'ruins': 36792,
 'ruiz': 36793,
 'rukh': 36794,
 'rukhsana': 36795,
 'rukmini': 36796,
 'rukminis': 36797,
 'ruksar': 36798,
 'rule': 36799,
 'rulebending': 36800,
 'ruled': 36801,
 'ruleeveryone': 36802,
 'ruler': 36803,
 'rulers': 36804,
 'rulership': 36805,
 'rulerwho': 36806,
 'rules':

 'semicomical': 38078,
 'semicorrupt': 38079,
 'semicriminal': 38080,
 'semidocumentary': 38081,
 'semifiction': 38082,
 'semifictional': 38083,
 'semifictionalized': 38084,
 'semifinal': 38085,
 'semifinals': 38086,
 'semigentleman': 38087,
 'semiinvalid': 38088,
 'seminal': 38089,
 'seminar': 38090,
 'seminary': 38091,
 'seminarybound': 38092,
 'seminola': 38093,
 'seminole': 38094,
 'seminoles': 38095,
 'seminude': 38096,
 'semiretired': 38097,
 'semiretirement': 38098,
 'semiskilled': 38099,
 'semitruck': 38100,
 'semiwild': 38101,
 'semple': 38102,
 'sempler': 38103,
 'semyon': 38104,
 'sen': 38105,
 'senac': 38106,
 'senate': 38107,
 'senator': 38108,
 'senatorial': 38109,
 'senators': 38110,
 'send': 38111,
 'sendak': 38112,
 'sender': 38113,
 'sendero': 38114,
 'sending': 38115,
 'sendler': 38116,
 'sendlers': 38117,
 'sends': 38118,
 'sendup': 38119,
 'senegal': 38120,
 'senegalese': 38121,
 'senff': 38122,
 'senile': 38123,
 'senility': 38124,
 'seninto': 38125,
 'senior': 38

 'slant': 39348,
 'slap': 39349,
 'slapjack': 39350,
 'slapped': 39351,
 'slappy': 39352,
 'slaps': 39353,
 'slapsie': 39354,
 'slapstick': 39355,
 'slash': 39356,
 'slashed': 39357,
 'slasher': 39358,
 'slashermovie': 39359,
 'slashers': 39360,
 'slashes': 39361,
 'slashing': 39362,
 'slate': 39363,
 'slated': 39364,
 'slater': 39365,
 'slattern': 39366,
 'slaughter': 39367,
 'slaughtered': 39368,
 'slaughterhouse': 39369,
 'slaughtering': 39370,
 'slaughters': 39371,
 'slausen': 39372,
 'slausens': 39373,
 'slave': 39374,
 'slavelike': 39375,
 'slaver': 39376,
 'slavers': 39377,
 'slavertraders': 39378,
 'slavery': 39379,
 'slaves': 39380,
 'slavesat': 39381,
 'slaveto': 39382,
 'slaveturnedgladiator': 39383,
 'slaving': 39384,
 'slavishly': 39385,
 'slavs': 39386,
 'slay': 39387,
 'slayer': 39388,
 'slaying': 39389,
 'slayings': 39390,
 'slays': 39391,
 'slayton': 39392,
 'slaytons': 39393,
 'sleaze': 39394,
 'sleazier': 39395,
 'sleaziest': 39396,
 'sleazy': 39397,
 'sled': 39398,


 'stashes': 40625,
 'stasio': 40626,
 'stasios': 40627,
 'state': 40628,
 'stated': 40629,
 'statehood': 40630,
 'stately': 40631,
 'statemandated': 40632,
 'statement': 40633,
 'statements': 40634,
 'stateoftheart': 40635,
 'stateroom': 40636,
 'states': 40637,
 'statesbased': 40638,
 'stateside': 40639,
 'statesman': 40640,
 'statess': 40641,
 'statewide': 40642,
 'static': 40643,
 'stating': 40644,
 'station': 40645,
 'stationed': 40646,
 'stationing': 40647,
 'stationmaster': 40648,
 'stations': 40649,
 'stationsand': 40650,
 'statistical': 40651,
 'statistically': 40652,
 'statistics': 40653,
 'stats': 40654,
 'statue': 40655,
 'statues': 40656,
 'statuesque': 40657,
 'statuette': 40658,
 'stature': 40659,
 'status': 40660,
 'statusconscious': 40661,
 'statusseeking': 40662,
 'statutes': 40663,
 'statutory': 40664,
 'staub': 40665,
 'stauffenberg': 40666,
 'staunch': 40667,
 'staunton': 40668,
 'stavanger': 40669,
 'stave': 40670,
 'stay': 40671,
 'stayathome': 40672,
 'stayed': 4

 'syndrome': 41977,
 'synesthesia': 41978,
 'synonymous': 41979,
 'synopsis': 41980,
 'synthesizers': 41981,
 'synthetic': 41982,
 'syphilis': 41983,
 'syphon': 41984,
 'syracuse': 41985,
 'syreena': 41986,
 'syria': 41987,
 'syrian': 41988,
 'syriana': 41989,
 'syrtis': 41990,
 'system': 41991,
 'systematic': 41992,
 'systematically': 41993,
 'systems': 41994,
 'szenes': 41995,
 'szeto': 41996,
 't': 41997,
 'ta': 41998,
 'taasha': 41999,
 'taashaafter': 42000,
 'tabacco': 42001,
 'tabak': 42002,
 'tabernacle': 42003,
 'tablas': 42004,
 'table': 42005,
 'tableau': 42006,
 'tableaux': 42007,
 'tables': 42008,
 'tablets': 42009,
 'tabloid': 42010,
 'taboo': 42011,
 'taboos': 42012,
 'tabrak': 42013,
 'tabs': 42014,
 'tacit': 42015,
 'taciturn': 42016,
 'tackett': 42017,
 'tackle': 42018,
 'tackled': 42019,
 'tackles': 42020,
 'tackling': 42021,
 'tacky': 42022,
 'taco': 42023,
 'tacos': 42024,
 'tactful': 42025,
 'tactic': 42026,
 'tactical': 42027,
 'tactician': 42028,
 'tactics': 4202

 'togetherness': 43254,
 'togetherwhen': 43255,
 'toggle': 43256,
 'togrul': 43257,
 'togruls': 43258,
 'toi': 43259,
 'toil': 43260,
 'toiler': 43261,
 'toilet': 43262,
 'toilets': 43263,
 'toiling': 43264,
 'toils': 43265,
 'toinette': 43266,
 'tojo': 43267,
 'token': 43268,
 'tokens': 43269,
 'tokugawa': 43270,
 'tokyo': 43271,
 'tokyos': 43272,
 'told': 43273,
 'toledano': 43274,
 'toler': 43275,
 'tolerance': 43276,
 'tolerant': 43277,
 'tolerate': 43278,
 'tolerated': 43279,
 'tolerates': 43280,
 'tolerating': 43281,
 'toles': 43282,
 'toll': 43283,
 'tollbooth': 43284,
 'tollgate': 43285,
 'tolliver': 43286,
 'tolls': 43287,
 'tolly': 43288,
 'tolson': 43289,
 'tolstoys': 43290,
 'tom': 43291,
 'tomahawk': 43292,
 'tomar': 43293,
 'tomars': 43294,
 'tomas': 43295,
 'tomasina': 43296,
 'tomasis': 43297,
 'tomasso': 43298,
 'tomato': 43299,
 'tomatoes': 43300,
 'tomb': 43301,
 'tomboy': 43302,
 'tomboyish': 43303,
 'tomboys': 43304,
 'tombs': 43305,
 'tombstone': 43306,
 'tomcatti

 'underneath': 44573,
 'underpaid': 44574,
 'underperforming': 44575,
 'underpinnings': 44576,
 'underprivileged': 44577,
 'underproduce': 44578,
 'underproduction': 44579,
 'underrated': 44580,
 'underrecognized': 44581,
 'underresourced': 44582,
 'undersea': 44583,
 'underside': 44584,
 'undersized': 44585,
 'understand': 44586,
 'understandably': 44587,
 'understanding': 44588,
 'understandingand': 44589,
 'understandingly': 44590,
 'understandings': 44591,
 'understands': 44592,
 'understood': 44593,
 'understudies': 44594,
 'understudy': 44595,
 'undertake': 44596,
 'undertaken': 44597,
 'undertaker': 44598,
 'undertakes': 44599,
 'undertaking': 44600,
 'undertones': 44601,
 'undertook': 44602,
 'underwater': 44603,
 'underway': 44604,
 'underwear': 44605,
 'underwood': 44606,
 'underworld': 44607,
 'underworlds': 44608,
 'underwrite': 44609,
 'underwrites': 44610,
 'undeserved': 44611,
 'undeservedly': 44612,
 'undesirable': 44613,
 'undesirables': 44614,
 'undesired': 44615,
 'u

 'vital': 45861,
 'vitali': 45862,
 'vitalinformation': 45863,
 'vitaliy': 45864,
 'vitally': 45865,
 'vitamin': 45866,
 'vitamins': 45867,
 'vitello': 45868,
 'vitiating': 45869,
 'vitka': 45870,
 'vito': 45871,
 'vitrine': 45872,
 'vitti': 45873,
 'vittore': 45874,
 'vittoria': 45875,
 'vittorio': 45876,
 'viv': 45877,
 'viva': 45878,
 'vivacious': 45879,
 'vivanne': 45880,
 'vivannes': 45881,
 'vive': 45882,
 'vivian': 45883,
 'vivians': 45884,
 'vivid': 45885,
 'vividly': 45886,
 'vivisect': 45887,
 'vivre': 45888,
 'vixen': 45889,
 'vixens': 45890,
 'vizcano': 45891,
 'vizier': 45892,
 'vjing': 45893,
 'vlad': 45894,
 'vladimir': 45895,
 'vlado': 45896,
 'vlads': 45897,
 'vlissingen': 45898,
 'vllig': 45899,
 'vlogging': 45900,
 'vnus': 45901,
 'vocabulary': 45902,
 'vocal': 45903,
 'vocalist': 45904,
 'vocation': 45905,
 'vocational': 45906,
 'vocationally': 45907,
 'voce': 45908,
 'vociferous': 45909,
 'vodka': 45910,
 'voga': 45911,
 'vogan': 45912,
 'vogel': 45913,
 'vogue': 4

 'wonderbra': 47122,
 'wonderbut': 47123,
 'wonderful': 47124,
 'wonderfully': 47125,
 'wondering': 47126,
 'wonderland': 47127,
 'wonders': 47128,
 'wonderwoman': 47129,
 'wonderworld': 47130,
 'wondrous': 47131,
 'wong': 47132,
 'wont': 47133,
 'woo': 47134,
 'wood': 47135,
 'woodall': 47136,
 'woodalls': 47137,
 'woodard': 47138,
 'woodbine': 47139,
 'woodboring': 47140,
 'woodbury': 47141,
 'woodcutters': 47142,
 'wooded': 47143,
 'wooden': 47144,
 'woodfield': 47145,
 'woodfoot': 47146,
 'woodfox': 47147,
 'woodhawk': 47148,
 'woodjiminy': 47149,
 'woodland': 47150,
 'woodlands': 47151,
 'woodlawn': 47152,
 'woodman': 47153,
 'woodrow': 47154,
 'woodruf': 47155,
 'woods': 47156,
 'woodsboro': 47157,
 'woodsman': 47158,
 'woodson': 47159,
 'woodstock': 47160,
 'woodward': 47161,
 'woodwards': 47162,
 'woody': 47163,
 'woodys': 47164,
 'wooed': 47165,
 'woofer': 47166,
 'woog': 47167,
 'wooing': 47168,
 'wool': 47169,
 'woolf': 47170,
 'woolfs': 47171,
 'woolly': 47172,
 'woolrich':

In [64]:
# vocabulary for targets
t_vocab = list(set(sum(targets, [])))
t_vocab.sort()
t_vocab = ['<pad>', '<bos>', '<eos>'] + t_vocab
target2idx = {word : idx for idx, word in enumerate(t_vocab)}
idx2target = {idx : word for idx, word in enumerate(t_vocab)}

pprint(target2idx)

{'.': 3,
 '..': 4,
 '...': 5,
 '....': 6,
 '.....': 7,
 '......': 8,
 '...a': 9,
 '...all': 10,
 '...and': 11,
 '...before': 12,
 '...being': 13,
 '...bring': 14,
 '...burning': 15,
 '...but': 16,
 '...cause': 17,
 '...each': 18,
 '...even': 19,
 '...fall': 20,
 '...from': 21,
 '...he': 22,
 '...hers': 23,
 '...hes': 24,
 '...high': 25,
 '...his': 26,
 '...i': 27,
 '...ill': 28,
 '...in': 29,
 '...is': 30,
 '...it': 31,
 '...its': 32,
 '...just': 33,
 '...justice': 34,
 '...lies': 35,
 '...meets': 36,
 '...men': 37,
 '...much': 38,
 '...no': 39,
 '...now': 40,
 '...or': 41,
 '...ora': 42,
 '...playing': 43,
 '...she': 44,
 '...slavery': 45,
 '...so': 46,
 '...starts': 47,
 '...street': 48,
 '...suddenly': 49,
 '...that': 50,
 '...the': 51,
 '...then': 52,
 '...there': 53,
 '...they': 54,
 '...theyre': 55,
 '...to': 56,
 '...together': 57,
 '...wait': 58,
 '...welcome': 59,
 '...what': 60,
 '...when': 61,
 '...with': 62,
 '...writing': 63,
 '...your': 64,
 '..told': 65,
 '.45caliber': 6

 'baseball': 1250,
 'baseball.': 1251,
 'based': 1252,
 'baseret': 1253,
 'bashes': 1254,
 'basketball': 1255,
 'bastard': 1256,
 'bastards...or': 1257,
 'batalla': 1258,
 'batavia': 1259,
 'bath': 1260,
 'baths': 1261,
 'batman': 1262,
 'battery': 1263,
 'battle': 1264,
 'battle.': 1265,
 'battle...': 1266,
 'battle...savage': 1267,
 'battled': 1268,
 'battlefield': 1269,
 'battlefield.': 1270,
 'battleground': 1271,
 'battles': 1272,
 'battlin': 1273,
 'battling': 1274,
 'baumer': 1275,
 'baums': 1276,
 'bawdy': 1277,
 'bax.': 1278,
 'baxter': 1279,
 'bayous': 1280,
 'bazooka': 1281,
 'bbc.': 1282,
 'bc': 1283,
 'be': 1284,
 'be.': 1285,
 'be...': 1286,
 'be...an': 1287,
 'be...thats': 1288,
 'beach': 1289,
 'beaches...': 1290,
 'beachheadall': 1291,
 'beachs': 1292,
 'beacon': 1293,
 'beans': 1294,
 'beanstalk': 1295,
 'bear': 1296,
 'bear...and': 1297,
 'bearing.': 1298,
 'bears': 1299,
 'bears.': 1300,
 'beast': 1301,
 'beast...': 1302,
 'beasts': 1303,
 'beasts...': 1304,
 'beast

 'clouded': 2583,
 'clouds': 2584,
 'clouds...flying': 2585,
 'clouseau': 2586,
 'clouzot': 2587,
 'clown': 2588,
 'clown.': 2589,
 'clowning': 2590,
 'clownyard': 2591,
 'club': 2592,
 'clubs...has': 2593,
 'clue': 2594,
 'clue.': 2595,
 'clues': 2596,
 'clyde': 2597,
 'coach': 2598,
 'coalition': 2599,
 'coasttocoast.': 2600,
 'cobra': 2601,
 'cobwebbed': 2602,
 'coca': 2603,
 'cockade': 2604,
 'cockeyed': 2605,
 'cocktail': 2606,
 'cocoanuts.': 2607,
 'cocteaus': 2608,
 'code': 2609,
 'code.': 2610,
 'codes': 2611,
 'codger': 2612,
 'coed': 2613,
 'coffee': 2614,
 'coffin': 2615,
 'coffin.': 2616,
 'coffinmate': 2617,
 'coffytized': 2618,
 'cohen': 2619,
 'coincidence': 2620,
 'col.': 2621,
 'colbert': 2622,
 'cold': 2623,
 'cold.': 2624,
 'coldblooded': 2625,
 'colder': 2626,
 'coldest': 2627,
 'cole': 2628,
 'collaborate': 2629,
 'collaboration': 2630,
 'collaboration.': 2631,
 'collaborative': 2632,
 'collection': 2633,
 'collective': 2634,
 'collector': 2635,
 'collects': 2636,


 'dramatization': 3833,
 'dramawith': 3834,
 'drank': 3835,
 'drape': 3836,
 'drauf': 3837,
 'draw': 3838,
 'drawn': 3839,
 'dread': 3840,
 'dreaded': 3841,
 'dreadedwisdoms': 3842,
 'dream': 3843,
 'dream.': 3844,
 'dream...': 3845,
 'dreamboat': 3846,
 'dreamed': 3847,
 'dreamer': 3848,
 'dreamers': 3849,
 'dreamers.': 3850,
 'dreaming': 3851,
 'dreaming...': 3852,
 'dreams': 3853,
 'dreams.': 3854,
 'dreams...': 3855,
 'dreamsnightmares...': 3856,
 'drer': 3857,
 'dress': 3858,
 'drew': 3859,
 'drifter...': 3860,
 'drifters': 3861,
 'drill.': 3862,
 'drills': 3863,
 'drink': 3864,
 'drinks': 3865,
 'drinks.': 3866,
 'dripping': 3867,
 'drive': 3868,
 'driven': 3869,
 'driver...wild': 3870,
 'drivers': 3871,
 'drives': 3872,
 'drives...': 3873,
 'driving': 3874,
 'drmmen': 3875,
 'droga.': 3876,
 'drogen.': 3877,
 'drogue': 3878,
 'drone': 3879,
 'drool': 3880,
 'drop': 3881,
 'dropped': 3882,
 'drops': 3883,
 'drove': 3884,
 'drowning': 3885,
 'drug': 3886,
 'drugs': 3887,
 'drugs.'

 'fully': 5104,
 'fun': 5105,
 'fun.': 5106,
 'fun...': 5107,
 'fun...done': 5108,
 'fun...on': 5109,
 'fund': 5110,
 'funeral.': 5111,
 'funfilled': 5112,
 'funky': 5113,
 'funloving': 5114,
 'funnier': 5115,
 'funniest': 5116,
 'funnsong': 5117,
 'funny': 5118,
 'funny.': 5119,
 'funny...': 5120,
 'funnybone': 5121,
 'funnyin': 5122,
 'funpacked': 5123,
 'funromp': 5124,
 'funster': 5125,
 'fuonly': 5126,
 'furious': 5127,
 'furlough': 5128,
 'furnished': 5129,
 'further': 5130,
 'further.': 5131,
 'fury': 5132,
 'fury...': 5133,
 'fury...and': 5134,
 'fury...like': 5135,
 'fury...of': 5136,
 'fuse': 5137,
 'fused': 5138,
 'futile': 5139,
 'future': 5140,
 'future.': 5141,
 'future...': 5142,
 'future...whether': 5143,
 'futures': 5144,
 'fuzz': 5145,
 'fuzzy': 5146,
 'fyra': 5147,
 'g.i.': 5148,
 'g.i.s': 5149,
 'g.w.': 5150,
 'ga': 5151,
 'gabin': 5152,
 'gable': 5153,
 'gags': 5154,
 'gaiety': 5155,
 'gaiety...jammed': 5156,
 'gaiety...moving': 5157,
 'gaiety12': 5158,
 'gain': 51

 'hunts': 6362,
 'hunts.': 6363,
 'huppert': 6364,
 'hur': 6365,
 'hurled': 6366,
 'hurricane': 6367,
 'hurry.': 6368,
 'hurt': 6369,
 'hurt...': 6370,
 'hurting': 6371,
 'hurtle': 6372,
 'hurts': 6373,
 'hurts.': 6374,
 'husband': 6375,
 'husband.': 6376,
 'husband...but': 6377,
 'husbands': 6378,
 'husbands.': 6379,
 'husbandthe': 6380,
 'hush': 6381,
 'hussein.': 6382,
 'hussy': 6383,
 'hustle': 6384,
 'hustler': 6385,
 'hutt': 6386,
 'hvad': 6387,
 'hvilke': 6388,
 'hvn': 6389,
 'hyde': 6390,
 'hyde.': 6391,
 'hye': 6392,
 'hylda': 6393,
 'hyperactivity': 6394,
 'hyperreality': 6395,
 'hypnotized': 6396,
 'hypocrites': 6397,
 'hysterical': 6398,
 'hysterical.': 6399,
 'hysterics': 6400,
 'i': 6401,
 'ibsens': 6402,
 'ice': 6403,
 'icetravaganza': 6404,
 'icey': 6405,
 'icon': 6406,
 'icon.': 6407,
 'iconic': 6408,
 'id': 6409,
 'idea': 6410,
 'idealist...through': 6411,
 'ideals': 6412,
 'ideas': 6413,
 'ideas...all': 6414,
 'identity': 6415,
 'identity.': 6416,
 'ides': 6417,
 'id

 'loser': 7641,
 'loser.': 7642,
 'loses': 7643,
 'losing': 7644,
 'loss': 7645,
 'lost': 7646,
 'lost.': 7647,
 'lost...': 7648,
 'lot': 7649,
 'lot.': 7650,
 'lottery': 7651,
 'loud': 7652,
 'louder': 7653,
 'louie': 7654,
 'louis': 7655,
 'louisiana.': 7656,
 'lounge': 7657,
 'lousy': 7658,
 'lovable': 7659,
 'love': 7660,
 'love.': 7661,
 'love...': 7662,
 'love.....her': 7663,
 'love...all': 7664,
 'love...and': 7665,
 'love...and...': 7666,
 'love...betrayal': 7667,
 'love...but': 7668,
 'love...for': 7669,
 'love...get': 7670,
 'love...marriage...and': 7671,
 'love...marriage...give': 7672,
 'love...then': 7673,
 'love..thru': 7674,
 'loveand': 7675,
 'lovebut': 7676,
 'loved': 7677,
 'loved.': 7678,
 'lovelies': 7679,
 'loveliest': 7680,
 'loveliness': 7681,
 'lovelorn': 7682,
 'lovely': 7683,
 'lovepassionobsession': 7684,
 'lover': 7685,
 'lover.': 7686,
 'lover...the': 7687,
 'lovers': 7688,
 'lovers.': 7689,
 'lovers...': 7690,
 'lovers...the': 7691,
 'loves': 7692,
 'loves

 'obsessed': 8918,
 'obsessed.': 8919,
 'obsession': 8920,
 'obsession.': 8921,
 'obsessions': 8922,
 'obsessive.': 8923,
 'obsolete': 8924,
 'occasion.': 8925,
 'occasional': 8926,
 'occult': 8927,
 'occupation': 8928,
 'occur.': 8929,
 'ocean': 8930,
 'ocean...': 8931,
 'oceans': 8932,
 'oceanthe': 8933,
 'och': 8934,
 'oclock': 8935,
 'octane': 8936,
 'october': 8937,
 'odd': 8938,
 'oddball': 8939,
 'oddest': 8940,
 'odds': 8941,
 'odds.': 8942,
 'oddsen': 8943,
 'odiar': 8944,
 'odio.': 8945,
 'odious.': 8946,
 'odyssey': 8947,
 'odyssey...experience': 8948,
 'oefening...': 8949,
 'of': 8950,
 'of...': 8951,
 'ofem': 8952,
 'ofer': 8953,
 'off': 8954,
 'off.': 8955,
 'off...the': 8956,
 'offbeat': 8957,
 'offbeatnik': 8958,
 'offcamera': 8959,
 'offen.': 8960,
 'offend': 8961,
 'offended': 8962,
 'offer': 8963,
 'offer.': 8964,
 'offered': 8965,
 'offerets': 8966,
 'offering': 8967,
 'offguard': 8968,
 'office': 8969,
 'office.': 8970,
 'officer': 8971,
 'officers': 8972,
 'offici

 'rebeccas': 10215,
 'rebel': 10216,
 'rebel.': 10217,
 'rebellion': 10218,
 'rebellion.': 10219,
 'rebellious': 10220,
 'rebels': 10221,
 'rebelswho': 10222,
 'rebirth.': 10223,
 'reborn.': 10224,
 'reborn...': 10225,
 'rebuke': 10226,
 'received': 10227,
 'receives': 10228,
 'receiving': 10229,
 'recent': 10230,
 'recipe': 10231,
 'reckless': 10232,
 'reckless...running': 10233,
 'recklessness': 10234,
 'reckoning': 10235,
 'reclaim': 10236,
 'recommended': 10237,
 'reconcile': 10238,
 'reconciliation': 10239,
 'reconcilliation': 10240,
 'reconnect': 10241,
 'reconstruction': 10242,
 'record': 10243,
 'recorded': 10244,
 'recorder': 10245,
 'recover': 10246,
 'recovery': 10247,
 'recovery.': 10248,
 'recreation': 10249,
 'recruiter.': 10250,
 'recruits': 10251,
 'red': 10252,
 'red...white...and': 10253,
 'redej': 10254,
 'redemption': 10255,
 'redemption.': 10256,
 'redfaced': 10257,
 'redhaired': 10258,
 'redhanded': 10259,
 'redhead': 10260,
 'redheaded': 10261,
 'redhot': 10262,


 'shoes.': 11348,
 'shook': 11349,
 'shookwhen': 11350,
 'shoot': 11351,
 'shoot...and': 11352,
 'shooter': 11353,
 'shootin': 11354,
 'shooting': 11355,
 'shootingiron': 11356,
 'shootout': 11357,
 'shoots': 11358,
 'shop.': 11359,
 'shopping.': 11360,
 'shops': 11361,
 'shopworn...': 11362,
 'shores': 11363,
 'short': 11364,
 'shortages': 11365,
 'shortcuts': 11366,
 'shortly': 11367,
 'shorts': 11368,
 'shot': 11369,
 'shot.': 11370,
 'shotgun.': 11371,
 'shots': 11372,
 'shots.': 11373,
 'should': 11374,
 'should.': 11375,
 'shoulda': 11376,
 'shoulder': 11377,
 'shoulders': 11378,
 'shouldnt': 11379,
 'show': 11380,
 'show.': 11381,
 'show...she': 11382,
 'showbut': 11383,
 'showdown': 11384,
 'showed': 11385,
 'showfull': 11386,
 'showgirl...and': 11387,
 'showing': 11388,
 'showmakes': 11389,
 'shown': 11390,
 'shows': 11391,
 'shows...it': 11392,
 'showstopping': 11393,
 'shraap': 11394,
 'shriek': 11395,
 'shrimp.': 11396,
 'shrink.': 11397,
 'shrouda': 11398,
 'shtless...': 1

 'terms...': 12596,
 'terms.on': 12597,
 'terrible': 12598,
 'terrible.': 12599,
 'terrific': 12600,
 'terrific...': 12601,
 'terrific...in': 12602,
 'terrified': 12603,
 'terrified...': 12604,
 'terrify': 12605,
 'terrifying': 12606,
 'territory': 12607,
 'terro': 12608,
 'terror': 12609,
 'terror.': 12610,
 'terror...': 12611,
 'terror...but': 12612,
 'terrorism': 12613,
 'terrorist': 12614,
 'terroristdeath': 12615,
 'terrorists': 12616,
 'terrorize': 12617,
 'terrorize.': 12618,
 'terrorized': 12619,
 'terrorized..': 12620,
 'terrorizes': 12621,
 'terrors': 12622,
 'terrors...': 12623,
 'terrorto': 12624,
 'terry': 12625,
 'terrythomas': 12626,
 'tervnde...': 12627,
 'test': 12628,
 'testing': 12629,
 'testosterone': 12630,
 'tevye...': 12631,
 'texas': 12632,
 'texas.': 12633,
 'than': 12634,
 'thank': 12635,
 'thankful': 12636,
 'thanksgiving': 12637,
 'that': 12638,
 'that.': 12639,
 'thatll': 12640,
 'thats': 12641,
 'the': 12642,
 'the...': 12643,
 'theater': 12644,
 'theater.

 'widest': 14074,
 'widmark': 14075,
 'widow': 14076,
 'wie': 14077,
 'wieder': 14078,
 'wife': 14079,
 'wife.': 14080,
 'wife...': 14081,
 'wife...his': 14082,
 'wifes': 14083,
 'wifeshopping...': 14084,
 'wilbur': 14085,
 'wild': 14086,
 'wild.': 14087,
 'wild...': 14088,
 'wild.it': 14089,
 'wilde': 14090,
 'wilder': 14091,
 'wilderness': 14092,
 'wilderness.': 14093,
 'wilderness...': 14094,
 'wildest': 14095,
 'wildfire...': 14096,
 'wildlife': 14097,
 'wildly': 14098,
 'wilds': 14099,
 'wiley': 14100,
 'will': 14101,
 'will.': 14102,
 'willard': 14103,
 'willard.': 14104,
 'william': 14105,
 'williams': 14106,
 'williams.': 14107,
 'williamson': 14108,
 'williamsport': 14109,
 'willie': 14110,
 'willing': 14111,
 'willing...to': 14112,
 'willkommen': 14113,
 'willowpoint': 14114,
 'wilson': 14115,
 'wily': 14116,
 'win': 14117,
 'win...a': 14118,
 'wind': 14119,
 'windbag': 14120,
 'winding': 14121,
 'window': 14122,
 'window.': 14123,
 'windows': 14124,
 'winds': 14125,
 'windso

In [65]:
def preprocess(sequences, max_len, dic, mode = 'source'):
    assert mode in ['source', 'target'], 'source와 target 중에 선택해주세요.'
    
    if mode == 'source':
        # preprocessing for source (encoder)
        s_input = list(map(lambda sentence : [dic.get(token) for token in sentence], sequences))
        s_len = list(map(lambda sentence : len(sentence), s_input))
        s_input = pad_sequences(sequences = s_input, maxlen = max_len, padding = 'post', truncating = 'post')
        return s_len, s_input
    
    elif mode == 'target':
        # preprocessing for target (decoder)
        # input
        t_input = list(map(lambda sentence : ['<bos>'] + sentence + ['<eos>'], sequences))
        t_input = list(map(lambda sentence : [dic.get(token) for token in sentence], t_input))
        t_len = list(map(lambda sentence : len(sentence), t_input))
        t_input = pad_sequences(sequences = t_input, maxlen = max_len, padding = 'post', truncating = 'post')
        
        # output
        t_output = list(map(lambda sentence : sentence + ['<eos>'], sequences))
        t_output = list(map(lambda sentence : [dic.get(token) for token in sentence], t_output))
        t_output = pad_sequences(sequences = t_output, maxlen = max_len, padding = 'post', truncating = 'post')
        
        return t_len, t_input, t_output

In [66]:
# preprocessing for source
s_max_len = 10
s_len, s_input = preprocess(sequences = sources,
                            max_len = s_max_len, dic = source2idx, mode = 'source')
print(s_len, s_input)

[109, 346, 386, 109, 292, 56, 90, 200, 463, 25, 209, 104, 270, 231, 303, 230, 343, 179, 199, 406, 250, 561, 125, 681, 61, 181, 64, 187, 111, 162, 216, 54, 190, 94, 141, 25, 105, 70, 39, 189, 224, 55, 480, 89, 455, 73, 189, 279, 154, 18, 152, 128, 173, 34, 549, 205, 184, 80, 87, 62, 133, 58, 77, 62, 68, 104, 64, 216, 17, 17, 18, 150, 173, 332, 218, 134, 188, 129, 87, 475, 25, 101, 322, 223, 204, 189, 81, 117, 50, 248, 73, 321, 255, 219, 85, 286, 22, 99, 72, 374, 30, 199, 21, 198, 124, 113, 35, 108, 60, 79, 182, 280, 125, 402, 206, 50, 177, 35, 128, 102, 117, 33, 173, 248, 330, 79, 219, 672, 97, 116, 130, 53, 230, 49, 125, 24, 208, 174, 188, 102, 228, 98, 98, 79, 24, 186, 141, 404, 59, 162, 535, 174, 62, 107, 306, 35, 157, 415, 494, 122, 41, 128, 50, 46, 183, 225, 77, 64, 40, 716, 83, 182, 183, 111, 46, 165, 61, 133, 113, 172, 296, 310, 107, 268, 164, 77, 96, 342, 132, 150, 94, 269, 137, 668, 103, 153, 89, 32, 411, 185, 206, 111, 130, 131, 63, 322, 66, 75, 126, 892, 80, 169, 155, 179, 39

In [67]:
# preprocessing for target
t_max_len = 12
t_len, t_input, t_output = preprocess(sequences = targets,
                                      max_len = t_max_len, dic = target2idx, mode = 'target')
print(t_len, t_input, t_output)

[5, 7, 9, 7, 16, 22, 9, 10, 12, 7, 7, 8, 6, 8, 5, 28, 5, 10, 5, 8, 4, 9, 21, 16, 19, 9, 6, 8, 7, 8, 14, 10, 12, 6, 9, 11, 6, 15, 10, 19, 12, 7, 6, 5, 26, 9, 14, 7, 4, 5, 8, 5, 13, 12, 10, 7, 14, 12, 9, 8, 7, 7, 21, 7, 9, 14, 6, 16, 8, 6, 19, 10, 8, 12, 6, 10, 7, 16, 14, 11, 16, 7, 8, 16, 13, 7, 7, 8, 13, 13, 8, 10, 8, 8, 19, 36, 13, 10, 10, 6, 8, 6, 14, 9, 8, 11, 12, 10, 9, 12, 6, 8, 9, 6, 9, 7, 5, 5, 9, 12, 8, 28, 14, 12, 10, 10, 14, 5, 13, 9, 5, 5, 20, 13, 10, 10, 16, 9, 4, 8, 15, 10, 15, 63, 11, 8, 17, 11, 10, 9, 12, 6, 5, 18, 5, 5, 9, 26, 8, 11, 11, 7, 14, 13, 9, 7, 7, 7, 7, 6, 3, 5, 6, 10, 9, 14, 5, 7, 12, 7, 10, 19, 13, 12, 13, 6, 7, 6, 12, 6, 9, 6, 11, 8, 29, 34, 10, 9, 14, 6, 19, 10, 8, 8, 8, 6, 20, 8, 20, 8, 15, 4, 7, 8, 7, 7, 11, 8, 15, 11, 13, 15, 8, 8, 5, 21, 12, 5, 11, 7, 23, 5, 11, 12, 11, 9, 11, 6, 8, 7, 8, 8, 11, 7, 9, 9, 10, 10, 37, 13, 13, 12, 10, 11, 11, 9, 13, 10, 23, 7, 20, 22, 17, 5, 8, 9, 7, 10, 9, 6, 23, 7, 7, 7, 23, 5, 25, 8, 34, 11, 6, 11, 15, 18, 11, 5, 11, 5

# hyper-param

In [68]:
# hyper-parameters
epochs = 100
batch_size = 4
learning_rate = .005
total_step = epochs / batch_size
buffer_size = 100
n_batch = buffer_size//batch_size
embedding_dim = 32
units = 128

# input
data = tf.data.Dataset.from_tensor_slices((s_len, s_input, t_len, t_input, t_output))
data = data.shuffle(buffer_size = buffer_size)
data = data.batch(batch_size = batch_size)
# s_mb_len, s_mb_input, t_mb_len, t_mb_input, t_mb_output = iterator.get_next()

In [69]:
def gru(units):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
    if tf.test.is_gpu_available():
        return tf.keras.layers.CuDNNGRU(units, 
                                        return_sequences=True, 
                                        return_state=True, 
                                        recurrent_initializer='glorot_uniform')
    else:
        return tf.keras.layers.GRU(units, 
                                   return_sequences=True, 
                                   return_state=True, 
                                   recurrent_activation='sigmoid', 
                                   recurrent_initializer='glorot_uniform')

In [70]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.enc_units)
        
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
#         print("state: {}".format(state.shape))
#         print("output: {}".format(state.shape))
              
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

In [71]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.dec_units)
        self.fc = tf.keras.layers.Dense(vocab_size)
        
        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)
        
    def call(self, x, hidden, enc_output):
        # enc_output shape == (batch_size, max_length, hidden_size)
        
        # hidden shape == (batch_size, hidden size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden size)
        # we are doing this to perform addition to calculate the score
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        # * `score = FC(tanh(FC(EO) + FC(H)))`
        # score shape == (batch_size, max_length, 1)
        # we get 1 at the last axis because we are applying tanh(FC(EO) + FC(H)) to self.V
        score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))
                
        #* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, 1)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector shape after sum == (batch_size, hidden_size)
        # * `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        # * `embedding output` = The input to the decoder X is passed through an embedding layer.
        x = self.embedding(x)
        
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        # * `merged vector = concat(embedding output, context vector)`
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        output, state = self.gru(x)
        
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state, attention_weights
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [72]:
encoder = Encoder(len(source2idx), embedding_dim, units, batch_size)
decoder = Decoder(len(target2idx), embedding_dim, units, batch_size)

def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    
#     print("real: {}".format(real))
#     print("pred: {}".format(pred))
#     print("mask: {}".format(mask))
#     print("loss: {}".format(tf.reduce_mean(loss_)))
    
    return tf.reduce_mean(loss_)

# creating optimizer
optimizer = tf.train.AdamOptimizer()

# creating check point (Object-based saving)
checkpoint_dir = './data_out/training_checkpoints_attention'
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                encoder=encoder,
                                decoder=decoder)

# create writer for tensorboard
summary_writer = tf.contrib.summary.create_file_writer(logdir=checkpoint_dir)

In [73]:
EPOCHS = 100

for epoch in range(EPOCHS):
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for i, (s_len, s_input, t_len, t_input, t_output) in enumerate(data):
        loss = 0
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(s_input, hidden)
            
            dec_hidden = enc_hidden
            
            dec_input = tf.expand_dims([target2idx['<bos>']] * batch_size, 1)
            
            #Teacher Forcing: feeding the target as the next input
            for t in range(1, t_input.shape[1]):
                predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(t_input[:, t], predictions)
            
                dec_input = tf.expand_dims(t_input[:, t], 1) #using teacher forcing
                
        batch_loss = (loss / int(t_input.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradient = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradient, variables))
        
    if epoch % 10 == 0:
        #save model every 10 epoch
        print('Epoch {} Loss {:.4f} Batch Loss {:.4f}'.format(epoch,
                                            total_loss / n_batch,
                                            batch_loss.numpy()))
        checkpoint.save(file_prefix = checkpoint_prefix)

Epoch 0 Loss 440.3130 Batch Loss 4.9445
Epoch 10 Loss 254.8723 Batch Loss 2.3058
Epoch 20 Loss 179.2827 Batch Loss 1.4804
Epoch 30 Loss 143.5447 Batch Loss 1.0996
Epoch 40 Loss 117.7241 Batch Loss 1.1751
Epoch 50 Loss 100.6196 Batch Loss 1.0711
Epoch 60 Loss 73.8384 Batch Loss 0.5898
Epoch 70 Loss 69.3896 Batch Loss 0.3891
Epoch 80 Loss 49.8837 Batch Loss 0.2853
Epoch 90 Loss 46.2909 Batch Loss 0.2025


In [74]:
def evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    attention_plot = np.zeros((max_length_targ, max_length_inp))
    
    inputs = []
    x = sentence.split(' ')
    #keyerror 처리
    for i in x:
        if not i in inp_lang.keys():
            pass
        else:
            inputs.append(inp_lang[i])
            
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)
    
    result = ''

    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)

    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang['<bos>']], 0)

    for t in range(max_length_targ):
        predictions, dec_hidden, attention_weights = decoder(dec_input, dec_hidden, enc_out)
        
        # storing the attention weigths to plot later on
        attention_weights = tf.reshape(attention_weights, (-1, ))
        attention_plot[t] = attention_weights.numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()

        result += idx2target[predicted_id] + ' '

        if idx2target.get(predicted_id) == '<eos>':
            return result, sentence, attention_plot
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)

    return result, sentence, attention_plot

# result, sentence, attention_plot = evaluate(sentence, encoder, decoder, source2idx, target2idx,
#                                             s_max_len, t_max_len)

In [75]:
# function for plotting the attention weights
def plot_attention(attention, sentence, predicted_sentence):
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(1, 1, 1)
    ax.matshow(attention, cmap='viridis')
    
    fontdict = {'fontsize': 14}
    
    ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
    ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)

    plt.show()

In [76]:

def translate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    result, sentence, attention_plot = evaluate(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ)
        
    print('Input: {}'.format(sentence))
    print('Predicted translation: {}'.format(result))
    
#     attention_plot = attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
#     plot_attention(attention_plot, sentence.split(' '), result.split(' '))

In [77]:
#restore checkpoint

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x26195630>

In [91]:
sentence = df.movie_summary[1]


In [92]:
sentence



In [93]:

# sentence = 'tensorflow is a framework for deep learning'

translate(sentence, encoder, decoder, source2idx, target2idx, s_max_len, t_max_len)

Predicted translation: a new comedy of a film of motion picture of a film 


In [94]:
df[df['movie_tagline'].str.contains('a new comedy')]

Unnamed: 0,level_0,index,movie_genre,movie_name,movie_poster,movie_summary,movie_synopsis,movie_tagline
6049,6442,19810,[Comedy],Nearly a Nasty Accident,www.imdb.com/title/tt0055221/mediaviewer/rm233...,the raf group captain has a hard job to restra...,"[\n , \n ]",that daffadilly blonde from carry on nurse in ...
8334,85,178,"[Comedy, Romance]",She's All That,www.imdb.com/title/tt0160862/mediaviewer/rm296...,a high school jock makes a bet that he can tur...,"[\n , \n ]",a new comedy that proves theres more to attrac...


In [96]:
df.movie_tagline[8334]

'a new comedy that proves theres more to attraction than meets the eye.'

In [85]:
sentence = df.movie_summary[106].split()[:15]
sentence = ' '.join(sentence)

In [86]:
sentence

'a group of bostonbred gangsters set up shop in balmy florida during the prohibition era'

In [91]:
translate(sentence, encoder, decoder, source2idx, target2idx, s_max_len, t_max_len)

Input: a group of bostonbred gangsters set up shop in balmy florida during the prohibition era
Predicted translation: The master of the most important and powerful films of the most 


In [110]:
sentence = df.movie_summary[107].split()[:30]
sentence = ' '.join(sentence)
sentence

'an indepth examination of the ways in which the us vietnam war impacts and disrupts the lives of people in a small industrial town in pennsylvania michael steven and nick'

In [111]:
translate(sentence, encoder, decoder, source2idx, target2idx, s_max_len, t_max_len)

Input: an indepth examination of the ways in which the us vietnam war impacts and disrupts the lives of people in a small industrial town in pennsylvania michael steven and nick
Predicted translation: The Past Never Stays in Focus. <eos> 


In [112]:
df[df['movie_tagline'].str.contains('Past Never Stays in Focus')]

Unnamed: 0,movie_genre,movie_name,movie_poster,movie_summary,movie_synopsis,movie_tagline
108,"['Drama', 'Mystery', 'Romance']",The Souvenir,www.imdb.com/title/tt6920356/mediaviewer/rm353...,a young film student in the early 80s becomes ...,Julie Honor Swinton Byrne is a young film stud...,The Past Never Stays in Focus.


In [108]:
df.movie_tagline[107]

'One of the most important and powerful films of all time!'

In [109]:
df.movie_summary[154]

'selfish yuppie charlie babbitts father left a fortune to his savant brother raymond and a pittance to charlie they travel crosscountry charles sanford charlie babbit is a selfcentered los angelesbased automobile dealerhustlerbookie who is at war with his own life charlie as a young teenager used his fathers 1949 buick convertible without permission and as a result he went to jail for two days on account that his father reported it stolen it is then that charlie learns that his estranged father died and left him from his last will and testament a huge bed of roses and the car while the remainder will of 3 million goes into a trust fund to be distributed to someone charlie seemed pretty angry by this and decides to look into this matter it seems as if that someone is raymond charlies unknown brother an autistic savant who lives in a world of his own resides at the walbrook institute charlie then kidnaps raymond and decides to take him on a lust for life trip to the west coast as a threa

In [82]:
sentence = 'will persist in the gulf It’s unlikely to end until the excess capacity in real estate has been cleared'
sentence

'will persist in the gulf It’s unlikely to end until the excess capacity in real estate has been cleared'

In [83]:
translate(sentence, encoder, decoder, source2idx, target2idx, s_max_len, t_max_len)

Input: will persist in the gulf It’s unlikely to end until the excess capacity in real estate has been cleared
Predicted translation: a daughter a father go to run like paradise like paradise like 
