### Use Word2Vec to train your own model on a dataset.

1) **Optional** - Find your own dataset of documents to train you model on. You are going to need a lot of data, so it's probably not realistic to scrape data for this assignment given the time constraints that we're working under. Try to find a dataset that has > 5000 documents.

- If you can't find a dataset to use try this one: <https://www.kaggle.com/c/quora-question-pairs>

2) Clean/Tokenize the documents.

3) Vectorize the model using Word2Vec and explore the results using each of the following at least one time:

- your_model.wv.most_similar()
- your_model.wv.similarity()
- your_model.wv.doesn't_match()

### Word2Vec with Movie Review Dataset

In [12]:
##### Your Code Here #####
import pandas as pd

df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [13]:
df = df.loc[0:9999,:]
df.shape

(10000, 2)

In [14]:
df.head()

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [15]:
from nltk.tokenize import word_tokenize
from gensim.models.word2vec import Word2Vec

sentences = [word_tokenize(text) for text in df.review]

print(sentences[:5])

[['My', 'family', 'and', 'I', 'normally', 'do', 'not', 'watch', 'local', 'movies', 'for', 'the', 'simple', 'reason', 'that', 'they', 'are', 'poorly', 'made', ',', 'they', 'lack', 'the', 'depth', ',', 'and', 'just', 'not', 'worth', 'our', 'time.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'The', 'trailer', 'of', '``', 'Nasaan', 'ka', 'man', "''", 'caught', 'my', 'attention', ',', 'my', 'daughter', 'in', 'law', "'s", 'and', 'daughter', "'s", 'so', 'we', 'took', 'time', 'out', 'to', 'watch', 'it', 'this', 'afternoon', '.', 'The', 'movie', 'exceeded', 'our', 'expectations', '.', 'The', 'cinematography', 'was', 'very', 'good', ',', 'the', 'story', 'beautiful', 'and', 'the', 'acting', 'awesome', '.', 'Jericho', 'Rosales', 'was', 'really', 'very', 'good', ',', 'so', "'s", 'Claudine', 'Barretto', '.', 'The', 'fact', 'that', 'I', 'despised', 'Diether', 'Ocampo', 'proves', 'he', 'was', 'effective', 'at', 'his', 'role', '.', 'I', 'have', 'never', 'been', 'this', 'touched', ',', 'moved', 'and', 'a

In [16]:
from gensim.models.word2vec import Word2Vec

model = Word2Vec(sentences, min_count=1, size=200)

print(model)

Word2Vec(vocab=84123, size=200, alpha=0.025)


In [17]:
model.wv.most_similar('Academy')

[('Award', 0.9167477488517761),
 ('Awards', 0.8830267190933228),
 ('BAFTA', 0.8776016235351562),
 ('Globe', 0.8755573034286499),
 ('nomination', 0.8745793104171753),
 ('Picture', 0.8718630075454712),
 ('Actor', 0.8670637011528015),
 ('Golden', 0.8619368076324463),
 ('Best', 0.8599739670753479),
 ('Brothers', 0.8595072031021118)]

In [18]:
model.wv.similarity('Academy', 'Picture')

0.8718629652728691

In [20]:
model.wv.doesnt_match("Globe Picture Best cheesiest".split())

'cheesiest'

### Stretch Goals:

1) Use Doc2Vec to train a model on your dataset, and then provide model with a new document and let it find similar documents.

2) Download the pre-trained word vectors from Google. Access the pre-trained vectors via the following link: https://code.google.com/archive/p/word2vec

Load the pre-trained word vectors and train the Word2vec model

Examine the first 100 keys or words of the vocabulary

Outputs the vector representation for a select set of words - the words can be of your choice

Examine the similarity between words - the words can be of your choice

For example:

model.similarity('house', 'bungalow')

model.similarity('house', 'umbrella')