For this homework, make sure that you format your notbook nicely and cite all sources in the appropriate sections. Programmatically generate or embed any figures or graphs that you need.

Names: Calvin Zikakis, Sarah Schwallier

Section 1: Word2Vec paper questions
---------------------------

    1) Describe how a CBOW word embedding is generated.

    2) What is a CBOW word embedding and how is it different than a skip-gram word embedding?

    3) What is the task that the authors use to evaluate the generated word embeddings?

    4) What are PCA and t-SNE? Why are these important to the task of training and interpreting word embeddings?

Sources Cited
--------------------------
Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean: https://arxiv.org/pdf/1301.3781.pdf  
J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
SENTENCE ORDERING USING RECURRENT NEURAL NETWORKS by Lajanugen Logeswaran, Honglak Lee & Dragomir Radev
Speech and Language Processing
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin

Section 2: Training your own word embeddings
--------------------------------

The spooky authors dataset consists of excerpts from authors of horror novels including Edgar Allan Poe, Mary Shelley, and HP Lovecraft. These excerpts each have a unique ID as well as a three letter tag describing which author wrote the excerpt. The data is split into a training set and a test set. The test set is lacking the three letter code which labels the author. 


We are using the The Blog Authorship Corpus for our secondary dataset. We decided on this dataset as it is comprised of 681,288 posts from 19,320 bloggers. We scanned through this database and pulled a small chunk of the total amount of posts. This was to reduce the overall size of the dataset to help with performance in training word embedding. This dataset will provide a data that is written with a style simular to normal human conversation simularly to the spooky authors dataset. This should help insure our generated sentences have a natural sound to them.





In [120]:
# import your libraries here
import numpy as np
import sklearn
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

import keras


from keras.layers import Dense, Activation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Sequential
from keras.utils import to_categorical




import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sns
import csv


%matplotlib inline

In [121]:
# -----------------Secondary Dataset Formatting and Trimming-------------------
# This cell trims and fixes the secondary dataset to get the data in a workable style
import re
from csv import reader


def format_secondaryDataset(training_file_path, output_file, sentence_length):
    '''
    this function takes the dataset and splits it to sentences and stores those in a txt file

    training_file_path = filepath of blogposts.csv
    output_file = outputfile name (.txt)
    sentence_length = minimum length sentences to grab (value is how many words per sentences)
    '''

    
    with open(training_file_path) as file:
        sentences = file.readlines()
    #open file
    
    file.close()
    
    output = open(output_file, "w+")
    
    count = 0
    
    for line in reader(sentences):
        blog_post = line[6]
        #Line[6] contains the blog post
        
        if count >= 7:
        #skip the stuff in the beggining. It's unneeded
        
            sentences = blog_post.split(".")
            #split the post on the periods to extract individual sentences
            
            for sentence in sentences:
            #loop over our list of sentences
            
                if sentence != "":
                #some blog posts contain '...'. This creates empty sentences. We don't want empty sentences
                    
                    #lets clean the sentence of symbols and make it all lowercase
                    
                    res = re.sub(' +', ' ', sentence)
                    res.strip('\t')
                    res.strip('\n')
                    #strip tabs and newlines
                    
                    lower = res.lower()
                    #make all lower case

                    whitelist = set('abcdefghijklmnopqrstuvwxyz 1234567890')
                    no_numbers_punct = ''.join(filter(whitelist.__contains__, lower))
                    #gets rid of punctuation

                    cleaned = no_numbers_punct.split()
                    
                    black_list = ['urllink']
                    #allows us to remove all 'urlLink' occurances
                    
                    if len(cleaned) >= sentence_length:
                        #adjust 4 if you only want longer sentences
                        #we are only concerned with sentences longer than 4 words
                        output.write(" ".join([i for i in cleaned if i not in black_list]) + "\n")
                    
                    
        if count == 2000:
        #Do not need this full dataset... It's 800mb's
            break
        
        count += 1
        

format_secondaryDataset("blogtext.csv", "secondaryDataset.txt", 5)

In [146]:
# code to train your word embeddings
from csv import reader
from gensim.models import Word2Vec

EMB = 300


def convert_data(data):
#flattens data to 1D matrix
    data_flattened = []
    
    for sentences in data:
        for word in sentences:
            data_flattened.append(word)
    
    return data_flattened



def standardize_length(words,length):
    counter = 0
    output = []
    
    sentence = []
    for word in words:
        if counter < length:
            sentence.append(word)
        else:
            output.append(sentence)
            sentence = []
            counter = -1
        counter += 1
            
    return output

# -----------------Primary Dataset-------------------

def Clean_data_primary_dataset(training_file_path):
    #This function tokenizes the primary dataset and returns a cleaned version where each word making up a sentence is a nested list inside a larger list of the corpus
    output_list = []

    with open(training_file_path) as file:
        sentences = file.readlines()
    #open file
    file.close()
    
    count = 0
        
    for line in reader(sentences):
        
        if count != 0:
        #don't want first sentence

            sentence = line[1]
            
            lower = sentence.lower()
            #make all lower case

            whitelist = set('abcdefghijklmnopqrstuvwxyz 1234567890')
            no_numbers_punct = ''.join(filter(whitelist.__contains__, lower))
            #gets rid of punctuation

            cleaned = no_numbers_punct.split()
            
            output_list.append(cleaned)
            
        count += 1

    return output_list


pri_Dataset = convert_data(Clean_data_primary_dataset("train.csv"))
#imports and cleans dataset

sentences_primaryDataset = standardize_length(pri_Dataset, 40)

model_primaryDataset = Word2Vec(sentences_primaryDataset, min_count=1, size=EMB, window=4, negative=10, iter=10, workers=4)
#creates word2vec model

print(model_primaryDataset)
#model summary

words_primaryDataset = list(model_primaryDataset.wv.vocab)
print(words_primaryDataset)
#shows the vocab

print(model_primaryDataset['sentence'])
#our model

# -----------------Secondary Dataset-------------------

#secondary dataset is stored as 'secondaryDataset.txt' after processing it

def tokenize_secondary_dataset(training_file_path):
    #tokenizes the secondary dataset and returns a cleaned version where each word making up a sentence is a nested list inside a larger list of the corpus
    
    output_list = []

    with open(training_file_path) as file:
        sentences = file.readlines()
        
    #open file
    file.close()
    
    for sentence in sentences:
    #loop over sentences
    
        words = sentence.split()
        #split sentences on the words
        
        output_list.append(words)
        #append words list to final output
    
    return output_list


sec_Dataset = convert_data(tokenize_secondary_dataset("secondaryDataset.txt"))
#secondary sentences

sentences_secondaryDataset = standardize_length(sec_Dataset, 40)

model_secondaryDataset = Word2Vec(sentences_secondaryDataset, min_count=1, size=EMB, window=4, negative=10, iter=10, workers=4)
#creates word2vec model

print(model_secondaryDataset)
#model summary

words_secondaryDataset = list(model_secondaryDataset.wv.vocab)
print(words_secondaryDataset)
#shows the vocab

print(model_secondaryDataset['sentence'])
#our model


Word2Vec(vocab=25422, size=300, alpha=0.025)
[-1.87201872e-02  1.21088393e-01  1.08820856e-01  7.85525888e-02
  5.32334268e-01  2.83965953e-02 -9.00506750e-02 -1.25300288e-01
 -1.27922092e-02  1.44604579e-01  1.39038516e-02 -2.34118804e-01
 -6.76755160e-02 -1.82309762e-01  7.02055246e-02 -2.45808065e-01
 -1.80962477e-02 -3.25352103e-01 -2.50371158e-01 -1.42340168e-01
  9.09577385e-02 -9.13541913e-02  9.32370126e-02  1.04349479e-01
  2.68350542e-01  3.70003320e-02 -1.85922548e-01 -1.42489463e-01
  9.52619314e-02 -3.08503807e-02 -6.57869428e-02 -7.06019104e-02
  2.74333924e-01 -1.00004785e-01  2.89140139e-02  1.96617514e-01
 -4.17433195e-02 -1.88345477e-01  3.23576480e-02  1.96594447e-01
 -1.13914713e-01 -1.28085703e-01  2.32560728e-02  2.59798914e-01
  1.84598491e-02 -1.73371434e-02  6.59476891e-02  7.91257545e-02
 -3.98730449e-02 -3.07504646e-02  1.50735080e-01  5.53137064e-03
 -7.15860352e-02  7.67737702e-02  1.16168141e-01  1.11783803e-01
 -7.13716894e-02  9.72867291e-03  5.37171885e



Sources Cited
--------------------------
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin


Brownlee, Jason. “How to Develop Word Embeddings in Python with Gensim.” Machine Learning Mastery, 7 Aug. 2019, machinelearningmastery.com/develop-word-embeddings-python-gensim/.

Section 3: Evaluate the differences between the word embeddings
----------------------------

(make sure to include graphs, figures, and paragraphs with full sentences)

In [98]:
# -----------------Section 3: Evaluate the differences between the word embeddings-------------------
#This section is evaluating via PCAs
primaryModel = model_primaryDataset[model_primaryDataset.wv.vocab]
secondaryModel = model_secondaryDataset[model_secondaryDataset.wv.vocab]
#Retirives the vectors from each embedding

def buildSimilarWords(randWord, pSimilarWords, sSimilarWords):
    psList = []
    psList.append(randWord)
    for wordTuple in pSimilarWords:
        word  = wordTuple[0]
        if word not in words_secondaryDataset:
            psList.append(word)
    return psList

def getPrimaryIndex(word):
    for i, iWord in enumerate(list(model_primaryDataset.wv.vocab)):
        if word == iWord:
            return i

randIndex = np.random.randint(0, high=len(words_primaryDataset))
randWord = words_primaryDataset[randIndex]
while randWord not in words_secondaryDataset:
    randIndex = np.random.randint(0, high=len(words_primaryDataset))
    randWord = words_primaryDataset[randIndex]
pSimilarWords = model_primaryDataset.wv.most_similar(randWord)
sSimilarWords = model_secondaryDataset.wv.most_similar(randWord)
print("Word: ", randWord, "\n")
print(pSimilarWords, "\n")
print(sSimilarWords, "\n")
similarWords = buildSimilarWords(randWord, pSimilarWords, sSimilarWords)
dnpWord = model_primaryDataset.wv.doesnt_match(similarWords)
dnsWord = model_secondaryDataset.wv.doesnt_match(similarWords)
print(dnpWord)
print(dnsWord)

pcaP = PCA(n_components=3)
resultP = pcaP.fit_transform(primaryModel)
ax = plt.figure(figsize=(10,8)).gca(projection='3d')
ax.scatter(resultP[:, 0], resultP[:, 1], resultP[:, 2], s=5, color='teal')
words_primaryDataset = list(model_primaryDataset.wv.vocab)
ax.set_title('Three-Dimensional PCA for the Primary Data Set')
ax.set_xlabel('Variable B')
ax.set_ylabel('Variable A')
ax.set_zlabel('Variable C')
ax.text(resultP[randIndex, 0], resultP[randIndex, 1], resultP[randIndex, 2], randWord,  fontweight='bold')
for word in similarWords:
    if word != dnpWord and word != randWord:
        p2 = getPrimaryIndex(word)
        ax.text(resultP[p2, 0], resultP[p2, 1], resultP[p2, 2], word)
p2 = words_primaryDataset.index(dnpWord)
ax.text(resultP[p2, 0], resultP[p2, 1], resultP[p2, 2], dnpWord, color='red')
plt.show()
#PCA model for the primary dataset
pcaP = PCA(n_components=2)
resultP = pcaP.fit_transform(primaryModel)
plt.scatter(resultP[:, 0], resultP[:, 1], s=5, color='teal')
words_primaryDataset = list(model_primaryDataset.wv.vocab)
plt.title('Two-Dimensional PCA for the Primary Data Set')
plt.xlabel('Variable B')
plt.ylabel('Variable A')
plt.show()
#PCA model for the primary dataset

randIndex = np.random.randint(0, high=len(words_secondaryDataset))
randWord = words_secondaryDataset[randIndex]
while randWord not in words_secondaryDataset:
    randIndex = np.random.randint(0, high=len(words_secondaryDataset))
    randWord = words_secondaryDataset[randIndex]
pSimilarWords = model_secondaryDataset.wv.most_similar(randWord)
sSimilarWords = model_secondaryDataset.wv.most_similar(randWord)
print("Word: ", randWord, "\n")
print(pSimilarWords, "\n")
print(sSimilarWords, "\n")
similarWords = buildSimilarWords(randWord, pSimilarWords, sSimilarWords)
dnpWord = model_secondaryDataset.wv.doesnt_match(similarWords)
dnsWord = model_secondaryDataset.wv.doesnt_match(similarWords)
print(dnpWord)
print(dnsWord)
def getSecondaryIndex(word):
    for i, iWord in enumerate(list(model_secondaryDataset.wv.vocab)):
        if word == iWord:
            return i
pcaS = PCA(n_components=3)
resultS = pcaS.fit_transform(secondaryModel)
ax = plt.figure(figsize=(10,8)).gca(projection='3d')
ax.scatter(resultS[:, 0], resultS[:, 1], resultS[:, 2], s=5, color='coral')
words_secondaryDataset = list(model_secondaryDataset.wv.vocab)
ax.set_title('Three-Dimensional PCA for the Secondary Data Set')
ax.set_xlabel('Variable B')
ax.set_ylabel('Variable A')
ax.set_zlabel('Variable C')
ax.text(resultS[randIndex, 0], resultS[randIndex, 1], resultS[randIndex, 2], randWord,  fontweight='bold')
for word in similarWords:
    if word != dnpWord and word != randWord:
        s2 = getSecondaryIndex(word)
        ax.text(resultS[s2, 0], resultS[s2, 1], resultS[s2, 2], word)
s2 = words_secondaryDataset.index(dnpWord)
ax.text(resultS[s2, 0], resultS[s2, 1], resultS[s2, 2], dnpWord, color='darkorchid')
plt.show()

def getSecondaryIndex(word):
    for i, iWord in enumerate(list(model_secondaryDataset.wv.vocab)):
        if word == iWord:
            return i

#PCA model for the secondary dataset
pcaS = PCA(n_components=2)
resultS = pcaS.fit_transform(secondaryModel)
plt.scatter(resultS[:, 0], resultS[:, 1], s=5, color='coral')
words_secondaryDataset = list(model_secondaryDataset.wv.vocab)
plt.title('Two-Dimensional PCA for the Secondary Data Set')
plt.xlabel('Variable B')
plt.ylabel('Variable A')
plt.show()
#PCA model for the secondary dataset

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


Word:  consistent 

[('viso', 0.9947690963745117), ('lxg', 0.9942871332168579), ('sereno', 0.994221031665802), ('rousseau', 0.9942161440849304), ('eccossois', 0.9941856265068054), ('porto', 0.994128942489624), ('accorto', 0.9941179752349854), ('hayti', 0.9939699172973633), ('mortali', 0.9938950538635254), ('andava', 0.993792712688446)] 

[('solid', 0.9751346111297607), ('concept', 0.963414192199707), ('contacted', 0.9629279375076294), ('rookie', 0.9617125988006592), ('denies', 0.9606367349624634), ('birkies', 0.9598064422607422), ('dull', 0.9596190452575684), ('reputation', 0.9588959217071533), ('photograph', 0.9587100744247437), ('conspiracy', 0.9582803249359131)] 

hayti
consistent


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


ValueError: Unknown projection '3d'

<Figure size 720x576 with 0 Axes>

Sources Cited
--------------------------
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin
Brownlee, Jason. “How to Develop Word Embeddings in Python with Gensim.” Machine Learning Mastery, 7 Aug. 2019, machinelearningmastery.com/develop-word-embeddings-python-gensim/. <br>
Durksen, Luuk. "Visualising high-dimensional datasets using PCA and t-SNE in Python" 29 Oct. 2016, https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

Section 4: Feedforward Neural Language Model
--------------------------

In [184]:
# code to train a feedforward neural language model
# on a set of given word embeddings
# make sure not to just copy + paste to train your two

vec = 300

def convert_data(data):
#flattens data to 1D matrix
    data_flattened = []
    
    for sentences in data:
        for word in sentences:
            data_flattened.append(word)
    
    return data_flattened


def data_to_index(data, model):
#assigns index values to data
    data_index = []
    
    for word in data:
        if word in model:
            data_index.append(model.vocab[word].index)
            
    return data_index


def generate_words(word2vec_model, keras_model, words_list, length=12):
    word_index = []
    words = []
    for word in words_list:
        word_index.append(word2vec_model.wv.vocab[word].index)
        words.append(word)
    
    for i in range(length):
        pred = keras_model.predict(x=np.array(word_index))
        pred.astype('float64')
        probability = np.random.multinomial(1,pred,1)
        
        prediction = np.argmax(probability[-1])
        word_index.append(prediction)

        
        index_to_word = word2vec_model.wv.index2word[prediction]
        words.append(index_to_word)
    return(words)
    


    train_x = np.zeros([len(), 40])
    train_y = np.zeros([len(sentences)])

    for i, sentence in enumerate(sentences):
        for t, word in enumerate(sentence[:-1]):
            train_x[i, t] = word2idx(word)
        train_y[i] = word2idx(sentence[-1])
#------------------ Primary Dataset -----------------

training_data_x = np.zeros([len(sentences_primaryDataset), 40])
training_data_y = np.zeros([len(sentences_primaryDataset)])

training_data_x.astype('float64')
training_data_y.astype('float64')

i = 0
for word_list in sentences_primaryDataset:
    j = 0
    for word in word_list[:-1]:
        training_data_x[i, j] = model_primaryDataset.wv.vocab[word].index
        j+=1
    training_data_y[i] = model_primaryDataset.wv.vocab[word_list[-1]].index
    i+=1


#Create Keras Model

trained_weights_primaryDataset = model_primaryDataset.wv.vectors
vocab_size_primaryDataset, embedding_size_primaryDataset = trained_weights_primaryDataset.shape


primary_FFNN = Sequential()
#primary_FFNN.add(Embedding(input_dim=vocab_size_primaryDataset, output_dim=embedding_size_primaryDataset, weights=[trained_weights_primaryDataset]))
primary_FFNN.add(Dense(input_dim=vocab_size_primaryDataset, output_dim=embedding_size_primaryDataset))
primary_FFNN.add(Activation('tanh'))
primary_FFNN.add(Dense(units=vocab_size_primaryDataset))
primary_FFNN.add(Activation('softmax'))
primary_FFNN.compile(optimizer="adadelta", loss='mean_squared_error')


words = generate_words(model_primaryDataset,primary_FFNN, ["this", "is","not"])        
print(words)


#------------------ Secondary Dataset -----------------

  
trained_weights_secondaryDataset = model_secondaryDataset.wv.vectors
vocab_size_secondaryDataset, embedding_size_secondaryDataset = trained_weights_secondaryDataset.shape



secondary_FFNN = Sequential()
secondary_FFNN.add(Embedding(input_dim=vocab_size_secondaryDataset, output_dim=embedding_size_secondaryDataset, weights=[trained_weights_secondaryDataset]))
secondary_FFNN.add(Dense(units=vocab_size_secondaryDataset))
secondary_FFNN.add(Activation('softmax'))
secondary_FFNN.compile(optimizer='adam', loss='sparse_categorical_crossentropy')


        
words = generate_words(model_secondaryDataset,secondary_FFNN, ["cocacola","is","cool"])        
print(words)

        
        



ValueError: Error when checking input: expected dense_113_input to have shape (25154,) but got array with shape (1,)

Sources Cited
--------------------------
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin

Section 5: Recurrent Neural Language Model
--------------------------

In [182]:
# code to train a recurrent neural language model 
# on a set of given word embeddings
# make sure not to just copy + paste to train your two 

vec = 300

def convert_data(data):
#flattens data to 1D matrix
    data_flattened = []
    
    for sentences in data:
        for word in sentences:
            data_flattened.append(word)
    
    return data_flattened


def data_to_index(data, model):
#assigns index values to data
    data_index = []
    
    for word in data:
        if word in model:
            data_index.append(model.vocab[word].index)
            
    return data_index


def generate_words(word2vec_model, keras_model, word, length=12):
    word_index = [word2vec_model.wv.vocab[word].index]
    words = [word]
    
    for i in range(length):
        pred = keras_model.predict(x=np.array(word_index))
        pred.astype('float64')
        probability = np.random.multinomial(1,pred,1)
        
        prediction = np.argmax(probability)
        word_index.append(prediction)

        
        index_to_word = word2vec_model.wv.index2word[prediction]
        words.append(index_to_word)
    return(words)
        
#------------------ Primary Dataset -----------------


#Create Keras Model

trained_weights_primaryDataset = model_primaryDataset.wv.syn0
vocab_size_primaryDataset, embedding_size_primaryDataset = trained_weights_primaryDataset.shape



primary_RNN = Sequential()
primary_RNN.add(Embedding(input_dim=vocab_size_primaryDataset, output_dim=embedding_size_primaryDataset, weights=[trained_weights_primaryDataset]))
primary_RNN.add(Simple)
primary_RNN.add(Dense(units=vocab_size_primaryDataset))
primary_RNN.add(Activation('relu'))
primary_RNN.compile(optimizer='adam', loss='sparse_categorical_crossentropy')


        
words = generate_words(model_primaryDataset,primary_FFNN, "this")        
print(words)


#------------------ Secondary Dataset -----------------

  
trained_weights_secondaryDataset = model_secondaryDataset.wv.syn0
vocab_size_secondaryDataset, embedding_size_secondaryDataset = trained_weights_secondaryDataset.shape



secondary_FFNN = Sequential()
secondary_FFNN.add(Embedding(input_dim=vocab_size_secondaryDataset, output_dim=embedding_size_secondaryDataset, weights=[trained_weights_secondaryDataset]))
secondary_FFNN.add(Dense(units=vocab_size_secondaryDataset))
secondary_FFNN.add(Dense(300))
secondary_FFNN.add(Activation('relu'))
secondary_FFNN.compile(optimizer='adam', loss='sparse_categorical_crossentropy')


        
words = generate_words(model_secondaryDataset,secondary_FFNN, "cocacola")        
print(words)

        
        



['this', 'the', 'of', 'and', 'to', 'i', 'a', 'in', 'was', 'that', 'my', 'it', 'had']




['cocacola', 'the', 'i', 'to', 'and', 'a', 'of', 'in', 'that', 'it', 'my', 'is', 'was']


Sources Cited
--------------------------
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin

“Python Gensim Word2Vec Tutorial with TensorFlow and Keras.” Adventures in Machine Learning, 1 Sept. 2017, adventuresinmachinelearning.com/gensim-word2vec-tutorial/.


Shukla, Vishal ShuklaVishal. “Using Pre-Trained word2vec with LSTM for Word Generation.” Stack Overflow, 1AD, stackoverflow.com/questions/42064690/using-pre-trained-word2vec-with-lstm-for-word-generation.

Section 6: Evaluate the differences between the two language models
----------------------------

(make sure to include graphs, figures, and paragraphs with full sentences)

Sources Cited
--------------------------
An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky and James H. Martin