### DATA340 HW01
Problem Set 01: Understanding Vector Spaces

Rija Masroor and Yera Park

### Question 1

We have chosen Gensim to be our word2vec model.

#### What two algorithms are used to train word2vec models? Explain the differences between the two algorithms?
The two algorithms used to train Word2Vec models are called Continuous Bag of Words (CBOW) and Skip-Gram. Both represent words in a continuous vector space. However, CBOW utilizes context words to predict the target word, while Skip-Gram uses the target word to predict the context words.<br />
More precisely, CBOW uses a neural network with a single hidden layer where the context words are the input and the predicted target word is the output layer and Skip-Gram has reversed input and output layers. <br />
CBOW is more advantageous when the contexts are clear and the order of words is not important whereas Skip-Gram is better with larger datasets and for semantic analysis. <br />
The main differences between these algorithms include the reverse directions of target and context words, model complexity, and differing advantageous performance settings.<br />

#### Relate the GLoVE model to the word2vec model. What are the differences between the two models?
Although they both capture semantic relationships between words, Word2Vec uses neural network architectures with CBOW or Skip-Gram algorithms whereas GloVe uses statistical information as a count-based model. <br />
CBOW and Skip-Gram each predict the target and context words with a single hidden layer. GloVe captures co-occurrence statistics of words by matrix analysis. This is done by pairing words according to co-occurrence frequencies and factorizing the matrix. <br />
For training, Word2Vec uses a fixed-size window around each target word. However, Glove does not use it but rather uses overall co-occurrence statistics. <br />
The accuracy of Word2Vec can be influenced by choice of algorithms and hyperparameter settings whereas GloVe is more advantageous in analyzing global semantic relationships and overall distributional properties of words. <br />
Although two models are created for similar goals, Word2Vec comes from neural network architectures and local context prediction and GloVe is a count-based model that utilizes co-occurrence statistics. 

### Initial Set Up

In [1]:
!pip install nltk
!pip install gensim
!pip install sklearn
!pip install scikit-learn
!pip install scikit-optimize



In [2]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

import gensim
from gensim.models import Word2Vec
from gensim.models import Doc2Vec

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yera\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Import Dataset

In [3]:
df = pd.read_csv("IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Data Preprocessing (cleaning, tokenization)

In [4]:
# Remove certain words 
df['review'] = df['review'].apply(lambda x: re.sub('<br />', ' ', x))

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits) 
stop_words = stopwords.words('english') + punct + ['null']

# Function to tokenize and lemmatize text
def tokenize_and_lemmatize(text):
    tokens = word_tokenize(text.lower())
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return tokens

# Apply tokenization and lemmatization to the reviews column
df['tokenized_review'] = df['review'].apply(tokenize_and_lemmatize)

df.head()

Unnamed: 0,review,sentiment,tokenized_review
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, oz, episod, 'll,..."
1,A wonderful little production. The filming t...,positive,"[wonder, littl, product, film, techniqu, unass..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonder, way, spend, time, hot, summe..."
3,Basically there's a family where a little boy ...,negative,"[basic, 's, famili, littl, boy, jake, think, '..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, 's, ``, love, time, money, ''..."


In [5]:
# Convert tokenized_reviews column to a list
tokenized_review_list = df['tokenized_review'].tolist()
tokenized_review_list 

[['one',
  'review',
  'mention',
  'watch',
  'oz',
  'episod',
  "'ll",
  'hook',
  'right',
  'exactli',
  'happen',
  'first',
  'thing',
  'struck',
  'oz',
  'brutal',
  'unflinch',
  'scene',
  'violenc',
  'set',
  'right',
  'word',
  'go',
  'trust',
  'show',
  'faint',
  'heart',
  'timid',
  'show',
  'pull',
  'punch',
  'regard',
  'drug',
  'sex',
  'violenc',
  'hardcor',
  'classic',
  'use',
  'word',
  'call',
  'oz',
  'nicknam',
  'given',
  'oswald',
  'maximum',
  'secur',
  'state',
  'penitentari',
  'focus',
  'mainli',
  'emerald',
  'citi',
  'experiment',
  'section',
  'prison',
  'cell',
  'glass',
  'front',
  'face',
  'inward',
  'privaci',
  'high',
  'agenda',
  'em',
  'citi',
  'home',
  'mani',
  '..',
  'aryan',
  'muslim',
  'gangsta',
  'latino',
  'christian',
  'italian',
  'irish',
  '....',
  'scuffl',
  'death',
  'stare',
  'dodgi',
  'deal',
  'shadi',
  'agreement',
  'never',
  'far',
  'away',
  'would',
  'say',
  'main',
  'appeal'

In [6]:
# Convert tokenized_reviews column to a list
sentiment_list = df['sentiment'].tolist()
sentiment_list

['positive',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'positive',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'positive',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'negative',
 'positive',
 'positive',
 'negative',
 'positive',
 'positive',

### Train the Initial Word2Vec model

In [51]:
# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_review_list, vector_size=100, window=5, min_count=5, sg=0, negative=5, epochs=10)

# Explanation of Hyperparameter Choices:
# - Vector Size: 100 dimensions to balance between capturing semantic information and computational efficiency.
# - Window Size: 5 to capture both local and global context within movie reviews.
# - Minimum Word Count: 5 to filter out rare words and improve model generalization.
# - Training Algorithm: CBOW (sg=0) chosen for simplicity and efficiency.
# - Negative Sampling: 5 negative samples to improve training efficiency.
# - Epochs: 10 epochs chosen to balance between training duration and model convergence.

# Save or use the trained Word2Vec model for further analysis.

In [52]:
word2vec_model

<gensim.models.word2vec.Word2Vec at 0x221be058a60>

In [53]:
# Vectorize reviews using the trained Word2Vec model
def vectorize_review(review_tokens, model):
    vectors = [model.wv[word] for word in review_tokens if word in model.wv.index_to_key]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # return zero vector if review has no known words

In [54]:
X = np.array([vectorize_review(tokens, word2vec_model) for tokens in tokenized_review_list])
y = np.array(sentiment_list)

In [55]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=317)

Logistic Regression: A relatively high accuracy score of 87% is achieved.

In [56]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8718


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Linear Regression: Not appropriate for sentiment analysis because sentiment analysis is a binary classification task, while linear regression prvides a range of values.

In [57]:
# from sklearn.preprocessing import LabelEncoder

# # Encode categorical target labels
# label_encoder = LabelEncoder()
# y_train_encoded = label_encoder.fit_transform(y_train)

# # Train a linear regression model
# linear_model = LinearRegression()
# linear_model.fit(X_train, y_train_encoded)

# # Make predictions
# y_pred_encoded = linear_model.predict(X_test)

# # Convert predictions back to original categorical labels
# y_pred = label_encoder.inverse_transform(y_pred_encoded)

# # Evaluate the model
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy: {accuracy}")


Decision Tree: Accuracy is lower than for logistic regression, perhaps because it may have overfit the training data.

In [58]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a decision tree classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

# Make predictions
y_pred = decision_tree.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy}")


Decision Tree Accuracy: 0.7177


## **Modify Word2Vec Hyperparameters**

### Model 2

Vector Size: Increasing the vector size to 150 dimensions to capture more semantic information only increased the accuracy score for logistic regression from 87.02% to 87.73%, a very small different.

In [59]:
# Train the Word2Vec model
word2vec_model_2 = Word2Vec(sentences=tokenized_review_list, vector_size=150, window=5, min_count=5, sg=0, negative=5, epochs=10)

# Explanation of Hyperparameter Changes:
# - Vector Size: Increased to 150 dimensions to capture more semantic information.

In [60]:
# Vectorize reviews using word2vec_model_2
X2 = np.array([vectorize_review(tokens, word2vec_model_2) for tokens in tokenized_review_list])

In [61]:
# Assuming sentiment_labels contains the sentiment labels (positive or negative) from the original dataset
y2 = np.array(sentiment_list)

# Split the data into training and testing sets
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=317)

In [62]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train2, y_train2)

# Make predictions
y_pred2 = classifier.predict(X_test2)

# Evaluate the model
accuracy = accuracy_score(y_test2, y_pred2)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8776


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Model 3

Vector size reduced but used Skip-Gram (sg), which increased accuracy level from 0.8773 (CBOW, 50 more vector_size).

This shows that Skip-Gram is more effective for this dataset.

In [63]:
# Train the Word2Vec model
word2vec_model_3 = Word2Vec(sentences=tokenized_review_list, vector_size=100, window=5, min_count=5, sg=1, negative=5, epochs=10)

# Explanation of Hyperparameter Changes:
# - Training Algorithm Use skipgram (sg=1) instead of continuous bag of words.

In [64]:
# Vectorize reviews using word2vec_model_2
X3 = np.array([vectorize_review(tokens, word2vec_model_3) for tokens in tokenized_review_list])

KeyboardInterrupt: 

In [None]:
# Assuming sentiment_labels contains the sentiment labels (positive or negative) from the original dataset
y3 = np.array(sentiment_list)

# Split the data into training and testing sets
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.2, random_state=317)

In [None]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train3, y_train3)

# Make predictions
y_pred3 = classifier.predict(X_test3)

# Evaluate the model
accuracy = accuracy_score(y_test3, y_pred3)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8781


### Bayesian Optimization for Hyperparameter Tuning

In [None]:
# Split data into training, validation, and test sets
X_train_val_4, X_test_4, y_train_val_4, y_test_4 = train_test_split(X, y, test_size=0.2, random_state=317)
X_train_4, X_val_4, y_train_4, y_val_4 = train_test_split(X_train_val_4, y_train_val_4, test_size=0.25, random_state=317)  # Split remaining data into training and validation sets

In [None]:
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

In [None]:
# Define objective function
def objective_function(params):
    # Extract hyperparameters
    C = params['C']
    penalty = params['penalty']
    
    # Train model with hyperparameters
    model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
    model.fit(X_train_4, y_train_4)
    
    # Evaluate performance
    accuracy = model.score(X_val_4, y_val_4)
    
    return -accuracy  # Minimize negative accuracy

# Define search space
param_space = {
    'C': (0.01, 10.0, 'log-uniform'),
    'penalty': ['l1', 'l2']
}

# Initialize Bayesian optimizer
bayes_search = BayesSearchCV(
    estimator=LogisticRegression(solver='liblinear'),
    search_spaces=param_space,
    n_iter=50,
    cv=StratifiedKFold(n_splits=5),
    n_jobs=-1
)

# Run optimization
bayes_search.fit(X_train_4, y_train_4)

# Get best hyperparameters
best_params = bayes_search.best_params_

# Train final model with best hyperparameters
final_model = LogisticRegression(**best_params, solver='liblinear')
final_model.fit(X_train_4, y_train_4)

# Evaluate on test set
test_accuracy = final_model.score(X_test_4, y_test_4)

In [None]:
test_accuracy

0.8699

### Question 4 for Word2Vec

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['review'])

# Calculate cosine similarity between TF-IDF vectors
def find_similar_documents(tfidf_vector, tfidf_matrix, top_n=5):
    cosine_similarities = cosine_similarity(tfidf_vector, tfidf_matrix).flatten()
    # Get indices of top n similar documents
    similar_doc_indices = cosine_similarities.argsort()[-top_n:][::-1]
    similar_documents = [(idx, cosine_similarities[idx]) for idx in similar_doc_indices]
    return similar_documents

# Example usage
query = "This movie was average. "
query_vector = tfidf_vectorizer.transform([query])
similar_documents = find_similar_documents(query_vector, tfidf_matrix)

# Print the most similar documents
for i, (idx, similarity) in enumerate(similar_documents, 1):
    print(f"Similar Document {i}:")
    print(df['review'][idx])
    print()

Similar Document 1:
I read all these reviews on here about how this is a such a good movie. Jeez, this movie was predictable and pretty boring. The acting was below average most of the time, especially by Mckenna. I haven't seen a more pathetic attempt at making someone "badass" in a movie. Oh man, this movie was a letdown. I also read somewhere this might be a cult classic. I know there are followers of the director, but this movie was just a average piece of film.<br /><br />The script was lame, for the most part the acting was lame, this movie was lame.<br /><br />Oh and pray for the guy that used to be in Cheers. He looks really bad. <br /><br />The best actor in this movie was probably the guy in Office Space, and he was only in this movie for about 8 minutes.<br /><br />4/10

Similar Document 2:
no comment - stupid movie, acting average or worse... screenplay - no sense at all... SKIP IT!

Similar Document 3:
I can't believe this movie has an average rating of 7.0! It is a fiendi

### Reasoning behind my choice of similarity measure

Cosine similarity is generally used in identifying similarities by measuring the cosine of the angle (inner product space) between two vectors. It is an excellent tool in measuring similarity in text analysis. 

## Doc2Vec

### Question 3

#### Explain how the Doc2Vec model works and how it differs from the Word2Vec model.
Doc2vec

In [9]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import multiprocessing

# Load the dataset
df = pd.read_csv("IMDB_Dataset.csv")

# Preprocess the text data
tagged_corpus = [TaggedDocument(words=doc.split(), tags=[str(i)]) for i, doc in enumerate(df['review'])]

# Define hyperparameters to test
hyperparameters = [
    {'vector_size': 100, 'window': 5, 'min_count': 5, 'epochs': 10},
    {'vector_size': 200, 'window': 5, 'min_count': 5, 'epochs': 100},
    {'vector_size': 200, 'window': 5, 'min_count': 5, 'epochs': 200}
]

# Define a function to train and evaluate the model with given hyperparameters
def train_and_evaluate_model(params, tagged_corpus):
    cores = multiprocessing.cpu_count()
    model = Doc2Vec(vector_size=params['vector_size'], window=params['window'],
                    min_count=params['min_count'], workers=cores-1, epochs=params['epochs'])
    model.build_vocab(tagged_corpus)
    model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(model.dv, df['sentiment'], test_size=0.2, random_state=317)

    # Train a logistic regression classifier
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)

    # Make predictions
    y_pred = classifier.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

# Train and evaluate models with different hyperparameters
results = []
for params in hyperparameters:
    accuracy = train_and_evaluate_model(params, tagged_corpus)
    results.append({'params': params, 'accuracy': accuracy})

# Print the results
for result in results:
    print(f"Hyperparameters: {result['params']} | Accuracy: {result['accuracy']}")


KeyboardInterrupt: 

In [73]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import multiprocessing

# Tag the tokenized reviews with unique identifiers
tagged_corpus = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(tokenized_review_list)]

# Set the number of CPU cores for multiprocessing
cores = multiprocessing.cpu_count()

# Train the Doc2Vec model
model = Doc2Vec(vector_size=100, window=5, min_count=1, workers=cores-1, epochs=10)
model.build_vocab(tagged_corpus)
model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

# Save the trained model
model.save("doc2vec_model")

print("Doc2Vec model training completed.")

Doc2Vec model training completed.


In [77]:
# Load the trained Doc2Vec model
model = Doc2Vec.load("doc2vec_model")

# Extract document vectors
doc_vectors = []
for i in range(len(model.dv)):
    doc_vectors.append(model.dv[i])

# Convert the list of vectors to a numpy array
doc_vectors = np.array(doc_vectors)

In [80]:
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_test, y_test):
    # Infer document vectors for the testing set
    inferred_vectors = [model.infer_vector(tagged_document.words) for tagged_document in tagged_corpus[len(X_train):]]

    # Calculate similarity and determine accuracy
    similarities = []
    for i, inferred_vector in enumerate(inferred_vectors):
        actual_vector = model.dv[i + len(X_train)]  # Access the corresponding vector directly
        similarity = cosine_similarity([inferred_vector], [actual_vector])[0][0]
        similarities.append(similarity)

    # Assuming a threshold of 0.5 for positive sentiment
    predictions = [1 if sim >= 0.5 else 0 for sim in similarities]

    # Calculate accuracy
    accuracy = accuracy_score(y_test, predictions)
    return accuracy


In [81]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(doc_vectors, sentiment_list, test_size=0.2, random_state=42)

# Define hyperparameters to test
hyperparameters = [
    {'vector_size': 100, 'window': 5, 'min_count': 5, 'epochs': 10},
    {'vector_size': 200, 'window': 5, 'min_count': 5, 'epochs': 10},
    # Add more hyperparameter combinations as needed
]

results = []

for params in hyperparameters:
    # Train the Doc2Vec model with the current hyperparameters
    model = Doc2Vec(vector_size=params['vector_size'], window=params['window'], min_count=params['min_count'], epochs=params['epochs'])
    model.build_vocab(tagged_corpus)
    model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

    # Evaluate the model
    accuracy = evaluate_model(model, X_test, y_test)  # Implement evaluate_model function
    results.append({'params': params, 'accuracy': accuracy})

# Print the results
for result in results:
    print(f"Hyperparameters: {result['params']} | Accuracy: {result['accuracy']}")



  score = y_true == y_pred


Hyperparameters: {'vector_size': 100, 'window': 5, 'min_count': 5, 'epochs': 10} | Accuracy: 0.0
Hyperparameters: {'vector_size': 200, 'window': 5, 'min_count': 5, 'epochs': 10} | Accuracy: 0.0


  score = y_true == y_pred


In [67]:
corpus = []

# Tokenize the text, remove stopwords, and lemmatize the text
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
for review in df['review']:
    tokens = word_tokenize(review.lower())
    processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in stop_words]
    corpus.append(processed_tokens)

# Save the processed data back to the CSV file
df.to_csv("IMDB_processed.csv", index=False)

In [None]:
tagged_corpus = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(corpus)]
cores = multiprocessing.cpu_count()

In [None]:
model = Doc2Vec(vector_size=100, window=5, min_count=10, workers=cores-1, epochs=100)
model.build_vocab(tagged_corpus)
model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
# Extract document vectors
document_vectors = [model.infer_vector(tagged_document.words) for tagged_document in tagged_corpus]

# Convert document vectors to NumPy array
document_vectors_array = np.array(document_vectors)

print(document_vectors_array.shape)

In [None]:
# Extract document vectors
doc_vectors = []
for i in range(len(model.dv)):
    doc_vectors.append(model.dv[i])

# Convert the list of vectors to a numpy array
doc_vectors = np.array(doc_vectors)

# Save the document vectors to a file
# save the document vectors to disk as embeddings
with open("doc_vectors.tsv", "w", encoding="utf-8") as file:
    for vector in doc_vectors:
        vector_str = "\t".join([str(x) for x in vector])
        file.write(vector_str + "\n")

# Create a metadata file containing document indices
# save the document vectors to disk as metadata
with open("metadata.tsv", "w", encoding="utf-8") as file:
    for i in range(len(model.dv)):
        file.write(f"Document {i}\n")

**Train the Doc2Vec Model**

Data Preprocessing

In [None]:
review_list = df['review'].tolist()


Initialize using the defaults

In [None]:
imdb = "IMDB_Dataset.csv"
doc2vec_model = Doc2Vec(corpus_file=imdb, vector_size=100, dm=1, dm_mean=1, dbow_words=0, window=5, min_count=5, negative=5, epochs=5)


KeyboardInterrupt: 

In [None]:
# Vectorize reviews using word2vec_model_2
Xa = np.array([vectorize_review(docs, doc2vec_model) for docs in tokenized_review_list])

KeyboardInterrupt: 