### DATA340 HW01
Problem Set 01: Understanding Vector Spaces

Rija Masroor and Yera Park

### Question 1

We have chosen Gensim to be our word2vec model.

#### What two algorithms are used to train word2vec models? Explain the differences between the two algorithms?
The two algorithms used to train Word2Vec models are called Continuous Bag of Words (CBOW) and Skip-Gram. Both represent words in a continuous vector space. However, CBOW utilizes context words to predict the target word, while Skip-Gram uses the target word to predict the context words.<br />
More precisely, CBOW uses a neural network with a single hidden layer where the context words are the input and the predicted target word is the output layer and Skip-Gram has reversed input and output layers. <br />
CBOW is more advantageous when the contexts are clear and the order of words is not important whereas Skip-Gram is better with larger datasets and for semantic analysis. <br />
The main differences between these algorithms include the reverse directions of target and context words, model complexity, and differing advantageous performance settings.<br />

#### Relate the GLoVE model to the word2vec model. What are the differences between the two models?
Although they both capture semantic relationships between words, Word2Vec uses neural network architectures with CBOW or Skip-Gram algorithms whereas GloVe uses statistical information as a count-based model. <br />
CBOW and Skip-Gram each predict the target and context words with a single hidden layer. GloVe captures co-occurrence statistics of words by matrix analysis. This is done by pairing words according to co-occurrence frequencies and factorizing the matrix. <br />
For training, Word2Vec uses a fixed-size window around each target word. However, Glove does not use it but rather uses overall co-occurrence statistics. <br />
The accuracy of Word2Vec can be influenced by choice of algorithms and hyperparameter settings whereas GloVe is more advantageous in analyzing global semantic relationships and overall distributional properties of words. <br />
Although two models are created for similar goals, Word2Vec comes from neural network architectures and local context prediction and GloVe is a count-based model that utilizes co-occurrence statistics. 

### Initial Set Up

In [7]:
!pip install nltk
!pip install gensim
!pip install sklearn
!pip install scikit-learn
!pip install scikit-optimize



In [8]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.corpus import stopwords
import string
nltk.download('stopwords')

import gensim
from gensim.models import Word2Vec
from gensim.models import Doc2Vec

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yera\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Import Dataset

In [9]:
df = pd.read_csv("IMDB_Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Data Preprocessing (cleaning, tokenization)

In [10]:
# Remove certain words 
df['review'] = df['review'].apply(lambda x: re.sub('<br />', ' ', x))

# Create instances of the stemmer
stemmer = PorterStemmer()

# For stopwords we will add punctuation
punct = list(string.punctuation) + list(string.digits) 
stop_words = stopwords.words('english') + punct + ['null']

# Function to tokenize and lemmatize text
def tokenize_and_lemmatize(text):
    tokens = word_tokenize(text.lower())
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return tokens

# Apply tokenization and lemmatization to the reviews column
df['tokenized_review'] = df['review'].apply(tokenize_and_lemmatize)

df.head()

Unnamed: 0,review,sentiment,tokenized_review
0,One of the other reviewers has mentioned that ...,positive,"[one, review, mention, watch, oz, episod, 'll,..."
1,A wonderful little production. The filming t...,positive,"[wonder, littl, product, film, techniqu, unass..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonder, way, spend, time, hot, summe..."
3,Basically there's a family where a little boy ...,negative,"[basic, 's, famili, littl, boy, jake, think, '..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, mattei, 's, ``, love, time, money, ''..."


In [None]:
# Convert tokenized_reviews column to a list
tokenized_review_list = df['tokenized_review'].tolist()

In [None]:
# Convert tokenized_reviews column to a list
sentiment_list = df['sentiment'].tolist()
sentiment_list

## **Word2Vec Model**

In [13]:
# Train the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_review_list, vector_size=100, window=5, min_count=5, sg=0, negative=5, epochs=10)

# Explanation of Hyperparameter Choices:
# - Vector Size: 100 dimensions to balance between capturing semantic information and computational efficiency.
# - Window Size: 5 to capture both local and global context within movie reviews.
# - Minimum Word Count: 5 to filter out rare words and improve model generalization.
# - Training Algorithm: CBOW (sg=0) chosen for simplicity and efficiency.
# - Negative Sampling: 5 negative samples to improve training efficiency.
# - Epochs: 10 epochs chosen to balance between training duration and model convergence.

# Save or use the trained Word2Vec model for further analysis.

In [14]:
word2vec_model

<gensim.models.word2vec.Word2Vec at 0x25a46c58970>

In [15]:
# Vectorize reviews using the trained Word2Vec model
def vectorize_review(review_tokens, model):
    vectors = [model.wv[word] for word in review_tokens if word in model.wv.index_to_key]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # return zero vector if review has no known words

In [16]:
X = np.array([vectorize_review(tokens, word2vec_model) for tokens in tokenized_review_list])
y = np.array(sentiment_list)

In [17]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=317)

Logistic Regression: A relatively high accuracy score of 87% is achieved.

In [18]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8706


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Linear Regression: Not appropriate for sentiment analysis because sentiment analysis is a binary classification task, while linear regression prvides a range of values.

In [19]:
# from sklearn.preprocessing import LabelEncoder

# # Encode categorical target labels
# label_encoder = LabelEncoder()
# y_train_encoded = label_encoder.fit_transform(y_train)

# # Train a linear regression model
# linear_model = LinearRegression()
# linear_model.fit(X_train, y_train_encoded)

# # Make predictions
# y_pred_encoded = linear_model.predict(X_test)

# # Convert predictions back to original categorical labels
# y_pred = label_encoder.inverse_transform(y_pred_encoded)

# # Evaluate the model
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy: {accuracy}")


Decision Tree: Accuracy is lower than for logistic regression, perhaps because it may have overfit the training data.

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Train a decision tree classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

# Make predictions
y_pred = decision_tree.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy}")


Decision Tree Accuracy: 0.7395


## **Modify Word2Vec Hyperparameters**

### Model 2

Vector Size: Increasing the vector size to 150 dimensions to capture more semantic information only increased the accuracy score for logistic regression from 87.02% to 87.73%, a very small different.

In [21]:
# Train the Word2Vec model
word2vec_model_2 = Word2Vec(sentences=tokenized_review_list, vector_size=150, window=5, min_count=5, sg=0, negative=5, epochs=10)

# Explanation of Hyperparameter Changes:
# - Vector Size: Increased to 150 dimensions to capture more semantic information.

In [22]:
# Vectorize reviews using word2vec_model_2
X2 = np.array([vectorize_review(tokens, word2vec_model_2) for tokens in tokenized_review_list])

In [23]:
# Assuming sentiment_labels contains the sentiment labels (positive or negative) from the original dataset
y2 = np.array(sentiment_list)

# Split the data into training and testing sets
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=317)

In [24]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train2, y_train2)

# Make predictions
y_pred2 = classifier.predict(X_test2)

# Evaluate the model
accuracy = accuracy_score(y_test2, y_pred2)
print(f"Accuracy: {accuracy}")

Accuracy: 0.879


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Model 3

Vector size reduced but used Skip-Gram (sg), which increased accuracy level from 0.8773 (CBOW, 50 more vector_size).

This shows that Skip-Gram is more effective for this dataset.

In [25]:
# Train the Word2Vec model
word2vec_model_3 = Word2Vec(sentences=tokenized_review_list, vector_size=100, window=5, min_count=5, sg=1, negative=5, epochs=10)

# Explanation of Hyperparameter Changes:
# - Training Algorithm Use skipgram (sg=1) instead of continuous bag of words.

In [26]:
# Vectorize reviews using word2vec_model_2
X3 = np.array([vectorize_review(tokens, word2vec_model_3) for tokens in tokenized_review_list])

In [27]:
# Assuming sentiment_labels contains the sentiment labels (positive or negative) from the original dataset
y3 = np.array(sentiment_list)

# Split the data into training and testing sets
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.2, random_state=317)

In [28]:
# Train a logistic regression classifier
classifier = LogisticRegression()
classifier.fit(X_train3, y_train3)

# Make predictions
y_pred3 = classifier.predict(X_test3)

# Evaluate the model
accuracy = accuracy_score(y_test3, y_pred3)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8793


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Bayesian Optimization for Hyperparameter Tuning

In [29]:
# Split data into training, validation, and test sets
X_train_val_4, X_test_4, y_train_val_4, y_test_4 = train_test_split(X, y, test_size=0.2, random_state=317)
X_train_4, X_val_4, y_train_4, y_val_4 = train_test_split(X_train_val_4, y_train_val_4, test_size=0.25, random_state=317)  # Split remaining data into training and validation sets

In [30]:
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

In [31]:
# Define objective function
def objective_function(params):
    # Extract hyperparameters
    C = params['C']
    penalty = params['penalty']
    
    # Train model with hyperparameters
    model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
    model.fit(X_train_4, y_train_4)
    
    # Evaluate performance
    accuracy = model.score(X_val_4, y_val_4)
    
    return -accuracy  # Minimize negative accuracy

# Define search space
param_space = {
    'C': (0.01, 10.0, 'log-uniform'),
    'penalty': ['l1', 'l2']
}

# Initialize Bayesian optimizer
bayes_search = BayesSearchCV(
    estimator=LogisticRegression(solver='liblinear'),
    search_spaces=param_space,
    n_iter=50,
    cv=StratifiedKFold(n_splits=5),
    n_jobs=-1
)

# Run optimization
bayes_search.fit(X_train_4, y_train_4)

# Get best hyperparameters
best_params = bayes_search.best_params_

# Train final model with best hyperparameters
final_model = LogisticRegression(**best_params, solver='liblinear')
final_model.fit(X_train_4, y_train_4)

# Evaluate on test set
test_accuracy = final_model.score(X_test_4, y_test_4)

In [32]:
test_accuracy

0.8709

### Question 4 for Word2Vec

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['review'])

# Calculate cosine similarity between TF-IDF vectors
def find_similar_documents(tfidf_vector, tfidf_matrix, top_n=5):
    cosine_similarities = cosine_similarity(tfidf_vector, tfidf_matrix).flatten()
    # Get indices of top n similar documents
    similar_doc_indices = cosine_similarities.argsort()[-top_n:][::-1]
    similar_documents = [(idx, cosine_similarities[idx]) for idx in similar_doc_indices]
    return similar_documents

# Example usage
query = "This movie was average. "
query_vector = tfidf_vectorizer.transform([query])
similar_documents = find_similar_documents(query_vector, tfidf_matrix)

# Print the most similar documents
for i, (idx, similarity) in enumerate(similar_documents, 1):
    print(f"Similar Document {i}:")
    print(df['review'][idx])
    print()

Similar Document 1:
I read all these reviews on here about how this is a such a good movie. Jeez, this movie was predictable and pretty boring. The acting was below average most of the time, especially by Mckenna. I haven't seen a more pathetic attempt at making someone "badass" in a movie. Oh man, this movie was a letdown. I also read somewhere this might be a cult classic. I know there are followers of the director, but this movie was just a average piece of film.  The script was lame, for the most part the acting was lame, this movie was lame.  Oh and pray for the guy that used to be in Cheers. He looks really bad.   The best actor in this movie was probably the guy in Office Space, and he was only in this movie for about 8 minutes.  4/10

Similar Document 2:
no comment - stupid movie, acting average or worse... screenplay - no sense at all... SKIP IT!

Similar Document 3:
I can't believe this movie has an average rating of 7.0! It is a fiendishly bad movie, and I saw it when it was

### Reasoning behind my choice of similarity measure

Cosine similarity is generally used in identifying similarities by measuring the cosine of the angle (inner product space) between two vectors. It is an excellent tool in measuring similarity in text analysis. 

## Doc2Vec

### Question 3

#### Explain how the Doc2Vec model works and how it differs from the Word2Vec model.

Doc2Vec is a model that represents each document as a vector. It is used to create a vectorised representation of a group of words taken collectively as a single unit.

In Doc2Vec, two main architectures are used: Distributed Memory (DM) and Distributed Bag of Words (DBOW). In the DM architecture, the model learns to predict the target word given the context words and a unique document ID. In the DBOW architecture, the model predicts words within a document but ignores the context words. This is analogous to the CBOW and the skip-gram architecture in Word2Vec.

In contrast, a Word2vec model represents each word as a vector. While Word2Vec focuses on word-level semantics, Doc2Vec extends this concept to capture both word and document semantics.

In [34]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import multiprocessing

# Load the dataset
df = pd.read_csv("IMDB_Dataset.csv")

# Preprocess the text data
tagged_corpus = [TaggedDocument(words=doc.split(), tags=[str(i)]) for i, doc in enumerate(df['review'])]

# Define hyperparameters to test
hyperparameters = [
    {'vector_size': 100, 'window': 5, 'min_count': 5, 'epochs': 10},
    {'vector_size': 150, 'window': 5, 'min_count': 5, 'epochs': 50}
]

# Define a function to train and evaluate the model with given hyperparameters
def train_and_evaluate_model(params, tagged_corpus):
    cores = multiprocessing.cpu_count()
    model = Doc2Vec(vector_size=params['vector_size'], window=params['window'],
                    min_count=params['min_count'], workers=cores-1, epochs=params['epochs'])
    model.build_vocab(tagged_corpus)
    model.train(tagged_corpus, total_examples=model.corpus_count, epochs=model.epochs)

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(model.dv, df['sentiment'], test_size=0.2, random_state=317)

    # Train a logistic regression classifier
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)

    # Make predictions
    y_pred = classifier.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)

    return accuracy

# Train and evaluate models with different hyperparameters
results = []
for params in hyperparameters:
    accuracy = train_and_evaluate_model(params, tagged_corpus)
    results.append({'params': params, 'accuracy': accuracy})

# Print the results
for result in results:
    print(f"Hyperparameters: {result['params']} | Accuracy: {result['accuracy']}")

Hyperparameters: {'vector_size': 100, 'window': 5, 'min_count': 5, 'epochs': 10} | Accuracy: 0.8064
Hyperparameters: {'vector_size': 150, 'window': 5, 'min_count': 5, 'epochs': 50} | Accuracy: 0.8503


#### Question 3 Part 2

Document and explain your model's hyperparameters and the reasoning behind your choices.

Explanation of Hyperparameter Choices:
- Vector Size: 100 and 150 dimensions to balance between capturing semantic information, dimensionality, and computational efficiency. These are a popular choices for Doc2Vec models. Higher values for vector size can capture more detailed relationships but requires more computational power. 
- Window Size: 5 to capture both local and global context within movie reviews.
- Minimum Word Count: 5 to filter out rare words and improve model generalization. 
- Epochs: 10 and 50 epochs chosen to balance between training duration and model convergence. Iterations for Doc2Vec is more time-consuming than Word2Vec due to the size of datasets. 

We have tried running hyperparameters with larger parameters (vector_size=300, window=5, min_count=5, workers=-1, epochs=5000), but this took over 2 hours.
Most likely because of 5000 for epoch. Also, when there were more than 3 hyperparameters (models) trained, it took way longer than what takes for 2 models.
Interestingly, 2 models on average on one laptop took around 9 minutes but on the other laptop, it took over 15 minutes. 
Doc2Vec seems to use computational power significantly more than Word2Vec so there were limited and long attempts in figuring out the models and right parameters.

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit-transform the review texts to TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(df['review'])

# Get the TF-IDF vector for a given document (e.g., document at index 0)
given_document_index = 0
given_tfidf_vector = tfidf_matrix[given_document_index]

# Calculate the cosine similarity between the given TF-IDF vector and all other TF-IDF vectors
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(given_tfidf_vector, tfidf_matrix)

# Sort the similarities to find the most similar documents
most_similar_indices = similarities.argsort()[0][::-1]

# Print the most similar documents
num_similar_documents = 5
for i in range(1, num_similar_documents + 1):
    similar_document_index = most_similar_indices[i]
    similar_document = df.iloc[similar_document_index]['review']
    print(f"Similar Document {i}:")
    print(similar_document)
    print()


Similar Document 1:
One of the best TV shows out there, if not the best one. Why? Simple: it has guts to show us real life in prison, without any clichés and predictable twists. This is not Prison Break or any other show, actually comparing to Oz the show Sopranos look like story for children's. Profanity, cursing, shots of explicit violence and using drugs, disgusting scenes of male sexual organs and rapes... all this and more in Oz. But this is not the best part of Oz; the characters are the strongest point of this show; they're all excellent and not annoying, despite the fact we are looking at brutal criminals. The actors are excellent, my favorite are the actors who are playing Ryan O'Reilly and Tobias Beecher, because they're so unique and changing their behavior completely. And most of all... the don't have no remorse for their actions. Overall... Oz is amazing show, the best one out there. Forget about CSI and shows about stupid doctors... this is the deal... OZ!

Similar Docume

#### Question 4 Part 2

Explain the reasoning behind your choice of similarity measure.

Cosine similarity is the most suitable choice for measuring similarity when using Word2Vec models because it is normalized and captures the directional element of vectors. The word frequency and document length may vary but normalization ensures that the orientations of the vectors are emphasized more than the magnitude of vectors. By measuring the inner product space between vectors, cosine similarity is able to capture the semantic similarity between words that have similar meanings and point in similar directions. In addition, cosine similarity is computationally efficient and can thus be used with large datasets. 
As cosine similarity indicates how close words or documents are and normalizes the length and complexity of words and documents, it is an excellent choice for both Word2Vec and Doc2Vec. 

**Q5: Explore the properties of subspaces and their relationship to vector spaces.**

**Explain the properties of subspaces and their relationship to vector spaces.**
    


A subspace is a vector space that is contained within another vector space. So every subspace is a vector space in its own right, but it is also defined relative to some other (larger) vector space.

**Provide an example of a subspace and its relationship to a vector space.**

In the vector space V = R3 (the real coordinate space over the field R of real numbers), take W to be the set of all vectors in V whose last component is 0. Then W is a subspace of V.

Proof:

    Given u and v in W, then they can be expressed as u = (u1, u2, 0) and v = (v1, v2, 0). Then u + v = (u1+v1, u2+v2, 0+0) = (u1+v1, u2+v2, 0). Thus, u + v is an element of W, too.
    Given u in W and a scalar c in R, if u = (u1, u2, 0) again, then cu = (cu1, cu2, c0) = (cu1, cu2,0). Thus, cu is an element of W too.

**Q6 Define a static word vector and a dynamic word vector. Explain the differences between the two types of word vectors.**

Static vector: A fixed representation of a word in a vector space that does not change during the course of a specific task or model training. 

Dynamic vector: A word representation in vector space that is context-dependent.

Differences:
Static word vectors consider only sematics, while dynamic vectors consider both sematics and pragamatics. Static vectors are fixed and can be reused across multile tasks without modification while dynamic word vectors are more adaptive acorss contexts. Dynamic wrd vectors are thus better suited for tasks that consider contextual information than static representations,





