1. [20 pts] In this assignment, we will approach the problem with Word2Vec and contextual analysis of keywords towards sentiment/category processing in our pipeline.
Generate a gensim model of movie reviews. Use any parameters you like while answering the questions (2.) and (3.) below.
Report the size of the vocabulary and characteristics of the gensim model, such as the number of mapping dimensions, etc.

In [1]:
import re
from nltk.corpus import stopwords
import nltk
def ie_preprocess(document):
    document = re.sub('<br />', '', document)
    document = re.sub(r'[^\w\s]', ' ', document)
    sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(document)]
    stop_words = set(stopwords.words('english'))
    return [[word for word in sent if word.lower() not in stop_words] for sent in sentences][0]

In [2]:
import csv
import pandas as pd
path='./movie_data.csv'
df=pd.read_csv(path)

In [3]:
df['review']=df['review'].apply(ie_preprocess)

In [4]:
import nltk
import gensim
print(f'gensim version= {gensim.__version__}')
from gensim.models import Word2Vec

from nltk.corpus import abc

sents = list(df['review'])

model = Word2Vec(df['review'], min_count=2, workers=4)
X = list(model.wv.index_to_key)


print(f'gensim model vocabulary has {len(X)} words mapped to N= {model.vector_size} dimensions')

gensim version= 4.3.2
gensim model vocabulary has 77747 words mapped to N= 100 dimensions


2. [20 pts] Generate the contexts for the following keywords:
(a.) melancholy
(b.) ghastly
(c.) lackluster
(d.) romantic

In [5]:
model.wv.most_similar('melancholy'),

([('powerfully', 0.8492774963378906),
  ('wistful', 0.8488004207611084),
  ('elegance', 0.8439174294471741),
  ('lyrical', 0.8364806175231934),
  ('sensual', 0.8362995386123657),
  ('dreamy', 0.8284203410148621),
  ('tenderness', 0.8245759606361389),
  ('enchanting', 0.8214840888977051),
  ('evocative', 0.8206413984298706),
  ('affecting', 0.8191242218017578)],)

In [6]:
model.wv.most_similar('ghastly')

[('transparent', 0.7996708154678345),
 ('rubbery', 0.7784503102302551),
 ('thudding', 0.7772441506385803),
 ('naff', 0.7730329632759094),
 ('flabby', 0.7718597650527954),
 ('soulless', 0.7715681195259094),
 ('pastel', 0.7656320929527283),
 ('grotesquely', 0.7638882994651794),
 ('ogre', 0.7606737017631531),
 ('greenish', 0.759727418422699)]

In [7]:
model.wv.most_similar('lackluster')

[('uninspired', 0.8729032278060913),
 ('uneven', 0.8582224249839783),
 ('dismal', 0.845085620880127),
 ('uninspiring', 0.841022789478302),
 ('pedestrian', 0.8365830779075623),
 ('leaden', 0.8307527899742126),
 ('turgid', 0.8289234042167664),
 ('lethargic', 0.817794144153595),
 ('stilted', 0.8150004148483276),
 ('amateurish', 0.809618353843689)]

In [8]:
model.wv.most_similar('romantic')

[('romance', 0.7989781498908997),
 ('screwball', 0.7238531112670898),
 ('tender', 0.6482086777687073),
 ('poignant', 0.6228998303413391),
 ('touching', 0.6206278800964355),
 ('bittersweet', 0.6038159132003784),
 ('sentimental', 0.5991921424865723),
 ('reaffirming', 0.5954933762550354),
 ('frothy', 0.5837821364402771),
 ('uplifting', 0.5831743478775024)]

3. [20 pts] Group the reviews into two by the original ground truth of sentiments and repeat question (2.) by generating a positive and a negative gensim model. Report the contexts (4 positive and 4 negative).

In [9]:

def different_sentiment(df, sentiment):
    reviews = df[df['sentiment'] == sentiment]['review']
    model = Word2Vec(reviews, min_count=2, workers=4)
    X = list(model.wv.index_to_key)

    print(f'gensim model vocabulary has {len(X)} words mapped to N= {model.vector_size} dimensions')
    print(model.wv.most_similar('melancholy'))
    print(model.wv.most_similar('ghastly'))
    print(model.wv.most_similar('lackluster'))
    print(model.wv.most_similar('romantic'))

# Assuming df is your DataFrame
different_sentiment(df, 0)
different_sentiment(df, 1)

gensim model vocabulary has 55169 words mapped to N= 100 dimensions
[('Seberg', 0.9365330338478088), ('enigmatic', 0.9292488694190979), ('Bergen', 0.9209020733833313), ('Irit', 0.9208400249481201), ('Eggar', 0.9208371639251709), ('substituting', 0.9200350046157837), ('mannered', 0.9186382293701172), ('Binoche', 0.9186217188835144), ('bohemian', 0.9180573225021362), ('depressives', 0.9179893732070923)]
[('chintzy', 0.9380182027816772), ('excessively', 0.9327852129936218), ('BLUE', 0.9279466867446899), ('Liotti', 0.9270133376121521), ('FLIES', 0.9209648966789246), ('stagy', 0.9208546876907349), ('plethora', 0.9195451736450195), ('BEATS', 0.9194968938827515), ('ranging', 0.9188086986541748), ('slabs', 0.9185315370559692)]
[('uninspired', 0.9211742281913757), ('uneven', 0.9176979660987854), ('lacklustre', 0.9164472818374634), ('leaden', 0.8993801474571228), ('pedestrian', 0.8918802738189697), ('unremarkable', 0.8904005289077759), ('dismal', 0.8877403736114502), ('unimpressive', 0.879656851

4. [20 pts] Comment about similarities and differences in (3.). Any comments on why romantic context was not affected?

One Similarity is that uninspired is the leading similarity for both sentiment 0 and the combined sentiments, which makes sense, since that kind of word is most commonly used within the context of negativity. Sensual is also shared in the positive sentiment for melancholy, as well as tenderness.

One of the reasons that romance is not affected by the sentiment is likely due to the fact that there is no difference in the context that these words are presented. For example, while melancholy might change depending on the sentiment, since melancholy would have multiple examples in the same semantic context as melancholy in sentiment 0, there would be a lot of similar words within that context. 

Since there may be a lack of examples for melancholy in sentiment 1, the words the words with the highest similarity scores might be inappropriate examples.

5. [20 pts] Read the following paper:
Maas, Andrew L., et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics, 2011.
Comment about and/or align your results in this homework.

In the paper, there was discussion of the differences of semantics and sentiment. Expressive content and Semantic Content are distinct. Because of this, we see that our model attempts to capture only the semantic meaning of a word, especially when it appears in the same contexts in terms of vector represenation but fails to capture the sentimental meaning of a word. Since words like romance have more of a neutral sentiment, as well as context, this is likely why it succeeds regardless of the sentiment group.

In the paper, there was discussion of the differences of semantics and sentiment. Expressive content and Semantic Content are distinct. Because of this, we see that our model attempts to capture only the semantic meaning of a word, especially when it appears in the same contexts in terms of vector represenation but fails to capture the sentimental meaning of a word. Since words like romance have more of a neutral sentiment, this is likely why it succeeds regardless of the sentiment group. This is also why, for instance, the model fails to capture