**Submitted By:**

**Name: Anoushka Mergoju**

**SUID: 328542442**

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/glove6b100dtxt/glove.6B.100d.txt
/kaggle/input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin.gz
/kaggle/input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin


In [3]:
!pip install nltk
!pip install gensim



**Part 1: Training a Word2Vec Model on the NLTK Movie Review Corpus**

In [4]:
# Import necessary libraries and modules
import gensim
import nltk
from nltk.corpus import movie_reviews
from gensim.models import Word2Vec
import logging

# Download nltk movie reviews corpus
nltk.download('movie_reviews')

#Reload the movie reviews variable
movie_reviews = nltk.corpus.movie_reviews

# Enable logging to monitor training
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Prepare data: List of lists of words
sentences = [list(movie_reviews.words(fileid)) for fileid in movie_reviews.fileids()]

# Train the Word2Vec model
model = gensim.models.Word2Vec(sentences, vector_size=100, window=5, min_count=2, workers=4)

# Save the model for later use
model.save("movie_reviews_word2vec.model")
print("Model training completed and saved.")

[nltk_data] Error loading movie_reviews: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>
Model training completed and saved.


**Part 2: Testing the Model**

*Test 1: Find Top 5 Similar Words*

In [5]:
# Load the model
model = Word2Vec.load("movie_reviews_word2vec.model")

test_words = ["movie", "star", "computer", "science", "king", "queen", "man", "woman", "dog", "cat"]
for word in test_words:
    try:
        similar_words = model.wv.most_similar(word, topn=5)
        print(f"Words most similar to '{word}': {similar_words}")
    except KeyError:
        print(f"The word '{word}' is not in the vocabulary.")


Words most similar to 'movie': [('film', 0.9487786293029785), ('picture', 0.835985541343689), ('sequel', 0.7876085042953491), ('ending', 0.7412267923355103), ('story', 0.7386842966079712)]
Words most similar to 'star': [('classic', 0.8083040714263916), ('trilogy', 0.8051865100860596), ('wars', 0.7903246879577637), ('episode', 0.7891380786895752), ('witch', 0.7834601998329163)]
Words most similar to 'computer': [('generated', 0.8773682713508606), ('rom', 0.806863009929657), ('plots', 0.8039576411247253), ('domain', 0.8026084899902344), ('sub', 0.7935991287231445)]
Words most similar to 'science': [('fiction', 0.93857741355896), ('pulp', 0.9123406410217285), ('horror', 0.8692793846130371), ('slasher', 0.8201507329940796), ('classic', 0.8167328238487244)]
Words most similar to 'king': [('jerry', 0.8562236428260803), ('captain', 0.85112065076828), ('edward', 0.8398842215538025), ('george', 0.8381919860839844), ('jackson', 0.8328857421875)]
Words most similar to 'queen': [('amidala', 0.9296

*Test 2: Analogy Game*

In [6]:
analogies = [
    ("man", "woman", "king"),
    ("paris", "france", "berlin"),
    ("dog", "puppy", "cat"),
    ("man", "woman", "programmer")
]

for a, b, c in analogies:
    try:
        result = model.wv.most_similar(positive=[c, b], negative=[a], topn=1)
        print(f"'{a}' is to '{b}' as '{c}' is to '{result[0][0]}'")
    except KeyError:
        print(f"An error occurred with the words: {a}, {b}, {c}")


'man' is to 'woman' as 'king' is to 'jay'
'paris' is to 'france' as 'berlin' is to '1963'
'dog' is to 'puppy' as 'cat' is to 'impersonal'
'man' is to 'woman' as 'programmer' is to 'fong'


**Part 3: Using Pre-trained Google News Model**

In [9]:
from gensim.models import KeyedVectors

# Load Google News model
google_model_path = '/kaggle/input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin'
google_model = KeyedVectors.load_word2vec_format(google_model_path, binary=True)

# Repeat the tests
for word in test_words:
    similar_words = google_model.most_similar(word, topn=5)
    print(f"Words most similar to '{word}' with Google News: {similar_words}")

for a, b, c in analogies:
    result = google_model.most_similar(positive=[c, b], negative=[a], topn=1)
    print(f"'{a}' is to '{b}' as '{c}' is to '{result[0][0]}' with Google News")


Words most similar to 'movie' with Google News: [('film', 0.8676770329475403), ('movies', 0.8013108372688293), ('films', 0.7363011837005615), ('moive', 0.6830360889434814), ('Movie', 0.6693680286407471)]
Words most similar to 'star' with Google News: [('stars', 0.7763954997062683), ('superstar', 0.7340598702430725), ('starlet', 0.6381064057350159), ('megastar', 0.6165120005607605), ('heart_throb', 0.5726701617240906)]
Words most similar to 'computer' with Google News: [('computers', 0.7979379892349243), ('laptop', 0.6640493273735046), ('laptop_computer', 0.6548868417739868), ('Computer', 0.647333562374115), ('com_puter', 0.6082080006599426)]
Words most similar to 'science' with Google News: [('faith_Jezierski', 0.6965422034263611), ('sciences', 0.6821076273918152), ('biology', 0.6775783896446228), ('scientific', 0.6535001993179321), ('mathematics', 0.6300910115242004)]
Words most similar to 'king' with Google News: [('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarc

**Part 4: Using Pre-trained GloVe Model**

In [12]:
def glove_to_word2vec(glove_input_file, word2vec_output_file):
    from gensim.scripts.glove2word2vec import glove2word2vec
    glove2word2vec(glove_input_file, word2vec_output_file)

# Call the function with the appropriate file paths
glove_input_file = '/kaggle/input/glove6b100dtxt/glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.w2vformat.txt'
glove_to_word2vec(glove_input_file, word2vec_output_file)

  glove2word2vec(glove_input_file, word2vec_output_file)


In [15]:
from gensim.models import KeyedVectors

# Load the converted model
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)




In [14]:
test_words = ["movie", "star", "computer", "science", "king", "queen", "man", "woman", "dog", "cat"]
for word in test_words:
    try:
        similar_words = glove_model.most_similar(word, topn=5)
        print(f"Words most similar to '{word}': {similar_words}")
    except KeyError:
        print(f"The word '{word}' is not in the vocabulary.")

analogies = [
    ("man", "woman", "king"),
    ("paris", "france", "berlin"),
    ("dog", "puppy", "cat"),
    ("man", "woman", "programmer")
]

for a, b, c in analogies:
    try:
        result = glove_model.most_similar(positive=[c, b], negative=[a], topn=1)
        print(f"'{a}' is to '{b}' as '{c}' is to '{result[0][0]}'")
    except KeyError:
        print(f"An error occurred with the words: {a}, {b}, {c}")


Words most similar to 'movie': [('film', 0.9055121541023254), ('movies', 0.8959327340126038), ('films', 0.866355299949646), ('hollywood', 0.8239826560020447), ('comedy', 0.8141382932662964)]
Words most similar to 'star': [('stars', 0.8661765456199646), ('superstar', 0.728345513343811), ('movie', 0.6531304717063904), ('legend', 0.6483872532844543), ('actor', 0.6472946405410767)]
Words most similar to 'computer': [('computers', 0.8751984238624573), ('software', 0.8373122215270996), ('technology', 0.7642159461975098), ('pc', 0.7366448640823364), ('hardware', 0.7290390729904175)]
Words most similar to 'science': [('sciences', 0.8073161244392395), ('physics', 0.7914698123931885), ('institute', 0.7663252353668213), ('mathematics', 0.7607672810554504), ('studies', 0.7590447664260864)]
Words most similar to 'king': [('prince', 0.7682328820228577), ('queen', 0.7507690787315369), ('son', 0.7020888328552246), ('brother', 0.6985775232315063), ('monarch', 0.6977890729904175)]
Words most similar to 

**Part 5: Model Comparison and Analysis**

Given the output from the three different models (trained on the NLTK Movie Review corpus, the pre-trained Google News Word2Vec, and the pre-trained GloVe model), let's analyze the performance of each based on the results:

#### 1. **GloVe Model**
   - **Similarity Task**: The GloVe model generally provides highly relevant similar words that align closely with our semantic expectations for the given words. It captures nuanced semantic relationships well, for instance, listing both "hollywood" and "comedy" as similar to "movie," which reflects broader contextual understanding.
   - **Analogy Task**: It performs standard analogy resolutions accurately, such as mapping "man" to "woman" as "king" to "queen." However, it falters slightly with less straightforward analogies like predicting "puppies" instead of "kitten" for the analogy from "dog" to "puppy" and "cat."

#### 2. **Google News Word2Vec**
   - **Similarity Task**: This model's strength is evident in its accurate and contextually relevant word similarities, reflecting its training on a vast and diverse dataset. However, it includes some noise (e.g., "moive" instead of "movie"), which might be a spelling error captured during its training.
   - **Analogy Task**: Performs excellently on standard analogies and offers precise predictions that match closely with expected results, highlighting its effectiveness in understanding and processing real-world, frequently encountered relationships.

#### 3. **NLTK Movie Reviews Word2Vec**
   - **Similarity Task**: This model's outputs are more narrowly focused around movie-related content, which is a reflection of its training data. For example, the word "science" brings up genre-related words like "fiction" and "horror."
   - **Analogy Task**: It struggles with analogies, providing some bizarre or incorrect matches such as "jay" for the pair "man" to "woman" as "king" to "jay", which might be due to the limited and highly specialized training data.

### Which Model Performs Best?

- **For General Language Understanding and Breadth**: The **Google News Word2Vec model** stands out as the most robust in handling a wide variety of common and complex language tasks due to its extensive training on a large and diverse corpus. It provides accurate synonyms and resolves analogies with a high degree of reliability.

- **For Context-Specific Tasks (Movies and Entertainment)**: The **NLTK Movie Reviews Word2Vec model** would be preferred if the domain of interest is strictly entertainment-related content, as it might capture nuanced sentiments or jargon specific to movie reviews better than the others.

- **For a Balance of Nuance and Semantic Depth**: The **GloVe model** appears to strike a balance between breadth and depth, offering nuanced semantic connections that are particularly useful for tasks requiring a deep understanding of word relationships beyond mere co-occurrence.

### Conclusion

In conclusion, the choice of the best model depends significantly on the specific needs of the application:
- **Google News Word2Vec** is ideal for general-purpose NLP tasks.
- **GloVe** offers deep semantic insights suitable for nuanced NLP applications.
- **NLTK Movie Reviews Word2Vec** excels in domain-specific contexts where movie-related content is predominant.

Each model's strengths and limitations reflect the nature of its training data and architecture, underscoring the importance of model selection based on the target application's specific requirements.