<a href="https://colab.research.google.com/github/Love1117/Machine_learning-Projects/blob/main/Machine_Learning%20Project/04_NLP%20Projects/Word2Vec%20Embeddings/Building%20Words%20with%20Similar%20meaning/Words_with_Similar_meaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Summary: Building Word Similarity Model Using Gensim (Word2Vec) on Netflix Reviews**

##**Overview**
This project focuses on building a word similarity model using Gensim‚Äôs Word2Vec library. The dataset used contains Netflix user reviews, and only the review text column was utilized to train the model. After comprehensive cleaning and preprocessing, the model successfully learned meaningful semantic relationships between words.

The trained model demonstrates high similarity scores for related words and low similarity scores for unrelated terms, confirming that the embedding space captures contextual meanings effectively.



##**Aim of the Project**
The primary aim of this project is to:

Learn distributed word representations using Word2Vec.

Capture semantic relationships and context-based meaning between words.

Build a model capable of identifying how closely related two words are, based on learned embeddings from real-world user reviews.

##**Import library for my project**

In [None]:
from google.colab import drive
drive.mount("/content/drive")
!pip install gensim

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
import gensim
import pandas as pd

##**Loading Dataset from drive**

In [None]:
df = pd.read_csv("/content/drive/My Drive/Text Data/netflix_reviews.csv")
df[:10]

Unnamed: 0,reviewId,userName,content,score,thumbsUpCount,reviewCreatedVersion,at,appVersion
0,21dda6fd-b840-4a9c-9bf5-91eef444aec2,Fidel Hitez,awesome,5,0,,2025-11-19 10:17:19,
1,ba557ee1-ec49-4f93-bdf9-64ba175d1dfc,Thomas Auntie,good üòäüòä,5,0,,2025-11-19 10:16:33,
2,05ed80f7-2350-434e-a247-f012eecdc04b,Eagle eagle,Bad customer service and can't solve the issues,1,0,9.42.0 build 8 63616,2025-11-19 10:10:48,9.42.0 build 8 63616
3,2eb7a262-bad2-4a48-92b2-a1d61ef204a5,Shuha Shabnam,üëçüëçüëçüëçüëçüëçüáßüá©üáßüá©üòòüòò,5,0,8.143.5 build 19 52400,2025-11-19 09:28:26,8.143.5 build 19 52400
4,e3dcd6c4-004a-407f-8564-c0c1b74bc1d9,Morne Munnik,I love the movies,5,0,9.41.0 build 8 63576,2025-11-19 09:22:10,9.41.0 build 8 63576
5,88857e48-b326-4950-9373-9a3d0f88a77e,Shaista,GTA San Andreas üëå,5,0,,2025-11-19 09:11:53,
6,6335cfb4-6512-46e4-baa4-55dd77838b54,shikhar jain,I have made the payment but still it has not r...,1,0,9.42.0 build 8 63616,2025-11-19 09:01:51,9.42.0 build 8 63616
7,7a59652c-5d4e-43f1-94f9-11a1c711b3e8,Kathrine Hernandez,If you take supernatural off Netflix I'm comin...,2,0,9.42.0 build 8 63616,2025-11-19 08:52:43,9.42.0 build 8 63616
8,8c772e3b-11be-45b0-9463-b084d7cc2594,l—îg—îndŒ±r—á gŒ±m√≠ng,into the dead 2game game after update not play...,3,0,,2025-11-19 08:40:04,
9,2bdc17e4-bfbe-42ba-9bd1-acf2a53ced50,Kelvin Kuik,the app keeps crashing after signing in,1,0,9.42.0 build 8 63616,2025-11-19 08:28:20,9.42.0 build 8 63616


##**Confirming to see if dataset has null values**

In [None]:
df_reviews = df['content']
print(f"Num of null values: {df_reviews.isna().sum()}")

Num of null values: 6


##**Drop null values**

In [None]:
reviews_text.dropna(inplace=True)
print(f"Num of null values: {df_reviews.isna().sum()}")


Num of null values: 0


In [None]:
len(reviews_text)

142641

##**Preprocessing text data using simple_preprocess function fomr gensim**

In [None]:
processed_reviews = reviews_text.apply(gensim.utils.simple_preprocess)
processed_reviews

Unnamed: 0,content
0,[awesome]
1,[good]
2,"[bad, customer, service, and, can, solve, the,..."
3,[]
4,"[love, the, movies]"
...,...
142642,"[really, like, it, there, are, so, many, movie..."
142643,"[love, netflix, always, enjoy, my, time, using..."
142644,"[sound, quality, is, very, slow, of, movies]"
142645,"[rate, is, very, expensive, bcos, we, see, net..."


##**building my gensim model**

In [None]:
model = gensim.models.Word2Vec(window=5, min_count=2, workers=3)

model.build_vocab(processed_reviews, progress_per=1000)

##**TRAIN MY MODEL: by default model epochs is set to 5**

In [None]:
print(model.epochs)
model.train(processed_reviews, total_examples= model.corpus_count, epochs=model.epochs)

5


(12978397, 17842260)

In [None]:
model.save("/content/drive/My Drive/Text Data/netflix_text_reviews.model")

In [None]:
loaded_model = gensim.models.Word2Vec.load("/content/drive/My Drive/Text Data/netflix_text_reviews.model")
print(loaded_model)

Word2Vec<vocab=19923, vector_size=100, alpha=0.025>


##**Experimenting my model: to see similar words**

In [None]:
model.wv.most_similar("good")

[('costly', 0.7782313823699951),
 ('pricey', 0.7511305809020996),
 ('cheap', 0.6899311542510986),
 ('affordable', 0.6469461917877197),
 ('overpriced', 0.6373260021209717),
 ('complicated', 0.6245598196983337),
 ('pricy', 0.6105921268463135),
 ('greedy', 0.6021075248718262),
 ('outrageous', 0.6019524931907654),
 ('unfair', 0.5734629034996033)]

In [None]:
model.wv.most_similar("expensive")

[('costly', 0.7782313823699951),
 ('pricey', 0.7511305809020996),
 ('cheap', 0.6899311542510986),
 ('affordable', 0.6469461917877197),
 ('overpriced', 0.6373260021209717),
 ('complicated', 0.6245598196983337),
 ('pricy', 0.6105921268463135),
 ('greedy', 0.6021075248718262),
 ('outrageous', 0.6019524931907654),
 ('unfair', 0.5734629034996033)]

In [None]:
model.wv.most_similar("bad")

[('poor', 0.6944313049316406),
 ('disappointing', 0.6535304188728333),
 ('pathetic', 0.5715718269348145),
 ('disgusting', 0.5678569078445435),
 ('good', 0.5593602061271667),
 ('dissapointing', 0.5589182376861572),
 ('inconvenient', 0.552000880241394),
 ('terrible', 0.5492978692054749),
 ('dumb', 0.5487064719200134),
 ('sad', 0.5447703003883362)]

##**Similar words and Un-similar words probability**

In [None]:
print(f"probability score for similarly of this 2 words is: {model.wv.similarity(w1="great", w2="nice")}")
print(f"probability score for similarly of this 2 words is: {model.wv.similarity(w1="expensive", w2="costly")}")
print("\n")
print("They are more than 50% so it means the similarity is high")

probability score for similarly of this 2 words is: 0.7676674723625183
probability score for similarly of this 2 words is: 0.7782313227653503


They are more than 50% so it means the similarity is high


In [None]:
print(f"probability score for similarly of this 2 words is: {model.wv.similarity(w1="great", w2="bad")}")
print(f"probability score for similarly of this 2 words is: {model.wv.similarity(w1="service", w2="good")}")
print("\n")
print("They are not more than 50% so it means there is no similarity")

probability score for similarly of this 2 words is: 0.23094773292541504
probability score for similarly of this 2 words is: -0.06856516003608704


They are not more than 50% so it means there is no similarity


In [None]:
print(f"probability score of 2 same words: {model.wv.similarity(w1="expensive", w2="expensive")}")

probability score of 2 same words: 1.0



##**Libraries:**
Gensim and pandas

##**Outcome:**

High similarity for semantically related words

Low similarity for unrelated words




##**Conclusion / Deployment Summary:**

When deployed, this Word2Vec model can:

Find words with similar meanings, enabling features like recommendation of alternative phrases or improving search functionality.

Enhance NLP pipelines, such as clustering, sentiment analysis, and document similarity tasks.

Serve as a pretrained embedding layer for downstream machine learning models.

Support semantic search by ranking words or documents based on contextual meaning rather than exact matching.


Overall, the model provides a robust foundation for building intelligent text-processing systems that rely on understanding real-world language usage from user reviews.

