# Word2Vec: Learning Word Embeddings on a Small Spam Dataset

**Goal:** Learn how to convert text messages into dense vector embeddings using Word2Vec.

**Why Word2Vec?**

- Bag of Words (BoW) and TF-IDF:
  - Sparse vectors
  - Ignore semantic meaning
  - Treat words independently

- Word2Vec:
  - Dense embeddings
  - Captures semantic relationships (king - man + woman ≈ queen)
  - Uses context (CBOW or Skip-gram)

**Dataset:** Small sample from `spam.csv`  
- Column `v2` contains text messages  
- Column `v1` contains labels (`ham`/`spam`) → ignored for Word2Vec  
- 20 short, real-world messages → ideal for learning


In [1]:
# Import libraries
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from gensim.models import Word2Vec, KeyedVectors

# Download punkt tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     D:\miniconda_setup\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# Load the small dataset
df = pd.read_csv('spam.csv', encoding='latin-1')

# Take first 20 messages
df_small = df[['v2']].head(20)

# Display first few messages
df_small


Unnamed: 0,v2
0,"Go until jurong point, crazy.. Available only ..."
1,Ok lar... Joking wif u oni...
2,Free entry in 2 a wkly comp to win FA Cup fina...
3,U dun say so early hor... U c already then say...
4,"Nah I don't think he goes to usf, he lives aro..."
5,FreeMsg Hey there darling it's been 3 week's n...
6,Even my brother is not like to speak with me. ...
7,As per your request 'Melle Melle (Oru Minnamin...
8,WINNER!! As a valued network customer you have...
9,Had your mobile 11 months or more? U R entitle...


**Explanation:**

- The dataset contains short messages (`v2`) with labels (`v1`)  
- For Word2Vec, we **only need the text (`v2`)**  
- Labels (`ham`/`spam`) will be ignored

## Text Preprocessing

In [3]:
# Combine all 20 messages into one string
text = " ".join(df_small['v2'].astype(str))

# Clean text: lowercase & remove punctuation/numbers
text_clean = re.sub(r'[^a-z\s]', '', text.lower())

# Tokenize into sentences, then words
sentences = [word_tokenize(sent) for sent in sent_tokenize(text_clean)]

# Show sample tokenized sentences
print("Sample tokenized sentences:", sentences[:3])


Sample tokenized sentences: [['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat', 'ok', 'lar', 'joking', 'wif', 'u', 'oni', 'free', 'entry', 'in', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', 'st', 'may', 'text', 'fa', 'to', 'to', 'receive', 'entry', 'questionstd', 'txt', 'ratetcs', 'apply', 'overs', 'u', 'dun', 'say', 'so', 'early', 'hor', 'u', 'c', 'already', 'then', 'say', 'nah', 'i', 'dont', 'think', 'he', 'goes', 'to', 'usf', 'he', 'lives', 'around', 'here', 'though', 'freemsg', 'hey', 'there', 'darling', 'its', 'been', 'weeks', 'now', 'and', 'no', 'word', 'back', 'id', 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'to', 'send', 'to', 'rcv', 'even', 'my', 'brother', 'is', 'not', 'like', 'to', 'speak', 'with', 'me', 'they', 'treat', 'me', 'like', 'aids', 'patent', 'as', 'per', 'your', 'request', 'melle', 'mel

**Explanation:**

- Lowercase all text → standardize words  
- Remove punctuation/numbers → Word2Vec only needs words  
- Sentence tokenization → split text into sentences  
- Word tokenization → split sentences into individual words (tokens)  
- `sentences` is a **list of lists**, ready for Word2Vec


## Train Custom Word2Ve

In [4]:
# Train Word2Vec model
custom_model = Word2Vec(
    sentences,
    vector_size=50,
    window=5,
    min_count=1,
    sg=0
)

print("Custom Word2Vec model trained!")

Custom Word2Vec model trained!


**Explanation:**

- `vector_size=50` → each word is a 50-dimensional vector  
- `window=5` → looks at 5 words before and after target  
- `min_count=1` → keeps all words  
- `sg=0` → CBOW (predict target from context)  
- The model is **kept in memory** for learning and exploration


In [5]:
# Vector for the word 'free'
print("Vector for 'free':\n", custom_model.wv['free'])

# Most similar words to 'free'
print("Words most similar to 'free':\n", custom_model.wv.most_similar('free'))


Vector for 'free':
 [-0.00042949 -0.01769441 -0.0171706   0.00560984 -0.01630914 -0.01817991
 -0.00456867 -0.01697105 -0.01432883 -0.01687582 -0.0005864  -0.00941468
  0.01330951  0.0032145  -0.00664493  0.01230546 -0.01190136 -0.00931185
 -0.01450581 -0.00868222 -0.00354911  0.0131164  -0.00541984  0.0096872
  0.01395269 -0.01488608  0.00903487  0.01216818 -0.0058824   0.01324652
  0.01227542 -0.01281739 -0.01361191  0.0051191  -0.00333173 -0.01197002
  0.01918649 -0.01017233 -0.01296445 -0.00031145 -0.00513106  0.00092714
 -0.00711915 -0.00084244 -0.00121083  0.00171802  0.01633341 -0.01154558
 -0.00341097  0.01115458]
Words most similar to 'free':
 [('id', 0.49181419610977173), ('fine', 0.4748515784740448), ('your', 0.3071158528327942), ('scotland', 0.3011452257633209), ('amore', 0.2860356569290161), ('hl', 0.2728937864303589), ('dont', 0.2691073715686798), ('early', 0.2579798102378845), ('sunday', 0.25507038831710815), ('wkly', 0.25023606419563293)]


In [6]:
# Get all words in vocabulary
unique_words = list(custom_model.wv.index_to_key)

# Loop through each word
for word in unique_words:
    vector = custom_model.wv[word]                 # 50D vector
    similar_words = custom_model.wv.most_similar(word, topn=5)  # top 5 similar words
    print(f"\nWord: {word}")
    print(f"Vector (first 5 dims shown): {vector[:5]} ...")  # snippet
    print(f"Top 5 similar words: {similar_words}")


Word: to
Vector (first 5 dims shown): [-0.00099044  0.00044592  0.01032321  0.01804609 -0.01848899] ...
Top 5 similar words: [('dun', 0.3258264362812042), ('credit', 0.32306140661239624), ('gota', 0.3029196262359619), ('amore', 0.2749740481376648), ('dont', 0.2742103040218353)]

Word: the
Vector (first 5 dims shown): [-0.0161322   0.00906084 -0.00819695  0.00157982  0.01717829] ...
Top 5 similar words: [('remember', 0.4195212423801422), ('this', 0.37753814458847046), ('wat', 0.302047461271286), ('fine', 0.2724485695362091), ('info', 0.2646276354789734)]

Word: i
Vector (first 5 dims shown): [-0.01711326  0.00738902  0.01049401  0.01144767  0.01516186] ...
Top 5 similar words: [('goalsteam', 0.37226977944374084), ('help', 0.355258584022522), ('oru', 0.34388771653175354), ('httpwap', 0.33819687366485596), ('has', 0.32935234904289246)]

Word: u
Vector (first 5 dims shown): [ 0.01578102 -0.01901144 -0.00028401  0.00688657 -0.00170458] ...
Top 5 similar words: [('code', 0.31285619735717773

## Use Pre-trained Word2Vec

We can also use a **pretrained Word2Vec model** (e.g., GoogleNews) for comparison.
- Pretrained embeddings are trained on huge corpora (millions of words)
- Useful for small datasets where custom embeddings may be limited
