# GloVe Word Embeddings

GloVe (Global Vectors) is a **pretrained embedding method** that captures **word meaning** based on **global co-occurrence statistics**.

In this notebook:
- We use **10 messages** from our spam.csv dataset
- Preprocess the text
- Load **pretrained GloVe embeddings** (50-dimensional)
- Get vectors for words and compute **similarities**


In [1]:
# Basic libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download punkt tokenizer
nltk.download('punkt')

# Gensim for pretrained GloVe embeddings
import gensim.downloader as api


[nltk_data] Downloading package punkt to
[nltk_data]     D:\miniconda_setup\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Load spam dataset
df = pd.read_csv('spam.csv', encoding='latin-1')

# Use only 'v2' column (message text) and first 10 messages
df_small = df[['v2']].head(10)

# Display dataset
df_small

Unnamed: 0,v2
0,"Go until jurong point, crazy.. Available only ..."
1,Ok lar... Joking wif u oni...
2,Free entry in 2 a wkly comp to win FA Cup fina...
3,U dun say so early hor... U c already then say...
4,"Nah I don't think he goes to usf, he lives aro..."
5,FreeMsg Hey there darling it's been 3 week's n...
6,Even my brother is not like to speak with me. ...
7,As per your request 'Melle Melle (Oru Minnamin...
8,WINNER!! As a valued network customer you have...
9,Had your mobile 11 months or more? U R entitle...


In [3]:
# Clean text: lowercase + remove punctuation/numbers
df_small['clean_text'] = df_small['v2'].apply(lambda x: re.sub(r'[^a-z\s]', '', str(x).lower()))

# Tokenize each message
df_small['tokens'] = df_small['clean_text'].apply(word_tokenize)

# Display tokenized messages
df_small[['v2', 'tokens']]


Unnamed: 0,v2,tokens
0,"Go until jurong point, crazy.. Available only ...","[go, until, jurong, point, crazy, available, o..."
1,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, a, wkly, comp, to, win, fa, ..."
3,U dun say so early hor... U c already then say...,"[u, dun, say, so, early, hor, u, c, already, t..."
4,"Nah I don't think he goes to usf, he lives aro...","[nah, i, dont, think, he, goes, to, usf, he, l..."
5,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, there, darling, its, been, week..."
6,Even my brother is not like to speak with me. ...,"[even, my, brother, is, not, like, to, speak, ..."
7,As per your request 'Melle Melle (Oru Minnamin...,"[as, per, your, request, melle, melle, oru, mi..."
8,WINNER!! As a valued network customer you have...,"[winner, as, a, valued, network, customer, you..."
9,Had your mobile 11 months or more? U R entitle...,"[had, your, mobile, months, or, more, u, r, en..."


In [4]:
# Load small pretrained GloVe embeddings (50D)
glove_model = api.load('glove-wiki-gigaword-50')  # ~66 MB
print("✅ GloVe 50D model loaded! Vocabulary size:", len(glove_model))

✅ GloVe 50D model loaded! Vocabulary size: 400000


### Pretrained GloVe Embeddings
- GloVe = Global Vectors for Word Representation
- 50-dimensional vectors trained on Wikipedia + Gigaword
- Useful for small datasets to provide semantic information


In [7]:
# Function: get sentence vector and optionally show specific word vectors
def sentence_vector(tokens, model, show_words=3):
    vectors = [model[word] for word in tokens if word in model]
    if len(vectors) == 0:
        return np.zeros(model.vector_size)
    
    # Show full vector of first `show_words` words
    for i, word in enumerate(tokens[:show_words]):
        if word in model:
            print(f"\nWord: {word}")
            print(f"Full 50D vector: {model[word]}")
    
    return np.mean(vectors, axis=0)

# Apply to all messages
df_small['sentence_vector'] = df_small['tokens'].apply(lambda x: sentence_vector(x, glove_model))

# Show sentence vectors (first 5 dims only for brevity)
for i, vec in enumerate(df_small['sentence_vector']):
    print(f"\nMessage {i}: {df_small['v2'].iloc[i]}")
    print(f"Sentence vector (first 5 dims): {vec[:5]} ...")



Word: go
Full 50D vector: [ 1.4828e-01  1.7761e-01  4.2346e-01 -3.1489e-01  3.2273e-01 -7.2413e-01
 -7.8955e-01  4.9214e-01 -2.0693e-01 -5.5088e-04 -4.7877e-01  2.8853e-01
 -5.7376e-01  2.7217e-01  1.1129e+00  5.7808e-01  6.9321e-01 -2.8652e-01
 -5.4545e-02 -6.1826e-01  1.7227e-01  2.9263e-01  3.8184e-01  6.2186e-01
  5.5461e-01 -1.7411e+00 -2.8802e-01 -1.7140e-01  7.4743e-01 -1.0135e+00
  3.3596e+00  1.1370e+00 -1.0028e+00  1.7685e-01 -6.1795e-03 -6.3491e-02
  1.9077e-01  4.4046e-02  3.8228e-01 -4.1607e-01 -5.0359e-01 -8.3803e-02
  1.7508e-01  4.0420e-01  7.7324e-02  1.7415e-01  1.2541e-01 -2.1820e-01
  1.2971e-01  3.2953e-01]

Word: until
Full 50D vector: [ 0.20025  -0.32821  -0.40859  -0.79438  -0.016211 -0.15642  -0.87742
  0.79077  -0.72598  -0.84135   0.32721   0.16083  -0.39978  -0.16564
  0.97777   0.75359  -0.58771  -0.18122  -0.86418   0.21439   0.93586
 -0.27561   1.1309    0.13037  -0.26363  -1.5425    0.44697   0.11369
  0.81079   0.50902   3.5276   -0.40359  -1.1329   -0

**What is FastText?**  
FastText is a word embedding model developed by Facebook AI Research. Like Word2Vec and GloVe, it converts words into dense numerical vectors that capture semantic meaning.

**Key Differences from GloVe / Word2Vec:**
1. **Subword Information**: FastText breaks words into smaller character n-grams.  
   - Example: `"playing"` → `"pla"`, `"lay"`, `"ayi"`, `"ing"` …  
   - This helps it generate vectors for **words not seen during training**.
2. **Vocabulary Coverage**: Can handle **out-of-vocabulary (OOV) words**, unlike GloVe/Word2Vec.
3. **Vector Size**: Commonly 300D (vs 50D in our small GloVe example), so richer representation.
4. **Use Case**: Great for tasks where dataset has typos, rare words, or informal text (like SMS/spam messages).

**Purpose in our Notebook:**  
- Convert each SMS message into a **sentence vector** using FastText.  
- Compare with GloVe vectors to see differences in handling unknown words.

## FastText Embeddings (Pretrained Model)

```python
import gensim.downloader as api
fasttext_model = api.load('fasttext-wiki-news-subwords-300')


### Summary: GloVe vs FastText

- Both generate word embeddings for the same purpose: **representing words as vectors**.  
- **Difference:** FastText handles **rare or misspelled words** using subword information; GloVe does not.