### Flashcards

<details>
<summary>What 2 problems in n-gram / tf-idf does word embeddings solve?</summary>

1. Similar words don't have similar representation (help vs assitance)
    > Similar words have similar vectors
2. Vector size is too large (vector is sparse and size of the entire corpus)
   > Vector size can be manually set (dimensions are low)
</details>


![](https://i.imgur.com/8ucXB3v.png)

![](https://i.imgur.com/MTzWBX0.png)

- Model Variations

![](https://i.imgur.com/Kqvk8hF.png) 

![](https://i.imgur.com/qoSwkWc.png)



![](https://i.imgur.com/3pVJO0W.png) 

### Code Implementation

In [3]:
import spacy

https://spacy.io/models/en

This requires the large (lg) or medium (md) model as the small (sm) model does not have vectors

- en_core_web_lg -- 685k keys, 343k unique vectors (300 dimensions)
- en_core_web_md -- 685k keys, 20k unique vectors (300 dimensions)


In [None]:
!python -m spacy download en_core_web_lg

In [5]:
nlp = spacy.load("en_core_web_lg")

In [11]:
doc = nlp("In the southern sky, adfdfkdgsj Franz Yash")

for token in doc:
    print(token.text, token.has_vector, not token.is_oov)

In True True
the True True
southern True True
sky True True
, True True
adfdfkdgsj False False
Franz True True
Yash True True


In [15]:
doc[0].vector.shape

(300,)

In [16]:
doc = nlp('''
In the southern sky, I see a light that is shining into the night.
Looking at it I remember what I'm meant to recall.
That what I hear right now in front of me was once such a song.
''')

In [19]:
doc.vector.shape

(300,)

In [20]:
base_token = nlp("bread")

In [53]:
tokens = nlp('bread sandwich butter rice pita person wheat computer cat mage')

for token in tokens:
    # This calculates cosine similarity
    print(token.text, token.similarity(base_token))

bread 0.9999999744752309
sandwich 0.6341067417450952
butter 0.7212939096748688
rice 0.6082261546210023
pita 0.5335656224460084
person 0.2385450112052788
wheat 0.615036141030184
computer 0.14402428531962433
cat 0.1255933926864667
mage 0.015320008784434172


- two words in similar context will have more similarity
- breat and butter, profit and loss

In [30]:
def print_similarity(base_word, words):

    base_token = nlp(base_word)
    tokens = nlp(words)

    for token in tokens:
        print(f"{token.text:<8} {round(token.similarity(base_token),3)}")

In [35]:
print_similarity('profit', 'gain loss money bread value cat water')

gain     0.352
loss     0.24
money    0.516
bread    0.171
value    0.523
cat      -0.065
water    0.16


In [83]:
print_similarity('iphone', 'apple samsung phone ipad android cat steve stevejobs')

apple    0.439
samsung  0.671
phone    0.729
ipad     0.774
android  0.674
cat      0.114
steve    0.122
stevejobs 0.0


  print(f"{token.text:<8} {round(token.similarity(base_token),3)}")


In [42]:
vv = lambda word: nlp.vocab[word].vector

In [46]:
result = vv('king') - vv('man') + vv('woman')

In [50]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result], [vv('queen')])

array([[0.6178014]], dtype=float32)

In [51]:
cosine_similarity([result], [vv('king')])

array([[0.8489542]], dtype=float32)

In [52]:
cosine_similarity([vv('king')], [vv('queen')])

array([[0.61088413]], dtype=float32)

- result is supposed to be more similar to queen because of the vector arithmetic
- explore this later

<details>
    <summary>ChatGPT Explaination</summary>

### Explanation for This Behavior:
1. **Word Embedding Quality**:
   - The embeddings used in your model may not be sufficiently trained or large enough to capture nuanced relationships like the analogy "king - man + woman ≈ queen."
   - Word embeddings trained on smaller or less diverse corpora may fail to generalize these relationships well.

2. **Vector Space Limitations**:
   - While the analogy calculation \( \text{king} - \text{man} + \text{woman} \) often works well with high-quality embeddings, the resulting vector may not always align perfectly with the target word due to:
     - Overlapping concepts (e.g., "king" and "queen" are both rulers and share semantic space).
     - Noise or less distinctive gender-related vector components.

3. **Cosine Similarity Context**:
   - The cosine similarity metric only measures the angle between two vectors, not their magnitude. If "king" and the `result` vector are closer in direction than "queen," this can explain the higher similarity with "king."

4. **Pretrained Embedding Limitations**:
   - If these embeddings are static (like Word2Vec or GloVe), they cannot handle word context, making the analogy computations less precise.
   - Contextual embeddings like BERT or GPT might yield better results, but even then, analogical reasoning is not guaranteed.

---

### How to Improve Results:
1. **Use Higher-Quality Embeddings**:
   - Use embeddings trained on large, diverse corpora (e.g., pretrained Word2Vec on Google News or fastText).

2. **Normalize Vectors**:
   - Normalize all vectors before performing operations to ensure consistency in cosine similarity calculations.

3. **Fine-Tune Embeddings**:
   - Train embeddings specifically on a dataset that captures the relationships you aim to explore (e.g., text rich in hierarchical or gendered concepts).

4. **Verify the Corpus**:
   - If using a custom corpus, ensure it has balanced and representative data for the desired analogies.

---

### Takeaway:
Your specific embeddings might still encode a relationship between "king," "man," "woman," and "queen," but the geometry of the vector space doesn't yield a perfect analogy in this instance. This discrepancy is not unusual and highlights the limitations of static embeddings in capturing all semantic relationships. 
</details>

### Text Classificcation using Spacy Word Vectors

Problem Statement

Using classical NLP techniques classify whether a given message/ text is **Real or Fake** Message.

We will use **glove embeddings** from spacy which is trained on massive wikipedia dataset to pre-process and text vectorization and apply different classification algorithms.

Dataset

Credits: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

This data consists of two columns. - Text - label

Text is the statements or messages regarding a particular event/situation.

label feature tells whether the given text is Fake or Real.

As there are only 2 classes, this problem comes under the **Binary Classification**.

In [56]:
!wget -O Fake_Real_Data.csv https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/14_word_vectors_spacy_text_classification/Fake_Real_Data.csv

--2025-01-16 10:30:28--  https://raw.githubusercontent.com/codebasics/nlp-tutorials/main/14_word_vectors_spacy_text_classification/Fake_Real_Data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25876225 (25M) [text/plain]
Saving to: ‘Fake_Real_Data.csv’


2025-01-16 10:30:39 (41.8 MB/s) - ‘Fake_Real_Data.csv’ saved [25876225/25876225]



In [58]:
import pandas as pd

df = pd.read_csv('/kaggle/working/Fake_Real_Data.csv')

In [59]:
df

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real
...,...,...
9895,Wikileaks Admits To Screwing Up IMMENSELY Wit...,Fake
9896,Trump consults Republican senators on Fed chie...,Real
9897,Trump lawyers say judge lacks jurisdiction for...,Real
9898,WATCH: Right-Wing Pastor Falsely Credits Trum...,Fake


In [60]:
df['label'].value_counts()

label
Fake    5000
Real    4900
Name: count, dtype: int64

In [61]:
df['fake'] = df['label'].apply(lambda x: 1 if x == 'Fake' else 0)
# df['fake'] = df['label'].map({'Fake' : 1, 'Real' : 0})

In [63]:
df.iloc[0]

Text      Top Trump Surrogate BRUTALLY Stabs Him In The...
label                                                 Fake
fake                                                     1
Name: 0, dtype: object

In [64]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [66]:
df['vector'] = df['Text'].apply(lambda x: nlp(x).vector)

In [67]:
df

Unnamed: 0,Text,label,fake,vector
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,1,"[-0.6759837, 1.4263071, -2.318466, -0.451093, ..."
1,U.S. conservative leader optimistic of common ...,Real,0,"[-1.8355803, 1.3101058, -2.4919677, 1.0268308,..."
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,0,"[-1.9851209, 0.14389805, -2.4221718, 0.9133005..."
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,1,"[-2.7812982, -0.16120885, -1.609772, 1.3624227..."
4,Democrats say Trump agrees to work on immigrat...,Real,0,"[-2.2010763, 0.9961637, -2.4088492, 1.128273, ..."
...,...,...,...,...
9895,Wikileaks Admits To Screwing Up IMMENSELY Wit...,Fake,1,"[-1.6682401, 0.78006977, -2.2337353, -0.159771..."
9896,Trump consults Republican senators on Fed chie...,Real,0,"[-1.9297235, 0.8007302, -1.8990824, 0.42668718..."
9897,Trump lawyers say judge lacks jurisdiction for...,Real,0,"[-1.5289013, 1.0250993, -1.9861357, 0.4278564,..."
9898,WATCH: Right-Wing Pastor Falsely Credits Trum...,Fake,1,"[-1.3928099, 0.7792715, -2.2072845, 0.13192406..."


In [69]:
df.to_csv('Fake_Real_Data_Processed.csv', index=False)

In [73]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['vector'].values,
    df['fake'],
    test_size = 0.2,
    random_state = 1,
    stratify = df['fake']
)

In [75]:
X_train.shape

(7920,)

In [81]:
# This X_train is np arrays (object) nested in a np array
# We need to convert this to 2d np array

import numpy as np

X_train = np.stack(X_train)
X_test = np.stack(X_test)

In [82]:
X_train.shape

(7920, 300)

In [84]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [86]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       980
           1       0.94      0.94      0.94      1000

    accuracy                           0.94      1980
   macro avg       0.94      0.94      0.94      1980
weighted avg       0.94      0.94      0.94      1980



In [87]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
     ('scaler', MinMaxScaler()),
     ('Random Forest', RandomForestClassifier())         
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       980
           1       1.00      0.98      0.99      1000

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980



### Example Problem (from TF IDF)

In [88]:
!kaggle datasets download -d saurabhshahane/ecommerce-text-classification

Dataset URL: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading ecommerce-text-classification.zip to /kaggle/working
 89%|█████████████████████████████████▊    | 7.00M/7.86M [00:01<00:00, 8.86MB/s]
100%|██████████████████████████████████████| 7.86M/7.86M [00:01<00:00, 5.99MB/s]


In [100]:
import pandas as pd
df = pd.read_csv('/kaggle/working/ecommerce-text-classification.zip', header=None, names=['label', 'text'])
df.head()

Unnamed: 0,label,text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [101]:
df = df.dropna()
df['label'].value_counts()

label
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8670
Name: count, dtype: int64

In [102]:
label_map = {label:i for i, label in enumerate(df['label'].unique())}
df['label_num'] = df['label'].map(label_map)

In [103]:
df = df.groupby('label').sample(n=6000, random_state=1).reset_index(drop=True)

In [104]:
df

Unnamed: 0,label,text,label_num
0,Books,Differential Calculus for JEE Mains and Advanced,1
1,Books,The Way to Geometry (Perfect Library),1
2,Books,Guns 101: A Beginner's Guide to Buying and Own...,1
3,Books,ROAMS with Primes Supplement (2018) PGMEE BOOKS,1
4,Books,Exam Ref 70-533 Implementing Microsoft Azure I...,1
...,...,...,...
23995,Household,HOKIPO Cotton Folding 48-Litre Round Laundry B...,0
23996,Household,Rena Germany Knife Sharpening Rod - Stainless ...,0
23997,Household,"Klaxon Rado Side Table (Matte Finish, Black)",0
23998,Household,MBTC Delton Cafeteria Bar Stool Chair in Light...,0


In [105]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [117]:
# For printing progress
# call progress_apply instead of apply

from tqdm.notebook import tqdm
tqdm.pandas()

In [118]:
df['vector'] = df['text'].progress_apply(lambda x: nlp(x).vector)

  0%|          | 0/24000 [00:00<?, ?it/s]

In [119]:
df

Unnamed: 0,label,text,label_num,vector
0,Books,Differential Calculus for JEE Mains and Advanced,1,"[-2.9665587, -2.1354587, 0.82516855, 1.6652772..."
1,Books,The Way to Geometry (Perfect Library),1,"[-4.694405, -0.9549776, 2.4569287, 0.001825004..."
2,Books,Guns 101: A Beginner's Guide to Buying and Own...,1,"[-1.1508142, 0.38399985, 1.0359765, -0.1661633..."
3,Books,ROAMS with Primes Supplement (2018) PGMEE BOOKS,1,"[-2.494515, -3.5772448, 2.9306443, 1.0085555, ..."
4,Books,Exam Ref 70-533 Implementing Microsoft Azure I...,1,"[-1.0092915, -0.5511473, -0.16532014, 0.582062..."
...,...,...,...,...
23995,Household,HOKIPO Cotton Folding 48-Litre Round Laundry B...,0,"[-1.8697267, 0.7694015, -1.7398616, 0.52211505..."
23996,Household,Rena Germany Knife Sharpening Rod - Stainless ...,0,"[-2.2516534, -0.3035365, -1.5092915, 2.1536775..."
23997,Household,"Klaxon Rado Side Table (Matte Finish, Black)",0,"[-3.926707, -2.5546288, 2.2692208, 2.030727, 1..."
23998,Household,MBTC Delton Cafeteria Bar Stool Chair in Light...,0,"[-2.4423666, 0.21136172, -1.9957286, 1.2733226..."


In [120]:
df.to_csv('ecommerce_text_classification_processed.csv', index=False)

In [121]:
label_map

{'Household': 0, 'Books': 1, 'Clothing & Accessories': 2, 'Electronics': 3}

In [122]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['vector'].values,
    df['label_num'],
    test_size = 0.2,
    random_state = 1,
    stratify = df['label_num']
)

In [124]:
import numpy as np

X_train = np.stack(X_train)
X_test = np.stack(X_test)

In [125]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.56      0.68      0.61      1200
           1       0.90      0.79      0.85      1200
           2       0.83      0.74      0.78      1200
           3       0.63      0.64      0.64      1200

    accuracy                           0.71      4800
   macro avg       0.73      0.71      0.72      4800
weighted avg       0.73      0.71      0.72      4800



Results from TF-IDF
```
              precision    recall  f1-score   support

           0       0.98      0.92      0.95      1200
           1       0.96      0.99      0.97      1200
           2       0.94      0.95      0.94      1200
           3       0.92      0.94      0.93      1200

    accuracy                           0.95      4800
   macro avg       0.95      0.95      0.95      4800
weighted avg       0.95      0.95      0.95      4800
```

In [126]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      0.92      0.89      1200
           1       0.98      0.96      0.97      1200
           2       0.95      0.94      0.94      1200
           3       0.95      0.92      0.94      1200

    accuracy                           0.94      4800
   macro avg       0.94      0.94      0.94      4800
weighted avg       0.94      0.94      0.94      4800



In [127]:
from sklearn.neighbors import KNeighborsClassifier

clf = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', KNeighborsClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.85      0.82      0.84      1200
           1       0.97      0.91      0.94      1200
           2       0.90      0.94      0.92      1200
           3       0.87      0.92      0.90      1200

    accuracy                           0.90      4800
   macro avg       0.90      0.90      0.90      4800
weighted avg       0.90      0.90      0.90      4800

