## **Word2Vec**
Word2Vec is a popular natural language processing (NLP) technique used to represent words as vectors in a continuous vector space. This method was introduced by Tomas Mikolov and his team at Google in a paper titled "Efficient Estimation of Word Representations in Vector Space" in 2013.

Word2Vec essentially learns distributed representations of words based on their contextual information in a given corpus. The key idea is that words with similar meanings should have similar vector representations. Word2Vec consists of two main models: the Continuous Bag of Words (CBOW) model and the Skip-gram model.

1. **Continuous Bag of Words (CBOW):** This model predicts a target word based on its context, i.e., the surrounding words in a sentence.

2. **Skip-gram Model:** This model, on the other hand, predicts the context words given a target word.

Word2Vec has proven to be effective in capturing semantic relationships between words and has become a fundamental tool in various NLP applications, including sentiment analysis, document clustering, and recommendation systems.

By representing words as vectors, Word2Vec enables mathematical operations on words, such as finding similarities or analogies between words. The vector representations obtained from Word2Vec can be used as features for downstream machine learning tasks or as embeddings for various applications in natural language understanding.

## **Implementing Word2Vec on Amazon review datasets**

### 1. Import Libraries and Dataset

In [70]:
import gensim
import numpy as np
import pandas as pd
import pprint

In [2]:
df = pd.read_json("dataset\Cell_Phones_and_Accessories_5.json", lines=True)
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [4]:
df.shape

(194439, 9)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194439 entries, 0 to 194438
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      194439 non-null  object
 1   asin            194439 non-null  object
 2   reviewerName    190920 non-null  object
 3   helpful         194439 non-null  object
 4   reviewText      194439 non-null  object
 5   overall         194439 non-null  int64 
 6   summary         194439 non-null  object
 7   unixReviewTime  194439 non-null  int64 
 8   reviewTime      194439 non-null  object
dtypes: int64(2), object(7)
memory usage: 13.4+ MB


### 2. Preprocess the dataset using gensim simpe preprocessing

The gensim.utils.simple_preprocess() is a utility function provided by Gensim for preprocessing text data. 

It makes tokenizing, normalizing, and cleaning text easier by completing standard pre-processing procedures like converting text to lowercase, eliminating punctuation, and splitting text into individual words.

In [18]:
print(df['reviewText'][:5])

0    They look good and stick good! I just don't li...
1    These stickers work like the review says they ...
2    These are awesome and make my phone look so st...
3    Item arrived in great time and was in perfect ...
4    awesome! stays on, and looks great. can be use...
Name: reviewText, dtype: object


In [17]:
# Example of gensim simple preprocess
review_text = df['reviewText'].apply(gensim.utils.simple_preprocess)
review_text[:5]

0    [they, look, good, and, stick, good, just, don...
1    [these, stickers, work, like, the, review, say...
2    [these, are, awesome, and, make, my, phone, lo...
3    [item, arrived, in, great, time, and, was, in,...
4    [awesome, stays, on, and, looks, great, can, b...
Name: reviewText, dtype: object

### 3. Training the Word2Vec Model

#### initialize the model

In [19]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

#### Build Vocabulary

In [20]:
model.build_vocab(review_text, progress_per=1000)

In [22]:
model.corpus_count

194439

#### Train the Word2Vec Model

In [23]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

(61502436, 83868975)

#### Save the Model

In [24]:
model.save("model/word2vec-amazon-cell-accessories-reviews-short.model")

#### Load saved model

In [3]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("model/word2vec-amazon-cell-accessories-reviews-short.model", mmap='r')

#### Finding Similar Words and Similarity between words

In [11]:
len(model.wv['good']), model.wv['good']

(100,
 array([ 3.7908862e+00,  2.1037699e-01,  5.6654531e-01,  1.7465203e+00,
        -2.0633793e+00,  1.7683836e+00,  2.9764297e+00, -1.1771388e+00,
         3.3997285e+00,  1.9098814e+00, -1.0367624e+00,  2.9892800e+00,
        -4.9457373e-03, -2.0754257e-01, -1.6682088e+00,  1.2165045e+00,
        -1.1478634e+00,  5.3653604e-01,  2.0549905e+00, -7.0834577e-01,
        -2.0617397e+00,  1.3905910e+00,  3.1469545e-01, -8.2369465e-01,
        -2.5021929e-01,  9.6753442e-01, -1.2798059e+00,  2.4592168e+00,
        -5.0034218e+00,  1.9439898e-01,  3.0021605e-01, -4.8863497e-01,
         3.4718969e+00, -2.1894975e-01,  1.1027510e+00, -1.2691100e+00,
         1.2202055e+00, -6.0457015e-01, -2.4700704e+00,  8.3072662e-01,
        -3.1335495e+00,  1.2516685e+00,  3.1859093e+00,  2.6590747e-01,
         1.6406068e+00, -1.5159631e+00,  2.2028489e+00, -4.4344463e+00,
        -1.9970876e+00,  5.5638456e-01,  2.4537299e+00,  2.6398520e+00,
        -6.1077476e-01,  2.2352457e+00,  8.8071078e-02,  5

In [4]:
model.wv.most_similar("good")

[('decent', 0.818357527256012),
 ('great', 0.7874462008476257),
 ('fantastic', 0.7144420146942139),
 ('nice', 0.7064128518104553),
 ('excellent', 0.637588381767273),
 ('superb', 0.6370196342468262),
 ('outstanding', 0.6097298860549927),
 ('awesome', 0.6054986119270325),
 ('exceptional', 0.6053029298782349),
 ('reasonable', 0.6011152863502502)]

In [5]:
model.wv.similarity(w1="cheap", w2="expensive")

0.46682075

In [20]:
model.wv.most_similar(positive=["good", "bad"], negative=['nice'])

[('terrible', 0.6243672370910645),
 ('shabby', 0.5796376466751099),
 ('horrible', 0.5532450079917908),
 ('poor', 0.5488090515136719),
 ('ok', 0.5208736062049866),
 ('okay', 0.5044753551483154),
 ('mediocre', 0.5011917948722839),
 ('decent', 0.47254306077957153),
 ('darn', 0.4574422240257263),
 ('crappy', 0.45334017276763916)]

## **Machine Learning Implementation using Word2Vec model**

### 1. take dataset and preprocess data

In [24]:
df_ml = df[['reviewText','overall']]
df_ml.head()

Unnamed: 0,reviewText,overall
0,They look good and stick good! I just don't li...,4
1,These stickers work like the review says they ...,5
2,These are awesome and make my phone look so st...,5
3,Item arrived in great time and was in perfect ...,4
4,"awesome! stays on, and looks great. can be use...",5


In [28]:
def vectorize_text(text_token, model= model.wv):
    vector_size = model.vector_size
    vector_init = np.zeros(vector_size)
    ctr = 1
    for token in text_token:
        if token in model:
            ctr += 1
            vector_init += model[token]
    vector_init = vector_init/ctr
    return vector_init

In [35]:
vectorize_text(['good', 'bad', 'good'], model= model.wv)

array([ 2.32552895,  0.62897152,  0.60898492,  1.07130483, -0.97558424,
        1.75729698,  1.42751348, -1.34696162,  1.860542  ,  1.11578837,
       -0.60967407,  1.62458852, -0.92041551,  0.03486456, -1.09615907,
        0.43038236, -0.76662487, -0.08310533,  0.64802644, -1.35172886,
       -1.34757328,  1.26133353,  0.53866123, -0.41255819, -0.64601198,
        0.98255458, -0.20671302,  1.16864669, -3.66476607,  0.47433361,
       -0.42900349,  0.0667076 ,  2.09990939,  0.46773907,  1.20073783,
       -0.7566806 ,  0.75944646, -1.23000658, -0.61687022,  0.02962214,
       -1.67007179,  0.54423916,  1.59732327,  0.27242199,  1.26959392,
       -1.57794321,  1.80544955, -3.17269886, -0.53241852,  0.73868734,
        1.77294254,  1.76466867, -0.09138958,  1.96993625, -0.09940507,
        0.45390694, -1.31525636, -0.22537105, -0.46347475, -2.6034815 ,
       -0.30415373, -1.06664398,  2.01131868, -0.26063541,  0.67370261,
        0.5297662 , -1.51603401, -2.40979874,  2.27096868, -2.60

In [34]:
df_ml['token'] = df['reviewText'].apply(gensim.utils.simple_preprocess)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml['token'] = df['reviewText'].apply(gensim.utils.simple_preprocess)


In [40]:
df_ml['v_token'] = df_ml['token'].apply(lambda x: vectorize_text(x, model=model.wv))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml['v_token'] = df_ml['token'].apply(lambda x: vectorize_text(x, model=model.wv))


### 2. Train Machine Learning Model

In [51]:
from sklearn.model_selection import train_test_split
x = df_ml['v_token'].to_list()
y = df_ml['overall'].to_list()

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

In [64]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [65]:
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(X_test)))

              precision    recall  f1-score   support

           1       0.51      0.48      0.49      2656
           2       0.23      0.04      0.06      2213
           3       0.34      0.22      0.27      4288
           4       0.43      0.21      0.28      7998
           5       0.68      0.93      0.79     21733

    accuracy                           0.62     38888
   macro avg       0.44      0.37      0.38     38888
weighted avg       0.56      0.62      0.56     38888



## **Implementing gensim word2vec pretrained**

In [68]:
import gensim.downloader

### 1. Use pretrained word2vec

In [72]:
pprint.pprint(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


In [74]:
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')



In [77]:
len(glove_vectors['good']), glove_vectors.most_similar('good')

(25,
 [('too', 0.9648017287254333),
  ('day', 0.9533665180206299),
  ('well', 0.9503170847892761),
  ('nice', 0.9438973665237427),
  ('better', 0.9425962567329407),
  ('fun', 0.9418926239013672),
  ('much', 0.9413353800773621),
  ('this', 0.9387555122375488),
  ('hope', 0.9383506774902344),
  ('great', 0.9378516674041748)])

### 2. prepare the data

In [78]:
df_ml['v_token'] = df_ml['token'].apply(lambda x: vectorize_text(x, model=glove_vectors))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ml['v_token'] = df_ml['token'].apply(lambda x: vectorize_text(x, model=glove_vectors))


In [80]:
from sklearn.model_selection import train_test_split
x = df_ml['v_token'].to_list()
y = df_ml['overall'].to_list()

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

### 3. train the machine learning models

In [81]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [82]:
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(X_test)))

              precision    recall  f1-score   support

           1       0.33      0.10      0.16      2656
           2       0.11      0.00      0.00      2213
           3       0.26      0.06      0.09      4288
           4       0.31      0.05      0.08      7998
           5       0.58      0.96      0.73     21733

    accuracy                           0.56     38888
   macro avg       0.32      0.23      0.21     38888
weighted avg       0.45      0.56      0.45     38888

