### <b>Word2Vec</b>

<br>

##### What is Word2Vec ? 

Word2Vec is a Word Embedding that´s applied on NLP tasks, across Embedding We be able extract representation on words in text, Beyond be most user Embedding for NLP, let´s go studing this concept. 
 
<br>

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

<br>


#### How does Word2Vec work?

Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)
CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context. Consider our example: Have a great day.
Let the input to the Neural Network be the word, great. Notice that here we are trying to predict a target word (day) using a single context input word great. More specifically, we use the one hot encoding of the input word and measure the output error compared to one hot encoding of the target word (day). In the process of predicting the target word, we learn the vector representation of the target word.
Let us look deeper into the actual architecture.

<br>


<p align=center>
<img src="https://miro.medium.com/max/700/0*3DFDpaXoglalyB4c.png" width="70%"></p>

<br>

The input or the context word is a one hot encoded vector of size V. The hidden layer contains N neurons and the output is again a V length vector with the elements being the softmax values.
Let’s get the terms in the picture right:
- Wvn is the weight matrix that maps the input x to the hidden layer (V*N dimensional matrix)
-W`nv is the weight matrix that maps the hidden layer outputs to the final output layer (N*V dimensional matrix)
I won’t get into the mathematics. We’ll just get an idea of what’s going on.
The hidden layer neurons just copy the weighted sum of inputs to the next layer. There is no activation like sigmoid, tanh or ReLU. The only non-linearity is the softmax calculations in the output layer.
But, the above model used a single context word to predict the target. We can use multiple context words to do the same.

<br>


<p align=center>
<img src="https://miro.medium.com/max/596/0*CCsrTAjN80MqswXG" width="70%"></p>

<br>


The above model takes C context words. When Wvn is used to calculate hidden layer inputs, we take an average over all these C context word inputs.
So, we have seen how word representations are generated using the context words. But there’s one more way we can do the same. We can use the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. Another variant, called Skip Gram model does this.

<br>

#### Skip-Gram model


This looks like multiple-context CBOW model just got flipped. To some extent that is true.
We input the target word into the network. The model outputs C probability distributions. What does this mean?
For each context position, we get C probability distributions of V probabilities, one for each word.


<br>
<p align=center>
<img src="https://miro.medium.com/max/700/0*Ta3qx5CQsrJloyCA.png" width="70%"></p>

<br>



#### Who wins?

Both have their own advantages and disadvantages. According to Mikolov, Skip Gram works well with small amount of data and is found to represent rare words well.
On the other hand, CBOW is faster and has better representations for more frequent words.


<br>

<hr>



* usar string.puctuation

In [18]:
import os 
import re 
import string 
import time 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')



import nltk 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from wordcloud import WordCloud


from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, GRU
from tensorflow.keras.layers import Dropout, InputLayer, Bidirectional
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.losses import BinaryCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.models import model_from_json

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# seed 
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)

In [3]:
path = '/content/drive/My Drive/Deep Learning - Projetos/Embeddings /Musical_instruments_reviews.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"
3,A2C00NNG1ZQQG2,1384719342,"RustyBill ""Sunday Rocker""","[0, 0]",Nice windscreen protects my MXL mic and preven...,5.0,GOOD WINDSCREEN FOR THE MONEY,1392336000,"02 14, 2014"
4,A94QU4C90B1AX,1384719342,SEAN MASLANKA,"[0, 0]",This pop filter is great. It looks and perform...,5.0,No more pops when I record my vocals.,1392940800,"02 21, 2014"


In [4]:
# lower case columns 
data.columns = data.columns.str.lower()

In [5]:
# drop columns 
cols_drop = ['reviewerid','asin','reviewername','helpful','unixreviewtime','reviewtime']
data.drop(columns=cols_drop, axis=1, inplace=True)

In [6]:
# Summary 
data['overall'] = data['overall'].replace({1:2,4:5})


# sentiments 
data['overall'] = data['overall'].replace({2:'Negative',
                                           3:'Neutral',
                                           5:'Positive'
                                           })

data['overall'].value_counts()

Positive    9022
Neutral      772
Negative     467
Name: overall, dtype: int64

In [7]:
data['reviewtext'] = data['reviewtext'].astype(str)
data['overall'] = data['overall'].astype(str)

In [8]:
data.isnull().sum()

reviewtext    0
overall       0
summary       0
dtype: int64

In [9]:
data.dropna(axis=0, inplace=True)

In [10]:
data['reviewtext'] = data['reviewtext'] + ' ' + data['summary']
data.drop('summary', axis=1, inplace=True)
data.head()

Unnamed: 0,reviewtext,overall
0,"Not much to write about here, but it does exac...",Positive
1,The product does exactly as it should and is q...,Positive
2,The primary job of this device is to block the...,Positive
3,Nice windscreen protects my MXL mic and preven...,Positive
4,This pop filter is great. It looks and perform...,Positive


In [11]:
X = data['reviewtext']
y = data['overall']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=42)

encoder = LabelEncoder()

y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

print('Classes: ', encoder.inverse_transform([0,1,2]))

Classes:  ['Negative' 'Neutral' 'Positive']


In [12]:
# feature engineering for NLP 

def removing_noise(text):
      
    removing_list = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
    text = re.sub(removing_list, " ", str(text))
    text = re.sub("'", ' ', text)
    text = text.lower().strip()

    return text 



def lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    text = lemmatizer.lemmatize(text)
    return text 



def tokenization(X_train, X_test, max_sequence_length=10, words_token=10000):

      tokenizer = Tokenizer(num_words=words_token)
      tokenizer.fit_on_texts(X_train)

      word_index = tokenizer.word_index
      num_words = len(word_index) + 1 

      sequences_train = tokenizer.texts_to_sequences(X_train)
      sequences_test =  tokenizer.texts_to_sequences(X_test)

      X_train = pad_sequences(sequences_train, maxlen=max_sequence_length, padding='post', truncating='post')
      X_test = pad_sequences(sequences_test, maxlen=max_sequence_length, padding='post', truncating='post')


      return (X_train, X_test, num_words)



    
def stop_words(text):

    stop_list = set(stopwords.words('english'))
    tokens = []
    for token in text.split():
      if token not in stop_list:
        tokens.append(token)
        return " ".join(tokens)
      else: 
        pass 

In [13]:
data['reviewtext'] = data['reviewtext'].apply(lambda x: removing_noise(x))
data['reviewtext'] = data['reviewtext'].apply(lambda x: stop_words(x))
data['reviewtext'] = data['reviewtext'].apply(lambda x: lemmatization(x))

In [14]:
# CountVectorizer 
cv = CountVectorizer(tokenizer=word_tokenize)
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)


# TF-IDF 
tfidf = TfidfTransformer()
X_train_idf = tfidf.fit_transform(X_train_cv)
X_test_idf = tfidf.transform(X_test_cv)

In [15]:
mdl = MultinomialNB()
mdl.fit(X_train_idf, y_train)
y_pred = mdl.predict(X_test_idf)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       139
           1       0.00      0.00      0.00       229
           2       0.88      1.00      0.94      2711

    accuracy                           0.88      3079
   macro avg       0.29      0.33      0.31      3079
weighted avg       0.78      0.88      0.82      3079



In [16]:
mdl = XGBClassifier(random_state=42)
mdl.fit(X_train_idf, y_train)
y_pred = mdl.predict(X_test_idf)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.50      0.02      0.04       139
           1       0.58      0.08      0.14       229
           2       0.89      1.00      0.94      2711

    accuracy                           0.88      3079
   macro avg       0.66      0.37      0.37      3079
weighted avg       0.85      0.88      0.84      3079



In [17]:
X_train, X_test, num_words = tokenization(X_train, X_test, max_sequence_length=50, words_token=10000)

In [19]:
max_sequence_length = 50

In [30]:
# LSTM with Embedding 

model = Sequential()
model.add(InputLayer(input_shape=max_sequence_length))
model.add(Embedding(input_dim=num_words,
                    output_dim=300,
                    input_length=max_sequence_length))
model.add(Dropout(0.20))
model.add(LSTM(128, recurrent_dropout=0.20))
model.add(LSTM(128, recurrent_dropout=0.20))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.20))
model.add(Dense(64, activation='relu'))
model.add(Dense(3, activation='softmax'))


model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 300)           5545500   
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 300)           0         
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               439296    
_________________________________________________________________
dense_4 (Dense)              (None, 64)                16448     
_________________________________________________________________
dense_5 (Dense)              (None, 3)                 195       
Total params: 6,001,439
Trainable params: 6,001,439
Non-trainable params: 0
_________________________________________________________________


In [31]:
model.compile(optimizer=RMSprop(0.001),
              loss=SparseCategoricalCrossentropy(),
              metrics=['accuracy'])


history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [32]:
loss, accuracy = model.evaluate(X_test, y_test)



In [33]:
y_pred = model.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.31      0.20      0.24       139
           1       0.31      0.25      0.28       229
           2       0.91      0.94      0.93      2711

    accuracy                           0.86      3079
   macro avg       0.51      0.47      0.48      3079
weighted avg       0.84      0.86      0.85      3079

