# Word2Vec for Text Classification

In this short notebook, we will see an example of how to use a pre-trained Word2vec model for doing feature extraction and performing text classification.

We will use the sentiment labelled sentences dataset from UCI repository
http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB. Let us first combine all the three separate data files into one using the following unix command:

```cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt```

For a pre-trained embedding model, we will use the Google News vectors.

In [1]:
#basic imports
import warnings
warnings.filterwarnings('ignore')
import os
import wget
import gzip
import shutil
from time import time

#pre-processing imports
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
import gensim.downloader
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to /Users/fulin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/fulin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Loading the dataset

In [2]:
#Read text data, categories.
#the file path consists of tab separated sentences and categories.
texts = []
categories = []
fh = open('./Data/sentiment_sentences.txt')
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    categories.append(sentiment)

In [3]:
#Inspect the dataset
print(len(categories), len(texts))
print("the category of text 1: '",texts[1], "' is ", categories[1])
print("The set of categories are: ", set(categories))

3000 3000
the category of text 1: ' Good case, Excellent value. ' is  1

The set of categories are:  {'0\n', '1\n'}


## Loading the word2vec model
For that task use KeyedVectors imported from the gensim library. <br>
* Load the pre-trained google news 300 word2vec model to the variable w2v_model
* Inspect the model by checking the words inside w2v_model: the number of words

**NB:**<br>
The magic command *%time* used returns the computational cost of the operation following it.

In [4]:
#Load W2V model. This will take some time.
%time w2v_model = gensim.downloader.load('word2vec-google-news-300')
print('done loading Word2Vec')

CPU times: user 17.8 s, sys: 667 ms, total: 18.5 s
Wall time: 19 s
done loading Word2Vec


In [5]:
#Inspect the model
word2vec_vocab = list(w2v_model.key_to_index.keys())
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(set(word2vec_vocab)), len(set(word2vec_vocab_lower)))

3000000 2702150


## Preprocessing text
* Remove the stopwords
* Remove punctuations
* Remove digits
* Convert all text to lower case
* Tokenize the text

In [9]:
#preprocess the text.

import string

def preprocess_corpus(texts):
    
    stop_words = set(stopwords.words('english'))
    processed_texts = []

    for text in texts:
        text = text.lower().translate(str.maketrans('', '', string.punctuation))
        
        words = word_tokenize(text)
        
        words_filtered = [word for word in words if word not in stop_words and not word.isdigit()]
        processed_texts.append(' '.join(words_filtered))

    return processed_texts

texts_processed = preprocess_corpus(texts)
print(len(categories), len(texts_processed))
print(texts_processed[1])
print(categories[1])

3000 3000
good case excellent value
1



## Converting text to numeric
In this section, we will convet the text to numerical data to be fed into a Machine Learning model for classification. <br>
We will extract the embeddings of words using w2v_model. <br>
Finally, every sentence is the average of the embeddings of its constituting words. 

In [10]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    features = []
    for tokens in list_of_lists:
        zero_vector = np.zeros(DIMENSION)
        vectors = 0
        for token in tokens.split():
            if token in w2v_model:
                zero_vector += w2v_model[token]
                vectors += 1
        if vectors > 0:
            zero_vector /= vectors
        features.append(zero_vector)
    return features


embedded_texts = embedding_feats(texts_processed)
print(len(embedded_texts))

3000


## Text Classification
For this example, we will use a simple Logistic Regression to classify the text.<br>
* Initialize a Logistic Regression model
* Split the embedded_texts and categories to train and test data and target
* Fit the model with the training data samples
* Predict on the test data samples
* Print the classification report

In [11]:
categories = [int(category.strip()) for category in categories]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(embedded_texts, categories, test_size=0.2, random_state=42)

In [13]:
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

In [14]:
y_pred = lr_model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.81      0.81       299
           1       0.81      0.81      0.81       301

    accuracy                           0.81       600
   macro avg       0.81      0.81      0.81       600
weighted avg       0.81      0.81      0.81       600

