## Multi-class classification using Gensim's Word2Vec and Doc2Vec models

Data: Titles and headlines from 93,239 news articles shared on Facebook, LinkedIn, and GooglePlus (we will use just news headlines and topics). Source: https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms

Classifiers used: Stochastic Gradient Descent (SGD), MLPclassifier (neural network)

Inputs used: Sentence vectors (as mean of word vectors), Doc2Vec vectors

In [1]:
import pandas as pd
df = pd.read_csv('c:/Users/abhatt/Desktop/Text_Analytics/python/data/News_SocialMedia.csv')
df.dtypes

IDLink               float64
Title                 object
Headline              object
Source                object
Topic                 object
PublishDate           object
SentimentTitle       float64
SentimentHeadline    float64
Facebook               int64
GooglePlus             int64
LinkedIn               int64
dtype: object

In [2]:
df = df[['Title', 'Headline', 'Topic']]
df = df.drop(df[df['Headline'].isna()].index, axis=0)   # 15 obs dropped

In [3]:
# Note: simple_preprocess() normalizes cases, drops punctuations, and tokenizes
# but does not remove stopwords or stem/lemmatize

from gensim.utils import simple_preprocess
tokenized_list = [simple_preprocess(h) for h in df['Headline']]



In [4]:
# Create test/train split

from sklearn.model_selection import train_test_split
import numpy as np

train_x, test_x, train_y, test_y = \
    train_test_split(np.array(tokenized_list), np.array(df['Topic']), 
    test_size=0.25, random_state=42)
train_x.shape, test_x.shape

  train_test_split(np.array(tokenized_list), np.array(df['Topic']),


((69918,), (23306,))

### Feature engineering using Word2Vec 
Word2Vec is Google's word vectorization model. In this approach, we will average word2vec vectors for all words in each title to compute a "sentence vector". This is not an ideal way to compute sentence vectors, doc2vec is a better approach.

In [5]:
from gensim.models import Word2Vec
w2v = Word2Vec(train_x, window=8, min_count=2, sample=1e-3, sg=1, workers=8)
vocab = set(w2v.wv.index_to_key)
len(vocab)

24915

In [6]:
num_features = 100

def average_word_vectors(tokens, model, vocabulary, num_features):
    feature_vector = np.zeros((num_features,), dtype="float64")
    ntokens = 0.
    for t in tokens:
        if t in vocabulary: 
            ntokens = ntokens + 1.
            feature_vector = np.add(feature_vector, model.wv[t])
    if ntokens:
        feature_vector = np.divide(feature_vector, ntokens)
    return feature_vector

In [7]:
w2v_train_x = [average_word_vectors(sent_tokens, w2v, vocab, num_features) 
               for sent_tokens in train_x]
avg_w2v_train_x = np.array(w2v_train_x)

w2v_test_x = [average_word_vectors(sent_tokens, w2v, vocab, num_features) 
              for sent_tokens in test_x]
avg_w2v_test_x = np.array(w2v_test_x)

print('Train features shape:', avg_w2v_train_x.shape, 
      '\nTest features shape:', avg_w2v_test_x.shape)

Train features shape: (69918, 100) 
Test features shape: (23306, 100)


### Feature engineering using Doc2Vec
Doc2Vec is Google's document vectorization model.

In [8]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(train_x)]
d2v = Doc2Vec(vector_size=100, window=3, min_count=4, workers=4, epochs=40)
d2v.build_vocab(docs)
d2v.train(docs, total_examples=d2v.corpus_count, epochs=d2v.epochs)

In [9]:
d2v_train_x = [d2v.infer_vector(i) for i in train_x]
d2v_test_x =  [d2v.infer_vector(i) for i in test_x]

### Classification using SGD classifier
SGD (Stochastic Gradient Descent) is a single-layer neural network (perceptron, with no hidden layers). All neural networks require hyper-parameter tuning. The best way to do that is by using grid search. In this example, we are not doing that, but using fixed values of hyper-parameters instead. However, this example uses SGD with 5-fold cross-validation.

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, classification_report

sgd = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
metrics = pd.DataFrame(columns=['Classification_Model', 'Training_Data', 'Recall', 'Precision', 'F1_Score'])

train_x = [avg_w2v_train_x, d2v_train_x]
test_x  = [avg_w2v_test_x,  d2v_test_x]
input_model = ['Word2Vec', 'Doc2Vec']

In [11]:
for i, x in enumerate(train_x):
    sgd.fit(x, train_y)
    sgd_cv_accuracy = cross_val_score(sgd, x, train_y, cv=5)
    print('Input model:', input_model[i], '\n')
    print('CV Accuracy (5-fold):', sgd_cv_accuracy)
    
    sgd_cv_mean_accuracy = np.mean(sgd_cv_accuracy)
    print('Mean CV Accuracy:', sgd_cv_mean_accuracy)
    
    sgd_test_accuracy = sgd.score(test_x[i], test_y)
    print('Test Accuracy:', sgd_test_accuracy)

    pred_y = sgd.predict(test_x[i])
    print(classification_report(test_y, pred_y, target_names=df['Topic'].unique()))

    recall = recall_score(pred_y, test_y, average='weighted') 
    precision = precision_score(pred_y, test_y, average='weighted')  
    f1score = f1_score(pred_y, test_y, average='weighted') 
    
    metrics = metrics.append(pd.Series(['SGD Classifier', input_model[i], recall, 
                precision, f1score], index=metrics.columns), ignore_index=True)

Input model: Word2Vec 

CV Accuracy (5-fold): [0.95995423 0.9597397  0.96124142 0.96173925 0.95973682]
Mean CV Accuracy: 0.960482286557181
Test Accuracy: 0.9618982236333992
              precision    recall  f1-score   support

       obama       0.96      0.96      0.96      8525
     economy       0.97      0.97      0.97      5514
   microsoft       0.96      0.96      0.96      7025
   palestine       0.96      0.93      0.95      2242

    accuracy                           0.96     23306
   macro avg       0.96      0.96      0.96     23306
weighted avg       0.96      0.96      0.96     23306

Input model: Doc2Vec 

CV Accuracy (5-fold): [0.84854119 0.85018593 0.85633581 0.85561038 0.849174  ]
Mean CV Accuracy: 0.8519694620191796
Test Accuracy: 0.8363082468033982
              precision    recall  f1-score   support

       obama       0.81      0.87      0.84      8525
     economy       0.87      0.86      0.86      5514
   microsoft       0.83      0.84      0.84      7025
  

In [12]:
metrics.sort_values(['F1_Score'], ascending=False)

Unnamed: 0,Classification_Model,Training_Data,Recall,Precision,F1_Score
0,SGD Classifier,Word2Vec,0.961898,0.961983,0.961913
1,SGD Classifier,Doc2Vec,0.836308,0.843277,0.838011


### Classification using MLP classifier
MLP classifier is a multi-layer neural network. Here, we use two hidden layers of 512 and 128 nodes. We are NOT doing k-fold cross-validation here.

In [13]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(512, 128), activation='relu', solver='adam', 
    learning_rate='adaptive', early_stopping=True, alpha=1e-5, random_state=42)

In [14]:
for i, x in enumerate(train_x):
    mlp.fit(x, train_y)
    mlp_test_accuracy = mlp.score(test_x[i], test_y)
    print('Test Accuracy:', mlp_test_accuracy)    
    
    pred_y = mlp.predict(test_x[i])
    print(classification_report(test_y, pred_y, target_names=df['Topic'].unique()))

    recall = recall_score(pred_y, test_y, average='weighted') 
    precision = precision_score(pred_y, test_y, average='weighted')  
    f1score = f1_score(pred_y, test_y, average='weighted') 
    
    metrics = metrics.append(pd.Series(['MLP Classifier', input_model[i], recall, 
                precision, f1score], index=metrics.columns), ignore_index=True)

Test Accuracy: 0.9690637604050459
              precision    recall  f1-score   support

       obama       0.96      0.97      0.97      8525
     economy       0.98      0.98      0.98      5514
   microsoft       0.97      0.97      0.97      7025
   palestine       0.96      0.95      0.96      2242

    accuracy                           0.97     23306
   macro avg       0.97      0.97      0.97     23306
weighted avg       0.97      0.97      0.97     23306

Test Accuracy: 0.8522698017677851
              precision    recall  f1-score   support

       obama       0.85      0.87      0.86      8525
     economy       0.86      0.89      0.88      5514
   microsoft       0.87      0.84      0.85      7025
   palestine       0.79      0.74      0.77      2242

    accuracy                           0.85     23306
   macro avg       0.84      0.83      0.84     23306
weighted avg       0.85      0.85      0.85     23306



In [15]:
metrics.sort_values(['F1_Score'], ascending=False)

Unnamed: 0,Classification_Model,Training_Data,Recall,Precision,F1_Score
2,MLP Classifier,Word2Vec,0.969064,0.969065,0.969057
0,SGD Classifier,Word2Vec,0.961898,0.961983,0.961913
3,MLP Classifier,Doc2Vec,0.85227,0.853415,0.852623
1,SGD Classifier,Doc2Vec,0.836308,0.843277,0.838011


For this data, Word2Vec outperforms Doc2Vec by a significant margin (0.97 to 0.85), and MLP outperforms SGD by a slight margin (0.97 to 0.96 or 0.85 to 0.84). This is surprising because we expected the average Word2Vec vector to be an inferior input compared to the Doc2Vec vector.