<h1 style="text-align: center;"><span style="color: #333399;">Word2Vec and Doc2Vec for Text Classification</span></h1>

<h6 style="text-align: center;">Created by: Michael Gagliano on 2/8/19</h6>
<h6 style="text-align: center;">Last Update: Michael Gagliano 3/11/19</h6>


# I.  Overview - Preface

In this notebook I will demonstrate how to utilize Word2Vec and Doc2Vec in order to generate and apply word embeddings used for text classification.

This notebook is *reference-based*, so the methods being implemented will be discussed in an appendix.

References will be provided often, as to credit original processes, code, and ideas when due. As always, there can be an infinite amount of ways to solve a problem. We just want to do it:  
<br>
<b>
1) Efficiently  (No need to fry your i5 or i7 processors running GridSearch locally with 1.5GB of data)  
    


2) Accurately  (Did we get the program to do what we wanted it to do i.e. is the estimated hypothesis as close to the target function as possible?)


3) Elegantly   (concise code; scalable, reproducible, annoted and simple)
</b>
</br>

***This notebook is a reference adaptation of the wonderful work done by Susan Li found [here][li]***

[li]: https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568

# II. Package Importing

In [84]:
# Data Extraction Packages from Web Pages
import requests
import urllib
import bs4 as bs  

# Standard Data Analytics Packages
import csv
import numpy as np
import pandas as pd

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# NLP Packages 
import gensim
import nltk
from nltk import FreqDist, word_tokenize
from nltk.corpus import stopwords
import re 
from bs4 import BeautifulSoup

# Data Visualization Packages
import seaborn as sns
sns.set_style("whitegrid") # Sets grid to generated graphs
sns.set_context("poster") # Makes images large

import logging

import matplotlib.pyplot as plt
from matplotlib import rcParams # special matplotlib argument for improved plots
%matplotlib inline

---

# Text Classification

<b><u>Brief Review:</b></u>

Text classification is ***Supervised Machine Learning***. 

For supervised learning, the data must contain observations (X) and labels (y).
 - The classification may be binary, or multiclass/multilabel. Methods of analysis differ between them.
 
 
 - If you need to perform classification but the data is unlabeled, you may need to perform clustering of some kind to determine underlying structure based on features (i.e. If you don't have labels, get them in a reasonable manner)

A classification problem uses inductive bias to generate and narrow down generalized hypotheses (y_pred) as close as possible to the target function (y_test).

In doing so, new and unseen instances can be classified by the trained classifier algorithm with (hopefully) the best accuracy.

## 1. Implementation

### 1.1 Fetching Data

In [47]:
# Fetching Data; Cached memory, not locally stored
url = 'https://storage.googleapis.com/tensorflow-workshop-examples/stack-overflow-data.csv'
response = urllib.request.urlopen(url)
html = response.read()

# Saving file locally now
with open('so_data.csv', 'wb') as f:
        f.write(html)

In [48]:
# Load data into pandas DataFrame object and preview first 5 rows
df = pd.read_csv('so_data.csv')
df.head()

Unnamed: 0,post,tags
0,what is causing this behavior in our c# datet...,c#
1,have dynamic html load as if it was in an ifra...,asp.net
2,how to convert a float value in to min:sec i ...,objective-c
3,.net framework 4 redistributable just wonderi...,.net
4,trying to calculate and print the mean and its...,python


In [49]:
# Determine the structure of the data i.e. How many rows and columns
df.shape

(40000, 2)

In [50]:
# Get a sum value of all documents (words) found in the text via the 'Post' column
print(df['post'].apply(lambda x: len(x.split(' '))).sum())

10286120


10,286,120 words in this document with 40,000 unique posts collected from the Stack Overflow forums. The number of features and dimensions can explode rapidly in NLP analysis, and the data we are using here is considered a very **small** dataset.

### 1.2. Examining Class Distribution

Done to determine whether or not classes are balanced/evenly distributed.  
If they were not, further statistical pre-processing would be needed to create valid assumptions for the analysis.

In [51]:
df['tags'].value_counts()

javascript       2000
php              2000
mysql            2000
c#               2000
java             2000
android          2000
angularjs        2000
jquery           2000
iphone           2000
ruby-on-rails    2000
python           2000
c++              2000
ios              2000
asp.net          2000
html             2000
sql              2000
objective-c      2000
.net             2000
c                2000
css              2000
Name: tags, dtype: int64

### 1.3 Examining Post Content

Very, very rarely will mined text data be clean and ready to process right out of the gate. It's important to examine the data to get an idea of how extensive the pre-processing might be, and determining the most effective ways to accomplish it.

In [52]:
# Post Content for first entry in data (index pos = 0)
df['post'][0]

'what is causing this behavior  in our c# datetime type  <pre><code>[test] public void sadness() {    var datetime = datetime.utcnow;    assert.that(datetime  is.equalto(datetime.parse(datetime.tostring()))); } </code></pre>   failed :   <pre><code> expected: 2011-10-31 06:12:44.000  but was:  2011-10-31 06:12:44.350 </code></pre>   i wish to know what is happening behind the scenes in tostring() etc to cause this behavior.    edit after seeing jon s answer :   <pre><code>[test] public void newsadness() {     var datetime = datetime.utcnow;     assert.that(datetime  is.equalto(datetime.parse(datetime.tostring( o )))); } </code></pre>   result :   <pre><code>expected: 2011-10-31 12:03:04.161 but was:  2011-10-31 06:33:04.161 </code></pre>   same result with capital and small  o  . i m reading up the docs  but still unclear.'

In [53]:
# Tag assignment for first entry in data (index pos = 0)
df['tags'][0]

'c#'

In [54]:
# Make it a function 

def print_plot(index):
    example = df[df.index == index][['post', 'tags']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Tag:', example[1])
print_plot(0)

what is causing this behavior  in our c# datetime type  <pre><code>[test] public void sadness() {    var datetime = datetime.utcnow;    assert.that(datetime  is.equalto(datetime.parse(datetime.tostring()))); } </code></pre>   failed :   <pre><code> expected: 2011-10-31 06:12:44.000  but was:  2011-10-31 06:12:44.350 </code></pre>   i wish to know what is happening behind the scenes in tostring() etc to cause this behavior.    edit after seeing jon s answer :   <pre><code>[test] public void newsadness() {     var datetime = datetime.utcnow;     assert.that(datetime  is.equalto(datetime.parse(datetime.tostring( o )))); } </code></pre>   result :   <pre><code>expected: 2011-10-31 12:03:04.161 but was:  2011-10-31 06:33:04.161 </code></pre>   same result with capital and small  o  . i m reading up the docs  but still unclear.
Tag: c#


You will notice here that in order to create a high-quality word embedding, we will need to remove all of the symbols, spaces, and numbers here that are not relevant to analysis. 

**Note:** Many of our tags contain a mix of alphanumeric characters and symbols (Example: '.net', 'c#', etc.) so we must take additional care to ensure those terms are maintained and not eliminated. 

# 2. Text Pre-Processing

Regular Expression, despite many people's beliefs, is your friend here. In conjunction with the BeautifulSoup package, this makes an extremely efficient way to clean text.

***IMPORTANT:*** Word2Vec, Doc2Vec, GloVe, and other word-embedding methods fall under the Text Pre-Processing section of text data mining

### 2.1 Create Regular Expression Rules

By combining regular expression patterns into pattern objects, we accomplish two things:
    1. Not having to re-write entire pattern lists
    2. Pattern matching 

In [62]:
# Create a compiled rule selecting:      "[", "]","(", ")", "{", "}", "[]", "@", ",", and ";"
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')

# Create a rule that removes anything that IS NOT lowercase alphanumeric, and "#", "+", "_"
    # In other words: Remove the remaining symbols such as:   ":", ".", "-", "="
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

# Create general stopwords lexicon using pre-compile English stopwords dictionary
STOPWORDS = set(stopwords.words('english'))

### 2.2 Text Extraction with Beautiful Soup and Regular Expression

In [63]:
def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding to remove HTML source elements like <post>, <code>, <pre>
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwords from text
    return text

### 2.3 Pull examples to observe results; Make adjustments if necessary

In [64]:
# First Row; index pos = 0

df['post'] = df['post'].apply(clean_text)
print_plot(0)

causing behavior c# datetime type test public void sadness var datetime datetimeutcnow assertthat datetime isequalto datetimeparse datetimetostring failed expected 20111031 061244000 20111031 061244350 wish know happening behind scenes tostring etc cause behavior edit seeing jon answer test public void newsadness var datetime datetimeutcnow assertthat datetime isequalto datetimeparse datetimetostring result expected 20111031 120304161 20111031 063304161 result capital small reading docs still unclear
Tag: c#


In [65]:
# 9th Row; Index Pos = 10

df['post'] = df['post'].apply(clean_text)
print_plot(10)

need interface c# possible duplicate would want use interfaces need interface want know use example interface idemo function prototype public void show first class using interface class myclass1 idemo public void show function body comes responsewrite myclass second class using interface class myclass2 idemo public void show function body comes responsewrite myclass2 responsewrite two classes function name different body even achieved without interface need interface use
Tag: c#


This is a **MUCH** better outcome. Text cleaning is extremely hard, and it will always be difficult to completely isolate words and phrases no matter how specific we make our regular expression search parameters. What we just did is an example of [Dimensionality Reduction](#Dimensionality-Reduction:); which is a way we reduce the number of variables/features in our data that are unnecessary. We can observe this via the following cell.

In [68]:
# Check updated word count found in posts
df['post'].apply(lambda x: len(x.split(' '))).sum()

3169018

Reducing our total number of features from just over 10mil prior to pre-processing down to just over 3mil is significant reduction!

### 2.4 Splitting the Data for Model Development

We need to now split the data into separate components entirely, so that a classification model can be built

<b><u>For this notebook:</b></u>

* I am splitting the Train/Test data on an 75/25 split for model building and validation

In [70]:
X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

In [71]:
# Checking sizes of each data split
print("training data points: {}".format(len(y_train)))
print("testing data points: {}".format(len(y_test)))

training data points: 30000
testing data points: 10000


# 2.5 Feature Engineering -    ** IMPORTANT **

<h1 style="text-align: center;"><span style="color: green">Word2Vec</h1></span>

Before we actually create any text classification algorithm pipelines, we need a [word embedding](#Word-Embedding:). Word2Vec simply helps us convert large amounts of data into a word embedding for further analysis. *Applying* the Word2Vec output for methods such as Text Classification/Sentiment Analysis is what our actual goal here is.

<b><u>For this notebook:</b></u>

* We are using a pre-trained model from Google, containing a ***100 billion word, 3000 feature corpus*** from Google News.

  
* We are comparing technical programming forum text documents relative to this pre-trained word2vec for Google news. Do you believe this will affect the accuracy of our classification models? If so, how?

<i>The pre-trained binary model can be downloaded [here][word2vecgoogle]. It is a large 1.5Gb file, so it may take awhile! </i>
    
Make sure it is downloaded/stored in the same working environment directory, otherwise you will need to specify a specific the specific filepath when importing the model later.

[word2vecgoogle]: https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download

<b><u>Outside of this notebook:</b></u>

* You can build your own pre-trained Word2Vec model, such that the model is best aligned with the focus of your text documents.

See: [Appendix A](#Appendix-A:-External-References) No. 2



### 2.5.1 Load Word2Vec pre-trained word embedding

In [73]:
# Will take awhile to load ~1-3 min
from gensim.models import Word2Vec

wv = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
wv.init_sims(replace=True) #Pre-computes L2 Norm for vector; correlate matrix that lowers memory load

After training, the word2vec model holds two vectors for each word in the vocabulary: 

<b>

1. Word embedding (rows of input/hidden matrix)  
 
2. Context embedding (columns of hidden/output matrix)  
</b>

The most common method to the Bag Of Words approach is averaging (as opposed to summing the vectors without averaging, or concatenating the vectors).

In [80]:
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.vectors_norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

### 2.5.2. Tokenization Check

Make sure the text you are wanting to map against the Word2Vec model is tokenized. The text documents must be in tokenized form to be properly mapped to the model. If they are not, like in our current case, we will need to do it prior to applying the word embedding.

In [81]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens
    
train, test = train_test_split(df, test_size=0.3, random_state = 42)

test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['post']), axis=1).values

X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)



In [82]:
my_tags = ['java','html','asp.net','c#','ruby-on-rails','jquery','mysql',
           'php','ios','javascript','python','c','css','android','iphone',
           'sql','objective-c','c++','angularjs','.net']

# 3. Model Deployment and Analysis

### 3.1 Logistic Regression (Multi-class text classification) - Word2Vec

In [85]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg = logreg.fit(X_train_word_average, train['tags'])

y_pred = logreg.predict(X_test_word_average)

print('accuracy %s' % accuracy_score(y_pred, test.tags))
print(classification_report(test.tags, y_pred,target_names=my_tags))



accuracy 0.6363333333333333
               precision    recall  f1-score   support

         java       0.62      0.59      0.61       613
         html       0.73      0.75      0.74       620
      asp.net       0.65      0.66      0.66       587
           c#       0.50      0.47      0.48       586
ruby-on-rails       0.70      0.77      0.73       599
       jquery       0.44      0.39      0.41       589
        mysql       0.66      0.61      0.63       594
          php       0.72      0.81      0.76       610
          ios       0.60      0.59      0.60       617
   javascript       0.56      0.53      0.54       587
       python       0.55      0.51      0.53       611
            c       0.63      0.61      0.62       594
          css       0.65      0.65      0.65       619
      android       0.62      0.58      0.60       574
       iphone       0.68      0.71      0.70       584
          sql       0.42      0.43      0.43       578
  objective-c       0.68      0.72  

<b>Using the pre-trained Word2Vec word embedding, the accuracy of the multi-class logistic regression model is only 63.6%</b>

This is certainly not ideal. Consider why this might be the case, given all that we've done so far.

*Hint: Pre-processing/text cleaning, the pre-trained model context, hyperparameters in the ML model, etc.*

### 3.2 Compare to Word2Vec Embedding vs. Traditional Methods

In [86]:
X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 42)

### 3.2.1 Naive Bayes - Traditional CBOW method

In [88]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

# Create Data Transformation Pipeline
nb = Pipeline([('vect', CountVectorizer()), # Creates word-frequency matrix
               ('tfidf', TfidfTransformer()), # Convert word-freq matrix into TF-IDF scored matrix
               ('clf', MultinomialNB()), # Naive Bayes Classification
              ])
nb.fit(X_train, y_train)

%time
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

Wall time: 0 ns
accuracy 0.7425
               precision    recall  f1-score   support

         java       0.65      0.64      0.65       518
         html       0.93      0.87      0.90       507
      asp.net       0.86      0.90      0.88       493
           c#       0.68      0.76      0.72       479
ruby-on-rails       0.72      0.86      0.78       495
       jquery       0.73      0.53      0.62       484
        mysql       0.77      0.75      0.76       490
          php       0.69      0.91      0.78       504
          ios       0.63      0.61      0.62       524
   javascript       0.58      0.64      0.61       487
       python       0.72      0.51      0.59       510
            c       0.81      0.78      0.80       500
          css       0.84      0.60      0.70       512
      android       0.66      0.85      0.74       472
       iphone       0.67      0.81      0.74       493
          sql       0.68      0.63      0.66       496
  objective-c       0.81      0.

<b>Results:</b> Using the Naive Bayes classifier and non-Word2Vec word embedding models, **this results in a siginficantly greater model accuracy of 74%**

**Why could this be?**

Remember, we combined Word2Vec word embeddings from the Google News pre-trained model and the vector created with the Stack Exchange forum posts. The actual *context* of these matrices while independent are largely different; Google News covers a broad range of everything and the Stack Exchange text documents are likely to have very technical terms and vocabulary not typically seen in news articles.

Naive Bayes has a high accuracy here, but that is because we did not introduce new external data to validate the model. To do this, K-Folds cross validation would be a desirable method. However, we need to keep in mind [target leakage][link1] and the implications it can have when external validation data isn't used.

[link1]: https://www.datarobot.com/wiki/target-leakage/

### 3.2.2 SVM - Traditional CBOW Method

In [90]:
from sklearn.linear_model import SGDClassifier

sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
               ])
sgd.fit(X_train, y_train)

%time

y_pred = sgd.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))



Wall time: 0 ns
accuracy 0.7881
               precision    recall  f1-score   support

         java       0.71      0.68      0.69       518
         html       0.83      0.94      0.88       507
      asp.net       0.88      0.94      0.91       493
           c#       0.83      0.75      0.79       479
ruby-on-rails       0.75      0.87      0.80       495
       jquery       0.79      0.41      0.54       484
        mysql       0.82      0.69      0.75       490
          php       0.68      0.95      0.80       504
          ios       0.79      0.57      0.66       524
   javascript       0.74      0.60      0.66       487
       python       0.73      0.64      0.68       510
            c       0.82      0.87      0.84       500
          css       0.76      0.78      0.77       512
      android       0.79      0.87      0.83       472
       iphone       0.82      0.80      0.81       493
          sql       0.72      0.70      0.71       496
  objective-c       0.82      0.

**Results:** SVM algorithms are great for classification, and it shows its strength here. This model is the most accurate used yet at 78% 

# 4. Doc2Vec Word Embedding

<h1 style="text-align: center;"><span style="color: green">Doc2Vec</h1></span>

Doc2Vec builds on the functionality of Word2Vec by allowing whole sentences/paragraphs/documents to be compared against words. The goal of this method is to provide a more accurate, clear representation of the context of each document as a whole.

In [98]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

### 4.1 Document Tagging for Doc2Vec

In [99]:
def label_sentences(corpus, label_type):
    
    """
    Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
    We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
    a dummy index of the post.
    """
    
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(TaggedDocument(v.split(), [label]))
    return labeled


X_train, X_test, y_train, y_test = train_test_split(df.post, df.tags, random_state=0, test_size=0.3)

X_train = label_sentences(X_train, 'Train')
X_test = label_sentences(X_test, 'Test')
all_data = X_train + X_test

#### Preview the tagged documents (words)

In [100]:
all_data[:2]

[TaggedDocument(words=['fulltext', 'search', 'php', 'pdo', 'returning', 'result', 'searched', 'lot', 'matter', 'find', 'wrong', 'setup', 'trying', 'fulltext', 'search', 'using', 'pdo', 'php', 'get', 'results', 'error', 'messages', 'table', 'contains', 'customer', 'details', 'id', 'int', '11', 'auto_increment', 'name', 'varchar', '150', 'lastname', 'varchar', '150', 'company', 'varchar', '250', 'adress', 'varchar', '150', 'postcode', 'int', '5', 'city', 'varchar', '150', 'email', 'varchar', '250', 'phone', 'varchar', '20', 'orgnr', 'varchar', '15', 'timestamp', 'timestamp', 'current_timestamp', 'run', 'sqlquery', 'alter', 'table', 'system_customer', 'add', 'fulltext', 'name', 'lastname', 'except', 'columns', 'id', 'postcode', 'timestamp', 'signs', 'trouble', 'far', 'idea', 'problem', 'lies', 'db', 'configuration', 'php', 'code', 'goes', 'php', 'sth', 'dbhprepare', 'select', 'name', 'lastname', 'company', 'adress', 'city', 'phone', 'email', 'orgnr', 'db_pre', 'customer', 'match', 'name',

When training the doc2vec, we will vary the following parameters:

> dm=0 , distributed bag of words (DBOW) is used.


> vector_size=300 , 300 vector dimensional feature vectors.


> negative=5 , specifies how many “noise words” should be drawn.


> min_count=1, ignores all words with total frequency lower than this.


> alpha=0.065 , the initial learning rate.



### 4.2 Create a Distributed Representation Model and Train over 30 Iterations of Shuffled Data

In [101]:
# Shows progress bar when training over documents
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from sklearn import utils

# Build Model
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

# Train model over 30 iterations of shuffled data
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), total_examples=len(all_data), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 4003917.71it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 4011193.04it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2809877.40it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 1503294.36it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2570984.43it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2781368.70it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2675535.99it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2864961.75it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 4264020.74it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 1604308.45it/s]
100%|██████████████████████████████████████████| 40000/40000 [00:00<00:00, 2076156.86it/s]

### 4.3 Extract the trained Doc2Vec Word Embeddings for Analysis

In [102]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors
    
train_vectors_dbow = get_vectors(model_dbow, len(X_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test), 300, 'Test')

### 4.4 Logistic Regression Model Analysis - Doc2Vec Word Embedding

In [103]:
logreg = LogisticRegression(n_jobs=1, C=1e5)

logreg.fit(train_vectors_dbow, y_train)
logreg = logreg.fit(train_vectors_dbow, y_train)

y_pred = logreg.predict(test_vectors_dbow)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))



accuracy 0.8036666666666666
               precision    recall  f1-score   support

         java       0.69      0.68      0.69       589
         html       0.89      0.91      0.90       661
      asp.net       0.93      0.94      0.93       606
           c#       0.77      0.77      0.77       613
ruby-on-rails       0.84      0.89      0.86       601
       jquery       0.71      0.71      0.71       585
        mysql       0.87      0.81      0.84       621
          php       0.81      0.83      0.82       587
          ios       0.67      0.66      0.67       560
   javascript       0.69      0.65      0.67       611
       python       0.66      0.67      0.67       593
            c       0.78      0.83      0.81       581
          css       0.82      0.75      0.78       608
      android       0.84      0.84      0.84       593
       iphone       0.84      0.82      0.83       592
          sql       0.72      0.66      0.69       597
  objective-c       0.85      0.89  

# Appendix A: External References

[1. RegEx Tester][Ref1]

[Ref1]: https://regexr.com/

[2. Building Your Own Custom Word Embedding for a Word2Vec Model][Ref2]

[Ref2]: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

[3. Word2Vec for Recommender Tags and Advertising][Ref3]

[Ref3]: http://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/

[4.a Word2Vec's init_sims() method][Ref4]

[Ref4]: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.init_sims

[5. Doc2Vec Sentence and Paragraph Tagging for Pre-Processing via TaggedDocument method][Ref5]

[Ref5]: https://radimrehurek.com/gensim/models/doc2vec.html

[6. FastText API, an NLP Framework for Text Classification][Ref6]

[Ref6]: https://www.analyticsvidhya.com/blog/2017/07/word-representations-text-classification-using-fasttext-nlp-facebook/


# Appendix C: Terms and External References

[Return to Start of Notebook](#Text-Classification)

#### Word Embedding: 

A representation of text where words that have the same meaning have a similar representation. In other words it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together.

#### Dimensionality Reduction:

See: [link][ref1]

[ref1]: https://en.wikipedia.org/wiki/Dimensionality_reduction


#### Target Leakage:

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed. 

See: [link][link2]

[link2]: https://machinelearningmastery.com/data-leakage-machine-learning/