## Baseline Tests

This notebook includes the classifiers that will be used to determine the performance of the deep learning-based classifier.

### Classifiers:

1) Naive Bayes

2) Support Vector Machines (SVM)

### Performance Measures:

1) Storage requirements of the classifier and feature representation used

2) Training time of the classifier

3) Speed of the classifier

4) Accuracy of the classifier

In [5]:
# Importing the libraries
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
from gensim.test.utils import datapath
from gensim.models import Word2Vec
import os

# Helpful variables
EXT_DATA_FOLDER = "C:\\Users\\Admin\\Projects\\thesis\\data\\"
ANALYSIS_SAMPLES = os.path.join(EXT_DATA_FOLDER, "Credibility_Analysis_Samples\\September_25\\")
dataset_columns = ['Identifier', 'Type', 'Category', 'URL', 'Cat1', 'Cat2', 'Cat3', 'Cat4', 'Cat5',
 'Cat6', 'Cat7', 'Score', 'First date_time', 'Tweets', 'Likes', 'Retweets',
 'Potential exposure', 'HTML', 'TEXT']

### Reading in csv and excel data

In [133]:
def create_dataset(corpus_path, annotated_samples):
    """
    Input: 
    corpus_path: Path for a CSV file containing a list of article URLs and its article text
    annotated_samples: Path of the excel file containing articles and its associated URL along with its labels
    
    Method:
    Retrieves the article text by matching the URLs within the corpus_path and annotated_samples and creates a dataframe 
    containing the URL, article text and the article's corresponding labels.
    
    Output:
    A pandas dataframe
    """
    article_corpus = pd.read_csv(corpus_path)
    annotated_corpus = pd.read_excel((annotated_samples))
    article_corpus.columns = ["URL", "HTML", "TEXT"]
    annotated_articles = annotated_corpus.loc[(annotated_corpus["Cat7"] == 0) | (annotated_corpus["Cat7"] == 1)]
    dataset = pd.merge(annotated_articles, article_corpus, how='left', on='URL')
    return dataset


In [128]:
article_corpus = pd.read_csv(corpus_path)
article_corpus.columns = ["URL", "HTML", "TEXT"]
print(article_corpus.shape)

print(article_corpus.loc[article_corpus["URL"] == "https://www.thestar.com/news/world/2017/05/07/anti-vaccine-activists-just-sparked-a-us-states-worst-measles-outbreak-in-decades.html"]["TEXT"])
print(article_corpus.loc[article_corpus["URL"] == "https://www.newscientist.com/article/mg23531335-800-cancer-vaccines-could-prime-our-own-bodies-to-fight-tumours/?utm_campaign=RSS%7CNSNS&utm_source=NSNS&utm_medium=RSS&campaign_id=RSS%7CNSNS-"])

(1116, 3)
65    TITLE: Anti-vaccine activists just sparked a U...
Name: TEXT, dtype: object
Empty DataFrame
Columns: [URL, HTML, TEXT]
Index: []


In [141]:
corpus_path = os.path.join(EXT_DATA_FOLDER, "url_text.csv")
excel_files = ["sample_third_adam_new.xlsx", "sample_third_amalie_new.xlsx", "sample_third_maryke_new.xlsx"]

df_files = []

for filename in excel_files:
    annotated_path = os.path.join(ANALYSIS_SAMPLES, filename)
    data = create_dataset(corpus_path, annotated_path)
    df_files.append(data)
    
dataset = pd.concat(df_files)

print(dataset.columns.values)
print(dataset.shape)

['Identifier' 'Type' 'Category' 'URL' 'Cat1' 'Cat2' 'Cat3' 'Cat4' 'Cat5'
 'Cat6' 'Cat7' 'Score' 'First date_time' 'Tweets' 'Likes' 'Retweets'
 'Potential exposure' 'HTML' 'TEXT']
(447, 19)


In [102]:
for filename in excel_files[1:]:
    print(filename)

September_13\sample_third_amalie_new.xlsx
September_13\sample_third_maryke_new.xlsx


In [5]:
#Example of article with missing text
print(dataset.head()["URL"][3])
print(dataset.head()["TEXT"][3])  
print(dataset.head()["HTML"])

http://triblive.com/news/healthnow/12759008-74/stronger-flu-vaccine-for-elderly-could-help-younger-adults-with-chronic-conditions
TITLE: Stronger flu vaccine for elderly could help younger adults with chronic conditions | TribLIVE
TEXT:     “Persons who have these conditions have a much greater risk of the flu being more severe to the point of needing to be hospitalized,” Dr. Ken Smith, professor of medicine at Pitt and co-author of the paper, told the Tribune-Review on Thursday. “If you are hospitalized with the flu, your risk of dying is certainly something that is a possibility.”  The high-dose vaccine is recommended for the elderly population because their immune response to the standard-dose vaccine lessens as they age. However, the price for a standard dose is about $11, while the stronger vaccine is about $31 per dose, Smith said. He said the dose for the elderly is about 24 percent stronger than a standard vaccine.   “The growing proportion of middle-aged adults with chronic he

In [143]:
#Save dataset locally
writer = pd.ExcelWriter("dataset3.xlsx")
dataset.to_excel(writer, "Sheet1")
writer.save()

In [6]:
#pre-processing
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from collections import defaultdict

labelled_articles = pd.read_excel("dataset3.xlsx")
labelled_articles = labelled_articles.dropna(subset=['TEXT'])
print(labelled_articles.shape)
criterias = ["Cat1", "Cat2", "Cat3", "Cat4", "Cat5", "Cat6", "Cat7"]
art_text_sent = np.array([sent_tokenize(article.split("TITLE: ")[1].replace("TEXT: ","").strip(" ")) for article in labelled_articles["TEXT"]])
art_text_word = np.array([word_tokenize(article.split("TITLE: ")[1].replace("TEXT: ","").strip(" ")) for article in labelled_articles["TEXT"]])
art_text_sent_word = np.array([[word_tokenize(sent) for sent in article] for article in art_text_sent])

(208, 19)


### Baseline classifier performance

Performance of classifier is measured using an f1_micro score:

'f1_micro': Calculate metrics globally by counting the total true positives, false negatives and false positives and accounts for class imbalance. [Source](https://stackoverflow.com/questions/43421456/computing-macro-f1-score-using-sklearn)

f1_micro scores are calculated using stratified k-fold cross validation for k = 10



In [18]:
from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

import warnings
warnings.filterwarnings('ignore')

categories = ['Not Satisfied', 'Satisfied']
nb_bow = []
nb_tfidf = []
nb_w2v = []
svm_bow = []
svm_tfidf = []
svm_w2v = []

### BoW performance

In [8]:
for criteria in criterias:
    
    X_train, X_test, y_train, y_test = train_test_split(list(labelled_articles["TEXT"]), list(labelled_articles[criteria]), test_size=int(20))

    nb_clf = Pipeline([('vect', CountVectorizer()),
                         #('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB()),])
    nb_clf.fit(X_train, y_train)
    nb_predicted = list(nb_clf.predict(X_test))
    #print("Actual vs. NB Predicted labels:\n" + str(y_test) + "\n" + str(nb_predicted))

    #print(metrics.classification_report(y_test, nb_predicted, target_names=categories))

    cv_scores = cross_val_score(nb_clf, X_train, y_train, cv=10, scoring='f1_micro')
    print("For " + criteria + ":")
    print("NB Average micro f1-score: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std()))
    nb_bow.append((cv_scores.mean(), cv_scores.std()))

    svm_clf = Pipeline([('vect', CountVectorizer()),
                         #('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, random_state=69,
                                               max_iter=5, tol=None)),
    ])

    svm_clf.fit(X_train, y_train)
    svm_predicted = list(svm_clf.predict(X_test))
    #print("Actual vs. SVM Predicted labels:\n" + str(y_test) + "\n" + str(svm_predicted))

    #print(metrics.classification_report(y_test, svm_predicted, target_names=categories))

    svm_cv_scores = cross_val_score(svm_clf, X_train, y_train, cv=10, scoring='f1_micro')
    print("SVM Average micro f1-score: %0.2f (+/- %0.2f)" % (svm_cv_scores.mean(), svm_cv_scores.std()))
    svm_bow.append((svm_cv_scores.mean(), svm_cv_scores.std()))

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

#print(nb_f1_scores)
#print("====")
#print(svm_f1_scores)

For Cat1:
NB Average micro f1-score: 0.84 (+/- 0.05)
SVM Average micro f1-score: 0.64 (+/- 0.24)
For Cat2:
NB Average micro f1-score: 0.80 (+/- 0.08)
SVM Average micro f1-score: 0.76 (+/- 0.09)
For Cat3:
NB Average micro f1-score: 0.88 (+/- 0.03)
SVM Average micro f1-score: 0.82 (+/- 0.24)
For Cat4:
NB Average micro f1-score: 0.88 (+/- 0.03)
SVM Average micro f1-score: 0.78 (+/- 0.16)
For Cat5:
NB Average micro f1-score: 0.82 (+/- 0.06)
SVM Average micro f1-score: 0.76 (+/- 0.10)
For Cat6:
NB Average micro f1-score: 0.78 (+/- 0.07)
SVM Average micro f1-score: 0.69 (+/- 0.10)
For Cat7:
NB Average micro f1-score: 0.85 (+/- 0.05)
SVM Average micro f1-score: 0.77 (+/- 0.13)


### TF-IDF Performance

In [10]:
for criteria in criterias:
    
    X_train, X_test, y_train, y_test = train_test_split(list(labelled_articles["TEXT"]), list(labelled_articles[criteria]), test_size=int(20))

    nb_clf = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB()),])
    nb_clf.fit(X_train, y_train)
    nb_predicted = list(nb_clf.predict(X_test))
    #print("Actual vs. NB Predicted labels:\n" + str(y_test) + "\n" + str(nb_predicted))


    #print(metrics.classification_report(y_test, nb_predicted, target_names=categories))

    cv_scores = cross_val_score(nb_clf, X_train, y_train, cv=10, scoring='f1_micro')
    print("For " + criteria + ":")
    print("NB Average micro f1-score: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std()))
    nb_tfidf.append((cv_scores.mean(), cv_scores.std()))

    svm_clf = Pipeline([('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, random_state=69,
                                               max_iter=5, tol=None)),
    ])

    svm_clf.fit(X_train, y_train)
    svm_predicted = list(svm_clf.predict(X_test))
    #print("Actual vs. SVM Predicted labels:\n" + str(y_test) + "\n" + str(svm_predicted))

    #print(metrics.classification_report(y_test, svm_predicted, target_names=categories))

    svm_cv_scores = cross_val_score(svm_clf, X_train, y_train, cv=10, scoring='f1_micro')
    print("SVM Average micro f1-score: %0.2f (+/- %0.2f)" % (svm_cv_scores.mean(), svm_cv_scores.std()))
    svm_tfidf.append((svm_cv_scores.mean(), svm_cv_scores.std()))

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

#print(nb_f1_scores)
#print("====")
#print(svm_f1_scores)

For Cat1:
NB Average micro f1-score: 0.80 (+/- 0.02)
SVM Average micro f1-score: 0.83 (+/- 0.06)
For Cat2:
NB Average micro f1-score: 0.78 (+/- 0.01)
SVM Average micro f1-score: 0.85 (+/- 0.05)
For Cat3:
NB Average micro f1-score: 0.89 (+/- 0.00)
SVM Average micro f1-score: 0.89 (+/- 0.02)
For Cat4:
NB Average micro f1-score: 0.83 (+/- 0.02)
SVM Average micro f1-score: 0.87 (+/- 0.04)
For Cat5:
NB Average micro f1-score: 0.79 (+/- 0.00)
SVM Average micro f1-score: 0.78 (+/- 0.04)
For Cat6:
NB Average micro f1-score: 0.69 (+/- 0.02)
SVM Average micro f1-score: 0.73 (+/- 0.11)
For Cat7:
NB Average micro f1-score: 0.82 (+/- 0.02)
SVM Average micro f1-score: 0.80 (+/- 0.02)


In [11]:
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.items())

    def fit(self, X, y):
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [14]:
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.items())

    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])

In [9]:
with open(os.path.join(EXT_DATA_FOLDER, "glove.6B.300d.txt"), "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

In [10]:
w2v_df = pd.DataFrame.from_dict(w2v, orient='index')
w2v_df.head()[0]

b'the'    <map object at 0x00000244CDB60FD0>
b','      <map object at 0x00000244CD986FD0>
b'.'      <map object at 0x00000244CD986EF0>
b'of'     <map object at 0x00000244CDB7D080>
b'to'     <map object at 0x00000244CDB7D0F0>
Name: 0, dtype: object

### Word2Vec with Mean Embedding Vectorizer

In [None]:
from gensim.sklearn_api import W2VTransformer

for criteria in criterias[:3]:
    
    X_train, X_test, y_train, y_test = train_test_split(list(labelled_articles["TEXT"]), list(labelled_articles[criteria]), test_size=int(20))

    print("For " + criteria + ":")
    
    
    nb_clf = Pipeline([('w2v', MeanEmbeddingVectorizer(w2v)),
                         ('clf', MultinomialNB()),])
    nb_predicted = nb_clf.predict(X_test)
    #print(metrics.classification_report(y_test, nb_predicted, target_names=categories))
    nb_cv_scores = cross_val_score(nb_clf, X_train, y_train, scoring='f1_micro')
    
    
    print("NB Average micro f1-score: %0.2f (+/- %0.2f)" % (nb_cv_scores.mean(), nb_cv_scores.std()))
    nb_w2v.append((nb_cv_scores.mean(), nb_cv_scores.std()))
    """
    
    svm_clf = Pipeline([('w2v', MeanEmbeddingVectorizer(w2v)),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, random_state=69,
                                               max_iter=5, tol=None)),
    ])
    
    svm_clf.fit(X_train, y_train)
    svm_predicted = svm_clf.predict(X_test)
    svm_cv_scores = cross_val_score(svm_clf, X_train, y_train, scoring='f1_micro')
    print("SVM Average micro f1-score: %0.2f (+/- %0.2f)" % (svm_cv_scores.mean(), svm_cv_scores.std()))
    svm_w2v.append((svm_cv_scores.mean(), svm_cv_scores.std()))
    """


For Cat1:
NB Average micro f1-score: 0.82 (+/- 0.00)
For Cat2:
NB Average micro f1-score: 0.78 (+/- 0.00)
For Cat3:


### Word2Vec with Tf-idf Weighted Vectorizer

In [19]:
from gensim.sklearn_api import W2VTransformer

for criteria in criterias:
    
    X_train, X_test, y_train, y_test = train_test_split(list(labelled_articles["TEXT"]), list(labelled_articles[criteria]), test_size=int(20))

    print("For " + criteria + ":")
    
    
    nb_clf = Pipeline([('w2v', TfidfEmbeddingVectorizer(w2v)),
                         ('clf', MultinomialNB()),])
    nb_clf.fit(X_train, y_train)

    nb_predicted = nb_clf.predict(X_test)
    #print(metrics.classification_report(y_test, nb_predicted, target_names=categories))
    nb_cv_scores = cross_val_score(nb_clf, X_train, y_train, scoring='f1_micro')
    
    
    print("NB Average micro f1-score: %0.2f (+/- %0.2f)" % (nb_cv_scores.mean(), nb_cv_scores.std()))
    nb_w2v.append((nb_cv_scores.mean(), nb_cv_scores.std()))

    
    svm_clf = Pipeline([('w2v', TfidfEmbeddingVectorizer(w2v)),
                         ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, random_state=69,
                                               max_iter=5, tol=None)),
    ])
    
    svm_clf.fit(X_train, y_train)
    svm_predicted = svm_clf.predict(X_test)
    svm_cv_scores = cross_val_score(svm_clf, X_train, y_train, scoring='f1_micro')
    print("SVM Average micro f1-score: %0.2f (+/- %0.2f)" % (svm_cv_scores.mean(), svm_cv_scores.std()))
    svm_w2v.append((svm_cv_scores.mean(), svm_cv_scores.std()))
    


For Cat1:
NB Average micro f1-score: 0.78 (+/- 0.01)
SVM Average micro f1-score: 0.41 (+/- 0.27)
For Cat2:
NB Average micro f1-score: 0.79 (+/- 0.00)
SVM Average micro f1-score: 0.40 (+/- 0.28)
For Cat3:
NB Average micro f1-score: 0.90 (+/- 0.01)
SVM Average micro f1-score: 0.90 (+/- 0.01)
For Cat4:
NB Average micro f1-score: 0.81 (+/- 0.01)
SVM Average micro f1-score: 0.60 (+/- 0.30)
For Cat5:
NB Average micro f1-score: 0.77 (+/- 0.01)
SVM Average micro f1-score: 0.77 (+/- 0.01)
For Cat6:
NB Average micro f1-score: 0.68 (+/- 0.00)
SVM Average micro f1-score: 0.44 (+/- 0.17)
For Cat7:
NB Average micro f1-score: 0.81 (+/- 0.00)


SystemError: <built-in method __deepcopy__ of numpy.ndarray object at 0x000002448479C940> returned a result with an error set

In [120]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [116]:
print(len(twenty_train.target))
print(len(cat7_scores))

2257
66


In [8]:
#dividing into training and testing set

#merge text and scores
cat4_dataset = np.array(list(zip(art_text, cat4_scores)))

#TODO: split this dataset into training and testing and then pass these into the next cell
#training_set = cat4_dataset[:int(len(cat4_dataset)*0.8)]
#testing_set = cat4_dataset[int(len(cat4_dataset)*0.8):]

split = 0.8
training_articles = art_text[:int(len(art_text)*split)]
training_preds = cat4_scores[:int(len(cat4_scores)*split)]

testing_articles = art_text[int(len(art_text)*split):]
testing_preds = cat4_scores[int(len(cat4_scores)*split):]
print("===== Training set size ====")
print("# of articles in testing set: {}".format(len(training_preds)))
print("Number of articles that satisfy this category (=1): {}\n".format(len(training_preds[training_preds == 1])))

print("===== Testing set size ====")
print("# of articles in testing set: {}".format(len(testing_preds)))
print("Number of articles that satisfy this category (=1): {}\n".format(len(testing_preds[testing_preds == 1])))

===== Training set size ====
# of articles in testing set: 52
Number of articles that satisfy this category (=1): 13

===== Testing set size ====
# of articles in testing set: 14
Number of articles that satisfy this category (=1): 1



In [9]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.externals import joblib

import numpy as np

random_state = 42

categories = ['Not Satisfies', 'Satisfies']

print("Number of articles: {}".format(len(training_articles)))

docs_test = testing_articles

# Naive Bayes classifier
bayes_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB())
                      ])
bayes_clf.fit(training_articles, training_preds)
joblib.dump(bayes_clf, "naive_bayes.pkl", compress=9)

# Predict the test dataset using Naive Bayes
predicted = bayes_clf.predict(docs_test)
print('Naive Bayes correct prediction: {:4.2f}'.format(np.mean(predicted == testing_preds)))
print(metrics.classification_report(testing_preds, predicted, target_names=categories))

# Support Vector Machine (SVM) classifier
svm_clf = Pipeline([('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=   5, random_state=42)),
])
svm_clf.fit(training_articles, training_preds)
joblib.dump(svm_clf, "svm.pkl", compress=9)
# Predict the test dataset using SVM
predicted = svm_clf.predict(docs_test)
print('SVM correct prediction: {:4.2f}'.format(np.mean(predicted == testing_preds)))
print(metrics.classification_report(testing_preds, predicted, target_names=categories))

print(metrics.confusion_matrix(testing_preds, predicted))

Number of articles: 52
Naive Bayes correct prediction: 0.93
               precision    recall  f1-score   support

Not Satisfies       0.93      1.00      0.96        13
    Satisfies       0.00      0.00      0.00         1

  avg / total       0.86      0.93      0.89        14



  'precision', 'predicted', average, warn_for)


SVM correct prediction: 0.93
               precision    recall  f1-score   support

Not Satisfies       0.93      1.00      0.96        13
    Satisfies       0.00      0.00      0.00         1

  avg / total       0.86      0.93      0.89        14

[[13  0]
 [ 1  0]]


  'precision', 'predicted', average, warn_for)


In [68]:
print("type of twent_test.data: ", type(twenty_train.target))
print(len(art_text))
print(twenty_test.target)

type of twent_test.data:  <class 'numpy.ndarray'>
66
[2 2 2 ... 2 2 1]


### Using gensim for word2vec

#### Inputs
Requires a sequence of sentences where the sentence is a list of words:
E.g. "Hi there. Goodbye there" -> [["Hi", "there"], ["Goodbye", "there"]]

In [389]:
def load_embeddings(filename):
    """
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    """
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
                continue
            labels.append(items[0])
            values = np.array([float(x) for x in items[1:]], 'f')
            rows.append(values)
    
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

In [366]:
#Loading pre-trained word2vec
pre_word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(datapath(os.path.join(EXT_DATA_FOLDER, "GoogleNews-vectors-negative300.bin")), binary=True)
pre_word2vec_model.save("pre_word2vec.model")

In [73]:
print("Similarity of 'woman' and 'man': ", pre_word2vec_model.similarity('woman', 'man'))
print("Similarity of 'woman' and 'woman': ", pre_word2vec_model.similarity('woman', 'woman'))
print("Similarity of 'dog' and 'hotdog': ", pre_word2vec_model.similarity('dog', 'hotdog'))

Similarity of 'woman' and 'man':  0.76640123
Similarity of 'woman' and 'woman':  1.0
Similarity of 'dog' and 'hotdog':  0.38931656


  if np.issubdtype(vec.dtype, np.int):


In [391]:
glove_embedding = load_embeddings(os.path.join(EXT_DATA_FOLDER, "glove.6B.300d.txt"))

In [390]:
w2v = load_embeddings(os.path.join(EXT_DATA_FOLDER, "GoogleNews-vectors-negative300.bin"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 19: invalid start byte

In [16]:
print ("Helo")

Helo
