# spaCy/HDBScan Feature Extraction Pipeline

### Note: it can be quite complicated to install spaCy and sense2vec, given conflicting low-level requirementso, so at this point I wouldn't suggest that others try to install the libraries and run this notebook.
  
However, it is well worth scanning down to the cell titled ***Harvesting Word Features***. In the output of that cell, there are examples of 52 feature clusters harvested by this process. The ultimate output of this process will produce a dataset containing a product ID (asin), overall rating, and word feature, for each word feature found in each product review. I don't consider these feature clusters as the final product, and we should discuss.


### We can use this output for several purposes. 

1. First, we should be able to quite easily make the data available to th web interface, so that we can display the top n word features (by overall rating) associated with products returned.

2. We will want to also include the user's selected word features in our model evaluation, to enable them to "drill into" selected features and thus explore the product/feature landscape.

3. Finally, I think it would be worth training a model on a vectorized representation of the top n most highly rated features, which may give us another dimension for predicting rating based on feature combination/interaction.

In [1]:
#!conda uninstall -y spacy

#Installing spaCy :

# For Linux:
# !conda install -y spacy -c conda-forge
# (use "spacy[cuda100]", if you have the 10.0 cuda driver installed) 

# For Mac:
# !pip install spacy==2.0.7

# Installing HDBScan and sense2vec
# !conda install -y -c conda-forge hdbscan
# !pip install sense2vec==1.1.1a0

In [2]:
#!pip install matplotlib
#!python -m pip list

In [3]:
import pandas as pd
import gzip
import time
# Install a few python packages using pip
from common import utils
utils.require_package("wget")      # for fetching dataset

In [4]:
# Standard python helper libraries.
from __future__ import print_function
from __future__ import division
import os, sys, time
import collections
import itertools

# Numerical manipulation libraries.
import numpy as np

#Visualization
import matplotlib
%matplotlib inline

import spacy
#activated = spacy.prefer_gpu()

In [5]:
def parse(path):
  print('start parse')
  start_parse = time.time()
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)
  end_parse = time.time()
  print('end parse with time for parse',end_parse - start_parse)

def getDF(path):
  print('start getDF')
  start = time.time()
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  print('end getDF')
  end = time.time()
  print('time taken to load data = ',end-start)
  return pd.DataFrame.from_dict(df, orient='index')

start = time.time()
print("Reading Pandas dataframe from reviews_Toys_and_Games_5.json.gz...")
df = getDF('./data/reviews_Toys_and_Games_5.json.gz')
print("...read reviews_Toys_and_Games_5.json.gz in {} seconds.".format(time.time()-start))

Reading Pandas dataframe from reviews_Toys_and_Games_5.json.gz...
start getDF
start parse
end parse with time for parse 8.654070615768433
end getDF
time taken to load data =  8.654944658279419
...read reviews_Toys_and_Games_5.json.gz in 10.878942489624023 seconds.


In [6]:
print(df.shape)
print(df.columns)
df.head(2)

(167597, 9)
Index(['helpful', 'reviewerName', 'summary', 'overall', 'reviewText',
       'reviewerID', 'unixReviewTime', 'reviewTime', 'asin'],
      dtype='object')


Unnamed: 0,helpful,reviewerName,summary,overall,reviewText,reviewerID,unixReviewTime,reviewTime,asin
0,"[0, 0]",Angie,Magnetic board,5.0,I like the item pricing. My granddaughter want...,A1VXOAVRGKGEAK,1390953600,"01 29, 2014",439893577
1,"[1, 1]",Candace,it works pretty good for moving to different a...,4.0,Love the magnet easel... great for moving to d...,A8R62G708TSCM,1395964800,"03 28, 2014",439893577


In [7]:
start = time.time()
print("Collecting summary counts of reviews by rating...")
print(df.groupby('overall').count())
print("...completed counts by rating in {} seconds.".format(time.time()-start))

Collecting summary counts of reviews by rating...
         helpful  reviewerName  summary  reviewText  reviewerID  \
overall                                                           
1.0         4707          4693     4707        4707        4707   
2.0         6298          6279     6298        6298        6298   
3.0        16357         16299    16357       16357       16357   
4.0        37445         37292    37445       37445       37445   
5.0       102790        102196   102790      102790      102790   

         unixReviewTime  reviewTime    asin  
overall                                      
1.0                4707        4707    4707  
2.0                6298        6298    6298  
3.0               16357       16357   16357  
4.0               37445       37445   37445  
5.0              102790      102790  102790  
...completed counts by rating in 0.12965989112854004 seconds.


In [8]:
import hdbscan
import seaborn as sns
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
from spacy.tokens import Doc



from sense2vec import Sense2VecComponent

start = time.time()
print("Reading English core web medium language data using spaCy...")
nlp = spacy.load('en_core_web_md')
print("...finished reading English core web medium language data in {} seconds.".format(time.time()-start))

#nlp_plain = spacy.load('en_core_web_md')
print("spaCy loaded en_core_web_md")
s2v = Sense2VecComponent('/home/burgew/w210_data/reddit_vectors-1.1.0')

IGNORED_LEMMAS = ['-PRON-']
IGNORED_POS = ['PUNCT', 'SPACE', 'PART', 'DET']

print("nlp: {}".format(nlp))
last_nlp_component = nlp.pipeline[-1]
if last_nlp_component[0] != 'sense2vec':
    nlp.add_pipe(s2v)
    print("added sense2vec to spaCy NLP pipeline")
else:
    print("sense2vec previously added to spaCy NLP pipeline")


Reading English core web medium language data using spaCy...
...finished reading English core web medium language data in 16.037681579589844 seconds.
spaCy loaded en_core_web_md
nlp: <spacy.lang.en.English object at 0x7f41677d2e80>
added sense2vec to spaCy NLP pipeline


In [9]:
print("nlp.pipeline[{}]: {}".format(len(nlp.pipeline), nlp.pipeline))

nlp.pipeline[4]: [('tagger', <spacy.pipeline.Tagger object at 0x7f41677d2fd0>), ('parser', <spacy.pipeline.DependencyParser object at 0x7f41677e44c0>), ('ner', <spacy.pipeline.EntityRecognizer object at 0x7f41677e4518>), ('sense2vec', <sense2vec.Sense2VecComponent object at 0x7f41669472e8>)]


In [10]:

debug = False

lemmas = {}
ignore_words = []

def consume_suspect_tokens(text, nlp, ignore_words, lemmas, ignore_initial=False):
    """ Parse text and return ignored words at the begining of the phrase, along with the valid phrase (if any)
    at the end of the phrase token.
    
    Args
    ----------
    ignore_initial (boolean) indicator of whether to ignore the first word in the phrase token (when a recursive call)
    text (string)            text to be parsed, tokenized, and vectorized
    nlp (spaCy pipeline)     pipeline to use for processing the input text
    ignore_words (list)      collected set of words to be ignored, to which this function may add words
    lemmas (dict)            dict of word->
    
    Returns:
    ----------
    None
    """
    if ignore_initial:
        token_parts = text.split(" ")
        first_word = token_parts[0]
        
        if first_word not in ignore_words:
            ignore_words.append(first_word)
            #print("Ignoring word '{}' in feature extraction...".format(first_word))    
        
        if (" " in text) and (len(text.split(" "))>1):
            text = " ".join(token_parts[1:])
        else:
            return None

    doc = nlp(text)
    
    for token in doc:
        if debug:
            print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop, [child for child in token.children])
        lower_token = token.text.lower()
        if (lower_token is not None):
            if (lower_token not in ignore_words):
                if (token.lemma_ in IGNORED_LEMMAS) or (token.pos_ in IGNORED_POS):
                    consume_suspect_tokens(text, nlp, ignore_words, lemmas, ignore_initial=True)
                else:
                    lemmas[token.text] = lower_token
                
        return None


def get_vectors(text, nlp, ignore_words, lemmas):
    """ <generator> Get embedding word vectors from a given text object. 
    Args
    ----------
    text (string)            text to be parsed, tokenized, and vectorized
    nlp (spaCy pipeline)     pipeline to use for processing the input text
    ignore_words (list)      collected set of words to be ignored, to which this function may add words
    lemmas (dict)            dict of word->
    
    Generates:
    ----------
    processed text (string) 
    phrase vector (numpy.ndarray)
    """
    
    chunks_seen = []
    consume_suspect_tokens(text, nlp, ignore_words, lemmas)
              
    doc = nlp(text)
    #####
    # Next, iterate through the sentences and within those the noun chunks.
    # These noun chunks will be lemmatized and collected as potential features.
    #####
    for sent in doc.sents:
        for chunk in sent.noun_chunks:
           
            if chunk.text not in chunks_seen:
                chunks_seen.append(chunk.text)
                processed_text = chunk.text
                
                lemmatized_tokens = []
                
                if lemmas is not None:
                    
                    for chunk_token in chunk.text.split(' '):
                        lower_token = chunk_token.lower()
                        if (ignore_words is None) or (lower_token not in ignore_words):
                            
                            try:
                                this_lemma = lemmas[lower_token]
                            except:
                                this_lemma = lower_token
                                
                            lemmatized_tokens.append(this_lemma)
                            
                    if len(lemmatized_tokens)>0:
                        processed_text = " ".join(lemma for lemma in lemmatized_tokens)
                    else:
                        continue
                    
                yield processed_text, chunk.vector
    

#####
# Collecting word feature clusters using 1 million characters of review text
#####

debug = False

# spaCy has a default limit of 1000000 characters of text when processing a document
max_text_length = 1000000

reviews = ""
start = time.time()
print("Reading up to {} characters of review text for vectorizing...".format(max_text_length))
for i in range(0, len(df)):
    reviewText = df.at[i, 'reviewText']
    if len(reviews) + len(reviewText) > max_text_length:
        break
    reviews = reviews + " " +reviewText
print("...finished reading {} characters of reviews in {} seconds.".format(len(reviews), time.time()-start))
print("This included the text of {} reviews".format(i))

     
#print("{} Reviews ".format(i), reviews)
                  
#review = df.loc[2,'reviewText']
#print(review)

reviewText = df

#start = time.time()
#print("Vectorizing words of {} characters using spaCy...".format(len(reviews)))
#doc = nlp(reviews)
#print("...finished vectorizing words in {} seconds.".format(time.time()-start))



In [11]:
i = 0
print("Vocab len: ",len(nlp.vocab))
print("Vocab vectors len: ",len(nlp.vocab.vectors))
print("\nA sample of words:")
for key in nlp.vocab.vectors:
    i += 1
    print("Key: ", key)
    print("String: ", nlp.vocab.strings[key])
    if i>5:
        break

Vocab len:  1344233
Vocab vectors len:  20000

A sample of words:
Key:  13683662949380521979
String:  ponerla
Key:  5106546397431201793
String:  HANG-OUT
Key:  11215641602458386432
String:  Minkoff
Key:  3362791329292965205
String:  s18
Key:  3424551750583975941
String:  croup
Key:  1378098176466092041
String:  Bamboo


In [17]:
vectors_filepath = './data/vectors_each.tsv'
metadata_filepath = './data/metadata_each.tsv'

In [12]:
#!rm -f ./vectors_each.tsv
#!rm -f ./metadata_each.tsv

def write_vectors(product, rating, concept_vec):
    """Write product, rating, phrase and sense vector to file"""
    
    with open(vectors_filepath, 'a') as out_v, open(metadata_filepath, 'a') as out_m:
        phrase = concept_vec[0]
        sense_vector = concept_vec[1]
        out_m.write('{}\t{}\t{}\n'.format(product, str(rating),phrase))
        out_v.write('\t'.join([str(x) for x in sense_vector]) + "\n")


In [13]:
write_to_file = True

sample_vect = [vec for vec in get_vectors("example", nlp, ignore_words, lemmas)][0][1]
vect_dim = sample_vect.shape
print("Sample vect[{}]".format(vect_dim))
index = []
output = None

iterations = 838
iteration_size = 200

print("Collecting word concept vectors for {} reviews...".format(iteration_size*iterations))

total_start = time.time()
for j in range(iterations):
    print("Starting iteration over reviews {} to {}...".format(j*iteration_size, (j+1)*iteration_size))

    iter_start = time.time()
    for i in range(iteration_size):
        # compute a review index from iterations and iteration_size
        review_ind = i + (j*iteration_size)
        
        #print(df['reviewerID'].iloc[i])
        product = df['asin'].iloc[review_ind]
        rating = df['overall'].iloc[review_ind]
        review = df['reviewText'].iloc[review_ind]
        #print(review)
        for concept_vec in get_vectors(review, nlp, ignore_words, lemmas):
            
            if write_to_file:
                # Append data to files for later reading
                write_vectors(product, rating, concept_vec)
            else:
                # Append data to a list and a numpy array
                index.append([product, rating, concept_vec[0]])
        
                if output is None:
                    # Create an np.array with the first row as the retrieved word vector
                    output = np.array([concept_vec[1]])
                else:
                    # Append the next vector to the end of the vectors array
                    output = np.append(output, np.array([concept_vec[1]]), axis=0)
    print("...completed an iteration of {} reviews in {} seconds.".format(iteration_size, time.time()-iter_start))
    
print("Collected {} word vectors in {} seconds.".format(i*j, time.time()-total_start))


Sample vect[(300,)]
Collecting word concept vectors for 200000 reviews...
Starting iteration over reviews 0 to 200...
...completed an iteration of 200 reviews in 20.799611568450928 seconds.
Starting iteration over reviews 200 to 400...
...completed an iteration of 200 reviews in 24.264477252960205 seconds.
Starting iteration over reviews 400 to 600...
...completed an iteration of 200 reviews in 20.551329612731934 seconds.
Starting iteration over reviews 600 to 800...
...completed an iteration of 200 reviews in 25.270239114761353 seconds.
Starting iteration over reviews 800 to 1000...
...completed an iteration of 200 reviews in 20.141697883605957 seconds.
Starting iteration over reviews 1000 to 1200...
...completed an iteration of 200 reviews in 30.214080095291138 seconds.
Starting iteration over reviews 1200 to 1400...
...completed an iteration of 200 reviews in 29.77208185195923 seconds.
Starting iteration over reviews 1400 to 1600...
...completed an iteration of 200 reviews in 22.864

...completed an iteration of 200 reviews in 17.002442359924316 seconds.
Starting iteration over reviews 13800 to 14000...
...completed an iteration of 200 reviews in 17.47238540649414 seconds.
Starting iteration over reviews 14000 to 14200...
...completed an iteration of 200 reviews in 17.61144185066223 seconds.
Starting iteration over reviews 14200 to 14400...
...completed an iteration of 200 reviews in 18.064692974090576 seconds.
Starting iteration over reviews 14400 to 14600...
...completed an iteration of 200 reviews in 16.74238133430481 seconds.
Starting iteration over reviews 14600 to 14800...
...completed an iteration of 200 reviews in 18.652212858200073 seconds.
Starting iteration over reviews 14800 to 15000...
...completed an iteration of 200 reviews in 16.68256425857544 seconds.
Starting iteration over reviews 15000 to 15200...
...completed an iteration of 200 reviews in 14.28907322883606 seconds.
Starting iteration over reviews 15200 to 15400...
...completed an iteration of 

...completed an iteration of 200 reviews in 15.83445429801941 seconds.
Starting iteration over reviews 27400 to 27600...
...completed an iteration of 200 reviews in 12.99533724784851 seconds.
Starting iteration over reviews 27600 to 27800...
...completed an iteration of 200 reviews in 16.752342224121094 seconds.
Starting iteration over reviews 27800 to 28000...
...completed an iteration of 200 reviews in 12.56347370147705 seconds.
Starting iteration over reviews 28000 to 28200...
...completed an iteration of 200 reviews in 14.777369022369385 seconds.
Starting iteration over reviews 28200 to 28400...
...completed an iteration of 200 reviews in 14.056739807128906 seconds.
Starting iteration over reviews 28400 to 28600...
...completed an iteration of 200 reviews in 15.31201171875 seconds.
Starting iteration over reviews 28600 to 28800...
...completed an iteration of 200 reviews in 16.07841205596924 seconds.
Starting iteration over reviews 28800 to 29000...
...completed an iteration of 200

...completed an iteration of 200 reviews in 15.500776529312134 seconds.
Starting iteration over reviews 41000 to 41200...
...completed an iteration of 200 reviews in 16.061554193496704 seconds.
Starting iteration over reviews 41200 to 41400...
...completed an iteration of 200 reviews in 15.03829288482666 seconds.
Starting iteration over reviews 41400 to 41600...
...completed an iteration of 200 reviews in 15.879251718521118 seconds.
Starting iteration over reviews 41600 to 41800...
...completed an iteration of 200 reviews in 16.258179664611816 seconds.
Starting iteration over reviews 41800 to 42000...
...completed an iteration of 200 reviews in 16.653322219848633 seconds.
Starting iteration over reviews 42000 to 42200...
...completed an iteration of 200 reviews in 13.139135360717773 seconds.
Starting iteration over reviews 42200 to 42400...
...completed an iteration of 200 reviews in 14.025718688964844 seconds.
Starting iteration over reviews 42400 to 42600...
...completed an iteration

...completed an iteration of 200 reviews in 15.264626502990723 seconds.
Starting iteration over reviews 54600 to 54800...
...completed an iteration of 200 reviews in 17.833472728729248 seconds.
Starting iteration over reviews 54800 to 55000...
...completed an iteration of 200 reviews in 15.100022077560425 seconds.
Starting iteration over reviews 55000 to 55200...
...completed an iteration of 200 reviews in 12.043143033981323 seconds.
Starting iteration over reviews 55200 to 55400...
...completed an iteration of 200 reviews in 14.660714149475098 seconds.
Starting iteration over reviews 55400 to 55600...
...completed an iteration of 200 reviews in 12.894985914230347 seconds.
Starting iteration over reviews 55600 to 55800...
...completed an iteration of 200 reviews in 16.475217580795288 seconds.
Starting iteration over reviews 55800 to 56000...
...completed an iteration of 200 reviews in 23.149320125579834 seconds.
Starting iteration over reviews 56000 to 56200...
...completed an iteratio

...completed an iteration of 200 reviews in 13.32876992225647 seconds.
Starting iteration over reviews 68200 to 68400...
...completed an iteration of 200 reviews in 12.467150211334229 seconds.
Starting iteration over reviews 68400 to 68600...
...completed an iteration of 200 reviews in 16.236023664474487 seconds.
Starting iteration over reviews 68600 to 68800...
...completed an iteration of 200 reviews in 21.48687720298767 seconds.
Starting iteration over reviews 68800 to 69000...
...completed an iteration of 200 reviews in 13.191395282745361 seconds.
Starting iteration over reviews 69000 to 69200...
...completed an iteration of 200 reviews in 13.253115892410278 seconds.
Starting iteration over reviews 69200 to 69400...
...completed an iteration of 200 reviews in 13.186903238296509 seconds.
Starting iteration over reviews 69400 to 69600...
...completed an iteration of 200 reviews in 15.264591217041016 seconds.
Starting iteration over reviews 69600 to 69800...
...completed an iteration 

...completed an iteration of 200 reviews in 15.079401969909668 seconds.
Starting iteration over reviews 81800 to 82000...
...completed an iteration of 200 reviews in 13.583588123321533 seconds.
Starting iteration over reviews 82000 to 82200...
...completed an iteration of 200 reviews in 13.103094339370728 seconds.
Starting iteration over reviews 82200 to 82400...
...completed an iteration of 200 reviews in 14.150864362716675 seconds.
Starting iteration over reviews 82400 to 82600...
...completed an iteration of 200 reviews in 11.323687076568604 seconds.
Starting iteration over reviews 82600 to 82800...
...completed an iteration of 200 reviews in 12.369129180908203 seconds.
Starting iteration over reviews 82800 to 83000...
...completed an iteration of 200 reviews in 16.358638286590576 seconds.
Starting iteration over reviews 83000 to 83200...
...completed an iteration of 200 reviews in 14.395226955413818 seconds.
Starting iteration over reviews 83200 to 83400...
...completed an iteratio

...completed an iteration of 200 reviews in 18.270150423049927 seconds.
Starting iteration over reviews 95400 to 95600...
...completed an iteration of 200 reviews in 24.021104335784912 seconds.
Starting iteration over reviews 95600 to 95800...
...completed an iteration of 200 reviews in 18.569719076156616 seconds.
Starting iteration over reviews 95800 to 96000...
...completed an iteration of 200 reviews in 16.506463766098022 seconds.
Starting iteration over reviews 96000 to 96200...
...completed an iteration of 200 reviews in 14.95996642112732 seconds.
Starting iteration over reviews 96200 to 96400...
...completed an iteration of 200 reviews in 23.098137140274048 seconds.
Starting iteration over reviews 96400 to 96600...
...completed an iteration of 200 reviews in 21.918591022491455 seconds.
Starting iteration over reviews 96600 to 96800...
...completed an iteration of 200 reviews in 13.736265659332275 seconds.
Starting iteration over reviews 96800 to 97000...
...completed an iteration

...completed an iteration of 200 reviews in 12.893296957015991 seconds.
Starting iteration over reviews 108800 to 109000...
...completed an iteration of 200 reviews in 9.512125015258789 seconds.
Starting iteration over reviews 109000 to 109200...
...completed an iteration of 200 reviews in 14.377772092819214 seconds.
Starting iteration over reviews 109200 to 109400...
...completed an iteration of 200 reviews in 18.85701847076416 seconds.
Starting iteration over reviews 109400 to 109600...
...completed an iteration of 200 reviews in 20.579145669937134 seconds.
Starting iteration over reviews 109600 to 109800...
...completed an iteration of 200 reviews in 12.927685976028442 seconds.
Starting iteration over reviews 109800 to 110000...
...completed an iteration of 200 reviews in 13.14707064628601 seconds.
Starting iteration over reviews 110000 to 110200...
...completed an iteration of 200 reviews in 14.061819791793823 seconds.
Starting iteration over reviews 110200 to 110400...
...complete

...completed an iteration of 200 reviews in 11.731244802474976 seconds.
Starting iteration over reviews 122200 to 122400...
...completed an iteration of 200 reviews in 21.801027059555054 seconds.
Starting iteration over reviews 122400 to 122600...
...completed an iteration of 200 reviews in 14.262207508087158 seconds.
Starting iteration over reviews 122600 to 122800...
...completed an iteration of 200 reviews in 17.090633153915405 seconds.
Starting iteration over reviews 122800 to 123000...
...completed an iteration of 200 reviews in 17.346162796020508 seconds.
Starting iteration over reviews 123000 to 123200...
...completed an iteration of 200 reviews in 18.26294994354248 seconds.
Starting iteration over reviews 123200 to 123400...
...completed an iteration of 200 reviews in 14.274372100830078 seconds.
Starting iteration over reviews 123400 to 123600...
...completed an iteration of 200 reviews in 14.037562847137451 seconds.
Starting iteration over reviews 123600 to 123800...
...comple

...completed an iteration of 200 reviews in 17.525763511657715 seconds.
Starting iteration over reviews 135600 to 135800...
...completed an iteration of 200 reviews in 15.174272775650024 seconds.
Starting iteration over reviews 135800 to 136000...
...completed an iteration of 200 reviews in 20.107938051223755 seconds.
Starting iteration over reviews 136000 to 136200...
...completed an iteration of 200 reviews in 12.684232711791992 seconds.
Starting iteration over reviews 136200 to 136400...
...completed an iteration of 200 reviews in 13.98273253440857 seconds.
Starting iteration over reviews 136400 to 136600...
...completed an iteration of 200 reviews in 11.49935507774353 seconds.
Starting iteration over reviews 136600 to 136800...
...completed an iteration of 200 reviews in 11.638357400894165 seconds.
Starting iteration over reviews 136800 to 137000...
...completed an iteration of 200 reviews in 13.695765495300293 seconds.
Starting iteration over reviews 137000 to 137200...
...complet

...completed an iteration of 200 reviews in 21.862679481506348 seconds.
Starting iteration over reviews 149000 to 149200...
...completed an iteration of 200 reviews in 19.46582317352295 seconds.
Starting iteration over reviews 149200 to 149400...
...completed an iteration of 200 reviews in 12.52415132522583 seconds.
Starting iteration over reviews 149400 to 149600...
...completed an iteration of 200 reviews in 14.279038906097412 seconds.
Starting iteration over reviews 149600 to 149800...
...completed an iteration of 200 reviews in 16.561567544937134 seconds.
Starting iteration over reviews 149800 to 150000...
...completed an iteration of 200 reviews in 22.464833736419678 seconds.
Starting iteration over reviews 150000 to 150200...
...completed an iteration of 200 reviews in 12.240449905395508 seconds.
Starting iteration over reviews 150200 to 150400...
...completed an iteration of 200 reviews in 16.77539086341858 seconds.
Starting iteration over reviews 150400 to 150600...
...complete

...completed an iteration of 200 reviews in 21.059853076934814 seconds.
Starting iteration over reviews 162400 to 162600...
...completed an iteration of 200 reviews in 21.54162359237671 seconds.
Starting iteration over reviews 162600 to 162800...
...completed an iteration of 200 reviews in 19.366231441497803 seconds.
Starting iteration over reviews 162800 to 163000...
...completed an iteration of 200 reviews in 14.396927833557129 seconds.
Starting iteration over reviews 163000 to 163200...
...completed an iteration of 200 reviews in 16.599801778793335 seconds.
Starting iteration over reviews 163200 to 163400...
...completed an iteration of 200 reviews in 18.891541481018066 seconds.
Starting iteration over reviews 163400 to 163600...
...completed an iteration of 200 reviews in 17.058192014694214 seconds.
Starting iteration over reviews 163600 to 163800...
...completed an iteration of 200 reviews in 15.001845359802246 seconds.
Starting iteration over reviews 163800 to 164000...
...comple

IndexError: single positional indexer is out-of-bounds

In [70]:
import tqdm

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

vdim = file_len(vectors_filepath)
print("vectors file contains {} lines".format(vdim))
index = []
output = None

sample_prob=0.05
# Generate random samples set, used for both the vectors file and the metadata file so the elements match
samples = np.random.choice(a=[True,False], size=vdim, p=[sample_prob,1.0-sample_prob])
print(samples)


vectors file contains 2994035 lines
[False False False ... False False False]


In [None]:
with open(vectors_filepath, 'r') as in_v:
    print('File {} contains sense vectors of dimension {}'.format(vectors_filepath, vdim))
    curr_line = 0
    t = tqdm.tqdm(total=vdim)
    
    for line in in_v:
        sample_this_row = samples[curr_line]
        
        if sample_this_row:
            if output is None:
                # Create an np.array with the first row as the retrieved word vector
                output = np.array([np.array(line.split('\t'))])
            else:
                # Append the next vector to the end of the vectors array
                output = np.append(output, np.array([np.array(line.split('\t'))]), axis=0)
                
        curr_line += 1
        t.update()
        #if len(output)>2:        
        #    break
    t.close()







  0%|          | 0/2994035 [00:00<?, ?it/s][A[A[A[A[A[A




30518it [03:55, 129.71it/s][A[A[A[A[A




  0%|          | 4400/2994035 [00:00<01:08, 43914.50it/s][A[A[A[A[A

File ./data/vectors_each.tsv contains sense vectors of dimension 2994035







  0%|          | 6851/2994035 [00:00<01:24, 35347.96it/s][A[A[A[A[A




  0%|          | 8657/2994035 [00:00<01:49, 27358.38it/s][A[A[A[A[A




  0%|          | 10351/2994035 [00:00<02:20, 21294.11it/s][A[A[A[A[A




  0%|          | 11958/2994035 [00:00<03:11, 15552.98it/s][A[A[A[A[A




  0%|          | 13369/2994035 [00:00<04:07, 12030.16it/s][A[A[A[A[A




  0%|          | 14593/2994035 [00:00<04:59, 9951.95it/s] [A[A[A[A[A




  1%|          | 15663/2994035 [00:01<06:20, 7824.21it/s][A[A[A[A[A




  1%|          | 16569/2994035 [00:01<06:59, 7095.59it/s][A[A[A[A[A




  1%|          | 17378/2994035 [00:01<08:16, 6000.67it/s][A[A[A[A[A




  1%|          | 18077/2994035 [00:01<10:06, 4906.78it/s][A[A[A[A[A




  1%|          | 18684/2994035 [00:01<09:34, 5175.04it/s][A[A[A[A[A




  1%|          | 19277/2994035 [00:01<10:11, 4862.99it/s][A[A[A[A[A




  1%|          | 19820/2994035 [00:02<11:51, 4181.05it/s][A[A[A

  1%|▏         | 40309/2994035 [00:13<1:22:17, 598.21it/s][A[A[A[A[A




  1%|▏         | 40379/2994035 [00:14<1:30:07, 546.19it/s][A[A[A[A[A




  1%|▏         | 40509/2994035 [00:14<1:17:14, 637.29it/s][A[A[A[A[A




  1%|▏         | 40584/2994035 [00:14<1:30:28, 544.02it/s][A[A[A[A[A




  1%|▏         | 40657/2994035 [00:14<1:27:51, 560.23it/s][A[A[A[A[A




  1%|▏         | 40738/2994035 [00:14<1:24:28, 582.68it/s][A[A[A[A[A




  1%|▏         | 40815/2994035 [00:14<1:19:12, 621.38it/s][A[A[A[A[A




  1%|▏         | 40918/2994035 [00:14<1:09:46, 705.32it/s][A[A[A[A[A




  1%|▏         | 41026/2994035 [00:14<1:05:59, 745.85it/s][A[A[A[A[A




  1%|▏         | 41107/2994035 [00:15<1:30:53, 541.50it/s][A[A[A[A[A




  1%|▏         | 41173/2994035 [00:15<1:36:53, 507.94it/s][A[A[A[A[A




  1%|▏         | 41278/2994035 [00:15<1:24:22, 583.22it/s][A[A[A[A[A




  1%|▏         | 41452/2994035 [00:15<1:09:03, 712.64it/s][A[A

  2%|▏         | 49269/2994035 [00:28<2:00:29, 407.34it/s][A[A[A[A[A




  2%|▏         | 49360/2994035 [00:28<1:42:39, 478.08it/s][A[A[A[A[A




  2%|▏         | 49416/2994035 [00:28<1:42:04, 480.83it/s][A[A[A[A[A




  2%|▏         | 49470/2994035 [00:28<1:42:50, 477.22it/s][A[A[A[A[A




  2%|▏         | 49572/2994035 [00:28<1:28:14, 556.12it/s][A[A[A[A[A




  2%|▏         | 49705/2994035 [00:28<1:14:29, 658.73it/s][A[A[A[A[A




  2%|▏         | 49784/2994035 [00:29<1:13:42, 665.68it/s][A[A[A[A[A




  2%|▏         | 49872/2994035 [00:29<1:10:26, 696.52it/s][A[A[A[A[A




  2%|▏         | 49949/2994035 [00:29<1:19:00, 621.01it/s][A[A[A[A[A




  2%|▏         | 50039/2994035 [00:29<1:14:21, 659.80it/s][A[A[A[A[A




  2%|▏         | 50154/2994035 [00:29<1:07:04, 731.45it/s][A[A[A[A[A




  2%|▏         | 50236/2994035 [00:29<1:07:40, 725.03it/s][A[A[A[A[A




  2%|▏         | 50313/2994035 [00:29<1:23:53, 584.86it/s][A[A

In [None]:
print("Output shape: {}".format(output.shape))

with open(metadata_filepath, 'r') as in_m:
    mdim = None
    curr_line = 0
    for line in in_m:
        if mdim is None:
            mdim = line.count('\t')+1
            print('File {} contains index entries of of dimension {}'.format(metadata_filepath, vdim))
        if line.endswith('\n'):
            line = line[:-1]
        sample_this_row = samples[curr_line]
        if sample_this_row:
            index.append(line.split('\t')
        curr_line += 1

In [33]:
print("Index length: {}".format(len(index)))

[['0439893577', '5.0', 'i\n'], ['0439893577', '5.0', 'i']]

In [None]:
print("Index({}, {}): {}".format(len(index), len(index[0]), index[:5]))
print("Output{}: {}".format(output.shape, output[:5]))

# Write these arrays to files for use in Tensorflow embedding visualization
        
import io

out_v = io.open('./vectors{}.tsv'.format(len(output)), 'w', encoding='utf-8')
out_m = io.open('./metadata{}.tsv'.format(len(index)), 'w', encoding='utf-8')
for word_num in range(len(output)):
  word = index[word_num][2]
  embeddings = output[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

In [None]:
HDBSCAN_METRIC = 'manhattan'

start = time.time()
print("Creating word clusters from word vectors...")
hdbscanner = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=5, metric=HDBSCAN_METRIC, gen_min_span_tree=True)
hdbscanner.fit(output)
print("...completed clustering in {} seconds.".format(time.time()-start))

In [None]:
start = time.time()
print("Condensing the linkage tree and then plotting...")
#hdbscanner.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
hdbscanner.condensed_tree_.plot()
hdbscanner.condensed_tree_.plot(select_clusters=True)
print("...plotted condensed tree in {} seconds.".format(time.time()-start))
tree = hdbscanner.condensed_tree_
print("Found {} clusters".format(len(tree._select_clusters())))


### This can only be graphed as single linkage tree for very small datasets 

start = time.time()
print("Plotting single linkage tree (not for large data) ...")
hdbscanner.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
print("...plotted single linkage tree tree in {} seconds.".format(time.time()-start))

start = time.time()
print("Plotting condensed tree...")
hdbscanner.condensed_tree_.plot()
print("...plotted condensed tree in {} seconds.".format(time.time()-start))

In [None]:
start = time.time()
print("Plotting condensed tree with vectors selected...")
hdbscanner.condensed_tree_.plot(select_clusters=True, selection_palette=sns.color_palette())
print("...plotted condensed tree with selected vectors in {} seconds.".format(time.time()-start))

In [None]:
def exemplars(cluster_id, condensed_tree):
    raw_tree = condensed_tree._raw_tree
    # Just the cluster elements of the tree, excluding singleton points
    cluster_tree = raw_tree[raw_tree['child_size'] > 1]
    # Get the leaf cluster nodes under the cluster we are considering
    leaves = hdbscan.plots._recurse_leaf_dfs(cluster_tree, cluster_id)
    # Now collect up the last remaining points of each leaf cluster (the heart of the leaf)
    result = np.array([])
    for leaf in leaves:
        max_lambda = raw_tree['lambda_val'][raw_tree['parent'] == leaf].max()
        points = raw_tree['child'][(raw_tree['parent'] == leaf) & 
                                   (raw_tree['lambda_val'] == max_lambda)]
        result = np.hstack((result, points))
    return result.astype(np.int)

# Harvesting Word Features
The below logic collects and filters word features from the condensed tree created in the cells above.

In [None]:
tree = hdbscanner.condensed_tree_

#print('Index, for reference:')
#for ind, entry in enumerate(index):
#    print("cluster: {}, ind: {}, entry: {}".format(hdbscanner.labels_[ind], ind, entry))

start = time.time()
print("Selecting clusters in tree...")
clusters = tree._select_clusters()
print("...finished selecting clusters in {} seconds.".format(time.time()-start))

initial_cluster_count = len(clusters)
print("Found {} clusters".format(initial_cluster_count))

selected_clusters = []

for i, c in enumerate(clusters):
    c_exemplars = exemplars(c, tree)
    
    #plt.scatter(data.T[0][c_exemplars], data.T[1][c_exemplars], c=palette[i], **plot_kwds)
    
    #print("Index: ", enumerate(index))
    #print("Output: ", output[:5])

    cluster_exemplars = set()
    for ind, ex_ind in enumerate(c_exemplars):
        #print("Exemplar -- {} : {}".format(index[ex_ind][0], index[ex_ind][2]))
        cluster_exemplars.add(index[ex_ind][2])
    
    members = set()
    for label_ind, label in np.ndenumerate(hdbscanner.labels_):
        if label == i:
            members.add(index[label_ind[0]][2])
            
            #print("Member: {} : {}".format(index[label_ind[0]][0], index[label_ind[0]][2]))
    
    exemplars_len = float(len(cluster_exemplars))
    members_len = float(len(members))
    
    if ((exemplars_len>0) and (len(members)>(2.0*exemplars_len))):
        #print("\nCluster {} persistence: {}".format(i, hdbscanner.cluster_persistence_.item(i)))
        #print("Cluster {} Exemplars: ".format(i),c_exemplars)
        #print("Cluster {} Exemplar Probabilities: ".format(i),[hdbscanner.probabilities_[ind] for ind in c_exemplars])
    
        example_cluster_exemplars = ", ".join(cluster_exemplars)
        example_cluster_members = ", ".join(members)
        
        selected_clusters.append([example_cluster_exemplars, example_cluster_members])

selected_cluster_count = len(selected_clusters)
if (selected_cluster_count>0):
    print("\nFound {} clusters ({}% of initially collected):".
          format(len(selected_clusters), 100.0*float(selected_cluster_count)/float(initial_cluster_count)))
    for example in selected_clusters:
        print("\nExemplars: {}".format(example[0]))
        print("Members: {}".format(example[1]))
                                                                    
noise_count = sum([1 for label in hdbscanner.labels_ if label == -1])
print("\nThere were {} words that were considered noise.".format(noise_count))

                                  
#print("\nOutliers.")
#for label_ind, label in np.ndenumerate(hdbscanner.labels_):
#    if label == -1:
#        print("{} : {}".format(index[label_ind[0]][0], index[label_ind[0]][2]))

In [None]:
hdbscanner.labels_


In [None]:
print(c_exemplars[0])

In [None]:
import time

import numpy as np
import matplotlib.pyplot as plt

from sklearn import cluster, datasets
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler

import hdbscan

np.random.seed(0)
plt.style.use('fivethirtyeight')

def make_var_density_blobs(n_samples=750, centers=[[0,0]], cluster_std=[0.5], random_state=0):
    samples_per_blob = n_samples // len(centers)
    blobs = [datasets.make_blobs(n_samples=samples_per_blob, centers=[c], cluster_std=cluster_std[i])[0]
             for i, c in enumerate(centers)]
    labels = [i * np.ones(samples_per_blob) for i in range(len(centers))]
    return np.vstack(blobs), np.hstack(labels)

# Generate datasets. We choose the size big enough to see the scalability
# of the algorithms, but not too big to avoid too long running times
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.08)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.10)
blobs = datasets.make_blobs(n_samples=n_samples-200, random_state=8)
noisy_blobs = np.vstack((blobs[0], 25.0*np.random.rand(200, 2)-[10.0,10.0])), np.hstack((blobs[1], -1*np.ones(200))) 
varying_blobs = make_var_density_blobs(n_samples,
                                       centers=[[1, 1],
                                                [-1, -1],
                                                [1, -1]],
                                       cluster_std=[0.2, 0.35, 0.5])
no_structure = np.random.rand(n_samples, 2), None

colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)

clustering_names = [
    'MiniBatchKMeans', 'AffinityPropagation',
    'SpectralClustering', 'AgglomerativeClustering',
    'DBSCAN', 'HDBSCAN']

plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
                    hspace=.01)

plot_num = 1

datasets = [noisy_circles, noisy_moons, noisy_blobs, varying_blobs, no_structure]
for i_dataset, dataset in enumerate(datasets):
    X, y = dataset
    # normalize dataset for easier parameter selection
    X = StandardScaler().fit_transform(X)

    # estimate bandwidth for mean shift
    bandwidth = cluster.estimate_bandwidth(X, quantile=0.3)

    # connectivity matrix for structured Ward
    connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
    # make connectivity symmetric
    connectivity = 0.5 * (connectivity + connectivity.T)

    # create clustering estimators
    two_means = cluster.MiniBatchKMeans(n_clusters=2)
    spectral = cluster.SpectralClustering(n_clusters=2,
                                          eigen_solver='arpack',
                                          affinity="nearest_neighbors")
    dbscan = cluster.DBSCAN(eps=.2)
    affinity_propagation = cluster.AffinityPropagation(damping=.9,
                                                       preference=-200)

    average_linkage = cluster.AgglomerativeClustering(
        linkage="average", affinity="cityblock", n_clusters=2,
        connectivity=connectivity)

    #hdbscanner = hdbscan.HDBSCAN()
    clustering_algorithms = [
        two_means, affinity_propagation, spectral, average_linkage,
        dbscan, hdbscanner]

    for name, algorithm in zip(clustering_names, clustering_algorithms):
        # predict cluster memberships
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        if hasattr(algorithm, 'labels_'):
            y_pred = algorithm.labels_.astype(np.int)
        else:
            y_pred = algorithm.predict(X)

        # plot
        plt.subplot(5, len(clustering_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)
        plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10)

        if hasattr(algorithm, 'cluster_centers_'):
            centers = algorithm.cluster_centers_
            center_colors = colors[:len(centers)]
            plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
        plt.xlim(-2, 2)
        plt.ylim(-2, 2)
        plt.xticks(())
        plt.yticks(())
        plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
                 transform=plt.gca().transAxes, size=15,
                 horizontalalignment='right')
        plot_num += 1

plt.show()

In [None]:
import tensorflow as tf

# value = np.array(value)
# value = value.reshape([2, 4])
output_init = tf.constant_initializer(output)

print('fitting shape:')
tf.reset_default_graph()
with tf.Session() :
    embedding_var = tf.get_variable('x', shape = [len(output), len(output[0])], initializer = output_init)
    embedding_var.initializer.run()
    print(embedding_var.eval())
    
sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_var: embedding_var)})

path_for_metadata = './graphs/embedding_test/embedding_test.ckpt'

with open(path_for_metadata,'w') as f:
    f.write("Index\tLabel\n")
    for ind,label_line in enumerate(index):
        label = '{}:{}:{}'.format(label_line[0], label_line[1], label_line[2])
        f.write("%d\t%s\n" % (ind,label))

In [None]:
from tensorflow.contrib.tensorboard.plugins import projector

with tf.Session() as sess:
    # Create summary writer.
    writer = tf.summary.FileWriter('./graphs/embedding_test', sess.graph)
    # Initialize embedding_var
    #sess.run(embedding_var.initializer)
    sess.run(embedding_init, feed_dict={embedding_var: embedding_var})
    # Create Projector config
    config = projector.ProjectorConfig()
    # Add embedding visualizer
    embedding = config.embeddings.add()
    # Attache the name 'embedding'
    embedding.tensor_name = embedding_var.name
    # Metafile which is described later
    embedding.metadata_path = path_for_metadata
    # Add writer and config to Projector
    projector.visualize_embeddings(writer, config)
    # Save the model
    saver_embed = tf.train.Saver([embedding_var])
    saver_embed.save(sess, './graphs/embedding_test/embedding_test.ckpt', 1)

writer.close()