In [18]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
import nltk

#!conda install -c anaconda --yes nltk 
#nltk.download('gutenberg')
# !conda install -c anaconda --yes gensim 
#!pip install google-compute-engine
#!conda install -c conda-forge google-cloud-sdk  --yes
#!conda uninstall -c conda-forge google-cloud-sdk  --yes

### Cracking Spacy

In [19]:
# dependcecy token by token
nlp = spacy.load('en', disable=['ner'])

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies. It basically talks about bitcoins.')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))


Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN
./. <--punct-- published/VBD
It/PRP <--nsubj-- talks/VBZ
basically/RB <--advmod-- talks/VBZ
talks/VBZ <--ROOT-- talks/VBZ
about/IN <--prep-- talks/VBZ
bitcoins/NNS <--pobj-- about/IN
./. <--punct-- talks/VBZ


In [20]:
nlp = spacy.load('en')

doc = nlp('There was nothing so very remarkable in that')
 
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

There/EX <--expl-- was/VBD
was/VBD <--ROOT-- was/VBD
nothing/NN <--attr-- was/VBD
so/RB <--advmod-- remarkable/JJ
very/RB <--advmod-- remarkable/JJ
remarkable/JJ <--amod-- nothing/NN
in/IN <--prep-- nothing/NN
that/DT <--pobj-- in/IN


In [26]:
from spacy import displacy
spacy.displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

TypeError: __init__() got an unexpected keyword argument 'encoding'

In [3]:
# dependcecy by sentence
nlp = spacy.load('en', disable=['ner'])

doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies. It basically talks about bitcoins.')
 
for sent in doc.sents:
        print("root: ", sent.root, "sentence:", sent.text)
print()
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))


root:  published sentence: Wall Street Journal just published an interesting piece on crypto currencies.
root:  talks sentence: It basically talks about bitcoins.

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--compound-- currencies/NNS
currencies/NNS <--pobj-- on/IN
./. <--punct-- published/VBD
It/PRP <--nsubj-- talks/VBZ
basically/RB <--advmod-- talks/VBZ
talks/VBZ <--ROOT-- talks/VBZ
about/IN <--prep-- talks/VBZ
bitcoins/NNS <--pobj-- about/IN
./. <--punct-- talks/VBZ


In [4]:
# # plotting the graph
from spacy import displacy
 
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})

## Intro to word2vec

The most common unsupervised neural network approach for NLP is word2vec, a shallow neural network model for converting words to vectors using distributed representation: Each word is represented by many neurons, and each neuron is involved in representing many words.  At the highest level of abstraction, word2vec assigns a vector of random values to each word.  For a word W, it looks at the words that are near W in the sentence, and shifts the values in the word vectors such that the vectors for words near that W are closer to the W vector, and vectors for words not near W are farther away from the W vector.  With a large enough corpus, this will eventually result in words that often appear together having vectors that are near one another, and words that rarely or never appear together having vectors that are far away from each other.  Then, using the vectors, similarity scores can be computed for each pair of words by taking the cosine of the vectors.  

This may sound quite similar to the Latent Semantic Analysis approach you just learned.  The conceptual difference is that LSA creates vector representations of sentences based on the words in them, while word2vec creates representations of individual words, based on the words around them.

## What is it good for?

Word2vec is useful for any time when computers need to parse requests written by humans. The problem with human communication is that there are so many different ways to communicate the same concept. It's easy for us, as humans, to know that "the silverware" and "the utensils" can refer to the same thing. Computers can't do that unless we teach them, and this can be a real chokepoint for human/computer interactions. If you've ever played a text adventure game (think _Colossal Cave Adventure_ or _Zork_), you may have encountered the following scenario:

And your brain explodes from frustration. A text adventure game that incorporates a properly trained word2vec model would have vectors for "pick up", "lift", and "take" that are close to the vector for "grab" and therefore could accept those other verbs as synonyms so you could move ahead faster. In more practical applications, word2vec and other similar algorithms are what help a search engine return the best results for your query and not just the ones that contain the exact words you used. In fact, search is a better example, because not only does the search engine need to understand your request, it also needs to match it to web pages that were _also written by humans_ and therefore _also use idiosyncratic language_.

Humans, man.  

So how does it work?

## Generating vectors: Multiple algorithms

In considering the relationship between a word and its surrounding words, word2vec has two options that are the inverse of one another:

 * _Continuous Bag of Words_ (CBOW): the identity of a word is predicted using the words near it in a sentence.
 * _Skip-gram_: The identities of words are predicted from the word they surround. Skip-gram seems to work better for larger corpuses.

For the sentence "Terry Gilliam is a better comedian than a director", if we focus on the word "comedian" then CBOW will try to predict "comedian" using "is", "a", "better", "than", "a", and "director".  Skip-gram will try to predict "is", "a", "better", "than", "a", and "director" using the word "comedian". In practice, for CBOW the vector for "comedian" will be pulled closer to the other words, while for skip-gram the vectors for the other words will be pulled closer to "comedian".  

In addition to moving the vectors for nearby words closer together, each time a word is processed some vectors are moved farther away. Word2vec has two approaches to "pushing" vectors apart:
 
 * _Negative sampling_: Like it says on the tin, each time a word is pulled toward some neighbors, the vectors for a randomly chosen small set of other words are pushed away.
 * _Hierarchical softmax_: Every neighboring word is pulled closer or farther from a subset of words chosen based on a tree of probabilities.

## What is similarity? Word2vec strengths and weaknesses

Keep in mind that word2vec operates on the assumption that frequent proximity indicates similarity, but words can be "similar" in various ways. They may be conceptually similar ("royal", "king", and "throne"), but they may also be functionally similar ("tremendous" and "negligible" are both common modifiers of "size"). Here is a more detailed exploration, [with examples](https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/), of what "similarity" means in word2vec.

One cool thing about word2vec is that it can identify similarities between words _that never occur near one another in the corpus_. For example, consider these sentences:

"The dog played with an elastic ball."
"Babies prefer the ball that is bouncy."
"I wanted to find a ball that's elastic."
"Tracy threw a bouncy ball."

"Elastic" and "bouncy" are similar in meaning in the text but don't appear in the same sentence. However, both appear near "ball". In the process of nudging the vectors around so that "elastic" and "bouncy" are both near the vector for "ball", the words also become nearer to one another and their similarity can be detected.

For a while after it was introduced, [no one was really sure why word2vec worked as well as it did](https://arxiv.org/pdf/1402.3722v1.pdf) (see last paragraph of the linked paper). A few years later, some additional math was developed to explain word2vec and similar models. If you are comfortable with both math and "academese", have a lot of time on your hands, and want to take a deep dive into the inner workings of word2vec, [check out this paper](https://arxiv.org/pdf/1502.03520v7.pdf) from 2016.  

One of the draws of word2vec when it first came out was that the vectors could be used to convert analogies ("king" is to "queen" as "man" is to "woman", for example) into mathematical expressions ("king" + "woman" - "man" = ?) and solve for the missing element ("queen"). This is kinda nifty.

A drawback of word2vec is that it works best with a corpus that is at least several billion words long. Even though the word2vec algorithm is speedy, this is a a lot of data and takes a long time! Our example dataset is only two million words long, which allows us to run it in the notebook without overwhelming the kernel, but probably won't give great results.  Still, let's try it!

There are a few word2vec implementations in Python, but the general consensus is the easiest one to us is in [gensim](https://radimrehurek.com/gensim/models/word2vec.html). Now is a good time to `pip install gensim` if you don't have it yet.

In [5]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text


In [6]:
# This gives a very long text. spacy has a limit
# # Import all the Austen in the Project Gutenberg corpus.
# austen = ""
# for novel in ['persuasion','emma','sense']:
#     work = gutenberg.raw('austen-' + novel + '.txt')
#     austen = austen + work

# # Clean the data.
# austen_clean = text_cleaner(austen)

In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')

In [8]:
# do separetly
persuasion = text_cleaner(gutenberg.raw('austen-persuasion.txt'))
emma = text_cleaner(gutenberg.raw('austen-emma.txt'))
sense = text_cleaner(gutenberg.raw('austen-sense.txt'))

In [9]:
print(len(persuasion))
print(len(emma))
print(len(sense))

462818
876869
666583


In [10]:
#doesn't work. kernel crashes or complains about memory limit
#austen_doc = nlp(austen_clean)
# do it separetly
persuasion_doc = nlp(persuasion)
emma_doc = nlp(emma)
sense_doc = nlp(sense)

In [11]:
from nltk.corpus import stopwords
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for austen_doc in (persuasion_doc, emma_doc, sense_doc):
    for sentence in austen_doc.sents:
        sentence = [
            token.lemma_.lower()
            for token in sentence
            if not token.lemma_.lower() in stopwords.words('english')
            and not token.is_punct
            and token.lemma_ != "-PRON-"
        ]
        sentences.append(sentence)

print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(persuasion) + len(emma) + len(sense)))

['one', 'daughter', 'eld', 'would', 'really', 'give', 'thing', 'much', 'tempt']
We have 17853 sentences and 2006270 tokens.


In [12]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [13]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

2019-02-21 19:43:01,070 : INFO : 'pattern' package not found; tag filters are not available for English
2019-02-21 19:43:01,079 : INFO : collecting all words and their counts
2019-02-21 19:43:01,080 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-21 19:43:01,105 : INFO : PROGRESS: at sentence #10000, processed 92243 words, keeping 6042 word types
2019-02-21 19:43:01,126 : INFO : collected 7603 word types from a corpus of 166833 raw words and 17853 sentences
2019-02-21 19:43:01,127 : INFO : Loading a fresh vocabulary
2019-02-21 19:43:01,137 : INFO : effective_min_count=10 retains 2021 unique words (26% of original 7603, drops 5582)
2019-02-21 19:43:01,138 : INFO : effective_min_count=10 leaves 152265 word corpus (91% of original 166833, drops 14568)
2019-02-21 19:43:01,146 : INFO : deleting the raw counts dictionary of 7603 items
2019-02-21 19:43:01,148 : INFO : sample=0.001 downsamples 61 most-common words
2019-02-21 19:43:01,149 : INFO : downsampling

done!


In [14]:
# see sample word vector
model.wv["dance"][:10]

array([-0.01862508, -0.11993532, -0.13501856,  0.07445867,  0.03582802,
        0.00733998, -0.06704213,  0.09933243,  0.04451099, -0.05703803],
      dtype=float32)

In [15]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('loud', 'aloud'))
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

2019-02-21 19:43:03,376 : INFO : precomputing L2-norms of word weight vectors
  if sys.path[0] == '':


[('attention', 0.6234357953071594), ('daughter', 0.6037120819091797), ('people', 0.5309017896652222), ('friend', 0.5290405750274658), ('satisfaction', 0.49827444553375244), ('indisposition', 0.49275344610214233), ('able', 0.4799255132675171), ('pleasing', 0.47090378403663635), ('address', 0.46492719650268555), ('father', 0.4638645648956299)]
0.8242786
0.331333
marriage


Clearly this model is not great – while some words given above might possibly fill in the analogy woman:lady::man:?, most answers likely make little sense. You'll notice as well that re-running the model likely gives you different results, indicating random chance plays a large role here.

We do, however, get a nice result on "marriage" being dissimilar to "breakfast", "lunch", and "dinner". 

## Drill 0

Take a few minutes to modify the hyperparameters of this model and see how its answers change. Can you wrangle any improvements?

In [16]:
# Tinker with hyperparameters here.
model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=5,  # Minimum word count threshold.
    window=10,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('loud', 'aloud'))
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))


2019-02-21 19:43:03,397 : INFO : collecting all words and their counts
2019-02-21 19:43:03,398 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-21 19:43:03,419 : INFO : PROGRESS: at sentence #10000, processed 92243 words, keeping 6042 word types
2019-02-21 19:43:03,434 : INFO : collected 7603 word types from a corpus of 166833 raw words and 17853 sentences
2019-02-21 19:43:03,435 : INFO : Loading a fresh vocabulary
2019-02-21 19:43:03,444 : INFO : effective_min_count=5 retains 3017 unique words (39% of original 7603, drops 4586)
2019-02-21 19:43:03,445 : INFO : effective_min_count=5 leaves 158867 word corpus (95% of original 166833, drops 7966)
2019-02-21 19:43:03,457 : INFO : deleting the raw counts dictionary of 7603 items
2019-02-21 19:43:03,459 : INFO : sample=0.001 downsamples 58 most-common words
2019-02-21 19:43:03,459 : INFO : downsampling leaves estimated 142915 word corpus (90.0% of prior 158867)
2019-02-21 19:43:03,464 : INFO : constructing 

[('observant', 0.5732364654541016), ('entertain', 0.5313925743103027), ('associate', 0.5287098288536072), ('daughter', 0.5250798463821411), ('train', 0.5061901211738586), ('congratulate', 0.5046765804290771), ('handsome', 0.5040509700775146), ('hysteric', 0.49939125776290894), ('back', 0.4969525635242462), ('row', 0.48633307218551636)]
0.6655203
0.43741384
marriage


# Example word2vec applications

You can use the vectors from word2vec as features in other models, or try to gain insight from the vector compositions themselves.

Here are some neat things people have done with word2vec:

 * [Visualizing word embeddings in Jane Austen's Pride and Prejudice](http://blogger.ghostweather.com/2014/11/visualizing-word-embeddings-in-pride.html). Skip to the bottom to see a _truly honest_ account of this data scientist's process.

 * [Tracking changes in Dutch Newspapers' associations with words like 'propaganda' and 'alien' from 1950 to 1990](https://www.slideshare.net/MelvinWevers/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces).

 * [Helping customers find clothing items similar to a given item but differing on one or more characteristics](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/).

## Drill 1: Word2Vec on 100B+ words

As we mentioned, word2vec really works best on a big corpus, but it can take half a day to clean such a corpus and run word2vec on it.  Fortunately, there are word2vec models available that have already been trained on _really_ big corpora. They are big files, but you can download a [pretrained model of your choice here](https://github.com/3Top/word2vec-api). At minimum, the ones built with word2vec (check the "Architecture" column) should load smoothly using an appropriately modified version of the code below, and you can play to your heart's content.

Because the models are so large, however, you may run into memory problems or crash the kernel. If you can't get a pretrained model to run locally, check out this [interactive web app of the Google News model](https://rare-technologies.com/word2vec-tutorial/#bonus_app) instead.

However you access it, play around with a pretrained model. Is there anything interesting you're able to pull out about analogies, similar words, or words that don't match? Write up a quick note about your tinkering and discuss it with your mentor during your next session.

In [19]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format ('/home/tigial3535/google_word_vectors.bin.gz', binary=True)

2019-02-21 19:44:26,896 : INFO : loading projection weights from /home/tigial3535/google_word_vectors.bin.gz
2019-02-21 19:46:28,112 : INFO : loaded (3000000, 300) matrix from /home/tigial3535/google_word_vectors.bin.gz


In [20]:
# Play around with your pretrained model here.
print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

  
2019-02-21 19:47:04,840 : INFO : precomputing L2-norms of word weight vectors


[('fella', 0.6031545400619507), ('gentleman', 0.5849649906158447), ('chap', 0.5543248653411865), ('gent', 0.543907880783081), ('guy', 0.5265033841133118), ('lad', 0.5139425992965698), ('feller', 0.5072450041770935), ('bloke', 0.49030160903930664), ('rascal', 0.4873698949813843), ('ladies', 0.47617611289024353)]


This is much better result.if lady is to man, gentelman is to women. fela and guy are also good words

In [22]:
print(model.wv.most_similar(positive=['place', 'service']))

  """Entry point for launching an IPython kernel.


[('services', 0.5195369720458984), ('ser_vice', 0.47518986463546753), ('service.The', 0.4567696452140808), ('sevice', 0.4489085078239441), ('places', 0.4361933469772339), ('facilities_distinguishes_EarthSearch', 0.4195823669433594), ('Complimentary_valet', 0.4184284806251526), ('servive', 0.41484370827674866), ('Princeton_Wendy_Benchley', 0.4123907685279846), ('Starbuck_Lind_Mortuary', 0.4115825891494751)]


It caught words with slight spelling mistakes. This will be good to analyze reviews with many typos

In [24]:
print(model.wv.most_similar(positive=['place', 'service'],  negative=['restaurant', 'food']))

  """Entry point for launching an IPython kernel.


[('palce', 0.3032969534397125), ('1st_SOPS', 0.2987600862979889), ('sevice', 0.29494941234588623), ('Cablevision_Optimum_Lightpath', 0.2813442349433899), ('ILS_Proton_launch', 0.27688130736351013), ('#oo#', 0.27500271797180176), ('Khokhrapar_Munabao_train', 0.27090984582901), ('BigPond_Broadband', 0.26873475313186646), ('doctor_Lothar_Heinrich', 0.2677452266216278), ('Euro_PacketCable', 0.2674042284488678)]


In [26]:
print(model.wv.most_similar(positive=['Paris', 'Italy'],  negative=['France']))

  """Entry point for launching an IPython kernel.


[('Milan', 0.7222141027450562), ('Rome', 0.7028310298919678), ('Palermo_Sicily', 0.5967570543289185), ('Italian', 0.5911272764205933), ('Tuscany', 0.5632812976837158), ('Bologna', 0.5608358383178711), ('Sicily', 0.5596384406089783), ('Bologna_Italy', 0.5470059514045715), ('Berna_Milan', 0.5464028120040894), ('Genoa', 0.5308899879455566)]


In [29]:
print(model.wv.most_similar(positive=['2Cents']))

  """Entry point for launching an IPython kernel.


[('Sworn_Enemy', 0.6559710502624512), ('Hinder_Papa_Roach', 0.6507534980773926), ('KILL_HANNAH', 0.6474124193191528), ('Singer_Jacoby_Shaddix', 0.645458459854126), ('Gracious_Few', 0.644819974899292), ('Buckcherry_Papa_Roach', 0.6439394950866699), ('Drummer_Quits', 0.6436043977737427), ('Jonny_Lives', 0.6390442848205566), ('Damned_Things', 0.6344761848449707), ('Duff_McKagan_Loaded', 0.6319283246994019)]


It found services others than restaurants and food.

Technical Coaching

Tiago [4 days ago]
Hello, Tinsae. I'm taking a look at stack overflow. Can you share with me the problematic part of the code and the dataset?


Tinsae [4 days ago]
4.4.4+Unsupervised+Neural+Networks+and+NLP

Input 5


Tinsae [4 days ago]
spacyerror.png



Tiago [4 days ago]
How many GBs of RAM do you have available?


Tinsae [4 days ago]
It is 8 GB computer. I assume at least 5 GB is free
raminfo.png



Tiago [4 days ago]
can you take a look at task manager, please?

Tiago [4 days ago]
you're using windows, right?


Tinsae [4 days ago]
yes


Tiago [4 days ago]
meanwhile, I'm trying to run the same thing in google colab, just to give us an environment with tons of RAM to spare


Tiago [4 days ago]
Tinsae, I am being able to run it in google colab with the following modification:

```# Parse the data. This can take some time.
nlp = spacy.load('en')
nlp.max_length = 2006272 + 1
austen_doc = nlp(austen_clean)```


Tiago [4 days ago]
aaaand my session crashed after using all available RAM


Tinsae [4 days ago]
How much ram did Google Colab gave you?


Tiago [4 days ago]
about 12GB


Tinsae [4 days ago]
I have 56 GB ram VM on Google Cloud Platform


Tiago [4 days ago]
btw, 12GB free


Tinsae [4 days ago]
I am using the free $300


Tiago [4 days ago]
that might work. But I want to see if we are able to break that work in chunks

Tiago [4 days ago]
and make it less memory-hungry


Tiago [4 days ago]
(another way would be to make a swapfile, but performance will be awful)


Tinsae [4 days ago]
Yeah, making it less memory hungry is what am also aiming.


Tiago [4 days ago]
a dumb way to solve it is to use spacy 1.x

Tiago [4 days ago]
I'll try that first


Tiago [4 days ago]
you using anaconda or pure python?


Tinsae [4 days ago]
anaconda


Tiago [4 days ago]
okay, it works under spacy 1.x


Tiago [4 days ago]
let me give you the command to do a conda install of spacy 1.x


Tiago [4 days ago]
one moment, as I figure this out


Tiago [4 days ago]
got it. You have to open a prompt and type:

```conda install 'spacy<2'```


Tiago [4 days ago]
and now, let me talk in the mentor channel as how we update the curriculum


Tinsae [4 days ago]
Is that downgrading spacy?


Tiago [4 days ago]
yes


Tinsae [4 days ago]
ok. That is one way to solve it


Tiago [4 days ago]
yes, that's the "dumb" way. The other one is to figure out the pieces of the pipeline (which you can do by typing `nlp.pipeline` in the notebook) and checking whether they can run in chunks of data


Tiago [4 days ago]
I'm going to give it a try, but, as it is no longer high-priority, if other tickets arise, I'll have to stop to deal with them, okay?

Tinsae [4 days ago]
ok. It is solved ticket


Tinsae [4 days ago]
Thanks Tiago


Tiago [4 days ago]
I'll spend some more time with it before marking it as solved, okay?


Tinsae [4 days ago]
ok

Technical Coaching 2

 Requested by @Tinsae G. Alemayehu
*import gensim
creates
module 'boto' has no attribute 'plugin'
error. How can I solve it?*

• What have you tried so far?

Tinsae   [1 minute ago]
Solved it!!

Tinsae   [1 minute ago]
I removed a virtual machine which had prebuilt libraries like tensorflow, sklearn on Debian OS. I installed a new Ubuntu image and installed anaconda from scratch. Ubuntu has more support on stack exchange.

I uninstalled gensim which was insalled by conda and used pip

pip install gensim

pip install google_cloud_platform

pip install --upgrade gensim smart_open

finally I restarted the kernel and it worked!!

Ticket Solved! (edited)


Techical Chocing three


Tiago   [7 hours ago]
Hey, Tinsae. You have 5 billion user reviews, like, actual reviews? Wow. Where did you get that?

Tiago   [7 hours ago]
And let me take a look at the Google News Corpus.

Tinsae   [7 hours ago]
Sorry I meant 5 million :slightly_smiling_face:

Tiago   [7 hours ago]
But unfortunately I have no specific answer for you, I don't know what would be a good size.

Tinsae   [7 hours ago]
you may check this
https://github.com/3Top/word2vec-api
GitHub
3Top/word2vec-api
Simple web service providing a word embedding model - 3Top/word2vec-api

Tiago   [7 hours ago]
thanks!

Tinsae   [7 hours ago]
The size of google news corpus is 100B. Does that mean 100B words?

Tiago   [6 hours ago]
I am trying to understand what is that 100B. Because the vocabulary size is 3 million

Tiago   [6 hours ago]
(and anyway, where did you get the 5 million user reviews? )

Tiago   [6 hours ago]
(not pointing fingers, just "oh, cool, I might do some cool stuff with that")

Tinsae   [6 hours ago]
Yelp reviews from kaggle

Tiago   [6 hours ago]
oh, okay

Tiago   [6 hours ago]
another question: why are you using that word2vec-api?

Tinsae   [6 hours ago]
I am not using it

Tiago   [6 hours ago]
oooooooh, okay

Tinsae   [6 hours ago]
The link is given in the course to check out  pretrained wordnetvectors

Tiago   [6 hours ago]
FOUND IT: about 100 billion words

Tiago   [6 hours ago]
https://code.google.com/archive/p/word2vec/

Tiago   [6 hours ago]
And so, I ask you, how many words do all of your reviews have?

Tinsae   [6 hours ago]
I don't know the actual number. I created a bag of words with min_df=0.001 and obtained 3000 words out of 1M reviews.

Tiago   [6 hours ago]
which function did you use for the bag of words?

Tinsae   [6 hours ago]
CountVectorizer

Tiago   [6 hours ago]
okay

Tiago   [6 hours ago]
and, do you have a link to the raw data?

Tiago   [6 hours ago]
or to the kaggle page

Tinsae   [6 hours ago]
It will take more than 2 hours to count the words.  I thought we could guess the word count.

https://www.kaggle.com/yelp-dataset/yelp-dataset/kernels
kaggle.com
Yelp Dataset
A trove of reviews, businesses, users, tips, and check-in data!

Tiago   [6 hours ago]
Are you doing the count single-threaded? Only one processor core?

Tinsae   [6 hours ago]
njobs=-1

Tiago   [6 hours ago]
where do you use that param njobs?

Tinsae   [6 hours ago]
I went back to the code to find that I didn't use CountVectorizer in parallelize form. It doesn't have njobs kwarg. (edited)

Tinsae   [6 hours ago]
But the virtual machine I used  has 8 cores with 30GB ram

Tiago   [6 hours ago]
another way to check if it uses parallelization under the core is to run the thing and monitor CPU Usage (with something like htop)

Tiago   [6 hours ago]
(I am saying this, because a looot of times I found something was parallelized by default, got a beefy VM and saw only one processor being used)

Tinsae   [6 hours ago]
The cpu utilization never crossed 50% (edited)

Tiago   [6 hours ago]
what did you use to check CPU Usage?

Tinsae   [6 hours ago]
GCP compute engine has a monitor page for every VM

Tiago   [6 hours ago]
oh, okay

Tiago   [6 hours ago]
well, anyway, if it wasn't bigger than 50%, definitely not parallel

Tinsae   [6 hours ago]
monitoring.png 

Tiago   [6 hours ago]
Well, I see two ways forward here:

1) Just train the model and let's see what happens
2) I go to google colab and do some parallel code to find out how many words that corpus has

Tiago   [6 hours ago]
I am wiling to do the second route because not many people here now. But I want to finish my avocado first.

Tinsae   [6 hours ago]
:avocado: ok.

Tinsae   [6 hours ago]
Can you download and upload 5GB data easily? It is large json file. (edited)

Tiago   [6 hours ago]
my idea is to download from kaggle straight to google colab's notebook

Tinsae   [6 hours ago]
k

Tiago   [6 hours ago]
btw, if you have any direct links just laying around, now is the time to share

Tinsae   [6 hours ago]
```# import neccessary libraries
from ftplib import FTP
import requests

# login to ftp server
server = "##.##.###.##"
username = "******@tinsaealemayehu.com"
password = "********"
ftp = FTP(server)
ftp.login(user=username, passwd=password)

rvfile = open("yelp_academic_dataset_review.json", "wb")
ftp.retrbinary('RETR yelp_academic_dataset_review.json', rvfile.write)
rvfile.close()```
(edited)

Tinsae   [6 hours ago]
You can use the above code. I sent you the server, username password through PM (edited)

Greg   [6 hours ago]
I have trained doc2vec models on considerably less text, and those are just fancy word2vec models, so my 2cents are to go for it.

Tinsae   [5 hours ago]
Thanks @Greg. Just for fun, I searched "2cents" using google news vectors and found a bunch of rock bands
'Sworn_Enemy', 0.6559710502624512), ('Hinder_Papa_Roach', 0.6507534980773926), ('KILL_HANNAH', 0.6474124193191528), ('Singer_Jacoby_Shaddix', 0.645458459854126)

Tiago   [5 hours ago]
Thanks, Greg!

Tiago   [5 hours ago]
But I think I'll do the useless coding now just for the sake of fun

Tiago   [4 hours ago]
Aaaand the results are just in: 6685900 words

Tiago   [4 hours ago]
so, almost 7 million words

Tiago   [4 hours ago]
in total

Tiago   [4 hours ago]
now, to count uniques

Tinsae   [4 hours ago]
Nice!! waiting for the unique counts

Tiago   [2 hours ago]
Aaand the count just used all the RAM!

Tinsae   [2 hours ago]
You spent enough time in it.  If there is a chance you could share your code. I would try it in my 56gb ram virtual machine. (edited)

Tiago   [2 hours ago]
```import pandas as pd
import dask.dataframe as dd
!wget -c "https://storage.googleapis.com/kaggle-datasets/10100/277695/yelp-dataset.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1551046405&Signature=p%2BBda9sOLceu8EceqHAhrOPKgraMDDfZ8%2FqiW1z8DygzgoJaUzBW5xL8MWtD5z55OVJWe61Zv9qpdVeHCXcZ0r4mpV7dezhwA2Mmus2OyJqAhhWoXxPSieNB36fHaFmrm6xu%2FOEp5R1TJ2s72vWKU7tnOJS9%2BFBYnvOuV6cPN4lrpVVTrixOjCLv8QQWndfGR7V1Wcz2MqLaqAVFWuN94WU6ycz%2BlcSvdgqPplNHmmzcT%2BBvepxcuTheWMPuHusGUPJ1FHHQ4BEv8RgrhtTvXtV6jxupyw69eMHWpdN0J4bWHACc2%2BGthxgK00Tplt%2FrINAFiPATMf7dpciyNKigAA%3D%3D" -O yelp.zip
!unzip yelp.zip
ddf = dd.read_json('yelp_academic_dataset_review.json', blocksize=2**28)
textao = ddf.text.to_bag()
size_all_itens = textao.count().compute()
distinct_items = textao.distinct().compute()```
```

Tiago   [2 hours ago]
soo.... apparently, only 4388 words

Tiago   [2 hours ago]
that seems too little

Tiago   [2 hours ago]
aaaand my code is broken

Tiago   [2 hours ago]
anyway, going to close the ticket because the issue is solved (I guess)

Tinsae   [10 minutes ago]
Thank you very much. @Tiago. We had a very productive session. I learned a lot of new things (edited)