### 荃者所以在鱼 得鱼而忘荃  言者所以在意 得意而忘言 (庄子)
### Nets are for fish; Once you get the fish, you can forget the net. Words are for meaning; Once you get the meaning, you can forget the words. (Zhuangzi/ Chuang Tzu)

##  <s>Word2Vec<s>
##  <s>TF-IDF<s>
##  <s>GloVec<s>
##  <s>Cosine Simiarity<s>
##  <s>ASCII      (ASCI27)<s>
##  <s>One-hot<s>


 <mark> • In vector semantics, a word is modeled as a vector—a point in high-dimensional space, also called an embedding. **static embeddings**: in each each word is mapped to a fixed embedding.<mark>

<mark> • Vector semantic models fall into two classes: sparse and dense. In sparse models each dimension corresponds to a word in the vocabulary V and cells are functions of co-occurrence counts. The term-document matrix has a row for each word (term) in the vocabulary and a column for each document. The word-context or term-term matrix has a row for each (target) word in the  vocabulary and a column for each context term in the vocabulary. Two sparse weightings are common: the tf-idf weighting which weights each cell by its term frequency and inverse document frequency, and 
**PPMI (pointwise positive mutual information)**, which is most common for for word-context matrices.<mark>
    
<mark>• Dense vector models have dimensionality 50–1000. Word2vec algorithms like **skip-gram** are a popular way to compute dense embeddings. Skip-gram trains a logistic regression classifier to compute the probability that two words are ‘likely to occur nearby in text’. This probability is computed from the dot product between the embeddings for the two words.<mark>
    
<mark> •Skip-gram uses stochastic gradient descent to train the classifier, by learning embeddings that have a high dot product with embeddings of words that occur nearby and a low dot product with noise words.<mark>
    
 <mark>•Other important embedding algorithms include GloVe, a method based on ratios of word co-occurrence probabilities.<mark>
     
<mark>•Whether using sparse or dense vectors, word and document similarities are computed by some function of the dot product between vectors. The cosine of two vectors—a normalized dot product—is the most popular such metric.<mark>

In [1]:
import spacy
import numpy as np
import pandas as pd
from spacy import displacy


In [4]:
nlp1 = spacy.load("en_core_web_sm")
nlp2 = spacy.load("en_core_web_md")
nlp3 = spacy.load("en_core_web_lg")

In [2]:
text1 = """Word Embedding demo on a Tuesday afternoon"""
text2 = """Word Embedding demo on a Thurdays afternoon"""
text3 = """Word Embedding demo on a Tuesday afternoon every two weeks"""

In [5]:
doc1 = nlp1(text1)
doc2 = nlp1(text2)
doc3 = nlp1(text3)

In [6]:
pd.DataFrame(list((doc1.vector, doc2.vector, doc3.vector)))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
0,1.604855,0.870129,0.120647,-0.967765,0.157772,-1.040268,0.443886,1.090996,-1.688075,2.247317,...,-0.186533,0.798131,0.428295,-0.06881,-0.93145,-1.342759,-1.490826,0.51776,1.276725,-0.960276
1,1.68109,0.971054,0.421737,-0.955669,0.252836,-0.803135,0.24796,1.519428,-1.447631,2.360409,...,-0.387708,0.766207,0.514951,-0.023413,-0.101163,-1.337172,-1.350518,0.19408,1.533168,-0.610841
2,1.279288,0.591502,-0.130218,-1.383584,-0.223938,-1.321685,0.487777,0.471892,-1.597132,1.744678,...,1.025017,0.045974,0.886794,-0.137476,-0.312957,-1.715649,-0.816762,1.14616,1.498479,-0.997208


In [10]:
print(doc1, "<-similarity to>", doc2, doc2.similarity(doc2))


Word Embedding demo on a Tuesday afternoon <-similarity to> Word Embedding demo on a Thurdays afternoon 1.0


In [7]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text3)
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

In [22]:
for token in doc3:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop, token.head)

Word word NOUN NN compound Xxxx True False Embedding
Embedding Embedding PROPN NNP compound Xxxxx True False demo
demo demo NOUN NNS ROOT xxxx True False demo
on on ADP IN prep xx True True demo
a a DET DT det x True True afternoon
Tuesday Tuesday PROPN NNP compound Xxxxx True False afternoon
afternoon afternoon NOUN NN pobj xxxx True False on
every every DET DT det xxxx True True weeks
two two NUM CD nummod xxx True True weeks
weeks week NOUN NNS npadvmod xxxx True False demo


In [11]:
text_job = """We are looking for a Mobile developer responsible for the development
and maintenance of applications aimed at a vast number of diverse devices. Familiarity 
with RESTful APIs to connect iOS or Android applications to back-end services, work from home. """

In [27]:
text_job_new = text_job + text_job
text_job_new_2 = text_job + "hahaha.  what is that"

In [30]:
doc_job_new_2 = nlp2(text_job_new_2)

In [12]:
profile_description_tech = "founder,full stack developer,web developer,mobile developer,lead developer,tech lead,technician lead"
profile_description_nontech = 'account executive,account coordinator,waitress,director of marketing,senior vice president.'


In [15]:
doc_job = nlp2(text_job)
doc_profile_tech = nlp2(profile_description_tech)
doc_profile_nontech = nlp2(profile_description_nontech)


In [31]:
doc_job_new.similarity(doc_job_new_2)

0.9959470979394311

In [18]:
doc_job.similarity(doc_profile_tech),doc_job.similarity(doc_profile_nontech)

(0.7751471282917346, 0.6852704822774701)

In [None]:
doc_job = nlp2(text_job)
doc_profile_tech = nlp2(profile_description_tech)
doc_profile_nontech = nlp2(profile_description_nontech)

aws lambda invoke --function-name prod-dhi-pac-tech-profile-classifier-model:live --payload '{"profile_description": "We are looking for a Mobile developer responsible for the developmentand maintenance of applications aimed at a vast number of diverse devices. Familiarity with RESTful APIs to connect iOS or Android applications to back-end services, work from home."}' --profile dhi-profileacquisition-prod outfile_job.json


aws lambda invoke --function-name prod-dhi-pac-tech-profile-classifier-model:live --payload '{"profile_description": "founder,full stack developer,web developer,mobile developer,lead developer,python developer, aws,tech lead,technician lead, agile"}' --profile dhi-profileacquisition-prod outfile_profiletech.json

aws lambda invoke --function-name prod-dhi-pac-tech-profile-classifier-model:live --payload '{"profile_description": "account executive,account coordinator,waitress,director of marketing,self employed,renovation , senior vice president."}' --profile dhi-profileacquisition-prod outfile_profile_nontech.json

aws lambda invoke --function-name prod-dhi-pac-tech-profile-classifier-model:live --payload '{"profile_description": "荃 者 所 以 在 鱼 得 鱼 而 忘 荃  言 者 所 以 在 意 得 意 而 忘 言"}' --profile dhi-profileacquisition-prod outfile.json


aws lambda invoke --function-name prod-dhi-pac-tech-profile-classifier-model:live --payload '{"profile_description": "Nets are for fish; Once you get the fish, you can forget the net. Words are for meaning; Once you get the meaning, you can forget the words."}' --profile dhi-profileacquisition-prod outfile_trnsl.json

In [20]:
doc_zz = nlp2("荃 者 所 以 在 鱼 得 鱼 而 忘 荃  言 者 所 以 在 意 得 意 而 忘 言")
doc_zz_tr = nlp2("Nets are for fish; Once you get the fish, you can forget the net. Words are for meaning; Once you get the meaning, you can forget the words.")

In [21]:
doc_zz.similarity(doc_zz_tr)

  doc_zz.similarity(doc_zz_tr)


0.0

## Future work
### 1. Our own pretrain
### 2. Richard's tagging machine 