## Towards a better Search: Leveraging word2vec for Contextualization

In [1]:
import nltk, string, json

import pyspark as ps

def tokenize(text):
    tokens = [] 
    
    for word in nltk.word_tokenize(text):
        if word \
            not in nltk.corpus.stopwords.words('english') \
            and word not in string.punctuation \
            and word != '``':    
                tokens.append(word)
    
    return tokens

In [2]:
sc = ps.SparkContext()

In [1]:
import os

# need to get local path since we are reading local files
cwd = os.getcwd()

In [3]:
from collections import Counter

from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

# parse input essay file
essay_rdd = sc.textFile('file://' + cwd + '/data/donors_choose/essay_1000.json')
row_rdd = essay_rdd.map(lambda x: json.loads(x))

# tokenize documents
tokenized_rdd = row_rdd.filter(lambda row: row['essay'] and row['essay'] != '') \
                       .map(lambda row: row['essay']) \
                       .map(lambda text: text.replace('\\n', '').replace('\r', '')) \
                       .map(lambda text: tokenize(text))

In [4]:
essay_rdd.take(1)

[u'{"essay":"\\"I am currently a Special Education Math teacher in a high needs middle school. My students are eager to learn but lack proper classroom resources. I am requesting necessary classroom basics in order to create a supply center. \\r\\\\n\\r\\\\nMy students are very motivated and try their best despite the academic disabilities. In addition to learning the language, my students come from low income homes. All of the students in my school are eligible for free or reduced lunch due to the household incomes. \\r\\\\n\\r\\\\nMy students are classified as emotionally disturbed, academically delayed or learning disabled. They need all the help they can get to get them to the same level as their middle school peers. Most don\'t have the supplies due to the financial strain buying school supplies puts on the family. Having a large supply of pens, pencils, glue sticks, stapler, and other classroom basics, would be a blessing each day. Most of these materials are needed for everyday 

In [5]:
# compute term and document frequencies
hashingTF = HashingTF(numFeatures=50000)
tf = hashingTF.transform(tokenized_rdd)

tf.cache()
idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

In [6]:
tf_rows = tfidf.take(2)

In [7]:
tf_rows

[SparseVector(50000, {6: 2.6047, 1024: 1.9459, 1080: 0.5236, 1175: 2.1189, 1646: 3.8642, 2331: 6.4904, 2509: 2.1047, 4241: 2.3136, 4829: 2.5913, 5288: 2.2936, 5465: 5.9445, 5735: 3.1952, 5739: 2.5267, 5829: 1.0144, 6031: 3.7566, 6679: 2.274, 6834: 2.9385, 7447: 0.998, 9488: 2.1047, 9969: 2.3136, 10131: 9.0217, 10254: 4.7115, 10541: 1.4281, 10738: 1.5906, 11291: 4.8293, 11952: 4.2697, 12395: 5.2993, 12644: 4.8857, 12776: 3.2452, 13047: 4.2697, 13525: 0.8452, 13594: 4.9628, 13600: 1.3873, 13605: 2.4779, 13749: 1.96, 15104: 2.7499, 15177: 1.8589, 16223: 4.4238, 16334: 1.6723, 16641: 2.4428, 17725: 1.96, 17740: 1.3519, 18115: 2.6355, 19329: 5.2993, 19494: 4.2697, 19712: 2.6892, 19973: 8.5394, 19988: 2.7344, 19997: 2.1814, 20439: 5.5225, 20782: 1.4534, 21059: 2.4779, 21592: 2.3979, 21943: 2.3136, 22369: 7.2258, 22707: 5.2993, 24125: 2.7499, 24142: 3.2921, 24557: 5.2993, 24866: 1.3113, 24870: 4.5109, 25076: 3.3824, 25101: 5.2642, 25149: 2.4916, 25588: 0.0, 25962: 4.2007, 26406: 1.6841, 26967

## word2vec

for more details on the theory please see: https://github.com/Jay-Oh-eN/awesome-resources/blob/master/nlp.md#word2vec

In [8]:
from pyspark.mllib.feature import Word2Vec

In [9]:
word2vec = Word2Vec()
model = word2vec.fit(tokenized_rdd)

In [11]:
model.save(sc, 'file:///Users/jonathandinu/spark-ds-applications/word2vec_train.model')

In [12]:
import numpy as np

word_vecs = model.getVectors()

In a production setting you would not be able to locally store/process the entire vector data. In that case you will need to run Spark transformations to perform the `doc2vec` calculation.  The PySpark `word2vec` is currently limited though in that you can only use the word vector results locally. 

In [13]:
word_v = word_vecs['school']

In [14]:
# word vectors are Py4J Java objects
type(word_v)

py4j.java_collections.JavaList

In [15]:
def doc2vec(document_tup):
    doc_vec = np.zeros(100)
    tot_words = 0
    
    for word in document_tup[0]:
        try:
            weight = document_tup[1][hashingTF.indexOf(word)]
            vec = np.array([ v for v in word_vecs[word] ])
            tot_words += 1
        except:
            continue
            
        doc_vec += weight * vec
        
    return doc_vec / float(tot_words)

In [16]:
ex = tokenized_rdd.zip(tfidf).take(1)

In [17]:
doc2vec(ex[0])

array([-0.12463389, -0.06361799,  0.02757775,  0.02979903,  0.09447952,
       -0.00094989, -0.05210614,  0.04388065,  0.00771801, -0.00877458,
       -0.00588867, -0.06390029,  0.02143864,  0.0763591 , -0.00178546,
       -0.07834113, -0.05603771, -0.03714454,  0.02887979,  0.01351358,
       -0.01813273,  0.01339551,  0.03628958, -0.05751648,  0.06928916,
        0.10708321, -0.02594397, -0.00249983, -0.03420761,  0.03307056,
        0.0542931 , -0.00116116,  0.01661484,  0.03526229, -0.03708085,
       -0.026805  ,  0.00984268,  0.06029845,  0.02275407,  0.04823143,
        0.03965133, -0.04779522, -0.08765039,  0.0632452 ,  0.02556925,
       -0.0324754 ,  0.01029455,  0.0363825 , -0.01970801, -0.074114  ,
        0.0570937 , -0.06795619, -0.02039386, -0.00944272,  0.04804359,
       -0.01009867,  0.10815541,  0.09950729, -0.06671419, -0.01148975,
        0.04038947, -0.07650435, -0.20667949,  0.06650247,  0.00640644,
       -0.0954082 ,  0.01003449, -0.11984633, -0.10211463,  0.10

In [18]:
# combine tf-idf scores with original document and collect() locally
document_vectors = tokenized_rdd.zip(tfidf).collect()

In [19]:
d2v = [ doc2vec(doc) for doc in document_vectors ]

In [20]:
from scipy.spatial import distance

def query(q, docs):
    '''
    use scipy cosine similarity to find most similar essay description
    '''
    tf_q = idf.transform(hashingTF.transform(tokenize(q)))
    q_vec = doc2vec((tokenize(q), tf_q))
    similarity = distance.cdist(docs, np.array([q_vec]), 'cosine')
    return np.argsort(similarity[:, 0])[:3]

In [21]:
query('field trip to aquarium', d2v)

array([772, 576, 258])

In [22]:
essay_rdd.zipWithIndex().take(1)

[(u'{"essay":"\\"I am currently a Special Education Math teacher in a high needs middle school. My students are eager to learn but lack proper classroom resources. I am requesting necessary classroom basics in order to create a supply center. \\r\\\\n\\r\\\\nMy students are very motivated and try their best despite the academic disabilities. In addition to learning the language, my students come from low income homes. All of the students in my school are eligible for free or reduced lunch due to the household incomes. \\r\\\\n\\r\\\\nMy students are classified as emotionally disturbed, academically delayed or learning disabled. They need all the help they can get to get them to the same level as their middle school peers. Most don\'t have the supplies due to the financial strain buying school supplies puts on the family. Having a large supply of pens, pencils, glue sticks, stapler, and other classroom basics, would be a blessing each day. Most of these materials are needed for everyday

In [23]:
def find_projects(indeces, num):
    q = indeces[:num]
    return essay_rdd.zipWithIndex().filter(lambda x: x[1] in q).collect()

In [24]:
find_projects(query('computers', d2v), 2)

[(u'{"essay":"\\"My students are in second grade in a low-income school where 95% of the students are English Language Learners. This community is a high-need community with lots of homelessness. The students in our school all qualify for reduced lunch, and face great adversity. \\r\\\\n\\r\\\\nThe children in my classroom come from homes where their parents speak little to no English. Most of the homes have no computer or Internet accessibility so many of the children come to school eager to use the Smartboard in our classroom to play learning games and to access the classroom website which has information on projects which we are doing in class. But since our laptop has broke the children are not able to work with the Smartboard. \\r\\\\n\\r\\\\nIf my students were granted a classroom laptop I could pay the monthly bill to integrate wireless Internet on it. Our school has many problems with wiring and therefore half of the school has no Internet, but we all have Smartboards. I have a

In [25]:
find_projects(query('computers math at risk neighborhood spanish', d2v), 2)

[(u'{"essay":"\\"The school that I teach at is in rural Eastern Oregon.  I am a 7-12 grade science teacher, in which I teach 7\\/8 computer science, 7\\/8 science, 9\\/10 biology, 11\\/12 physics, 11\\/12 personal finance,9-12 anatomy, and independent study (chemistry and physical science will be offered next year).  \\r\\\\n\\r\\\\nThis school is a high poverty school in which 15.9% of the county is unemployed.  Most of the students in the school are on free or reduced meals.  Many of these students will be the first in their family to attend college, let alone graduate high school. \\r\\\\n\\r\\\\nThe State of Oregon (Department of Education) has taken a 1.2% budget reduction clear from the top (higher education) to the bottom (K-12 classrooms).  This means that schools have had to make significant reductions in their budgets for the rest of the school year (frozen spending).  \\r\\\\n\\r\\\\nWe currently have the gel electrophoresis apparatus, but do not have the funding to buy the 

In [26]:
tokenized_rdd.flatMap(lambda x: x).countByValue()

defaultdict(<type 'int'>, {u'aided': 1, u'Andreas': 1, u'colorguards': 1, u'Poetry': 1, u'activist/artists': 1, u'Heights': 1, u'four': 48, u'woods': 2, u'hanging': 2, u'Until': 3, u'marching': 4, u'Foundation': 1, u'granting': 2, u'eligible': 13, u'electricity': 16, u'party\u201d': 1, u'Journey': 1, u'Western': 2, u'immature': 1, u'sinking': 1, u'Headsprout': 2, u'bowls.': 1, u'oceans': 2, u'capoeira': 3, u'yellow': 1, u'uncertain': 1, u'bringing': 15, u'writing.': 2, u'differentiated': 15, u'basics': 12, u'specify': 1, u'grueling': 1, u'Less': 2, u'wooden': 2, u'selections': 1, u'Feel': 1, u'Does': 4, u'succession': 3, u'Paul': 2, u'straight': 6, u'Sandy': 3, u'HAPPENED': 1, u'charter': 38, u'specially': 1, u'tired': 2, u'consists': 24, u'second': 149, u'attended': 6, u'scraped': 1, u'prosody': 1, u'Townsend': 2, u'2012-2013': 1, u'errors': 2, u'Initially': 2, u'so.I': 1, u'cooking': 12, u'fingers': 11, u'Hamilton': 2, u'fossil': 2, u'designing': 7, u'resilient': 7, u'replaced': 8, u

In [27]:
find_projects(query('powerful calculator', d2v), 2)

[(u'{"essay":"\\"Do you remember not being able to see what your teacher was doing because the board was so far away or way off to the side?  You probably wish the board could have been moved.  This stand will allow the \\"\\"board\\"\\" to be moved closer to the students and will be put directly in front of them. \\r\\\\n\\r\\\\nMy students are eager, bright and tremendous young learners.  We learn in a small rural school surrounded by woods and corn fields.  Just last week, we watched the corn chopper just outside our window go up and down the rows harvesting the corn.  Approximately 50% of our population receives free or reduced lunch.  Teachers and staff work hard to support all the needs of each student.  We work together to overcome challenges, which can sometimes be great. \\r\\\\n\\r\\\\nI am requesting 1 chart stand that has a pocket chart, white board and chart paper to use for phonics and writing instruction. A Deluxe Chart Stand will allow the \\"\\"board\\"\\" and teaching

In [28]:
find_projects(query('paper computer', d2v), 3)

[(u'{"essay":"\\"   I am requesting a class set of white boards in order to make my instruction more effective.  My students will use this class set of 32 white boards, markers, and erasers to perform guided learning work in all subjects.  This will allow me to check for all my students\' understanding of the lesson\'s objective with a quick glance around the classroom, rather than calling on a few students and hoping the whole class comprehends the lesson.       \\r\\\\n\\r\\\\n   Today, while visiting a KIPP school in Newark, NJ, I witnessed a teacher using these white boards in his 7th grade classroom for word work.  I anticipate using these white boards for everything from word work, practicing editing marks, answering multiple choice questions on the overhead for any subject, and doing sample math problems to test my students\' understanding of the mini-lesson.  \\r\\\\n   I work at a under resourced school in the Bronx where no teachers currently use white boards in their instruc