In [4]:
import graphlab as gl
import numpy as np

In [5]:
msf = gl.load_sframe('../data/kindle_data.sf/')
meta = gl.load_sframe('../data/meta_data.sf/')
wc = gl.load_sframe('../data/meta_data.sf/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1493383788.log


This non-commercial license of GraphLab Create for academic use is assigned to gsimmons17@gsb.columbia.edu and will expire on December 07, 2017.


## Section I: Item Similarity Recommender
#### We use a pre-built model that usesuser reviews of items with ratings to calculate Jaccard similarity scores 

In [6]:
def isValidImg(url):
    if url[-3:]!='jpg':
        return False
    else:
        return True
    
def getAsinRows(asin):
    sf = meta.filter_by(asin, 'asin', exclude=False)
    sf['prodImg'] = sf['imUrl'].apply(lambda x: gl.Image(x) if isValidImg(x) else gl.Image(path=None))
    return(sf)

def getPredefinedRelated(asin):
    rel = rel.filter_by(asin,'asin', exclude=False)
    search = gl.SArray(rel['bought'].unique())
    sf = getAsinRows(search)
    return(sf)

In [7]:
predefined_items = msf['related'].unpack(column_name_prefix='rel')
related = gl.SFrame()
related['asin'] = msf['asin']
related['viewed'] = predefined_items['rel.also_viewed']
related['bought'] = predefined_items['rel.buy_after_viewing']

This model first computes the similarity between items using the observations of users who have interacted with both items. Given a similarity between item $i$ and $j$, $\mbox{S}(i,j)$, it scores an item $j$ for user $u$ using a weighted average of the userâ€™s previous observations $I_u$.

Jaccard similarity is used to measure the similarity between two set of elements. In the context of recommendation, the Jaccard similarity between two items is computed as

$$
\mbox{JS}(i,j) = \frac{\mid U_i \cap U_j \mid}{\mid U_i \cup U_j \mid}
$$

where $U_i$ is the set of users who rated item $i$. Jaccard is a good choice when one only has implicit feedbacks of items (e.g., people rated them or not), or when one does not care about how many stars items received.

In [8]:
sm = gl.recommender.item_similarity_recommender.create(msf,
                                                user_id='reviewerID',
                                                item_id='asin',
                                                target='overall',
                                                only_top_k=50,
                                                threshold=0.01)

Let's select a book from our product database to test whether the model does a good job of predicting similarity from user interactions

In [9]:
sample_item = meta['asin'][52387]
sample = getAsinRows(sample_item)

In [10]:
gl.canvas.set_target('ipynb')
sample['prodImg'].show()

Let's look at the description of Conrad's book ``The Mirror of the Sea''

In [11]:
sample['description']

dtype: str
Rows: 1
['Conrad is regarded as one of the great novelists in English, though he did not speak the language fluently until he was in his twenties (and then always with a marked Polish accent). He wrote stories and novels, predominantly with a nautical setting, that depict trials of the human spirit by the demands of duty and honour. Conrad was a master prose stylist who brought a distinctly non-English tragic sensibility into English literature. While some of his works have a strain of romanticism, he is viewed as a precursor of modernist literature. His narrative style and anti-heroic characters have influenced many authors.']

Predictions for jaccard similarities are made via:

$$ y_{uj} = \frac{\sum_{i \in I_u} \mbox{SIM}(i,j) r_{ui}}{\sum_{i \in I_u} \mbox{SIM}(i,j)} $$

In [12]:
similar_items = sm.get_similar_items(sample['asin'])

From our calculation, we get a result that includes a similarity score (for $k$ number of similar items), as shown below:

In [13]:
similar_items.head()

asin,similar,score,rank
B004TPFMI2,B005HKHOG6,0.0769230723381,1
B004TPFMI2,B0075EV2BK,0.0769230723381,2
B004TPFMI2,B0082T27MU,0.0769230723381,3
B004TPFMI2,B0074CUBI8,0.0769230723381,4
B004TPFMI2,B00C1PCKW8,0.0769230723381,5
B004TPFMI2,B006OA0WZ8,0.0769230723381,6
B004TPFMI2,B00INHMNLG,0.0769230723381,7
B004TPFMI2,B0076795JM,0.0769230723381,8
B004TPFMI2,B008QO92U6,0.0769230723381,9
B004TPFMI2,B004UK001K,0.0769230723381,10


We can then use the product `asin` identifiers to get images of the books that are similar and show those books

In [14]:
similar_items['similar']

dtype: str
Rows: 10
['B005HKHOG6', 'B0075EV2BK', 'B0082T27MU', 'B0074CUBI8', 'B00C1PCKW8', 'B006OA0WZ8', 'B00INHMNLG', 'B0076795JM', 'B008QO92U6', 'B004UK001K']

In [15]:
related_items = getAsinRows(similar_items['similar'])

In [16]:
related_items['prodImg'].show()

A quick glance at the novels reveals that there are several books from Ted Hughes recommended.  Like Conrad, Hughes is also an English novelist.  His wikipedia biography includes this description of his work:

>Hughes's earlier poetic work is rooted in nature and, in particular, the innocent savagery of animals, an interest from an early age. He wrote frequently of the mixture of beauty and violence in the natural world.[57] Animals serve as a metaphor for his view on life: animals live out a struggle for the survival of the fittest in the same way that humans strive for ascendancy and success. Examples can be seen in the poems "Hawk Roosting" and "Jaguar".[57]

>The West Riding dialect of Hughes's childhood remained a staple of his poetry, his lexicon lending a texture that is concrete, terse, emphatic, economical yet powerful. The manner of speech renders the hard facts of things and wards off self-indulgence.[14]

>Hughes's later work is deeply reliant upon myth and the British bardic tradition, heavily inflected with a modernist, Jungian and ecological viewpoint.[57] He re-worked classical and archetypal myth working with a conception of the dark sub-conscious.[57]

*Source: [https://en.wikipedia.org/wiki/Ted_Hughes](https://en.wikipedia.org/wiki/Ted_Hughes)

## Section II: [ZiWei]

Get an SArray of the concatenated text in the `summary`, `reviewText`, and `description` fields.

In [21]:
docs = msf.apply(lambda x: str(x['summary']) + ' ' + str(x['reviewText']) + ' ' + str(x['description']))

Create a function to count words from a `docs` SArray that outputs a `docs_sf` SFrame with associated word counts

In [26]:
def get_word_frequency(docs):
    """
    Returns the frequency of occurrence of words in an SArray of documents
    Args:
    docs: An SArray (of dtype str) of documents
    Returns:
    An SFrame with the following columns:
     'word'      : Word used
     'count'     : Number of times the word occured in all documents.
     'frequency' : Relative frequency of the word in the set of input documents.
    """

    # Use the count_words function to count the number of words.
    docs_sf = gl.SFrame()
    docs_sf['words'] = gl.text_analytics.count_words(docs)

    # Stack the dictionary into individual word-count pairs.
    docs_sf = docs_sf.stack('words', 
                         new_column_name=['word', 'count'])

    # Count the number of unique words (remove None values)
    docs_sf = docs_sf.groupby('word', {'count': gl.aggregate.SUM('count')})
    docs_sf['frequency'] = docs_sf['count'] / docs_sf["count"].sum()
    return docs_sf

In [None]:
docs_sf = get_word_frequency(docs)

In [None]:
def predict(document_bow, word_topic_counts, topic_counts, vocab,
            alpha=0.1, beta=0.01, num_burnin=5):
    """
    Make predictions for a single document.
    Parameters
    ----------
    document_bow : dict
        Dictionary with words as keys and document frequencies as counts.
    word_topic_counts : numpy array, num_vocab x num_topics
        Number of times a given word has ever been assigned to a topic.
    topic_counts : numpy vector of length num_topics
        Number of times any word has been assigned to a topic.
    vocab : dict
        Words are keys and unique integer is the value.
    alpha : float
        Hyperparameter. See topic_model docs.
    beta : float
        Hyperparameter. See topic_model docs.
    num_burnin : int
        Number of iterations of Gibbs sampling to perform at predict time.
    Returns
    -------
    out : numpy array of length num_topics
        Probabilities that the document belongs to each topic.
    """
    num_vocab, num_topics = word_topic_counts.shape

    # proportion of each topic in this test doc
    doc_topic_counts = np.zeros(num_topics)
    # Assignment of each unique word
    doc_topic_assignments = []

    # Initialize assignments and counts
    # NB: we are assuming document_bow doesn't change.
    for i, (word, freq) in enumerate(document_bow.iteritems()):
        if word not in vocab:  # skip words not present in training set
            continue
        topic = np.random.randint(0, num_topics-1)
        doc_topic_assignments.append(topic)
        doc_topic_counts[topic] += freq

    # Sample topic assignments for the test document
    for burnin in range(num_burnin):
        for i, (word, freq) in enumerate(document_bow.iteritems()):
            if word not in vocab:
                continue
            word_id = vocab[word]

            # Get old topic and decrement counts
            topic = doc_topic_assignments[i]
            doc_topic_counts[topic] -= freq

            # Sample a new topic
            gamma = np.zeros(num_topics)  # store probabilities
            for k in range(num_topics):
                gamma[k] = (doc_topic_counts[k] + alpha) * (word_topic_counts[word_id, k] + beta) / (topic_counts[k] + num_vocab * beta)
            gamma = gamma / gamma.sum()  # normalize to probabilities
            topic = np.random.choice(num_topics, 1, p=gamma)

            # Use new topic to increment counts
            doc_topic_assignments[i] = topic
            doc_topic_counts[topic] += freq

    # Create predictions
    predictions = np.zeros(num_topics)
    total_doc_topic_counts = doc_topic_counts.sum()
    for k in range(num_topics):
        predictions[k] = (doc_topic_counts[k] + alpha) / (total_doc_topic_counts + num_topics * alpha)
    return predictions / predictions.sum()


    if __name__ == '__main__':
    docs = gl.SFrame({'text': [{'first': 5, 'doc': 1}, {'second': 3, 'doc': 5}]})
    m = gl.topic_model.create(docs)

    # Get test document in bag of words format
    document_bow = docs['text'][0]

    # Input: Global parameters from trained model

    # Number of times each word in the vocabulary has ever been assigned to topic k (in any document). You can make an approximate version of this by multiplying m['topics'] by some large number (e.g. number of tokens in corpus) that indicates how strong you "believe" in these topics. Make it into counts by flooring it to an integer.
    prior_strength = 1000000
    word_topic_counts = np.array(m['topics']['topic_probabilities'])
    word_topic_counts = np.floor(prior_strength * word_topic_counts)

    # Number of times any word as been assigned to each topic.
    topic_counts = word_topic_counts.sum(0)

    # Get vocabulary lookup
    num_topics = m['num_topics']
    vocab = {}
    for i, w in enumerate(m['topics']['vocabulary']):
        vocab[w] = i
    num_vocab = len(vocab)

    # Make prediction on test document
    probs = predict(document_bow, word_topic_counts, topic_counts, vocab)

In [16]:
print msf[1]['summary']
print msf[1]['reviewText']
print msf[1]['description']

Okay for true beginners
So, I bought this book a few days ago and have tried three recipes so far.  The first was a total flop.  There must be an error, but be forewarned, do NOT make the Blueberry Coffee Cake as it comes out as inedible mush--WAY too much water.  The other two recipes (mac and cheese and grilled cheese with tomato) were decent for quick lunches or dinners.  They were average in taste, but considering the short amount of time it took to make them, I'm okay with that.  All in all, it's a nice idea book to get creative with everyday ingredients, but with errors and only average taste, I give it three stars.
In less time and for less money than it takes to order pizza, you can make it yourself!Three harried but heatlh-conscious college students compiled and tested this collection of more than 200 tasty, hearty, inexpensive recipes anyone can cook -- yes, anyone!Whether you're short on cash, fearful of fat, counting your calories, or just miss home cooking, The Healthy Col

In [8]:
wc = gl.text_analytics.count_words(docs, to_lower=True)

In [13]:
trimmer = gl.toolkits.feature_engineering.RareWordTrimmer(threshold=2)

# Fit and transform the data.
transformed_sf = trimmer.fit_transform(wc)

ToolkitError: Input data is not an SFrame. If it is a Pandas DataFrame, you may use the to_sframe() function to convert it to an SFrame.

In [11]:
len(wc)

3205467

In [None]:
# Use the count_words function to count the number of words.
docs_sf = gl.SFrame()
docs_sf['words'] = gl.text_analytics.count_words(docs)

# Stack the dictionary into individual word-count pairs.
docs_sf = docs_sf.stack('words', 
                     new_column_name=['word', 'count'])

# Count the number of unique words (remove None values)
docs_sf = docs_sf.groupby('word', {'count': gl.aggregate.SUM('count')})
docs_sf['frequency'] = docs_sf['count'] / docs_sf["count"].sum()

# Run CTM with Spark

In [1]:
import findspark
findspark.init('/Users/Zoe/spark-2.1.0-bin-hadoop2.7/')
from pyspark.sql import SparkSession

from pyspark.context import SparkContext
sc = SparkContext('local')
spark = SparkSession(sc)

## Prepare data

In [2]:
review_df = spark.read.json("/Users/Zoe/Documents/Spring2017/GR5243/MyPrjs/localData/prj5/reviews_Kindle_Store.json")
#meta_df = spark.read.json("/Users/Zoe/Documents/Spring2017/GR5243/MyPrjs/localData/prj5/meta_Kindle_Store.json")

In [3]:
df = review_df.select(review_df.asin,review_df.overall,review_df.reviewerID)

In [6]:
review_df = 0
meta_df = 0

In [5]:
df.take(5)

[Row(asin=u'1603420304', overall=4.0, reviewerID=u'A2GZ9GFZV1LWB0'),
 Row(asin=u'1603420304', overall=3.0, reviewerID=u'A1K7VSUDCVAPW8'),
 Row(asin=u'1603420304', overall=4.0, reviewerID=u'A35J5XRE5ZT6H2'),
 Row(asin=u'1603420304', overall=4.0, reviewerID=u'A3DGZNFSMNWSX5'),
 Row(asin=u'1603420304', overall=5.0, reviewerID=u'A2CVDQ6H36L4VL')]

In [7]:
asins_code = df.select('asin').distinct().rdd.zipWithIndex()
users_code = df.select('reviewerID').distinct().rdd.zipWithIndex()

In [8]:
asins_df = spark.createDataFrame(asins_code.map(lambda r: (r[0][0],r[1])),['asin','item'])
users_df = spark.createDataFrame(users_code.map(lambda r: (r[0][0],r[1])),['reviewerID','user'])

In [9]:
Ratings = df.select(df.asin,df.overall,df.reviewerID).join(asins_df,"asin").join(users_df,"reviewerID")

In [10]:
Ratings = Ratings.select(Ratings.user, Ratings.item, Ratings.overall.alias('rating'))

In [19]:
row1 = Ratings.agg({"user": "max", "item":"max"}).collect()

In [20]:
row1

[Row(max(item)=430529, max(user)=1406889)]

In [12]:
ItemTopics = spark.read.load('/Users/Zoe/Documents/Spring2017/GR5243/MyPrjs/localData/prj5/predictions.csv', 
                      format='com.databricks.spark.csv', 
                      header='true', 
                      inferSchema='true')

In [13]:
ItemTopicsRDD = asins_df.join(ItemTopics,"asin").drop("asin").rdd.map(lambda r: (r[0],[r[i] for i in range(1,51)]))

In [14]:
ItemTopics = spark.createDataFrame(ItemTopicsRDD,['item','topic'])

In [23]:
Full = Ratings.join(ItemTopics, "item")

In [24]:
subFull = Full.limit(20)

In [None]:
subFull.collect()

## Train CTM on Data

In [15]:
from pyspark.sql.functions import collect_list
from time import time
import numpy as np
from numpy.random import rand
from numpy import matrix

In [22]:
def CTM_train(Full,I,J,K,LAMBDA,max_iter=10,n_partition=6):
    '''
    '''
    
    # define update functions
    def updateU(i,v_ind,R,V,LAMBDA):
        '''
        '''
        r = v_ind.shape[0]
        K = V.shape[1]
    
        A = V[v_ind,:].T.dot(V[v_ind,:]) + LAMBDA*r*np.eye(K)
        b = V[v_ind,:].T.dot(R).T
        
        return (np.linalg.solve(A, b)).T
    
    def updateV(j,u_ind,R,U,LAMBDA,Th):
        '''
        '''
        r = u_ind.shape[0]
        K = U.shape[1]
    
        A = U[u_ind,:].T.dot(U[u_ind,:]) + LAMBDA*r*np.eye(K)
        b = U[u_ind,:].T.dot(R).T + LAMBDA*r*Th.reshape([K,1])
    
        return (np.linalg.solve(A, b)).T
    
    print('pre-compute block information...')
    Full = Full.repartition(n_partition)
    U_map = Full.groupBy("user").agg(collect_list("item").alias('items'),collect_list("rating").alias('ratings')).sort('user')
    V_map = Full.groupBy("item").agg(collect_list("user").alias('users'),collect_list("rating").alias('ratings'), first('topic').alias('topic')).sort('item')
    U_map = U_map.repartition(n_partition)
    V_map = V_map.repartition(n_partition)
    
    print('initialize parameters...')
    U = matrix(rand(I,K))
    V = matrix(rand(J,K))
    
    Us = sc.broadcast(U)
    Vs = sc.broadcast(V)
    
    print('update parameters...')
    for i in range(max_iter):
        
        
        st = time()
        U = U_map.rdd.map(lambda r: updateU(r[0],np.array(r[1]),np.array(r[2]),Vs.value,LAMBDA)).reduce(lambda a,b: np.vstack((a,b)))
        Us = sc.broadcast(U)
        
        
        V = V_map.rdd.map(lambda r: updateV(r[0],np.array(r[1]),np.array(r[2]),Us.value,LAMBDA,np.array(r[3]))).reduce(lambda a,b: np.vstack((a,b)))
        Vs = sc.broadcast(V)
        ed = time()
        
        
        print('Finish iteration round: '+str(i)+', use time: '+str(round(ed-st,4))+'s.\n')
    
    return (U,V)        

In [21]:
# carefully set number of threads to improve performance
U,V = CTM_train(Full,1406889,430529,50,LAMBDA=0.02,max_iter=10,n_partition=200)

pre-compute block information...
initialize parameters...
update parameters...


Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 64.0 failed 1 times, most recent failure: Lost task 2.0 in stage 64.0 (TID 4507, localhost, executor driver): TaskResultLost (result lost from block manager)
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)


## Predict on test data

In [None]:
def CTM_predict(test,U,V):
    '''
    '''
    
    preds = test.rdd.map(lambda r: ((r[0],r[1]),U[r[0],:].dot(V[r[1],:].T)[0,0]))
    return preds

In [None]:
preds = CTM_predict(Ratings,U,V)

In [None]:
def evaluate(test,preds,N):
    '''
    '''
    
    se = test.rdd.map(lambda r: ((r[0],r[1]),r[2])).join(preds).map(lambda r: (r[1][0]-r[1][1])**2).reduce(lambda a,b: a+b)
    return se/N

In [None]:
N = Ratings.count()
evaluate(test,preds,N)