<h1 align="center">CS340 - Assignment 3</h1>
<h3 align="center">Due Date: 30 April 2017</h3>
<br>
<p style="text-indent: 40px">In this project we are going to use a clustering method(LDA) on the Yelp dataset. The dataset is included in the assignment folder. You should unzip and upload the file to ibm's datascience tool.
</p>


## Topics From Reviews

"LDA is a topic model which infers topics from a collection of text documents."
For the purpose of this assignment we can treat each yelp review as a document and extract two topics from the data by using LDA.

For this assignment you should use ml transformers and ml pipeline. Just for convenince, ReviewsRdd is provided for you. You should convert this rdd to DataFrame and work on it. In order to do this assignment you should revisit PySpark ML documentation many times.

End result should look like this.

<div>
    <img src="http://image.prntscr.com/image/13eb8a01533346c0bb3186dfc5402b73.png" width=200>
</div>

Your solution does not have to be exactly like this, since each time you run the LDA model, it gives you different results. 

<p style="text-indent: 40px"> Note: You should remove stopwords with a feature transformer, the words you should remove are in the stopwords list that we have provided.</p>
<p style="text-indent: 40px">Important: Do not discuss the solution with your friends. <b>Plagiarism</b> will not be tolerated and issue will be referred to the <b>disciplinary committee</b>.</p>


In [1]:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# The following variable contains the path to your file on your Object Storage.
path_1 = "swift://CS340." + name + "/100kReviews.txt"


In [2]:
stopwords = ["i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their","theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was","were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and","but","if","or","because","as","until","while","of","at","by","for","with","about","against","between","into","through","during","before","after","above","below","to","from","up","down","in","out","on","off","over","under","again","further","then","once","here","there","when","where","why","how","all","any","both","each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very","s","t","can","will","just","don","should","now","i'll","you'll","he'll","she'll","we'll","they'll","i'd","you'd","he'd","she'd","we'd","they'd","i'm","you're","he's","she's","it's","we're","they're","i've","we've","you've","they've","isn't","aren't","wasn't","weren't","haven't","hasn't","hadn't","don't","doesn't","didn't","won't","wouldn't","shan't","shouldn't","mustn't","can't","couldn't","cannot","could","here's","how's","let's","ought","that's","there's","what's","when's","where's","who's","why's","would"]
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
# convert all words to lowercase
# remove words that are too long or too short 
reviewsRdd = sc.textFile(path_1)\
            .map(lambda line: line.lower())\
            .map(lambda line: " ".join(word for word in line.strip().split() if 25 > len(word) > 4))\
            .zipWithIndex()\
            .map(lambda (text, indx):Row(Id=indx, Text=text))

In [3]:
reviews = sqlContext.createDataFrame(reviewsRdd)

In [4]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.ml.clustering import LDA, LDAModel
from pyspark.ml.linalg import Vector, Vectors

In [5]:
# Defining the pipeline
tokenizer = Tokenizer(inputCol='Text', outputCol='Words')
removeStopWords = StopWordsRemover(inputCol='Words', outputCol='filtered_words', stopWords=stopwords)
count_vectorizer = CountVectorizer(inputCol='filtered_words', outputCol='vectors')
words_per_topic = 6
num_of_topics = 2
lda_model = LDA(featuresCol='vectors', k=num_of_topics)
pipeline = Pipeline(stages=[tokenizer, removeStopWords, count_vectorizer, lda_model])

model = pipeline.fit(reviews)
cv = model.stages[2]
all_vocab = cv.vocabulary
lda = model.stages[3]

In [6]:
topics_indices = lda.describeTopics(words_per_topic)

In [7]:
topics_indices.show()

+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[46, 208, 241, 28...|[0.00700437883614...|
|    1|  [0, 1, 2, 3, 4, 5]|[0.00842281832305...|
+-----+--------------------+--------------------+



In [8]:
# Rendering the topics and their terms
topics = topics_indices.select("termIndices").rdd\
                     .map(lambda row:[all_vocab[row.termIndices[i]] for i in range(words_per_topic)]).collect()

for topic in range(len(topics)):
    print "Topic" + str(topic)
    for term in topics[topic]:
        print term
    print

Topic0
nicht
einen
essen
waren
einem
etwas

Topic1
place
great
really
service
always
little



In [9]:
# model.transform(reviews).show()