# Topic Modeling of Yelp Academy Dataset

***Yading Guo***

**2016/11/15**

Topic modeling was carried out in this report on Yelp academy dataset. There are two main parts of this report.
* **Topic modeling of all the reviews in the dataset**
* **Comparison of topics of reviews from different sex**

## Topic modeling of all the reviews in the dataset

Three major steps are involved in this part. 
* Processing the raw reviews into words
* Constructing TDIDF vector used for topic modeling
* Fitting topic model 

### Firstly, processing the raw reviews into words

In the beginning, I import some modules for later usage. 
* nltk and string module are used for natural languange processing
* json and io are used for file read and write
* graphlab is used for topic modeling
* pyLDAvis is used for topic visualization

In [2]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import json
import graphlab
import io
import sexmachine.detector as gender
import pyLDAvis
import pyLDAvis.graphlab

This non-commercial license of GraphLab Create for academic use is assigned to adenguo@gmail.com and will expire on August 05, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1479187159.log


Some constants were defined and loaded. They are raw data, hyperparameter of constructing TDIDF model, stopwords list and so on.

And some functions are defined.

In [3]:
male_id = set()
female_id = set()
data_user = []
data_review = []    
with open('data/yelp_academic_dataset_review.json') as f:
    for line in f:
        data_review.append(json.loads(line))
with open('data/yelp_academic_dataset_user.json') as f:
    for line in f:
        data_user.append(json.loads(line))
trim_value = 2
min_length = 10
extra_words = set(["food", "it", "get", "go", "u"])
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
detector1 = gender.Detector()
pyLDAvis.enable_notebook()
def write_text_lists(texts_list, file_name):
    f = io.open(file_name, 'w+', encoding='utf8')
    for line in texts_list:
        f.write(u','.join(line) + '\n')
    f.close()
def load_text_lists(file_name):
    f = io.open(file_name, 'r', encoding='utf8')
    lines = f.readlines()
    return [line.strip().split(',') for line in lines]
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
def data_review_to_sf_bag_words(texts_list):
    sf_text = graphlab.SFrame({'text': texts_list})
    encoder = graphlab.feature_engineering.WordCounter()
    transformed_sf = encoder.fit_transform(sf_text)
    return transformed_sf
def data_review_to_sf_tfidf(texts_list):
    sf_text = graphlab.SFrame({'text': texts_list})
    encoder = graphlab.feature_engineering.WordCounter()
    bag_words_sf = encoder.fit_transform(sf_text)
    encoder_tfidf = graphlab.feature_engineering.TFIDF('text')
    encoder_tfidf = encoder_tfidf.fit(bag_words_sf)
    result = encoder_tfidf.transform(bag_words_sf)
    return result
def create_topic_model(text_file):
    print "loading save text"
    text_list = load_text_lists(text_file)
    print "creating bag of words vector"
    bag_of_words_vector = data_review_to_sf_bag_words(text_list)
    print "triming vector by value "+ str(trim_value)
    bag_of_words_vector = bag_of_words_vector['text'].dict_trim_by_values(trim_value)
    print "delect short line then " + str(min_length)
    ix = bag_of_words_vector.apply(lambda x: len(x.keys()) >= min_length)
    bag_of_words_vector = bag_of_words_vector[ix]
    print "remove extra words"
    bag_of_words_vector = bag_of_words_vector.dict_trim_by_keys(extra_words,exclude=True)
    print "creating tfidf vector"
    tfidf_vector = graphlab.text_analytics.tf_idf(bag_of_words_vector)
    model = graphlab.topic_model.create(tfidf_vector,
                              num_topics=10,       # number of topics
                              num_iterations=100,   # algorithm parameters
                              alpha=10, beta=0.1)  # hyperparameters
    return model,tfidf_vector
def split_male_female(data_user, data_review):
    for user in data_user:
        if detector1.get_gender(user['name']) == "male":
            male_id.add(user['user_id'])
        elif detector1.get_gender(user['name']) == "female":
            female_id.add(user['user_id'])
        else:
            pass
    male_review = []
    female_review = []
    for review in data_review:
        if review['user_id'] in male_id:
            male_review.append(review['text'])
        elif review['user_id'] in female_id:
            female_review.append(review['text'])
        else:
            pass
    return male_review,female_review

  def _formatters_default(self):
  def _deferred_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):


Following steps were done when the raw data is processed into list of words.
1. Stop words are removed.
2. Punctuations are removed.
3. Lemmatization are performed on each words. 
4. Tokenization the text into a list of words.

In the following funtions, the function *clean* is the workhorse of there steps. There processed data is then write into a file using function *write_text_lists*

In [4]:
text_data = [x['text'] for x in data_review]
text_list1 = [clean(doc).split() for doc in text_data]
write_text_lists(text_list1, 'clean_text.txt')

### Secondly, construction of TFIDF vector

There are two procedures in this part.
1. Removing very rare words which is the words appearing less than 2 times in the corpus. And delete very short reviews which is reviews which contain less than 10 unique words.
2. A vector of bag of words is construct and then it is converted into a vector of TFIDF.

The first part of function *creat_topic_model* is doing above two steps.

### Thirdly, Fitting topic model
The second part of funtion *creating_topic_model* is doing this part.
The interative visualization of the topic model is construct.

In [5]:
all_model,all_tfidf = create_topic_model('clean_text.txt')
pyLDAvis.graphlab.prepare(all_model, all_tfidf)

loading save text
creating bag of words vector
triming vector by value 2
delect short line then 10
remove extra words
creating tfidf vector


  def _ipython_display_formatter_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _deferred_printers_default(self):


As we can see from the visualization. There are several apparent topic. For example, topic 8 is about foreign cuisine. Topic 3 is about shops providing service other than food, such as salon, car service and so on. Topic 1 is about ordinary food. 

## Comparison of topics of reviews from different sex

I split the reviews into reviews from male and reviews from female.

This is done by function *split_male_female*. 


Then I do exactly the same procedures as the first part of this report on both reviews to produce male topic model and female topic model and visualized them to compare.

In [6]:
male_review, female_review = split_male_female(data_user, data_review)
male_text_list = [clean(doc).split() for doc in male_review]
female_text_list = [clean(doc).split() for doc in female_review]
write_text_lists(male_text_list, 'male_clean_text.txt')
write_text_lists(female_text_list, 'female_clean_text.txt')

In [7]:
male_model,male_tfidf = create_topic_model('male_clean_text.txt')
female_model,female_tfidf = create_topic_model('female_clean_text.txt')

loading save text
creating bag of words vector
triming vector by value 2
delect short line then 10
remove extra words
creating tfidf vector


loading save text
creating bag of words vector
triming vector by value 2
delect short line then 10
remove extra words
creating tfidf vector


In [8]:
pyLDAvis.graphlab.prepare(male_model, male_tfidf)

In [9]:
pyLDAvis.graphlab.prepare(female_model, female_tfidf)

As we can see, there are obvious difference existing between male's and female's topics. There is a topic, 3, for male is about car, repair and problem. Other topics from male are about Mexico food(topic 9), gambling(topic 4) and shows(topic 6). These topics don't show themselves in female's topics. Female's topics are concentrate on drink and dessert(topic 9), hair cut and massage(topic 8) and specific food(topic 5).