# NLP in Python


1. The dataset 
2. Text processing with spaCy
3. Automatic phrase modeling
4. Topic modeling with LDA
5. Visualizing topic models with pyLDAvis

# The Dataset

https://www.kaggle.com/residentmario/exploring-tripadvisor-uk-restaurant-reviews/data



# spaCy


spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

Tokenization <br/>
Text normalization, such as lowercasing, stemming/lemmatization<br/>
Part-of-speech tagging<br/>
Syntactic dependency parsing<br/>
Sentence boundary detection<br/>
Named entity recognition and annotation<br/>

In [4]:
import sys
print(sys.path)

['', 'C:\\Python36\\python.exe', 'C:\\Users\\chenjf\\Desktop\\shell.w32-ix86', 'C:\\Users\\chenjf\\Anaconda3\\python36.zip', 'C:\\Users\\chenjf\\Anaconda3\\DLLs', 'C:\\Users\\chenjf\\Anaconda3\\lib', 'C:\\Users\\chenjf\\Anaconda3', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages\\Babel-2.5.0-py3.6.egg', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages\\win32', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages\\win32\\lib', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages\\Pythonwin', 'C:\\Users\\chenjf\\Anaconda3\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\chenjf\\.ipython']


In [9]:
import pandas as pd
import numpy as np
import spacy
import itertools as it

import os
import codecs

ModuleNotFoundError: No module named 'spacy'

In [6]:
#default english model

nlp = spacy.load('en_core_web_sm')

NameError: name 'spacy' is not defined

In [7]:
nlp

NameError: name 'nlp' is not defined

In [8]:
#read in data

data = pd.read_csv('C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/restaurant_reviews.csv', encoding='utf-8')

In [None]:
data

In [None]:
#take review_text field

fields = ["review_text"]
data = pd.read_csv('C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/restaurant_reviews.csv', encoding='utf-8', na_values=['NA'], usecols = fields)

In [None]:
#concatenate 0-9

print(data['review_text'][0:9].str.cat(sep=' '))

In [None]:
#use your own path file

reviews_path = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/sample_reviews.txt'

In [None]:
#0:500 to one string

sample_reviews = data['review_text'][0:200].str.cat(sep=' ')

In [None]:
#print string

sample_reviews

In [None]:
text_file = open('C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/sample_reviews.txt', "w", encoding="utf-8")
text_file.write(sample_reviews)
text_file.close()

Hand these reviews to spaCy, and be prepared to wait...

In [None]:
#parse and tag

parsed_reviews = nlp(sample_reviews)

In [None]:
print(parsed_reviews)

Looks the same. What did this do?

In [None]:
# sentence detection

for num, sentence in enumerate(parsed_reviews.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')
    


In [None]:
# entity detection
# https://spacy.io/usage/linguistic-features


for num, entity in enumerate(parsed_reviews.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

In [None]:
# part of speech tagging

token_text = [token.orth_ for token in parsed_reviews]
token_pos = [token.pos_ for token in parsed_reviews]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

In [None]:
# normalization
# lemmatization, shape analysis


token_lemma = [token.lemma_ for token in parsed_reviews]
token_shape = [token.shape_ for token in parsed_reviews]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

#### What about other token-level attributes?
* relative frequency of tokens <br> 
* whether or not a token matches any of these categories: stopword, punctuation, whitespace, represents a number, whether or not the token is included in spaCy's default vocabulary)

In [None]:
# token attributes

token_attributes = [(token.orth_,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_reviews]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

# Phrase Modeling


Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. There's some fancy formula that our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase. It involves a ratio of the number of times each token appears in the corpus and the number of times they appear in order, against the size of the corpus vocabulary. 


Once our phrase model has been trained, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model. But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modeling — the Phrases class in particular.

In [None]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

Simultaneously perform phrase modeling with iterative data transformation:

Segment text of complete reviews into sentences & normalize text <br/>
First-order phrase modeling $\rightarrow$ apply first-order phrase model to transform sentences<br/>
Second-order phrase modeling $\rightarrow$ apply second-order phrase model to transform sentences<br/>
Apply text normalization and second-order phrase model to text of complete reviews<br/>
We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

Iterate over the reviews <br/>
Segment the reviews into individual sentences<br/>
Remove punctuation and excess whitespace<br/>
Lemmatize the text<br/>
(and do so efficiently in parallel when data is huge, thanks to spaCy's nlp.pipe() function)

In [None]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename) as f:
        for review in f:
            yield review.replace('\n', ' ')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_reviews in nlp.pipe(line_review(filename), batch_size = 1000, n_threads=4):
        for sent in parsed_reviews.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Write this data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [None]:
unigram_sentences_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/unigram_sentences_all.txt'

In [None]:
#reviews_path is path to sample_reviews.txt
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf-8') as f:
    for sentence in lemmatized_sentence_corpus(reviews_path): 
        f.write(sentence + '\n')

The `unigram_sentences_all` file now is a large text file with one document/sentence per line —  Gensim's *LineSentence* class provides an iterator for working with other gensim components. It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [None]:
for unigram_sentence in it.islice(unigram_sentences, 0, 20):
    print(u' '.join(unigram_sentence))
    print(u'')

Next, we'll learn a phrase model that will link individual words into two-word phrases.

In [None]:
bigram_model_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/bigram_model_all.txt'

bigram_model = Phrases(unigram_sentences)

bigram_model.save(bigram_model_filepath)
    
# load the finished model
bigram_model = Phrases.load(bigram_model_filepath)

Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [None]:
bigram_sentences_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/bigrammed_sentences_all.txt'


with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
    for unigram_sentence in unigram_sentences:
        bigram_sentence = u' '.join(bigram_model[unigram_sentence])
        f.write(bigram_sentence + '\n')

In [None]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [None]:
#look at a subset

for bigram_sentence in it.islice(bigram_sentences, 20, 50):
    print(u' '.join(bigram_sentence))
    print(u'')

In [None]:
bigram_reviews_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/bigram_transformed_reviews_all.txt'

In [None]:
#list of stop words
spacy.lang.en.English.Defaults.stop_words

at this point, you would usually run your entire file through

In [None]:
with codecs.open(bigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
    for parsed_review in nlp.pipe(line_review(reviews_path)):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order phrase model
            bigram_review = bigram_model[unigram_review]
            
            # remove any remaining stopwords
            bigram_review = [term for term in bigram_review
                              if term not in spacy.lang.en.English.Defaults.stop_words]
            
            # write the transformed review as a line in the new file
            bigram_review = u' '.join(bigram_review)
            f.write(bigram_review + '\n')

In [None]:
print(u'Original:' + u'\n')

for review in it.islice(line_review(reviews_path), 0,1):
    print(review)

print(u'----' + u'\n')
print(u'Transformed:' + u'\n')

with codecs.open(bigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 0,1):
        print(review)

# Topic Modeling with Latent Dirichlet Allocation (LDA)

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

In [None]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim

### 1st step: learn the full vocabulary

In [None]:
bigram_dictionary_filepath= 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/bigram_dict_all.dict'

In [None]:
bigram_reviews = LineSentence(bigram_sentences_filepath)

    # learn the dictionary by iterating over all of the reviews
bigram_dictionary = Dictionary(bigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
bigram_dictionary.filter_extremes(no_below=5, no_above=0.2)
bigram_dictionary.compactify()


bigram_dictionary.save(bigram_dictionary_filepath)
    
# load the finished dictionary from disk
bigram_dictionary = Dictionary.load(bigram_dictionary_filepath)


Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The bigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.

"bag-of-words" abbreviated to bow.


In [None]:
bigram_bow_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/bigram_bow_corpus_all.mm'

In [None]:
def bigram_bow_generator(filepath):
    """
    function to read reviews from a file
    output: bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield bigram_dictionary.doc2bow(review)

In [None]:
# generate bag-of-words representations for all reviews and save them as a matrix
MmCorpus.serialize(bigram_bow_filepath,
                       bigram_bow_generator(bigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
bigram_bow_corpus = MmCorpus(bigram_bow_filepath)

With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to LdaMulticore as inputs, along with the number of topics the model should learn.

In [None]:
lda_model_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/lda_model_all'

In [None]:
lda = LdaMulticore(bigram_bow_corpus,num_topics=10,
                   id2word=bigram_dictionary, 
                   workers=2)
    
lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [None]:
def topics(topic_number, topn=5):
    print(u'{:10} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:10} {:.3f}'.format(term, round(frequency, 3)))

In [None]:
topics(topic_number = 2)

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [None]:
#LDAvis_data_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/ldavis_prepared.txt'

In [None]:
#take topic models prepared by gensim and prepare data for visualization

LDAvis_prepared = pyLDAvis.gensim.prepare(lda, bigram_bow_corpus,
                                              bigram_dictionary)


In [None]:
pyLDAvis.display(LDAvis_prepared)

What an LDA visualization shows:
1. Better interpretation of individual topics
2. Relationships between different topics

Distance: topics that are similar appear closer together, dissimilar topics appear farther apart <br/>
Size: relative frequency of topic in dataset<br/>
Bar chart: 30 most relevant terms<br/>


# Word2vec

The goal of word vector embedding models, or word vector models for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the meaning or concept the term represents, and the relationship between it and other terms in the vocabulary. 

# I like ___ food.

a) italian
b) mexican
c) pen
d) chair

In [None]:
from gensim.models import Word2Vec

bigram_sentences = LineSentence(bigram_sentences_filepath)
word2vec_filepath = 'C:/Users/chenjf/Desktop/data_science-master/data_science-master/restaurant_reviews/word2vec_model_all'

In [None]:
food2vec = Word2Vec(bigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)

food2vec.save(word2vec_filepath)


        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(u'{} training epochs so far'.format(food2vec.train_count))

In [None]:
# look up the topn most similar terms to token

def get_related_terms(token, topn=10):
    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):
        print(u'{:10} {}'.format(word, round(similarity, 5)))

In [None]:
get_related_terms('restaurant')