Methodologies:
- 1. Decide on which level of categories we wanna build topic modeling based on token frequencies (level 3 because most vocabularies on level4 are overlapped **check**)
- 2. Decide preprocess pipeline before feeding into ngram parser (**NOTE**: it's a bit different from the text parsing method):
    - tokenize + punct removal + ngram 
    - tokenize + punct removal + ngram + lemmatize 
    - tokenize + punct removal + ngram + stopwords removal + lemmatize (**check**)
- 3. Compare ngram parser between `gensim.models.phrases` vs.s `nltk.collocations` (**check**)
    - `gensim` measure metrics: relative frequency. 
        - `gensim` parameter: `min_count`=5, `threshold=10`. `Threshold` = _(cnt(a, b) - min_count) * N / (cnt(a) * cnt(b))_
    - `nltk.collocations` measure metrics: pmi (**check**) or `raw_freq` (pmi/Pointwise Mutual Information: quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions). 
    
- 4. Decide to keep certain tokens after pos tagging
    - ngram: adjective + noun (**check**)
    - unigram: adjective, noun, adverb, verb (**check**)

In [1]:
%matplotlib inline
import random
random.seed(1234)

import pandas as pd
import gzip
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
# import nltk
# nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import spacy
import matplotlib.pyplot as plt
import pyLDAvis #python library for interactive topic model visualization
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

# import warnings
# warnings.filterwarnings("ignore",category=DeprecationWarning)

import pickle
import numpy as np

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
# import pyLDAvis
# import pyLDAvis.sklearn
# import matplotlib.pyplot as plt
%matplotlib inline

from tqdm.notebook import tqdm as tqdm
tqdm.pandas()

# Example for detecting bigrams 
import math
import nltk
from collections import defaultdict

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
from nltk.probability import FreqDist

from gensim.models.ldamulticore import LdaMulticore

In [32]:
def sent_to_words(sentences):
    '''
    Simplify and tokenize strings in an iterable.
    
            Parameters:
                    sentences (iterable): Review strings in a list or as a pandas.series.
                    
            Returns:
                    _ (generator): Simplify review tokens.  
    '''
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

def remove_stopwords(texts, extra_stopwords = []):
    '''
    Remove stopwords in an iterable (string or tokens).

        Parameters:
                texts (iterable): Review strings in a list or as a pandas.series.

        Returns:
                _ (list): Simplify review stirngs without stopwords.  
    '''
    stop_words = stopwords.words('english') + extra_stopwords
    return [[word for word in simple_preprocess(str(doc), deacc=True) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    '''
    Form bigrams.
    
        Parameters:
            texts (iterable): Review tokens in a list or as a pandas.series.

        Returns:
            _ (list): strings contain bigrams in a_b format.   
    '''
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=10) #train based on corpus
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    '''
    Form trigrams.
    
        Parameters:
            texts (iterable): Review tokens in a list or as a pandas.series.

        Returns:
            _ (list): strings contain trigrams in a_b_c format.   
    '''
    bigram = gensim.models.Phrases(texts, min_count=5, threshold=10) 
    bigram_mod = gensim.models.phrases.Phraser(bigram)

    trigram = gensim.models.Phrases(bigram[texts], threshold=10) 
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))  
        if len(allowed_postags) > 0: 
            texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
        else: 
            texts_out.append([token.lemma_ for token in doc])
    return texts_out

def preprocess_review(sentences, extra_stopwords = [], allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    sens1 = sent_to_words(sentences)
    tokens1 = remove_stopwords(sens1, extra_stopwords = extra_stopwords)
    bigram_lst = make_bigrams(tokens1)
    trigram_lst = make_trigrams(tokens1)
    reviews = lemmatization(sens1, allowed_postags = allowed_postags)
    return bigram_lst, trigram_lst, reviews

# read data 

In [3]:
df = pd.read_csv('recategorized_dogs_cats_data.csv', index_col = False)

In [4]:
df.loc[df.category_2 == 'Cats', ['category_3', 'final_category_4']].value_counts()

category_3                     final_category_4         
Food                           Wet                          3955
Beds & Furniture               Beds & Sofas                 2366
Toys                           Catnip Toys                  2268
Litter & Housebreaking         Litter Boxes                 2257
Toys                           Mice & Animal Toys           2091
Food                           Dry                          1961
Treats                         Snacks                       1578
Health Supplies                Relaxants                    1432
Litter & Housebreaking         Litter                       1242
Beds & Furniture               Activity Trees               1213
Toys                           Feather Toys                 1140
Beds & Furniture               Scratching Pads              1075
Health Supplies                Supplements & Vitamins       1072
                               Flea, Lice & Tick Control     708
Toys                           Ba

In [8]:
df1 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Food') & (df.final_category_4 == 'Wet')].reset_index(drop = True) 
df2 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Food') & (df.final_category_4 == 'Dry')].reset_index(drop = True)  

df3 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Beds & Furniture') & (df.final_category_4 == 'Beds & Sofas')]
df4 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Beds & Furniture') & (df.final_category_4 == 'Activity Trees')]
df5 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Beds & Furniture') & (df.final_category_4 == 'Scratching Pads')]

df6 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Litter & Housebreaking') & (df.final_category_4 == 'Litter')]
df7 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Litter & Housebreaking') & (df.final_category_4 == 'Litter Boxes')]

df8 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Toys') & (df.final_category_4 == 'Mice & Animal Toys')]
df9 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Toys') & (df.final_category_4 == 'Catnip Toys')]
df10 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Toys') & (df.final_category_4 == 'Feather Toys')]

df11 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Treats') & (df.final_category_4 == 'Snacks')]
df12 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Treats') & (df.final_category_4 == 'Snacks')]

df13 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Health Supplies') & (df.final_category_4 == 'Relaxants')]
df14 = df.loc[(df.category_2 == 'Cats') & (df.category_3 == 'Health Supplies') & (df.final_category_4 == 'Supplements & Vitamins')]

In [33]:
cats_toys_bigram, cats_toys_trigram, cats_toys_reviews = preprocess_review(df1.reviewText, extra_stopwords = ['cat', 'cats', 'toy', 'toys'])

In [15]:
cats_toys_bigram, cats_toys_trigram, cats_toys_reviews = preprocess_review(df1.reviewText, extra_stopwords = ['cat', 'cats', 'toy', 'toys'])
cats_food_bigram, cats_food_trigram, cats_food_reviews = preprocess_review(df2.reviewText, extra_stopwords = ['cat', 'cats', 'food'])
cats_beds_bigram, cats_beds_trigram, cats_beds_reviews = preprocess_review(df3.reviewText, extra_stopwords = ['cat', 'cats', 'bed', 'beds'] )
cats_litter_bigram, cats_litter_trigram, cats_litter_reviews = preprocess_review(df4.reviewText, extra_stopwords = ['cat', 'cats', 'litters', 'litter'])
cats_health_supplies_bigram, cats_health_supplies_trigram, cats_health_supplies_reviews = preprocess_review(df5.reviewText, extra_stopwords = ['cat', 'cats'])
cats_treats_bigram, cats_treats_trigram, cats_treats_reviews = preprocess_review(df6.reviewText, extra_stopwords = ['cat', 'cats', 'treat', 'treats'])
cats_grooming_bigram, cats_grooming_trigram, cats_grooming_reviews = preprocess_review(df7.reviewText, extra_stopwords = ['cat', 'cats'])
cats_feeding_supplies_bigram, cats_feeding_suppliess_trigram, cats_feeding_supplies_reviews = preprocess_review(df8.reviewText, extra_stopwords = ['cat', 'cats'])

NameError: name 'data_words' is not defined

In [None]:
cats_toys_bigram, cats_toys_trigram, cats_toys_reviews2 = preprocess_review(df1.reviewText, extra_stopwords = ['cat', 'cats', 'toy', 'toys'],
                                                                          allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_food_bigram, cats_food_trigram, cats_food_reviews2 = preprocess_review(df2.reviewText, extra_stopwords = ['cat', 'cats', 'food'],
                                                                          allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_beds_bigram, cats_beds_trigram, cats_beds_reviews2 = preprocess_review(df3.reviewText, extra_stopwords = ['cat', 'cats', 'bed', 'beds'],
                                                                          allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_litter_bigram, cats_litter_trigram, cats_litter_reviews2 = preprocess_review(df4.reviewText, extra_stopwords = ['cat', 'cats', 'litters', 'litter'],
                                                                                allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_health_supplies_bigram, cats_health_supplies_trigram, cats_health_supplies_reviews2 = preprocess_review(df5.reviewText, extra_stopwords = ['cat', 'cats'],
                                                                                                           allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_treats_bigram, cats_treats_trigram, cats_treats_reviews2 = preprocess_review(df6.reviewText, extra_stopwords = ['cat', 'cats', 'treat', 'treats'],
                                                                                allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_grooming_bigram, cats_grooming_trigram, cats_grooming_reviews2 = preprocess_review(df7.reviewText, extra_stopwords = ['cat', 'cats'],
                                                                                      allowed_postags = ['ADJ', 'VERB', 'ADV'])
cats_feeding_supplies_bigram, cats_feeding_suppliess_trigram, cats_feeding_supplies_reviews2 = preprocess_review(df8.reviewText, extra_stopwords = ['cat', 'cats'],
                                                                                                               allowed_postags = ['ADJ', 'VERB', 'ADV'])

# method 1:  GENSIM.simple_preprocess(); gensim.models.phrases.Phraser(ngram)

In [34]:
cats_wet_food_reviews = df1.reviewText #wet food
cats_wet_food_data_words = list(sent_to_words(tqdm(cats_wet_food_reviews.tolist())))
cats_wet_food_data_words_nostops = remove_stopwords(tqdm(cats_wet_food_data_words))
cats_wet_food_data_words_bigrams = make_bigrams(tqdm(cats_wet_food_data_words_nostops))
cats_wet_food_unigrams_lemmatized = lemmatization(tqdm(cats_wet_food_data_words_nostops), allowed_postags=['NOUN','ADJ', 'VERB', 'ADV'])
cats_wet_food_bigrams_lemmatized = lemmatization(tqdm(cats_wet_food_data_words_bigrams), allowed_postags=['NOUN','ADJ', 'VERB', 'ADV'])

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

In [35]:
cats_dry_food_reviews = df2.reviewText #wet food
cats_dry_food_data_words = list(sent_to_words(tqdm(cats_dry_food_reviews.tolist())))
cats_dry_food_data_words_nostops = remove_stopwords(tqdm(cats_dry_food_data_words))
cats_dry_food_data_words_bigrams = make_bigrams(tqdm(cats_dry_food_data_words_nostops))
cats_dry_food_unigrams_lemmatized = lemmatization(tqdm(cats_dry_food_data_words_nostops), allowed_postags=['NOUN','ADJ', 'VERB', 'ADV'])
cats_dry_food_bigrams_lemmatized = lemmatization(tqdm(cats_dry_food_data_words_bigrams), allowed_postags=['NOUN','ADJ', 'VERB', 'ADV'])

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

## GENSIM: take one review as an example

In [36]:
cats_wet_food_reviews[714] 

'9 Lives Cat Food is one Meal my Cats really like and enjoy eating It. Daily Essentials Real Flaked Tuna in Sauce, one of the many Flavors that is Good for them and I get it Locally. 100% Completed and Balanced Nutrition with all kinds of Vitamins and Nutrients all supposed to be Good for Cats. I can get several Meals out of each Can of Food, my Cats are a little Finicky so I have to try different Flavors and Brands all the time. They enjoy eating it maybe twice a week, they don\'t eat a lot at one time, they are Nibblers and prefer to eat many little Meals. I have tried to give them other more expensive Cat Food from Popular "Specialty Pet Shops" or other "Food Boutiques" for Pets, however, they don\'t want any of that, they Crave and like eating 9 Lives. I am Glad there are such Diversity of Flavors to make my Cats Happy and to choose from. I even Feed 9 Lives to several Cats outside and they Love It! ...Thank You D.D.'

In [37]:
[" ".join(cats_wet_food_data_words[714])] 

['lives cat food is one meal my cats really like and enjoy eating it daily essentials real flaked tuna in sauce one of the many flavors that is good for them and get it locally completed and balanced nutrition with all kinds of vitamins and nutrients all supposed to be good for cats can get several meals out of each can of food my cats are little finicky so have to try different flavors and brands all the time they enjoy eating it maybe twice week they don eat lot at one time they are nibblers and prefer to eat many little meals have tried to give them other more expensive cat food from popular specialty pet shops or other food boutiques for pets however they don want any of that they crave and like eating lives am glad there are such diversity of flavors to make my cats happy and to choose from even feed lives to several cats outside and they love it thank you']

In [None]:
[" ".join(cats_wet_food_data_words_nostops[714])] 

In [38]:
[" ".join(cats_wet_food_data_words_bigrams[714])] #enjoy_eating?  completed_balanced? 

['lives cat food one meal cats really like enjoy_eating daily essentials real flaked_tuna sauce one many_flavors good get_locally completed_balanced nutrition kinds_vitamins nutrients_supposed good cats get several_meals food cats little_finicky try_different flavors brands time enjoy_eating maybe twice_week eat lot one time_nibblers prefer eat many little_meals tried_give expensive cat food popular_specialty pet_shops food_boutiques pets_however want_crave like eating lives glad_diversity flavors_make cats happy_choose even feed lives several cats outside_love thank']

In [40]:
[" ".join(cats_wet_food_unigrams_lemmatized[714])] 

['live cat food meal cat really enjoy eat daily essential real flaked tuna sauce many flavor good get locally complete balanced nutrition kind vitamin nutrient suppose good cat get several meal food cat little finicky try different flavor brand time enjoy eat maybe twice week eat lot time nibbler prefer eat many little meal try give expensive cat food popular specialty pet shop food boutique pet however want crave eat life glad diversity flavor make cat happy choose even feed live several cat love thank']

In [39]:
[" ".join(cats_wet_food_bigrams_lemmatized[714])] #kinds_vitamin?  nutrients_supposed? 

['live cat food meal cat really enjoy_eate daily essential real flaked_tuna sauce many_flavor good get_locally completed_balance nutrition nutrients_suppose good cat get several_meal food cat try_different flavor brand time enjoy_eate maybe twice_week eat lot time_nibbler prefer eat many little_meal tried_give expensive cat food popular_specialty pet_shop food_boutique want_crave eat life glad_diversity cat happy_choose even feed live several cat outside_love thank']

# method 2: NLTK.collocations

In [41]:
def sent_to_words(sentences):
    '''
    Simplify and tokenize strings in an iterable.
    
            Parameters:
                    sentences (iterable): Review strings in a list or as a pandas.series.
                    
            Returns:
                    _ (generator): Simplify review tokens.  
    '''
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

def get_bigrams(data_words): 
    '''
    Form bigram.
    
        Parameters:
            data_words (iterable): Review tokens in a list or as a pandas.series.

        Returns:
           bigram_pmi (data frame): bigram and it's pmi score. 
    '''
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = nltk.collocations.BigramCollocationFinder.from_documents(data_words) #data_words_nostops
    finder.apply_freq_filter(20)
    bigram_scores = finder.score_ngrams(bigram_measures.pmi) 
    bigram_pmi = pd.DataFrame(bigram_scores)
    bigram_pmi.columns = ['bigram', 'pmi']
    bigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)
    return finder, bigram_measures, bigram_pmi

def bigram_filter(bigram):
    """
    Filter bigram. 
        
        Parameters:
            bigram (string): a bigram.

        Returns:
           _ (boolean): decide whether or not to keep the bigram.
    """
    stop_words = stopwords.words('english')   
    tag = nltk.pos_tag(bigram)
    #if tag[0][1] not in ['JJ', 'NN', 'NNS'] and tag[1][1] not in ['NN', 'NNS']: #we only want adjective + noun
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']: #we only want adjective + noun
        return False
    if bigram[0] in stop_words or bigram[1] in stop_words: 
        return False
    if 'n' in bigram or 't' in bigram:
        return False
    if 'PRON' in bigram: #we don't want pronoun
        return False
    if len(bigram[0]) <= 2 or len(bigram[1]) <= 2:
        return False
    return True

def get_trigrams(data_words): 
    trigram_measures = nltk.collocations.TrigramAssocMeasures()
    finder = nltk.collocations.TrigramCollocationFinder.from_documents(data_words) #data_words_nostops
    finder.apply_freq_filter(20)
    trigram_scores = finder.score_ngrams(trigram_measures.pmi)

    trigram_pmi = pd.DataFrame(trigram_scores)
    trigram_pmi.columns = ['trigram', 'pmi']
    trigram_pmi.sort_values(by='pmi', axis = 0, ascending = False, inplace = True)
    return finder, trigram_measures, trigram_pmi

def trigram_filter(trigram):
    tag = nltk.pos_tag(trigram)
    stop_words = stopwords.words('english')
    #if tag[0][1] not in ['JJ', 'NN', 'NNS'] and tag[1][1] not in ['NN', 'NNS']: #we only want adjective + noun
    if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['JJ','NN']:
        return False
    if trigram[0] in stop_words or trigram[-1] in stop_words or trigram[1] in stop_words:
        return False
    if 'n' in trigram or 't' in trigram:
        return False
    if 'PRON' in trigram:
        return False
    if len(trigram[0]) <= 2 or len(trigram[1]) <= 2 and len(trigram[2]) <= 2:
        return False
    return True 

def replace_ngram(x, bigrams, trigrams):
    '''
    Form ngram. 
        
        Parameters:
            bigram (string): a bigram.

        Returns:
           _ (boolean): decide whether or not to keep the bigram.
    '''
    for gram in trigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    for gram in bigrams:
        x = x.replace(gram, '_'.join(gram.split()))
    return x

def remove_stopwords(texts, extra_stopwords = []):
    '''
    Remove stopwords in an iterable (string or tokens).

        Parameters:
                texts (iterable): Review strings in a list or as a pandas.series.

        Returns:
                _ (list): Simplify review stirngs without stopwords.  
    '''
    stop_words = stopwords.words('english') + extra_stopwords
    return [[word for word in doc.split() if word not in stop_words] for doc in texts]

def lemmatize_skip_ngrams(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    '''feed tokens'''
    nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
    texts_out = []
    for sent in texts:
        doc = nlp(' '.join(sent)) 
        if len(allowed_postags) > 0: 
            returned_doc = []
            for token in doc:
                if '_' in token.text:
                    returned_token = token.text
                if token.pos_ in allowed_postags and '_' not in token.text:
                    returned_token = token.lemma_
                if token.pos_ not in allowed_postags and '_' not in token.text:
                    continue
                returned_doc.append(returned_token)
            texts_out.append(returned_doc)
                    
        else: 
            returned_doc = []
            for token in doc:
                if '_' in token.text:
                    returned_token = token.text
                else: 
                    returned_token = token.lemma_
                returned_doc.append(returned_token)
            texts_out.append(returned_doc)
    return texts_out

In [42]:
cats_wet_food_data_words = list(sent_to_words(tqdm(cats_wet_food_reviews.tolist())))
cats_wet_food_bigram_finder, cats_wet_food_bigram_measures, cats_wet_food_bigram_pmi = get_bigrams(tqdm(cats_wet_food_data_words))
cats_wet_food_trigram_finder, cats_wet_food_trigram_measures, cats_wet_food_trigram_pmi = get_trigrams(tqdm(cats_wet_food_data_words))
cats_wet_food_filtered_bigram = cats_wet_food_bigram_pmi[cats_wet_food_bigram_pmi.progress_apply(lambda bigram:\
                                              bigram_filter(bigram['bigram'])\
                                              and bigram.pmi > 5, axis = 1)][:500]

cats_wet_food_filtered_trigram = cats_wet_food_trigram_pmi[cats_wet_food_trigram_pmi.progress_apply(lambda trigram: \
                                                 trigram_filter(trigram['trigram'])\
                                                 and trigram.pmi > 5, axis = 1)][:500]
cats_wet_food_bigrams = [' '.join(x) for x in cats_wet_food_filtered_bigram.bigram.values]
cats_wet_food_trigrams = [' '.join(x) for x in cats_wet_food_filtered_trigram.trigram.values]

cats_wet_food_reviews_ngrams = pd.DataFrame([' '.join(sen) for sen in cats_wet_food_data_words], columns = ['reviewText'])
cats_wet_food_reviews_ngrams.reviewText = cats_wet_food_reviews_ngrams.reviewText.map(lambda x: replace_ngram(x, cats_wet_food_bigrams, cats_wet_food_trigrams))
cats_wet_food_cleaned_reviews_ngrams = remove_stopwords(tqdm(cats_wet_food_reviews_ngrams.reviewText))
cats_wet_food_lemmatized_reviews_ngrams = lemmatize_skip_ngrams(tqdm(cats_wet_food_cleaned_reviews_ngrams), allowed_postags = [])

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/2124 [00:00<?, ?it/s]

  0%|          | 0/591 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

  0%|          | 0/3955 [00:00<?, ?it/s]

In [43]:
cats_wet_food_lemmatized_reviews_ngrams2 = lemmatize_skip_ngrams(tqdm(cats_wet_food_cleaned_reviews_ngrams), allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])

  0%|          | 0/3955 [00:00<?, ?it/s]

In [44]:
cats_wet_food_bigram_pmi.head(10)

Unnamed: 0,bigram,pmi
0,"(officials, aafco)",13.122603
1,"(elegant, medleys)",12.646165
2,"(levels, established)",12.572023
3,"(reasonably, priced)",12.570062
4,"(maine, coon)",12.387646
5,"(pot, pie)",12.329706
6,"(petite, cuisine)",12.286269
7,"(spot, stew)",12.1732
8,"(ocean, whitefish)",12.152786
9,"(control, officials)",11.981619


In [45]:
cats_wet_food_bigrams[:20]
cats_wet_food_filtered_bigram[:20]

Unnamed: 0,bigram,pmi
1,"(elegant, medleys)",12.646165
4,"(maine, coon)",12.387646
5,"(pot, pie)",12.329706
6,"(petite, cuisine)",12.286269
7,"(spot, stew)",12.1732
8,"(ocean, whitefish)",12.152786
9,"(control, officials)",11.981619
10,"(pro, plan)",11.864805
11,"(urinary, tract)",11.747529
12,"(blue, buffalo)",11.513261


In [46]:
cats_wet_food_trigrams[:10]
cats_wet_food_filtered_trigram[:10]

Unnamed: 0,trigram,pmi
0,"(control, officials, aafco)",25.104222
1,"(nutritional, levels, established)",24.536364
2,"(specialty, pet, shops)",22.252273
4,"(feed, control, officials)",20.860519
5,"(american, feed, control)",20.47719
6,"(chicken, pot, pie)",20.402118
10,"(try, different, ones)",16.837204
11,"(local, pet, store)",16.550639
12,"(ounce, cans, pack)",16.463351
13,"(feast, gravy, lovers)",16.422643


In [47]:
cats_dry_food_data_words = list(sent_to_words(tqdm(cats_dry_food_reviews.tolist())))
cats_dry_food_bigram_finder, cats_dry_food_bigram_measures, cats_dry_food_bigram_pmi = get_bigrams(tqdm(cats_dry_food_data_words))
cats_dry_food_trigram_finder, cats_dry_food_trigram_measures, cats_dry_food_trigram_pmi = get_trigrams(tqdm(cats_dry_food_data_words))
cats_dry_food_filtered_bigram = cats_dry_food_bigram_pmi[cats_dry_food_bigram_pmi.progress_apply(lambda bigram:\
                                              bigram_filter(bigram['bigram'])\
                                              and bigram.pmi > 5, axis = 1)][:500]

cats_dry_food_filtered_trigram = cats_dry_food_trigram_pmi[cats_dry_food_trigram_pmi.progress_apply(lambda trigram: \
                                                 trigram_filter(trigram['trigram'])\
                                                 and trigram.pmi > 5, axis = 1)][:500]
cats_dry_food_bigrams = [' '.join(x) for x in cats_dry_food_filtered_bigram.bigram.values]
cats_dry_food_trigrams = [' '.join(x) for x in cats_dry_food_filtered_trigram.trigram.values]

cats_dry_food_reviews_ngrams = pd.DataFrame([' '.join(sen) for sen in cats_dry_food_data_words], columns = ['reviewText'])
cats_dry_food_reviews_ngrams.reviewText = cats_dry_food_reviews_ngrams.reviewText.map(lambda x: replace_ngram(x, cats_dry_food_bigrams, cats_dry_food_trigrams))
cats_dry_food_cleaned_reviews_ngrams = remove_stopwords(tqdm(cats_dry_food_reviews_ngrams.reviewText))

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

  0%|          | 0/1143 [00:00<?, ?it/s]

  0%|          | 0/182 [00:00<?, ?it/s]

  0%|          | 0/1961 [00:00<?, ?it/s]

In [48]:
cats_dry_food_lemmatized_reviews_ngrams = lemmatize_skip_ngrams(tqdm(cats_dry_food_cleaned_reviews_ngrams), allowed_postags = [])

  0%|          | 0/1961 [00:00<?, ?it/s]

In [49]:
cats_dry_food_lemmatized_reviews_ngrams2 = lemmatize_skip_ngrams(tqdm(cats_dry_food_cleaned_reviews_ngrams), allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])

  0%|          | 0/1961 [00:00<?, ?it/s]

In [50]:
cats_dry_food_filtered_trigram[:10]

Unnamed: 0,trigram,pmi
0,"(blue, buffalo, wilderness)",18.578908
2,"(hill, science, diet)",16.129572
11,"(year, old, cat)",12.679082
12,"(grain, free, diet)",12.652765
17,"(grain, free, dry)",11.958455
24,"(grain, free, cat)",11.236011
26,"(high, quality, food)",11.034184
27,"(grain, free, food)",10.89294
49,"(dry, cat, food)",9.15687
58,"(adult, cat, food)",8.761447


# NLTK: take one review as an example

In [52]:
cats_wet_food_reviews[714] #wet food

'9 Lives Cat Food is one Meal my Cats really like and enjoy eating It. Daily Essentials Real Flaked Tuna in Sauce, one of the many Flavors that is Good for them and I get it Locally. 100% Completed and Balanced Nutrition with all kinds of Vitamins and Nutrients all supposed to be Good for Cats. I can get several Meals out of each Can of Food, my Cats are a little Finicky so I have to try different Flavors and Brands all the time. They enjoy eating it maybe twice a week, they don\'t eat a lot at one time, they are Nibblers and prefer to eat many little Meals. I have tried to give them other more expensive Cat Food from Popular "Specialty Pet Shops" or other "Food Boutiques" for Pets, however, they don\'t want any of that, they Crave and like eating 9 Lives. I am Glad there are such Diversity of Flavors to make my Cats Happy and to choose from. I even Feed 9 Lives to several Cats outside and they Love It! ...Thank You D.D.'

In [53]:
' '.join(cats_wet_food_data_words[714])

'lives cat food is one meal my cats really like and enjoy eating it daily essentials real flaked tuna in sauce one of the many flavors that is good for them and get it locally completed and balanced nutrition with all kinds of vitamins and nutrients all supposed to be good for cats can get several meals out of each can of food my cats are little finicky so have to try different flavors and brands all the time they enjoy eating it maybe twice week they don eat lot at one time they are nibblers and prefer to eat many little meals have tried to give them other more expensive cat food from popular specialty pet shops or other food boutiques for pets however they don want any of that they crave and like eating lives am glad there are such diversity of flavors to make my cats happy and to choose from even feed lives to several cats outside and they love it thank you'

In [54]:
cats_wet_food_reviews_ngrams.loc[714, 'reviewText'] 

'lives cat food is one meal my cats really like and enjoy_eating it daily essentials real flaked tuna in sauce one of the many_flavors that is good for them and get it locally completed and balanced_nutrition with all kinds of vitamins and nutrients all supposed to be good for cats can get several meals out of each can of food my cats are little_finicky so have to try different_flavors and brands all the time they enjoy_eating it maybe twice week they don eat lot at one time they are nibblers and prefer to eat many little meals have tried to give them other more expensive cat food from popular specialty_pet_shops or other food_boutiques for pets however they don want any of that they crave and like eating lives am glad there are such diversity of flavors to make my cats happy and to choose from even feed lives to several_cats_outside and they love it thank you'

In [55]:
' '.join(cats_wet_food_lemmatized_reviews_ngrams[714]) 

'live cat food one meal cat really like enjoy_eating daily essential real flaked tuna sauce one many_flavors good get locally complete balanced_nutrition kind vitamin nutrient suppose good cat get several meal food cat little_finicky try different_flavors brand time enjoy_eating maybe twice week eat lot one time nibbler prefer eat many little meal try give expensive cat food popular specialty_pet_shops food_boutiques pet however want crave like eat life glad diversity flavor make cat happy choose even feed life several_cats_outside love thank'

In [56]:
' '.join(cats_wet_food_lemmatized_reviews_ngrams2[714]) #m?

'live cat food meal cat really enjoy_eating daily essential real flaked tuna sauce many_flavors good get locally complete balanced_nutrition kind vitamin nutrient suppose good cat get several meal food cat little_finicky try different_flavors brand time enjoy_eating maybe twice week eat lot time nibbler prefer eat many little meal try give expensive cat food popular specialty_pet_shops food_boutiques pet however want crave eat life glad diversity flavor make cat happy choose even feed life several_cats_outside love thank'

In [57]:
fdist = FreqDist([word for sen in cats_wet_food_lemmatized_reviews_ngrams2 for word in sen])
lst1 = [tu[0] for tu in fdist.most_common(100)]

fdist2 = FreqDist([word for sen in cats_dry_food_lemmatized_reviews_ngrams2 for word in sen])
lst2 = [tu[0] for tu in fdist2.most_common(100)]

In [58]:
trigrams_fdist1 = sorted(cats_wet_food_trigram_finder.nbest(cats_wet_food_trigram_measures.raw_freq, 20))
trigrams_lst1 = [' '.join(tu) for tu in trigrams_fdist1]

trigrams_fdist2 = sorted(cats_dry_food_trigram_finder.nbest(cats_dry_food_trigram_measures.raw_freq, 20))
trigrams_lst2 = [' '.join(tu) for tu in trigrams_fdist2]

In [59]:
trigrams_lst1

['all the time',
 'canned cat food',
 'cats love it',
 'cats love this',
 'for my cats',
 'love this food',
 'my cat loves',
 'my cats are',
 'my cats love',
 'of cat food',
 'of my cats',
 'one of my',
 'one of the',
 'seems to be',
 'subscribe and save',
 'the fancy feast',
 'this cat food',
 'this food and',
 'this food is',
 'this is the']

In [60]:
len([w for w in trigrams_lst1 if w in trigrams_lst2])

13

In [61]:
bigrams_fdist1 = sorted(cats_wet_food_bigram_finder.nbest(cats_wet_food_bigram_measures.raw_freq, 20))
bigrams_lst1 = [' '.join(tu) for tu in bigrams_fdist1]

bigrams_fdist2 = sorted(cats_dry_food_bigram_finder.nbest(cats_dry_food_bigram_measures.raw_freq, 20))
bigrams_lst2 = [' '.join(tu) for tu in bigrams_fdist2]

In [62]:
bigrams_lst1

['and the',
 'and they',
 'cat food',
 'cats love',
 'dry food',
 'eat it',
 'fancy feast',
 'food and',
 'for the',
 'in the',
 'it is',
 'my cat',
 'my cats',
 'of the',
 'one of',
 'the food',
 'this food',
 'this is',
 'to be',
 'to the']

In [63]:
len([w for w in bigrams_lst1 if w in bigrams_lst2])

17

In [64]:
cats_dry_food_reviews[44] #wet food

'I used this in my automatic feeder instead of the overpriced science diet CD from the vet. I use friskies special diet cans normally, since wet food is much better for UTIs, but this dry is ok when we go away for a few days. More meat than Purinas cheaper urinary food.'

In [65]:
' '.join(cats_dry_food_data_words[44])

'used this in my automatic feeder instead of the overpriced science diet cd from the vet use friskies special diet cans normally since wet food is much better for utis but this dry is ok when we go away for few days more meat than purinas cheaper urinary food'

In [66]:
cats_dry_food_reviews_ngrams.loc[44, 'reviewText'] 

'used this in my automatic feeder instead of the overpriced science_diet cd from the vet use friskies special diet cans normally since wet food is much better for utis but this dry is ok when we go away for few days more meat than purinas cheaper urinary food'

In [67]:
' '.join(cats_dry_food_lemmatized_reviews_ngrams[44]) 

'use automatic feeder instead overprice science_diet cd vet use friskie special diet can normally since wet food much well utis dry ok go away day meat purina cheap urinary food'

In [68]:
' '.join(cats_dry_food_lemmatized_reviews_ngrams2[44]) 

'use automatic feeder instead overprice science_diet cd vet use friskie special diet can normally wet food much well utis dry go away day meat purina cheap urinary food'