# Collocations
Phrase consisting of more than one word but these words more commonly co-occur in a given context than its individual word parts.
The two most common types of collocation are bigrams and trigrams. Bigrams are two adjacent words, such as ‘CT scan’, ‘machine learning’, or ‘social media’. Trigrams are three adjacent words, such as ‘out of business’, or ‘Proctor and Gamble’.

Explored several methods to filter out the most meaningful collocations: frequency counting, Pointwise Mutual Information (PMI), and hypothesis testing (t-test and chi-square).

Bigrams: (Noun, Noun), (Adjective, Noun)
Trigrams: (Adjective/Noun, Anything, Adjective/Noun)

In [6]:
#load all libraries
import os
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import spacy
import string

Data: Singapore Reviews

In [8]:
#load reviews data
#os.chdir("D:\Choogle\Data\dataset_review")
os.chdir("D:\Choogle\Data\LargerSampleReviews")
#df = pd.read_csv("dataset2_london.csv")
df = pd.read_csv("SingaporeReviews.csv")
#df = pd.read_excel('sqllab_untitled_query_3_20200203T120456.xlsx')
print (df.head)

<bound method NDFrame.head of            crayon_product_id  \
0      R-102001000-200121252   
1      R-102001000-200121252   
2      R-102001000-200121252   
3      R-102001000-200121252   
4      R-102001000-200121252   
...                      ...   
14995  R-102001000-201078136   
14996  R-102001000-201078136   
14997  R-102001000-201078136   
14998  R-102001000-201078136   
14999  R-102001000-201078136   

                                                    text       city  
0      4th level Tangs Plaza; former Island Coffee Ho...  Singapore  
1      We wouldn't have found this restaurant if it w...  Singapore  
2      Dined here with friends and had a good night o...  Singapore  
3      Truly an enjoyable gastronomic experience if y...  Singapore  
4      Tucked away on the 4th floor of Tangs, you wil...  Singapore  
...                                                  ...        ...  
14995  I had a stop over in Singapore after a fun vac...  Singapore  
14996  We had some homema

In [10]:
#load reviews data
reviews = pd.read_csv('SingaporeReviews.csv')

In [11]:
reviews.head(2)

Unnamed: 0,crayon_product_id,text,city
0,R-102001000-200121252,4th level Tangs Plaza; former Island Coffee Ho...,Singapore
1,R-102001000-200121252,We wouldn't have found this restaurant if it w...,Singapore


Extract only the reviews...

In [12]:
#extract only reviews
comments = reviews['text']
comments = comments.astype('str')

In [13]:
comments = reviews['text']
comments

0        4th level Tangs Plaza; former Island Coffee Ho...
1        We wouldn't have found this restaurant if it w...
2        Dined here with friends and had a good night o...
3        Truly an enjoyable gastronomic experience if y...
4        Tucked away on the 4th floor of Tangs, you wil...
                               ...                        
14995    I had a stop over in Singapore after a fun vac...
14996    We had some homemade noodles, which were actua...
14997    Of all the vendors and restaurants on Singapor...
14998    Located on Smith Street alongside the hawker s...
14999    As you would expect from the name of this plac...
Name: text, Length: 15000, dtype: object

## Preprocessing

In [14]:
#function to remove non-ascii characters
def _removeNonAscii(s): return "".join(i for i in s if ord(i)<128)

In [15]:
comments = comments.astype('str')

In [16]:
#remove non-ascii characters
comments = comments.map(lambda x: _removeNonAscii(x))

In [17]:
#get stop words of all languages
STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in nltk.corpus.stopwords.fileids()}

In [18]:
#function to detect language based on # of stop words for particular language
def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    lang = max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), key = lambda x: x[1])[0]
    if lang == 'english':
        return True
    else:
        return False

In [19]:
#filter for only english comments
eng_comments=comments[comments.apply(get_language)]

In [20]:
eng_comments.head()

0    4th level Tangs Plaza; former Island Coffee Ho...
1    We wouldn't have found this restaurant if it w...
2    Dined here with friends and had a good night o...
3    Truly an enjoyable gastronomic experience if y...
4    Tucked away on the 4th floor of Tangs, you wil...
Name: text, dtype: object

In [21]:
#drop duplicates
eng_comments.drop_duplicates(inplace=True)

In [22]:
#load spacy
#nlp = spacy.load('en')
nlp = spacy.load("en_core_web_sm")

In [23]:
#function to clean and lemmatize comments
def clean_comments(text):
    #remove punctuations
    regex = re.compile('[' + re.escape(string.punctuation) + '\\r\\t\\n]')
    nopunct = regex.sub(" ", str(text))
    #use spacy to lemmatize comments
    doc = nlp(nopunct, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

In [24]:
#apply function to clean and lemmatize comments
lemmatized = eng_comments.map(clean_comments)

In [25]:
#make sure to lowercase everything
lemmatized = lemmatized.map(lambda x: [word.lower() for word in x])

In [26]:
lemmatized.head()

0    [4th, level, tangs, plaza,  , former, island, ...
1    [-pron-, wouldn, t, have, find, this, restaura...
2    [dine, here, with, friend, and, have, a, good,...
3    [truly, an, enjoyable, gastronomic, experience...
4    [tuck, away, on, the, 4th, floor, of, tang,  ,...
Name: text, dtype: object

In [27]:
#turn all comments' tokens into one single list
unlist_comments = [item for items in lemmatized for item in items]

## Initialize NLTK's Bigrams/Trigrams Finder

In [28]:
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()

In [29]:
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(unlist_comments)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(unlist_comments)

## 1. Counting Frequencies of Adjacent Words
- Main idea: simply order by frequency
- Issues: too sensitive to very frequent pairs and pronouns/articles/prepositions come up often
- Solution: filter for only adjectives and nouns

In [30]:
bigram_freq = bigramFinder.ngram_fd.items()

In [31]:
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)

In [32]:
bigramFreqTable.head().reset_index(drop=True)

Unnamed: 0,bigram,freq
0,"( , -pron-)",19211
1,"( , the)",12983
2,"(-pron-, be)",10252
3,"(-pron-, have)",5401
4,"(and, -pron-)",4514


In [33]:
bigramFreqTable[:10]

Unnamed: 0,bigram,freq
163,"( , -pron-)",19211
27,"( , the)",12983
151,"(-pron-, be)",10252
238,"(-pron-, have)",5401
264,"(and, -pron-)",4514
336,"(of, the)",4283
144,"(be, a)",4118
83,"(the, food)",3897
484,"( , and)",3626
424,"(and, the)",3626


In [34]:
#get english stopwords
en_stopwords = set(stopwords.words('english'))

In [35]:
#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords:
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False

In [36]:
#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]

In [37]:
#nltk.download('averaged_perceptron_tagger')

In [38]:
filtered_bi[:10]

Unnamed: 0,bigram,freq
3853,"(good, food)",405
875,"(ice, cream)",391
3147,"(service, staff)",373
922,"(good, service)",318
3732,"(first, time)",317
2382,"(great, food)",285
2638,"(great, place)",273
3857,"(good, place)",273
3868,"(great, service)",264
23019,"(dim, sum)",232


In [39]:
trigram_freq = trigramFinder.ngram_fd.items()

In [40]:
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)

In [41]:
trigramFreqTable.head().reset_index(drop=True)

Unnamed: 0,trigram,freq
0,"( , -pron-, be)",3450
1,"( , -pron-, have)",1991
2,"(the, food, be)",1836
3,"( , the, food)",1290
4,"( , amp, )",1184


In [42]:
trigramFreqTable[:10]

Unnamed: 0,trigram,freq
175,"( , -pron-, be)",3450
3466,"( , -pron-, have)",1991
264,"(the, food, be)",1836
84,"( , the, food)",1290
2714,"( , amp, )",1184
277,"(and, -pron-, be)",1033
94,"( , service, be)",975
1501,"( , there, be)",948
3289,"(-pron-, be, a)",935
2398,"(-pron-, have, a)",884


In [43]:
def rightTypesTri(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or '  ' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords:
            return False
    first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in first_type and tags[2][1] in third_type:
        return True
    else:
        return False

In [44]:
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]

In [45]:
filtered_tri[:10]

Unnamed: 0,trigram,freq
89059,"(beach, road, kitchen)",99
173575,"(michelin, star, restaurant)",67
253182,"(crab, bee, hoon)",44
146697,"(din, tai, fung)",38
22443,"(chinese, new, year)",34
56213,"(f, amp, b)",33
19128,"(fine, dining, restaurant)",32
49133,"(north, indian, food)",31
553346,"(squid, ink, paella)",31
46948,"(good, indian, food)",31


In [46]:
freq_bi = filtered_bi[:20].bigram.values

In [47]:
freq_tri = filtered_tri[:20].trigram.values

## 2. PMI

In [48]:
bigramFinder.apply_freq_filter(20)

In [49]:
bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)

In [50]:
bigramPMITable[:10]

Unnamed: 0,bigram,PMI
0,"(tiong, bahru)",15.571748
1,"(tanjong, pagar)",15.361328
2,"(earl, grey)",15.294214
3,"(telok, ayer)",15.111393
4,"(ngoh, hiang)",14.920756
5,"(aglio, olio)",14.838293
6,"(sri, lankan)",14.742418
7,"(xi, yan)",14.687971
8,"(gula, melaka)",14.664416
9,"(ikan, bili)",14.645121


In [51]:
trigramFinder.apply_freq_filter(20)

In [52]:
trigramPMITable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.pmi)), columns=['trigram','PMI']).sort_values(by='PMI', ascending=False)

In [53]:
trigramPMITable[:10]

Unnamed: 0,trigram,PMI
0,"(din, tai, fung)",28.281361
1,"(kueh, pie, tee)",26.590174
2,"(sri, lankan, crab)",24.386658
3,"(squid, ink, paella)",23.733306
4,"(angel, hair, pasta)",23.647481
5,"(xiao, long, bao)",23.499672
6,"(f, amp, b)",23.331157
7,"(buah, keluak, ice)",23.034429
8,"(salted, egg, yolk)",21.752068
9,"(crab, bee, hoon)",21.584104


In [54]:
pmi_bi = bigramPMITable[:20].bigram.values

In [55]:
pmi_tri = trigramPMITable[:20].trigram.values

## 3. t-test

In [56]:
bigramTtable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.student_t)), columns=['bigram','t']).sort_values(by='t', ascending=False)

In [57]:
bigramTtable.head()

Unnamed: 0,bigram,t
0,"( , -pron-)",90.000731
1,"(-pron-, be)",66.561675
2,"( , the)",65.799605
3,"(-pron-, have)",62.747189
4,"(the, food)",53.233167


In [58]:
filteredT_bi = bigramTtable[bigramTtable.bigram.map(lambda x: rightTypes(x))]

In [59]:
filteredT_bi[:10]

Unnamed: 0,bigram,t
124,"(ice, cream)",19.760473
151,"(service, staff)",18.263802
164,"(first, time)",17.621035
215,"(good, food)",15.546471
219,"(great, place)",15.389335
223,"(dim, sum)",15.227592
235,"(main, course)",15.012504
238,"(good, service)",14.906034
242,"(great, service)",14.841409
251,"(good, value)",14.69567


In [60]:
trigramTtable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.student_t)), columns=['trigram','t']).sort_values(by='t', ascending=False)

In [61]:
trigramTtable.head()

Unnamed: 0,trigram,t
0,"( , -pron-, be)",53.272253
1,"( , -pron-, have)",43.003582
2,"(the, food, be)",42.210463
3,"( , the, food)",34.45655
4,"( , amp, )",34.090925


In [62]:
filteredT_tri = trigramTtable[trigramTtable.trigram.map(lambda x: rightTypesTri(x))]

In [63]:
filteredT_tri.head(10)

Unnamed: 0,trigram,t
354,"(beach, road, kitchen)",9.949871
668,"(michelin, star, restaurant)",8.185266
1241,"(crab, bee, hoon)",6.633247
1513,"(din, tai, fung)",6.164414
1779,"(chinese, new, year)",5.830925
1839,"(f, amp, b)",5.744562
1925,"(fine, dining, restaurant)",5.656574
2012,"(squid, ink, paella)",5.567764
2014,"(north, indian, food)",5.567655
2029,"(good, indian, food)",5.558119


In [64]:
t_bi = filteredT_bi[:20].bigram.values

In [65]:
t_tri = filteredT_tri[:20].trigram.values

## 4. Chi-Square

In [66]:
bigramChiTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi-sq']).sort_values(by='chi-sq', ascending=False)

In [67]:
bigramChiTable.head(20)

Unnamed: 0,bigram,chi-sq
0,"(tanjong, pagar)",1136538.0
1,"(buah, keluak)",1034317.0
2,"(bee, hoon)",1005597.0
3,"(mondo, mio)",1001438.0
4,"(aglio, olio)",995973.9
5,"(tiong, bahru)",974073.2
6,"(telok, ayer)",955749.3
7,"(gula, melaka)",934819.9
8,"(amuse, bouche)",932177.6
9,"(ngoh, hiang)",930493.4


In [68]:
trigramChiTable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.chi_sq)), columns=['trigram','chi-sq']).sort_values(by='chi-sq', ascending=False)

In [69]:
trigramChiTable.head(20)

Unnamed: 0,trigram,chi-sq
0,"(din, tai, fung)",12397150000.0
1,"(kueh, pie, tee)",2626746000.0
2,"(xiao, long, bao)",794662100.0
3,"(sri, lankan, crab)",438717100.0
4,"(squid, ink, paella)",432494600.0
5,"(angel, hair, pasta)",407394300.0
6,"(f, amp, b)",348250700.0
7,"(beach, road, kitchen)",309480200.0
8,"(buah, keluak, ice)",198089300.0
9,"(crab, bee, hoon)",138737400.0


In [70]:
chi_bi = bigramChiTable[:20].bigram.values

In [71]:
chi_tri = trigramChiTable[:20].trigram.values

## 5. Likelihood

In [72]:
bigramLikTable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.likelihood_ratio)), columns=['bigram','likelihood ratio']).sort_values(by='likelihood ratio', ascending=False)

In [73]:
bigramLikTable.head()

Unnamed: 0,bigram,likelihood ratio
0,"( , -pron-)",19560.563697
1,"(-pron-, have)",13954.808912
2,"(do, not)",10360.894292
3,"(-pron-, be)",10097.745585
4,"(in, singapore)",10067.997184


In [74]:
filteredLik_bi = bigramLikTable[bigramLikTable.bigram.map(lambda x: rightTypes(x))]

In [75]:
filteredLik_bi.head(10)

Unnamed: 0,bigram,likelihood ratio
24,"(ice, cream)",5702.415267
46,"(dim, sum)",4014.077732
76,"(michelin, star)",2674.953464
90,"(first, time)",2397.946925
95,"(bee, hoon)",2323.381667
103,"(main, course)",2218.830595
110,"(little, india)",2033.399712
119,"(foie, gra)",1975.89743
121,"(free, flow)",1939.37302
123,"(fine, dining)",1893.135982


In [76]:
trigramLikTable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.likelihood_ratio)), columns=['trigram','likelihood ratio']).sort_values(by='likelihood ratio', ascending=False)

In [77]:
trigramLikTable.head()

Unnamed: 0,trigram,likelihood ratio
0,"( , -pron-, have)",50794.051757
1,"( , -pron-, be)",50077.418314
2,"(be, , -pron-)",42706.664042
3,"(the, food, be)",41525.333609
4,"(amp, , -pron-)",37555.519943


In [78]:
filteredLik_tri = trigramLikTable[trigramLikTable.trigram.map(lambda x: rightTypesTri(x))]

In [79]:
filteredLik_tri.head(20)

Unnamed: 0,trigram,likelihood ratio
1307,"(vanilla, ice, cream)",9079.999392
1319,"(keluak, ice, cream)",8986.17417
1591,"(fine, dining, experience)",7187.516348
2050,"(michelin, star, restaurant)",4810.427538
2118,"(beach, road, kitchen)",4537.000934
2166,"(good, dining, experience)",4364.239334
2194,"(crab, bee, hoon)",4240.804258
2286,"(bee, hoon, soup)",3881.226689
2494,"(north, indian, food)",3288.234831
2551,"(fine, dining, restaurant)",3122.226393


In [80]:
lik_bi = filteredLik_bi[:20].bigram.values

In [81]:
lik_tri = filteredLik_tri[:20].trigram.values

## Bigram Comparison

In [82]:
bigramsCompare = pd.DataFrame([freq_bi, pmi_bi, t_bi, chi_bi, lik_bi]).T

In [83]:
bigramsCompare.columns = ['Frequency With Filter', 'PMI', 'T-test With Filter', 'Chi-Sq Test', 'Likeihood Ratio Test With Filter']

In [84]:
bigramsCompare

Unnamed: 0,Frequency With Filter,PMI,T-test With Filter,Chi-Sq Test,Likeihood Ratio Test With Filter
0,"(good, food)","(tiong, bahru)","(ice, cream)","(tanjong, pagar)","(ice, cream)"
1,"(ice, cream)","(tanjong, pagar)","(service, staff)","(buah, keluak)","(dim, sum)"
2,"(service, staff)","(earl, grey)","(first, time)","(bee, hoon)","(michelin, star)"
3,"(good, service)","(telok, ayer)","(good, food)","(mondo, mio)","(first, time)"
4,"(first, time)","(ngoh, hiang)","(great, place)","(aglio, olio)","(bee, hoon)"
5,"(great, food)","(aglio, olio)","(dim, sum)","(tiong, bahru)","(main, course)"
6,"(great, place)","(sri, lankan)","(main, course)","(telok, ayer)","(little, india)"
7,"(good, place)","(xi, yan)","(good, service)","(gula, melaka)","(foie, gra)"
8,"(great, service)","(gula, melaka)","(great, service)","(amuse, bouche)","(free, flow)"
9,"(dim, sum)","(ikan, bili)","(good, value)","(ngoh, hiang)","(fine, dining)"


## Trigram Comparison

In [85]:
trigramsCompare = pd.DataFrame([freq_tri, pmi_tri, t_tri, chi_tri, lik_tri]).T

In [86]:
trigramsCompare.columns = ['Frequency With Filter', 'PMI', 'T-test With Filter', 'Chi-Sq Test', 'Likeihood Ratio Test With Filter']

In [87]:
trigramsCompare

Unnamed: 0,Frequency With Filter,PMI,T-test With Filter,Chi-Sq Test,Likeihood Ratio Test With Filter
0,"(beach, road, kitchen)","(din, tai, fung)","(beach, road, kitchen)","(din, tai, fung)","(vanilla, ice, cream)"
1,"(michelin, star, restaurant)","(kueh, pie, tee)","(michelin, star, restaurant)","(kueh, pie, tee)","(keluak, ice, cream)"
2,"(crab, bee, hoon)","(sri, lankan, crab)","(crab, bee, hoon)","(xiao, long, bao)","(fine, dining, experience)"
3,"(din, tai, fung)","(squid, ink, paella)","(din, tai, fung)","(sri, lankan, crab)","(michelin, star, restaurant)"
4,"(chinese, new, year)","(angel, hair, pasta)","(chinese, new, year)","(squid, ink, paella)","(beach, road, kitchen)"
5,"(f, amp, b)","(xiao, long, bao)","(f, amp, b)","(angel, hair, pasta)","(good, dining, experience)"
6,"(fine, dining, restaurant)","(f, amp, b)","(fine, dining, restaurant)","(f, amp, b)","(crab, bee, hoon)"
7,"(north, indian, food)","(buah, keluak, ice)","(squid, ink, paella)","(beach, road, kitchen)","(bee, hoon, soup)"
8,"(squid, ink, paella)","(salted, egg, yolk)","(north, indian, food)","(buah, keluak, ice)","(north, indian, food)"
9,"(good, indian, food)","(crab, bee, hoon)","(good, indian, food)","(crab, bee, hoon)","(fine, dining, restaurant)"
