In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


I've chosen the third assignment , to classify 
if a given bigram is a collocation or rather a co-oocurance.

Let's start with some motivation why extraction of collocations might be of interest.
1. Improving insight analysis and topic modeling.
2. Most relevant key word identification in document.

In [None]:
import os
PATH='/content/gdrive/MyDrive/gdrive/Github'
os.chdir(os.path.join(PATH,'NLP_assignment'))

The data sets we will use

Hotel reviews data:
https://www.kaggle.com/datafiniti/hotel-reviews/data

The relevant data set is in 7282_1.csv file with the relevant reviews column : reviews.text

This dataset was chosen mainly because it is small (to save processing time),
but still descriptive enough to demonstrate the approach.

The first step is as always the data preparation . All the required functions are at the clean_data.py script.

Functions as puctuation removal, non_ascii_chars removal contraction_expantion , etc. are present at this script.
The function prepare_data , applies in logical order all the functions above to the reviews so eventually we can get a text appropriate to work with. 

In [None]:
!pip install contractions
!python -m spacy download en_core_web_sm

In [None]:
import clean_data
from clean_data import *

In [None]:
#load data
reviews = pd.read_csv('7282_1.csv')
#extract reviews
comments = reviews[['reviews.text']]

In [None]:
comments_preproccesed=clean_data.prepare_data(comments,'reviews.text')

In [None]:
comments_preproccesed.head()

Unnamed: 0,reviews.text,processed_reviews.text
0,Pleasant 10 min walk along the sea front to th...,"[pleasant, 10, min, walk, along, the, sea, fro..."
1,Really lovely hotel. Stayed on the very top fl...,"[really, lovely, hotel, stay, on, the, very, t..."
2,We stayed here for four nights in October. The...,"[we, stay, here, for, four, night, in, october..."
3,We loved staying on the island of Lido! You ne...,"[we, love, stay, on, the, island, of, lido, yo..."
4,Lovely view out onto the lagoon. Excellent vie...,"[lovely, view, out, onto, the, lagoon, excelle..."


In [None]:
comments_list=comments_preproccesed["processed_reviews.text"].values
final=[val for sublist in comments_list for val in sublist]

Lets generate some bigrams and rank them by frequency

In [None]:
bigrams = nltk.collocations.BigramAssocMeasures()
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(final)
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreq = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)

Let's take a look at the bigrams

In [None]:
bigramFreq[:5]

Unnamed: 0,bigram,freq
193,"(it, be)",8995
96,"(the, room)",8782
97,"(room, be)",8431
302,"(in, the)",8240
171,"(be, very)",7733


We see that the stop words, articles, prepositions or pronouns  present at the texts are common , not meaningfull and prevent us from obtaining a meaningfull
co-occurance/collocations list.

Let's to something about it!

We will filter out those bigrams that not contain the above mentioned,
and are of the following structure:

(Noun, Noun), (Adjective, Noun)

Practically, we will apply a POS filter.

The function filter_matching is present at the clean_data file

In [None]:
filtered_bigrams = bigramFreq[bigramFreq.bigram.map(lambda x: clean_data.filter_matching(x))]  

In [None]:
filtered_bigrams.head(10)

Unnamed: 0,bigram,freq
1033,"(front, desk)",2616
65,"(great, location)",795
252,"(friendly, staff)",761
2450,"(walk, distance)",684
1435,"(clean, room)",639
5190,"(hot, tub)",593
85,"(hotel, staff)",590
249,"(nice, hotel)",528
3006,"(continental, breakfast)",521
4042,"(free, breakfast)",520


Looks much better now!

We see that the top resutls filtered by POS filter and ranked by frequency are in fact mixture , 
between words that actually make more sense together and more commonly co-occur in a given context than in separate, they are collocations - such as walk-distance , or front-desk,
and bigrams that are just co-occurancies, such as nice hotel. 
We see that we can't rely only on the frequency measure to identify to which group the given bigram belongs to.
So...
Let's try more sophisticated analysis!

Let's make hypothesis testing with 3 tests :
t-test , chi-square , Likelihood ratios

**T-test** : 

Is used to compare the mean of two given samples. 

Point to notice!: 
The t-test assumes that probabilities are approximately
normally distributed, which is not true in general!!!

Let's assume our corpus is of N words , and we examine some given bigram (a1, a2) .
Null hypothesis : a1 and a2 are independent:

Ho= "a1 a2" occurance has probability : $\mu=P(a1)P(a2)=Count(a1)/N * Count(a2)/N$

Alternative hypothesis:

H1 : "a1 a2" occurance does not have expected probability $\mu$

t-statistic score :      $t=(Count(a1 a2)/N - \mu)/\sqrt(s^2/N)$

for Bernoulli trial :      $s^2=P(1-P)\approx$P$\approx$Count(a1 a2)/N  


**Chi-square test** :

Point to notice!: 
1. The chi-square test does not assume normally distributed probabilities!
2. The chi-square test appropriate for large probabilities. 
3. The chi-square test is not appropriate with sparse data (if numbers in the
2 by 2 tables are small ! Very low frquency bigrams might give a very high chi-squre values - misleading result!)

The null hypothesis assumption is like in t-test

for each bigram the following table is calculated:

----------|     word1==a1           |   word1!=a1----------------------
----------|-------------------------|-----------------------------------  
 word1==a2| Count("a1 a1")          | Count ("x a2")  (x!=a1)
----------|-------------------------|-----------------------------------
word2!=a2 |Count ("a1 x")  (x!=a2)  | Count ("x1 x2") (x1!=a1 & x2!=a2)


 $\chi^2 =\sum_{i,j}(Oij-Eij)^2/Eij$

 Oij - value in the table above for row i column j

 Eij - N*Count(a1)/N * Count(a2)/N

Taking into account the mentioned in notice points, we should filter out results based on small frequencies ,after making an examination , we concluded that the bigrams with frequencies less than 20 , are less likely to form collocations ,thus we will ignore them in this test examination.

**Likelihood ratios**

Two Hypothesis used in Likelihood ratios :

– Hypothesis 1 : formalization of independence , P(a2|a1)=p=P(a2|not a1)

– Hypothesis 2 : formalization of dependence , P(a2|a1)=p1 != p2=P(a2|not a1)


Assuming binomial distribution the log likelihood raio is calculated as follows:

log $\lambda$ = log L(H1)/L(H2) =
log b(c12;c1,p)b(c1-c12;N-c1,p) / b(c12;c1,p1)b(c1-c12;N-c1,p2) =
log L(c21;c1,p)+log L(c2-c1;N-c1,p)-log L(c12;c1,p1)-log L(c1-c12;N-c1,p2)

L(k;n,x)= $x^k$\$(1-x)^n /$\$(1-x)^k\$

c1- a1 frequency

c2 - a2 frequency

c12 - a12 bigram frequency

N - total num. of words in corpus

p=c2/N , p1=c12/c1 , p2=(c2-c12)/(N-c1)

The higher the test value the more likely that a2 is the collocation of a1.

**T-test**

In [None]:
bigramT_test = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.student_t)), columns=['bigram','t-score']).sort_values(by='t-score', ascending=False)
#filter out those collocations/co-occurances that not contain stop words,and are of the following structure:(Noun, Noun), (Adjective, Noun)
filteredT_bigrams = bigramT_test[bigramT_test.bigram.map(lambda x: clean_data.filter_matching(x))]
bigramT_test_freq=pd.merge(filteredT_bigrams, bigramFreq, on='bigram').sort_values(by='t-score', ascending=False)
#bigramT_test_freq_filtered=bigramT_test_freq[bigramT_test_freq['freq']>20].reset_index(drop=True)
bigramT_test_freq.head(10)

Unnamed: 0,bigram,t-score,freq
0,"(front, desk)",50.998854,2616
1,"(great, location)",27.315524,795
2,"(friendly, staff)",26.362455,761
3,"(walk, distance)",26.102098,684
4,"(hot, tub)",24.301816,593
5,"(continental, breakfast)",22.689935,521
6,"(free, breakfast)",22.326432,520
7,"(great, place)",21.258349,510
8,"(parking, lot)",20.579494,428
9,"(customer, service)",20.305471,415


What we can note is that the results pretty much resemble those ranked by just frequency.

**Chi-Square test**

In [None]:
bigramChi_test = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.chi_sq)), columns=['bigram','chi-sq']).sort_values(by='chi-sq', ascending=False)
#filter out those collocations/co-occurances that not contain stop words,and are of the following structure:(Noun, Noun), (Adjective, Noun)
filteredChi_bigrams = bigramChi_test[bigramChi_test.bigram.map(lambda x: clean_data.filter_matching(x))]
bigramChi_test_freq=pd.merge(filteredChi_bigrams, bigramFreq, on='bigram').sort_values(by='chi-sq', ascending=False)
bigramChi_test_freq_filtered=bigramChi_test_freq[bigramChi_test_freq['freq']>20].reset_index(drop=True)
bigramChi_test_freq_filtered.head(10)

Unnamed: 0,bigram,chi-sq,freq
0,"(wi, fi)",1450628.0,225
1,"(cracker, barrel)",1180855.0,44
2,"(howard, johnson)",1069493.0,38
3,"(la, quinta)",934854.7,130
4,"(front, desk)",902855.7,2616
5,"(santa, barbara)",789857.3,36
6,"(santana, row)",741320.7,51
7,"(elk, springs)",691330.9,58
8,"(french, quarter)",663305.3,98
9,"(flat, screen)",618073.7,115


All the top 10 bigrams are actually collocations!

**Likelihood test**

In [None]:
bigramLikelihood_test = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.likelihood_ratio)), columns=['bigram','likelihood']).sort_values(by='likelihood', ascending=False)
#filter out those collocations/co-occurances that not contain stop words,and are of the following structure:(Noun, Noun), (Adjective, Noun)
filteredLikelihood_bigrams = bigramLikelihood_test[bigramLikelihood_test.bigram.map(lambda x: clean_data.filter_matching(x))]
bigramLikelihood_test_freq=pd.merge(filteredLikelihood_bigrams, bigramFreq, on='bigram').sort_values(by='likelihood', ascending=False)
#bigramLikelihood_test_freq_filtered=bigramLikelihood_test_freq[bigramLikelihood_test_freq['freq']>20].reset_index(drop=True)
bigramLikelihood_test_freq.head(10)

Unnamed: 0,bigram,likelihood,freq
0,"(front, desk)",31155.002334,2616
1,"(walk, distance)",8289.266979,684
2,"(hot, tub)",6791.858799,593
3,"(continental, breakfast)",5077.996503,521
4,"(customer, service)",4336.616941,415
5,"(wi, fi)",4291.368227,225
6,"(great, location)",4191.052979,795
7,"(parking, lot)",3852.821968,428
8,"(holiday, inn)",3502.51899,291
9,"(friendly, staff)",3447.044517,761


Lets compare the results of the 3 performed tests!

In [None]:
bigramsCompare=pd.DataFrame([bigramT_test_freq[['bigram']][:20].bigram.values,bigramChi_test_freq_filtered[['bigram']][:20].bigram.values,bigramLikelihood_test_freq[['bigram']][:20].bigram.values ]).T
bigramsCompare.columns=['T-test','Chi-square test','Likelihood ratio']
bigramsCompare

Unnamed: 0,T-test,Chi-square test,Likelihood ratio
0,"(front, desk)","(wi, fi)","(front, desk)"
1,"(great, location)","(cracker, barrel)","(walk, distance)"
2,"(friendly, staff)","(howard, johnson)","(hot, tub)"
3,"(walk, distance)","(la, quinta)","(continental, breakfast)"
4,"(hot, tub)","(front, desk)","(customer, service)"
5,"(continental, breakfast)","(santa, barbara)","(wi, fi)"
6,"(free, breakfast)","(santana, row)","(great, location)"
7,"(great, place)","(elk, springs)","(parking, lot)"
8,"(parking, lot)","(french, quarter)","(holiday, inn)"
9,"(customer, service)","(flat, screen)","(friendly, staff)"


**Conclusion**

The superiority of the Chi-square test results and the Likelihood ratios results over the results of the t-test is very obvious.
The basic statistical assumption of the t-test limits the application of this test.

The results demonstrated by both Chi-square test and the Likelihood test are quite similar , however the last definitely preferable since it doesn't have the limiting assumptions of the Chi-square test , and as a consequence it doesn't require any visual data examination and assumption as we did for low frequency items.
