# Word Collocations

Collocations are pairs or groups of words that frequently appear together in a language.

They sound ‚Äúnatural‚Äù to native speakers because they occur commonly.

Examples:

strong tea (not powerful tea)

make a decision (not do a decision)

fast food, heavy rain, take a break

Collocations can be:

Adjective + noun (e.g., "heavy rain")

Verb + noun (e.g., "make money")

Verb + adverb (e.g., "deeply regret")

Noun + noun (e.g., "traffic jam")

They help NLP models understand natural language patterns and improve text generation.

In NLP, collocations are detected using:

Frequency counts

Bigrams and trigrams

Statistical measures like PMI (Pointwise Mutual Information)

Collocations improve applications like machine translation, speech recognition, and text prediciton.

In [11]:
# identify word collocations
import nltk
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures,TrigramAssocMeasures
from nltk.corpus import stopwords
import string
# Downloading nltk resources --> Run once
nltk.download('punkt_tab')
nltk.download('stopwords')
text="""Natural Language Processing is a fascinating \
     area of artificial intelligence.
     It deals with how computers understand \
      and generate human language.\
     Word Collocations are pairs of words that\
     appear together more often than by chance.
     For example, 'artificial intelligence',\
     'machine learning', and 'deep learning'."""

tokens=nltk.word_tokenize(text.lower())
stop_words=set(stopwords.words('english'))
filtered_tokens=[t for t in tokens if t not in stop_words and t not in string.punctuation]
#Bigrams
print("\n=== üìç TOP BIGRAM COLLOCATIONS(with PMI scores)===")
bigram_finder= BigramCollocationFinder.from_words(filtered_tokens)
bigram_finder.apply_freq_filter(1) #Filter low-frequency Bigrams
scored_bigrams=bigram_finder.score_ngrams(BigramAssocMeasures.pmi) # It returns [((w1,w2),score),......]

for(w1,w2), score in scored_bigrams[:10]: # Top 10
  print(f"{w1}-{w2} | PMI: {score:.4f}")
# Trigram
print("\n=== üìç TOP TRIGRAM COLLOCATIONS(with PMI scores)===")
trigram_finder= TrigramCollocationFinder.from_words(filtered_tokens)
trigram_finder.apply_freq_filter(1)
scored_trigrams=trigram_finder.score_ngrams(TrigramAssocMeasures.pmi)

for(w1,w2,w3), score in scored_trigrams[:10]:
  print(f"{w1}-{w2}-{w3} | PMI: {score:.4f}")


=== üìç TOP BIGRAM COLLOCATIONS(with PMI scores)===
appear-together | PMI: 4.8074
area-artificial | PMI: 4.8074
chance-example | PMI: 4.8074
collocations-pairs | PMI: 4.8074
computers-understand | PMI: 4.8074
deals-computers | PMI: 4.8074
example-'artificial | PMI: 4.8074
fascinating-area | PMI: 4.8074
generate-human | PMI: 4.8074
often-chance | PMI: 4.8074

=== üìç TOP TRIGRAM COLLOCATIONS(with PMI scores)===
appear-together-often | PMI: 9.6147
chance-example-'artificial | PMI: 9.6147
collocations-pairs-words | PMI: 9.6147
computers-understand-generate | PMI: 9.6147
deals-computers-understand | PMI: 9.6147
fascinating-area-artificial | PMI: 9.6147
often-chance-example | PMI: 9.6147
pairs-words-appear | PMI: 9.6147
processing-fascinating-area | PMI: 9.6147
together-often-chance | PMI: 9.6147


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
