# Statistical NLP Model

This program is a Bigram Probability Model in NLP.
It analyzes a text and finds which word most frequently follows each word, along with the probability of that happening.


⭐ What This Program Does (Short Summary)

It builds a bigram model and calculates, for every word w1,
which next word w2 has the highest probability of occurring after it.

In simple terms:
➡ "For each word, which word usually comes next?"

Probability formula:
      
    
                            Count(W1,W2)
                P(W2|W1) = ________________
                            Count(W1)


⭐ What Type of Program is This?

This program is a:

--> Bigram Language Model

--> Statistical NLP Model

--> Next-word probability calculator

Before deep learning, NLP used models like:

Unigrams

Bigrams

Trigrams

N-gram models

⭐ Uses of This Program

Predicting next words

Text generation (basic)

Understanding word patterns

Language modeling

Autocomplete systems (early versions)

In [10]:
from collections import defaultdict

def highest_probability(text):
  words=text.split()#Tokenize text:step(1)

  bigram_counts=defaultdict(lambda: defaultdict(int))
  first_word_counts=defaultdict(int) # count bigram:step(2)

  for i in range(len(words)-1):
    w1=words[i]
    w2=words[i+1]
    bigram_counts[w1][w2]+=1
    first_word_counts[w1]+=1

#calculate prob and find higher1
  result={}
  for w1 in bigram_counts:
    max_prob=0
    best_w2=None
    for w2 in bigram_counts[w1]:
      prob=bigram_counts[w1][w2] / first_word_counts[w1]
      if prob>max_prob:
        max_prob=prob
        best_w2=w2
    result[w1]=(best_w2,max_prob)
  return result

text="Education empowers students to learn and education helps teachers to guide and education creates opportunities to learn and grow in education systems around the world."
output=highest_probability(text)
print("Highest prob of word(w2) occuring after another word(w1):")
for w1,(w2,prob) in output.items():
  print(f"After '{w1}' -> '{w2}' with probability {prob:.2f}")


Highest prob of word(w2) occuring after another word(w1):
After 'Education' -> 'empowers' with probability 1.00
After 'empowers' -> 'students' with probability 1.00
After 'students' -> 'to' with probability 1.00
After 'to' -> 'learn' with probability 0.67
After 'learn' -> 'and' with probability 1.00
After 'and' -> 'education' with probability 0.67
After 'education' -> 'helps' with probability 0.33
After 'helps' -> 'teachers' with probability 1.00
After 'teachers' -> 'to' with probability 1.00
After 'guide' -> 'and' with probability 1.00
After 'creates' -> 'opportunities' with probability 1.00
After 'opportunities' -> 'to' with probability 1.00
After 'grow' -> 'in' with probability 1.00
After 'in' -> 'education' with probability 1.00
After 'systems' -> 'around' with probability 1.00
After 'around' -> 'the' with probability 1.00
After 'the' -> 'world.' with probability 1.00
