# Natural Language Processing - N-Gram Language Modeling

## Gaurav Jain   2K20/MC/49

### Importing Libraries

In [1]:
# text preprocessing
import re

# model creation
from collections import defaultdict

# text prediction
import random


### Text Corpus On Olympic Games

In [2]:
data_text = """The modern Olympic Games or Olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 teams, representing sovereign states and territories participating; by default the Games generally substitute for any World Championships the year in which they take place (however, each class usually maintains their own records).[2] The Olympic Games are normally held every four years, and since 1994, have alternated between the Summer and Winter Olympics every two years during the four-year period.

Their creation was inspired by the ancient Olympic Games, held in Olympia, Greece from the 8th century BC to the 4th century AD. Baron Pierre de Coubertin founded the International Olympic Committee (IOC) in 1894, leading to the first modern Games in Athens in 1896. The IOC is the governing body of the Olympic Movement (which encompasses all entities and individuals involved in the Olympic Games) with the Olympic Charter defining its structure and authority.

The evolution of the Olympic Movement during the 20th and 21st centuries has resulted in several changes to the Olympic Games. Some of these adjustments include the creation of the Winter Olympic Games for snow and ice sports, the Paralympic Games for athletes with disabilities, the Youth Olympic Games for athletes aged 14 to 18, the five Continental games (Pan American, African, Asian, European, and Pacific), and the World Games for sports that are not contested in the Olympic Games. The IOC also endorses the Deaflympics and the Special Olympics. The IOC has needed to adapt to a variety of economic, political, and technological advancements. The abuse of amateur rules by the Eastern Bloc nations prompted the IOC to shift away from pure amateurism, as envisioned by Coubertin, to the acceptance of professional athletes participating at the Games. The growing importance of mass media has created the issue of corporate sponsorship and general commercialisation of the Games. World wars led to the cancellation of the 1916, 1940, and 1944 Olympics; large-scale boycotts during the Cold War limited participation in the 1980 and 1984 Olympics;[3] and the 2020 Olympics were postponed until 2021 as a result of the COVID-19 pandemic.

The Olympic Movement consists of international sports federations (IFs), National Olympic Committees (NOCs), and organising committees for each specific Olympic Games. As the decision-making body, the IOC is responsible for choosing the host city for each Games, and organises and funds the Games according to the Olympic Charter. The IOC also determines the Olympic programme, consisting of the sports to be contested at the Games. There are several Olympic rituals and symbols, such as the Olympic flag and torch, as well as the opening and closing ceremonies. Over 14,000 athletes competed at the 2020 Summer Olympics and 2022 Winter Olympics combined, in 40 different sports and 448 events.[4][5] The first-, second-, and third-place finishers in each event receive Olympic medals: gold, silver, and bronze, respectively.

The Games have grown to the point that nearly every nation is now represented; colonies and overseas territories are often allowed to field their own teams. This growth has created numerous challenges and controversies, including boycotts, doping, bribery, and terrorism. Every two years, the Olympics and its media exposure provide athletes with the chance to attain national and international fame. The Games also provide an opportunity for the host city and country to showcase themselves to the world.

The Ancient Olympic Games were religious and athletic festivals held every four years at the sanctuary of Zeus in Olympia, Greece. Competition was among representatives of several city-states and kingdoms of Ancient Greece. These Games featured mainly athletic but also combat sports such as wrestling and the pankration, horse and chariot racing events. It has been widely written that during the Games, all conflicts among the participating city-states were postponed until the Games were finished. This cessation of hostilities was known as the Olympic peace or truce.[6] This idea is a modern myth because the Greeks never suspended their wars. The truce did allow those religious pilgrims who were travelling to Olympia to pass through warring territories unmolested because they were protected by Zeus.[7]

The origin of the Olympics is shrouded in mystery and legend;[8] one of the most popular myths identifies Heracles and his father Zeus as the progenitors of the Games.[9][10][11] According to legend, it was Heracles who first called the Games "Olympic" and established the custom of holding them every four years.[12] The myth continues that after Heracles completed his twelve labours, he built the Olympic Stadium as an honour to Zeus. Following its completion, he walked in a straight line for 200 steps and called this distance a "stadion" (Ancient Greek: στάδιον, Latin: stadium, "stage"), which later became a unit of distance. The most widely accepted inception date for the Ancient Olympics is 776 BC; this is based on inscriptions, found at Olympia, listing the winners of a footrace held every four years starting in 776 BC.[13] The Ancient Games featured running events, a pentathlon (consisting of a jumping event, discus and javelin throws, a foot race, and wrestling), boxing, wrestling, pankration, and equestrian events.[14][15] Tradition has it that Coroebus, a cook from the city of Elis, was the first Olympic champion.[16]

The Olympics were of fundamental religious importance, featuring sporting events alongside ritual sacrifices honouring both Zeus (whose famous statue by Phidias stood in his temple at Olympia) and Pelops, divine hero and mythical king of Olympia. Pelops was famous for his chariot race with King Oenomaus of Pisatis.[17] The winners of the events were admired and immortalised in poems and statues.[18] The Games were held every four years, and this period, known as an Olympiad, was used by Greeks as one of their units of time measurement. The Games were part of a cycle known as the Panhellenic Games, which included the Pythian Games, the Nemean Games, and the Isthmian Games.[19]

The Olympic Games reached the height of their success in the 6th and 5th centuries BC, but then gradually declined in importance as the Romans gained power and influence in Greece. While there is no scholarly consensus as to when the Games officially ended, the most commonly held date is 393 AD, when the emperor Theodosius I decreed that all pagan cults and practices be eliminated.[b] Another date commonly cited is 426 AD, when his successor, Theodosius II, ordered the destruction of all Greek temples.[21]"""

### Cleaning The Text Data

In [3]:
def text_cleaner(text):
    # tokenizing the text
    t_text = 'stkn ' + text
    t_text = re.sub(r"\."," etkn stkn",t_text)         # 'stkn' is starting token and 'etkn' is ending token
    t_text = t_text[:len(t_text)-9]

    # lower case text
    newString = t_text.lower()
    newString = re.sub(r"'s\b","",newString)
    
    # remove punctuations
    newString = re.sub("[^a-zA-Z]", " ", newString)

    # seperating into words before joining together
    words=[]
    for i in newString.split():
        words.append(i)
    return (" ".join(words)).strip()

# preprocess the text
data_new = text_cleaner(data_text)
data = data_new.split(' ')

In [4]:
print(data_new)

stkn the modern olympic games or olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions etkn stkn the olympic games are considered the world foremost sports competition with more than teams representing sovereign states and territories participating by default the games generally substitute for any world championships the year in which they take place however each class usually maintains their own records etkn stkn the olympic games are normally held every four years and since have alternated between the summer and winter olympics every two years during the four year period etkn stkn their creation was inspired by the ancient olympic games held in olympia greece from the th century bc to the th century ad etkn stkn baron pierre de coubertin founded the international olympic committee ioc in leading to the first modern games in athens in etkn stkn

### Functions For Each N-Gram

### Bigram

In [5]:
model_bi = defaultdict(lambda: defaultdict(int))

In [6]:
n_gram = 2
for i in range(n_gram,len(data)+1):
    w1 = data[i-n_gram+0]
    w2 = data[i-n_gram+1]
    model_bi[w1][w2] += 1

In [7]:
for w1 in model_bi:
    total_count = float(sum(model_bi[w1].values()))
    for w2 in model_bi[w1]:
        model_bi[w1][w2] /= total_count

In [8]:
model_bi

defaultdict(<function __main__.<lambda>()>,
            {'stkn': defaultdict(int,
                         {'the': 0.5555555555555556,
                          'their': 0.022222222222222223,
                          'baron': 0.022222222222222223,
                          'some': 0.022222222222222223,
                          'world': 0.022222222222222223,
                          'as': 0.022222222222222223,
                          'there': 0.022222222222222223,
                          'over': 0.022222222222222223,
                          'this': 0.06666666666666667,
                          'every': 0.022222222222222223,
                          'competition': 0.022222222222222223,
                          'these': 0.022222222222222223,
                          'it': 0.022222222222222223,
                          'according': 0.022222222222222223,
                          'following': 0.022222222222222223,
                          'tradition': 0.022222222222222223,
  

In [9]:
def bigram_gen(input_text):
    
    starting_words = input_text.split(' ')
    sentence = list(starting_words)
    next_word = ''

    while next_word != 'etkn':
        w1 = sentence[-1]
        if w1 in model_bi:
            # calls the next word given on the probability of that word itself
            next_word = random.choices(list(model_bi[w1].keys()), weights=model_bi[w1].values())[0]
            sentence.append(next_word)
        else:
            break
        
        if len(sentence) < 30:
            next_word = ''

    return ' '.join(sentence)

### Trigram

In [10]:
model_tri = defaultdict(lambda: defaultdict(lambda: 0))

In [11]:
n_gram = 3
for i in range(n_gram,len(data)+1):
    w1 = data[i-n_gram+0]
    w2 = data[i-n_gram+1]
    w3 = data[i-n_gram+2]
    model_tri[(w1,w2)][w3] += 1

In [12]:
for w1_w2 in model_tri:
    total_count = float(sum(model_tri[w1_w2].values()))
    for w3 in model_tri[w1_w2]:
        model_tri[w1_w2][w3] /= total_count

In [13]:
model_tri

defaultdict(<function __main__.<lambda>()>,
            {('stkn',
              'the'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'modern': 0.04,
                          'olympic': 0.16,
                          'ioc': 0.16,
                          'evolution': 0.04,
                          'abuse': 0.04,
                          'growing': 0.04,
                          'first': 0.04,
                          'games': 0.16,
                          'ancient': 0.08,
                          'truce': 0.04,
                          'origin': 0.04,
                          'myth': 0.04,
                          'most': 0.04,
                          'olympics': 0.04,
                          'winners': 0.04}),
             ('the',
              'modern'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'olympic': 1.0}),
             ('modern',
              'olympic'): defaultdict(<funct

In [14]:
def trigram_gen(input_text):
    
    starting_words = input_text.split(' ')
    sentence = list(starting_words)
    next_word = ''
    
    while next_word != 'etkn':
        w1, w2 = sentence[-2], sentence[-1]
        if (w1, w2) in model_tri:
            # calls the next word given on the probability of that word itself
            next_word = random.choices(list(model_tri[(w1, w2)].keys()), weights=model_tri[(w1, w2)].values())[0]
            sentence.append(next_word)
        else:
            break
        
        if len(sentence) < 30:
            next_word = ''
            
    return ' '.join(sentence)

### Four-gram

In [15]:
model_four = defaultdict(lambda: defaultdict(int))

In [16]:
n_gram = 4
for i in range(n_gram,len(data)+1):
    w1 = data[i-n_gram+0]
    w2 = data[i-n_gram+1]
    w3 = data[i-n_gram+2]
    w4 = data[i-n_gram+3]
    model_four[(w1, w2, w3)][w4] += 1

In [17]:
for w1_w2_w3 in model_four:
    total_count = float(sum(model_four[w1_w2_w3].values()))
    for w4 in model_four[w1_w2_w3]:
        model_four[w1_w2_w3][w4] /= total_count

In [18]:
model_four

defaultdict(<function __main__.<lambda>()>,
            {('stkn', 'the', 'modern'): defaultdict(int, {'olympic': 1.0}),
             ('the', 'modern', 'olympic'): defaultdict(int, {'games': 1.0}),
             ('modern', 'olympic', 'games'): defaultdict(int, {'or': 1.0}),
             ('olympic', 'games', 'or'): defaultdict(int, {'olympics': 1.0}),
             ('games', 'or', 'olympics'): defaultdict(int, {'are': 1.0}),
             ('or', 'olympics', 'are'): defaultdict(int, {'the': 1.0}),
             ('olympics', 'are', 'the'): defaultdict(int, {'leading': 1.0}),
             ('are',
              'the',
              'leading'): defaultdict(int, {'international': 1.0}),
             ('the',
              'leading',
              'international'): defaultdict(int, {'sporting': 1.0}),
             ('leading',
              'international',
              'sporting'): defaultdict(int, {'events': 1.0}),
             ('international',
              'sporting',
              'events'): d

In [19]:
def fourgram_gen(input_text):
    
    starting_words = input_text.split(' ')
    sentence = list(starting_words)
    next_word = ''

    while next_word != 'etkn':
        w1, w2, w3 = sentence[-3], sentence[-2], sentence[-1]
        if (w1, w2, w3) in model_four:
            # calls the next word given on the probability of that word itself
            next_word = random.choices(list(model_four[(w1, w2, w3)].keys()), weights=model_four[(w1, w2, w3)].values())[0]
            sentence.append(next_word)
        else:
            break
            
        if len(sentence) < 30:
            next_word = ''
            
    return ' '.join(sentence)

### Generating The Outputs

In [23]:
input_text = 'and'
generated_sentence = bigram_gen(input_text)
print(generated_sentence)

and bronze respectively etkn stkn pelops divine hero and chariot racing events etkn stkn competition was inspired by phidias stood in the world championships the sports that all greek latin stadium as an opportunity for his twelve labours he built the youth olympic flag and overseas territories are normally held in poems and bronze respectively etkn


In [21]:
input_text = 'the modern'
generated_sentence = trigram_gen(input_text)
print(generated_sentence)

the modern olympic games are considered the world foremost sports competition with more than teams representing sovereign states and territories participating by default the games according to the five continental games pan american african asian european and pacific and the world etkn


In [22]:
input_text = 'the modern olympic'
generated_sentence = fourgram_gen(input_text)
print(generated_sentence)

the modern olympic games or olympics are the leading international sporting events featuring summer and winter olympics combined in different sports and events etkn stkn tradition has it that coroebus a cook from the city of elis was the first olympic champion etkn


### Reverse Generation

In [24]:
r_data = list(reversed(data))

### Bigram

In [25]:
model_rbi = defaultdict(lambda: defaultdict(int))

In [26]:
n_gram = 2
for i in range(n_gram,len(r_data)+1):
    w1 = r_data[i-n_gram+0]
    w2 = r_data[i-n_gram+1]
    model_rbi[w1][w2] += 1

In [27]:
for w1 in model_rbi:
    total_count = float(sum(model_rbi[w1].values()))
    for w2 in model_rbi[w1]:
        model_rbi[w1][w2] /= total_count

In [28]:
model_rbi

defaultdict(<function __main__.<lambda>()>,
            {'etkn': defaultdict(int,
                         {'temples': 0.022222222222222223,
                          'eliminated': 0.022222222222222223,
                          'greece': 0.06666666666666667,
                          'games': 0.17777777777777778,
                          'measurement': 0.022222222222222223,
                          'statues': 0.022222222222222223,
                          'pisatis': 0.022222222222222223,
                          'olympia': 0.022222222222222223,
                          'champion': 0.022222222222222223,
                          'events': 0.06666666666666667,
                          'bc': 0.022222222222222223,
                          'distance': 0.022222222222222223,
                          'zeus': 0.044444444444444446,
                          'years': 0.022222222222222223,
                          'wars': 0.022222222222222223,
                          'truce': 0.0222222

In [29]:
def bigram_rgen(input_text):
    
    starting_words = input_text.split(' ')
    sentence = list(starting_words)
    next_word = ''

    while next_word != 'stkn':
        w1 = sentence[-1]
        if w1 in model_rbi:
            # calls the next word given on the probability of that word itself
            next_word = random.choices(list(model_rbi[w1].keys()), weights=model_rbi[w1].values())[0]
            sentence.append(next_word)
        else:
            break
        
        if len(sentence) < 30:
            next_word = ''

    return ' '.join(reversed(sentence))

In [30]:
input_text = 'the'
before = bigram_rgen(input_text)
print(before)

stkn this growth has been widely accepted inception date for any world participate in a pentathlon consisting of the deaflympics and events alongside ritual sacrifices honouring both zeus whose famous statue by coubertin to adapt to the greeks as the olympic games have grown to the first called the th centuries bc etkn stkn over athletes with disabilities the


In [31]:
input_text = 'the'
after = bigram_gen(input_text)
print(after)

the games were admired and closing ceremonies etkn stkn the olympic games or truce etkn stkn the youth olympic stadium as the host city for each class usually maintains their success in bc but also combat sports competitions etkn


In [32]:
final = before[:len(before)-4] + ' *' + input_text + '* ' + after[4:]
print(final)

stkn this growth has been widely accepted inception date for any world participate in a pentathlon consisting of the deaflympics and events alongside ritual sacrifices honouring both zeus whose famous statue by coubertin to adapt to the greeks as the olympic games have grown to the first called the th centuries bc etkn stkn over athletes with disabilities *the* games were admired and closing ceremonies etkn stkn the olympic games or truce etkn stkn the youth olympic stadium as the host city for each class usually maintains their success in bc but also combat sports competitions etkn


### Trigram

In [33]:
model_rtri = defaultdict(lambda: defaultdict(lambda: 0))

In [34]:
n_gram = 3
for i in range(n_gram,len(r_data)+1):
    w1 = r_data[i-n_gram+0]
    w2 = r_data[i-n_gram+1]
    w3 = r_data[i-n_gram+2]
    model_rtri[(w1,w2)][w3] += 1

In [35]:
for w1_w2 in model_rtri:
    total_count = float(sum(model_rtri[w1_w2].values()))
    for w3 in model_rtri[w1_w2]:
        model_rtri[w1_w2][w3] /= total_count

In [36]:
model_rtri

defaultdict(<function __main__.<lambda>()>,
            {('etkn',
              'temples'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'greek': 1.0}),
             ('temples',
              'greek'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'all': 1.0}),
             ('greek',
              'all'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'of': 1.0}),
             ('all',
              'of'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'destruction': 1.0}),
             ('of',
              'destruction'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'the': 1.0}),
             ('destruction',
              'the'): defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
                         {'ordered': 1.0}),
             ('the',
              'ordered'):

In [37]:
def trigram_rgen(input_text):
    
    starting_words = input_text.split(' ')
    sentence = list(starting_words)
    next_word = ''
    
    while next_word != 'stkn':
        w1, w2 = sentence[-2], sentence[-1]
        if (w1, w2) in model_rtri:
            # calls the next word given on the probability of that word itself
            next_word = random.choices(list(model_rtri[(w1, w2)].keys()), weights=model_rtri[(w1, w2)].values())[0]
            sentence.append(next_word)
        else:
            break
        
        if len(sentence) < 30:
            next_word = ''
            
    return ' '.join(reversed(sentence))

In [38]:
input_text = 'games the'
before = trigram_rgen(input_text)
print(before)

stkn the olympics were of fundamental religious importance featuring sporting events alongside ritual sacrifices honouring both zeus whose famous statue by phidias stood in his temple at olympia and pelops divine hero and mythical king of olympia etkn stkn over athletes competed at the games


In [39]:
input_text = 'the games'
after = trigram_gen(input_text)
print(after)

the games generally substitute for any world championships the year in which they take place however each class usually maintains their own records etkn stkn the growing importance of mass media has created the issue of corporate sponsorship and general commercialisation of the and olympics large scale boycotts during the games olympic and established the custom of holding them every four years starting in bc etkn


In [40]:
final = before[:len(before)-10] + ' *' + input_text + '* ' + after[10:]
print(final)

stkn the olympics were of fundamental religious importance featuring sporting events alongside ritual sacrifices honouring both zeus whose famous statue by phidias stood in his temple at olympia and pelops divine hero and mythical king of olympia etkn stkn over athletes competed at *the games* generally substitute for any world championships the year in which they take place however each class usually maintains their own records etkn stkn the growing importance of mass media has created the issue of corporate sponsorship and general commercialisation of the and olympics large scale boycotts during the games olympic and established the custom of holding them every four years starting in bc etkn


# Thank You For Listening