# N-gram
An n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity! For example, the word “cheese” is a 1-gram (unigram). The combination of the words “cheese flavored” is a 2-gram (bigram). Similarly, “cheese flavored snack” is a 3-gram (trigram). And “ultimate cheese flavored snack” is a 4-gram (qualgram).

<img src="./images/ngram.JPG">

Here first we will do n-gram ranking.
In n-gram ranking, we simply rank the n-grams according to how many times they appear in a body of text — be it a book, a collection of tweets, or reviews left by customers

In [2]:
# Reading the data
import pandas as pd
df = pd.read_csv('./data/tweets.csv')
df.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,LOSER! https://t.co/p5imhMJqS1,05-18-2020 14:55:14,32295,135445,False,1262396333064892416
1,Twitter for iPhone,Most of the money raised by the RINO losers of...,05-05-2020 18:18:26,19706,82425,False,1257736426206031874
2,Twitter for iPhone,....because they don’t know how to win and the...,05-05-2020 04:46:34,12665,56868,False,1257532112233803782
3,Twitter for iPhone,....lost for Evan “McMuffin” McMullin (to me)....,05-05-2020 04:46:34,13855,62268,False,1257532114666508291
4,Twitter for iPhone,....get even for all of their many failures. Y...,05-05-2020 04:46:33,8122,33261,False,1257532110971318274


In [3]:
df.shape

(261, 7)

In [4]:
# importing relevant packages
# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe']

import matplotlib.pyplot as plt

*****************************************************************
Cleaning the data
- Lemmatization 
- stop word removal
- Return the normal form for the Unicode string

In [5]:
def basic_clean(text):
  """
  Function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower()) # refer https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [25]:
words = basic_clean(str(df['text'].tolist()))
print ("# of words : ", len(words))
words[:15]

# of words :  3944


['loser',
 'httpstcop5imhmjqs1',
 'money',
 'raised',
 'rino',
 'loser',
 'socalled',
 'lincoln',
 'project',
 'go',
 'pocket',
 'ive',
 'done',
 'judge',
 'tax']

Get bi-grams

In [26]:
import pandas as pd
(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

(hater, loser)        42
(total, loser)        32
(loser, hater)        14
(amp, loser)          13
(donald, trump)       11
(hater, amp)          11
(winner, loser)        8
(separate, winner)     8
(twist, fate)          6
(reacts, new)          6
dtype: int64

In [27]:
# top 10 n-grams
(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

(hater, amp, loser)             10
(separate, winner, loser)        8
(winner, loser, person)          6
(loser, person, reacts)          6
(new, twist, fate)               6
(person, reacts, new)            6
(hater, loser, happy)            6
(reacts, new, twist)             6
(including, hater, loser)        5
(everyone, including, hater)     5
dtype: int64

##  Character N-gram

In [1]:
from nltk import ngrams
["".join(k1) for k1 in list(ngrams("hello world",n=3))]

['hel', 'ell', 'llo', 'lo ', 'o w', ' wo', 'wor', 'orl', 'rld']

In [2]:
b='student'
[b[i:i+2] for i in range(len(b)-1)]

['st', 'tu', 'ud', 'de', 'en', 'nt']

# Build Characters N-Grams Model

In [3]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

In [4]:
raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Tennis')
raw_html = raw_html.read()

article_html = bs.BeautifulSoup(raw_html, 'lxml')
article_paragraphs = article_html.find_all('p')
article_text = ''

for para in article_paragraphs:
    article_text += para.text

article_text = article_text.lower()
article_text



In [5]:
article_text = re.sub(r'[^A-Za-z. ]', '', article_text)

In [6]:
ngrams = {}
chars = 3

for i in range(len(article_text)-chars):
    seq = article_text[i:i+chars]
    print(seq)
    if seq not in ngrams.keys():
        ngrams[seq] = []
    ngrams[seq].append(article_text[i+chars])

ten
enn
nni
nis
is 
s i
 is
is 
s a
 a 
a r
 ra
rac
ack
cke
ket
et 
t s
 sp
spo
por
ort
rt 
t t
 th
tha
hat
at 
t c
 ca
can
an 
n b
 be
be 
e p
 pl
pla
lay
aye
yed
ed 
d i
 in
ind
ndi
div
ivi
vid
idu
dua
ual
all
lly
ly 
y a
 ag
aga
gai
ain
ins
nst
st 
t a
 a 
a s
 si
sin
ing
ngl
gle
le 
e o
 op
opp
ppo
pon
one
nen
ent
nt 
t s
 si
sin
ing
ngl
gle
les
es 
s o
 or
or 
r b
 be
bet
etw
twe
wee
een
en 
n t
 tw
two
wo 
o t
 te
tea
eam
ams
ms 
s o
 of
of 
f t
 tw
two
wo 
o p
 pl
pla
lay
aye
yer
ers
rs 
s e
 ea
eac
ach
ch 
h d
 do
dou
oub
ubl
ble
les
es.
s. 
. e
 ea
eac
ach
ch 
h p
 pl
pla
lay
aye
yer
er 
r u
 us
use
ses
es 
s a
 a 
a t
 te
ten
enn
nni
nis
is 
s r
 ra
rac
ack
cke
ket
et 
t t
 th
tha
hat
at 
t i
 is
is 
s s
 st
str
tru
run
ung
ng 
g w
 wi
wit
ith
th 
h c
 co
cor
ord
rd 
d t
 to
to 
o s
 st
str
tri
rik
ike
ke 
e a
 a 
a h
 ho
hol
oll
llo
low
ow 
w r
 ru
rub
ubb
bbe
ber
er 
r b
 ba
bal
all
ll 
l c
 co
cov
ove
ver
ere
red
ed 
d w
 wi
wit
ith
th 
h f
 fe
fel
elt
lt 
t o
 ov
ove
ver


In [9]:
ngrams.keys()

dict_keys(['ten', 'enn', 'nni', 'nis', 'is ', 's i', ' is', 's a', ' a ', 'a r', ' ra', 'rac', 'ack', 'cke', 'ket', 'et ', 't s', ' sp', 'spo', 'por', 'ort', 'rt ', 't t', ' th', 'tha', 'hat', 'at ', 't c', ' ca', 'can', 'an ', 'n b', ' be', 'be ', 'e p', ' pl', 'pla', 'lay', 'aye', 'yed', 'ed ', 'd i', ' in', 'ind', 'ndi', 'div', 'ivi', 'vid', 'idu', 'dua', 'ual', 'all', 'lly', 'ly ', 'y a', ' ag', 'aga', 'gai', 'ain', 'ins', 'nst', 'st ', 't a', 'a s', ' si', 'sin', 'ing', 'ngl', 'gle', 'le ', 'e o', ' op', 'opp', 'ppo', 'pon', 'one', 'nen', 'ent', 'nt ', 'les', 'es ', 's o', ' or', 'or ', 'r b', 'bet', 'etw', 'twe', 'wee', 'een', 'en ', 'n t', ' tw', 'two', 'wo ', 'o t', ' te', 'tea', 'eam', 'ams', 'ms ', ' of', 'of ', 'f t', 'o p', 'yer', 'ers', 'rs ', 's e', ' ea', 'eac', 'ach', 'ch ', 'h d', ' do', 'dou', 'oub', 'ubl', 'ble', 'es.', 's. ', '. e', 'h p', 'er ', 'r u', ' us', 'use', 'ses', 'a t', 's r', 't i', 's s', ' st', 'str', 'tru', 'run', 'ung', 'ng ', 'g w', ' wi', 'wit', 'i

## Words N-Grams Model

In [10]:
ngrams = {}
words = 3

words_tokens = nltk.word_tokenize(article_text)
for i in range(len(words_tokens)-words):
    seq = ' '.join(words_tokens[i:i+words])
    print(seq)
    if  seq not in ngrams.keys():
        ngrams[seq] = []
    ngrams[seq].append(words_tokens[i+words])

tennis is a
is a racket
a racket sport
racket sport that
sport that can
that can be
can be played
be played individually
played individually against
individually against a
against a single
a single opponent
single opponent singles
opponent singles or
singles or between
or between two
between two teams
two teams of
teams of two
of two players
two players each
players each doubles
each doubles .
doubles . each
. each player
each player uses
player uses a
uses a tennis
a tennis racket
tennis racket that
racket that is
that is strung
is strung with
strung with cord
with cord to
cord to strike
to strike a
strike a hollow
a hollow rubber
hollow rubber ball
rubber ball covered
ball covered with
covered with felt
with felt over
felt over or
over or around
or around a
around a net
a net and
net and into
and into the
into the opponents
the opponents court
opponents court .
court . the
. the object
the object of
object of the
of the game
the game is
game is to
is to manoeuvre
to manoeuvre the
man