# N-GRAM

N-gram is a contiguous sequence of 'N' items like words or characters from text or speech. The items can be letters, words or base pairs according to the application. The value of ’N’ determines the order of the N-gram. They are fundamental concept used in various NLP tasks such as language modeling, text classification, machine translation and more.

**N-grams can be of various types based on the value of 'n':**

**Unigrams (1-grams)** are single words

**Bigrams (2-grams)** are pairs of consecutive words

**Trigrams (3-grams)** are triplets of consecutive words

In [1]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

#Data cleaning and preprocessing
import re
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aasif\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
import pandas as pd

messages=pd.read_csv(r"C:\Users\aasif\Downloads\Streamlit\spamhamdata.csv",sep="\t",names=["label","message"])

In [3]:
corpus = []

for i in range(0,len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages["message"][i])
    review = review.lower()
    review = review.split()
    corpus.append(review)
    review=[ps.stem(word) for word in review if word not in set(stopwords.words("english"))]
    review=" ".join(review)
    corpus.append(review)

In [4]:
#Creating Bag of Words(BoW) model
corpus_str = [' '.join(tokens) for tokens in corpus]
from sklearn.feature_extraction.text import CountVectorizer
cov=CountVectorizer(max_features=2500,lowercase=True,ngram_range=(1, 1)) #for binary BOW enable binary=True

In [5]:
X=cov.fit_transform(corpus_str).toarray()

In [6]:
cov.vocabulary_

{'go': 839,
 'until': 2259,
 'point': 1595,
 'crazy': 468,
 'available': 149,
 'only': 1484,
 'in': 1014,
 'bugis': 283,
 'great': 864,
 'world': 2432,
 'la': 1127,
 'cine': 385,
 'there': 2131,
 'got': 855,
 'wat': 2345,
 'ok': 1472,
 'lar': 1137,
 'joking': 1086,
 'wif': 2394,
 'oni': 1482,
 'free': 764,
 'entry': 646,
 'wkly': 2414,
 'comp': 426,
 'to': 2175,
 'win': 2399,
 'fa': 681,
 'cup': 481,
 'final': 718,
 'tkts': 2171,
 'st': 1985,
 'may': 1279,
 'text': 2110,
 'receive': 1708,
 'question': 1665,
 'std': 2000,
 'txt': 2231,
 'rate': 1685,
 'apply': 106,
 'over': 1512,
 'dun': 603,
 'say': 1808,
 'so': 1936,
 'early': 609,
 'already': 68,
 'then': 2129,
 'nah': 1396,
 'don': 575,
 'think': 2136,
 'he': 917,
 'goes': 842,
 'usf': 2280,
 'lives': 1192,
 'around': 124,
 'here': 930,
 'though': 2146,
 'freemsg': 766,
 'hey': 931,
 'darling': 501,
 'it': 1058,
 'been': 189,
 'week': 2362,
 'now': 1450,
 'and': 82,
 'no': 1434,
 'word': 2426,
 'back': 161,
 'like': 1182,
 'some': 1

In [7]:
#Creating Bag of Words(BoW) model
corpus_str = [' '.join(tokens) for tokens in corpus]
from sklearn.feature_extraction.text import CountVectorizer
cov=CountVectorizer(max_features=2500,lowercase=True,ngram_range=(2, 2)) #for binary BOW enable binary=True

X=cov.fit_transform(corpus_str).toarray()

In [8]:
cov.vocabulary_ #(for (2,2) bigram)

{'free entry': 539,
 'entry in': 486,
 'to win': 2033,
 'to to': 2028,
 'to receive': 2012,
 'std txt': 1781,
 'txt rate': 2063,
 'rate apply': 1547,
 'so early': 1711,
 'don think': 461,
 'hey there': 734,
 'now and': 1351,
 'you up': 2459,
 'up for': 2078,
 'for it': 515,
 'it still': 976,
 'to send': 2015,
 'send to': 1645,
 'even my': 490,
 'my brother': 1258,
 'is not': 924,
 'like to': 1087,
 'with me': 2329,
 'me like': 1178,
 'as per': 111,
 'per your': 1486,
 'your request': 2494,
 'has been': 671,
 'been set': 193,
 'set as': 1660,
 'as your': 117,
 'your callertune': 2475,
 'callertune for': 295,
 'for all': 506,
 'all callers': 17,
 'callers press': 294,
 'press to': 1526,
 'to copy': 1970,
 'copy your': 397,
 'your friends': 2481,
 'friends callertune': 549,
 'as valued': 114,
 'customer you': 415,
 'you have': 2413,
 'have been': 679,
 'been selected': 192,
 'selected to': 1630,
 'to claim': 1966,
 'claim call': 353,
 'call claim': 267,
 'claim code': 354,
 'valid hours':