# Text classification n-gram(s) model 

![ngram](https://user-images.githubusercontent.com/54467567/68990210-27e76400-0816-11ea-9bdb-5057291a9f91.jpg)


In [19]:
import tarfile
import os
import pyprind
import pandas as pd
import numpy as np
import re

########### TOKENIZATIN ############
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.util import ngrams

########### STEMMING ###############
import nltk
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

########### STOPWORDS ##############
nltk.download('stopwords')
from nltk.corpus import stopwords

########### LEMMATIZATION ##########
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')

########### LDA ####################
from sklearn.decomposition import LatentDirichletAllocation

########### TEST & TRAIN SPLIT #####
from sklearn.model_selection import train_test_split

########### LOGISTIC REGRESSION #####
from sklearn.linear_model import LogisticRegression

os.chdir(r"C:\important_files")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Thatoi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Thatoi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### File Loading

In [2]:
# tar.gz file extraction
with tarfile.open('aclImdb_v1.tar.gz','r:gz') as tar :
    tar.extractall()

# combining files into a dataframe
labels = {'pos' :1 ,
          'neg' :0}

pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()

In [3]:
# creating an appended file
basepath = 'aclImdb'
for s in ('test','train'):
    for l in ('pos','neg'):
        path = os.path.join(basepath,s,l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                      'r',encoding='utf-8') as infile:
                txt = infile.read()
            df =df.append([[txt,labels[l]]], ignore_index=True)
            pbar.update()
df.columns=['reviews','sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:03:01


In [4]:
df.head()

Unnamed: 0,reviews,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [5]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
print(df.shape)
print(df.head(3))

(50000, 2)
                                                 reviews  sentiment
11841  In 1974, the teenager Martha Moxley (Maggie Gr...          1
19602  OK... so... I really like Kris Kristofferson a...          0
45519  ***SPOILER*** Do not read this, if you think a...          0


Process of performing sentiment analysis :
![model](https://user-images.githubusercontent.com/54467567/68970713-89271d00-07ad-11ea-87d2-e5d55344a31c.PNG)

reference : https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
## Cleaning dataset {html tags, emoticons & non-characters}

In [6]:
df.loc[50,'reviews']

"Whether one views him as a gallant cavalier of the plains or a glory hunting egomaniac, debates about the life and military career of George Armstrong Custer continue down to the present day. They Died With Their Boots On presents certain facts of the Custer story and has taken liberty with others.<br /><br />He did in fact graduate at the bottom of his class at West Point and got this overnight promotion on the battlefield to Brigadier General. His record leading the Michigan Regiment under his command was one of brilliance.<br /><br />It was also true that his marriage to Libby Bacon was one of the great love matches of the 19th century. Libby and George were married for 12 years until The Little Big Horn. What's not known to today's audience is that Libby survived until 1933. During that time she was the custodian of the Custer legend. By dint of her own iron will and force of personality her late husband became a hero because she would not allow him to be remembered in any other w

In [4]:
def preprocessor(text):
    # replaing HTML tags
    text = re.sub('<[^>]*>', '', text) 
    # storing emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', 
                           text)
    # remove non-characters, lowercasing and adding  emoticons
    text = (re.sub('[\W]+', ' ', text.lower()) + 
            ' '.join(emoticons).replace('-', '')) # removing nose in emoticons
    return text

In [5]:
#example
preprocessor(df.loc[50,'reviews'])

'whether one views him as a gallant cavalier of the plains or a glory hunting egomaniac debates about the life and military career of george armstrong custer continue down to the present day they died with their boots on presents certain facts of the custer story and has taken liberty with others he did in fact graduate at the bottom of his class at west point and got this overnight promotion on the battlefield to brigadier general his record leading the michigan regiment under his command was one of brilliance it was also true that his marriage to libby bacon was one of the great love matches of the 19th century libby and george were married for 12 years until the little big horn what s not known to today s audience is that libby survived until 1933 during that time she was the custodian of the custer legend by dint of her own iron will and force of personality her late husband became a hero because she would not allow him to be remembered in any other way i think raoul walsh and warn

In [9]:
df['reviews'] = df['reviews'].apply(preprocessor)

### NLTK Ngrams, Stemming  & Stop words removal

#### Tokenization :
Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
Natural language processing is used for building applications such as Text classification, intelligent chatbot,
sentimental analysis, language translation, etc. It becomes vital to understand the pattern in the text to achieve
the above-stated purpose. These tokens are very useful for finding such patterns as well as is considered as a 
base step for stemming and lemmatization.

##### multigram tokens
When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. bi-grams or tri-grams adds more predictive power to the model as it includes partial sentence with the subject and verb unlike the uni-gram model that contain only a single word.

In [79]:
def extract_ngrams(data, num):
    '''
    definition to create n-grams of tokens 
    from the sentences using nltk.tokenize
    method
    '''
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]

In [83]:
# creating 1,2 & 3 grams tokens 
df['1gram_tokens'] = ''
df['2gram_tokens'] = ''
df['3gram_tokens'] = ''

for i in df.index:
    df.at[i,'1gram_tokens'] = extract_ngrams(df.at[i,'reviews'],1)
    df.at[i,'2gram_tokens'] = extract_ngrams(df.at[i,'reviews'],2)
    df.at[i,'3gram_tokens'] = extract_ngrams(df.at[i,'reviews'],3)

In [94]:
df[['reviews','1gram_tokens','2gram_tokens','3gram_tokens']].head(5)

Unnamed: 0,reviews,1gram_tokens,2gram_tokens,3gram_tokens
11841,in 1974 the teenager martha moxley maggie grac...,"[in, 1974, the, teenag, martha, moxley, maggi,...","[in 1974, 1974 the, the teenag, teenager marth...","[in 1974 th, 1974 the teenag, the teenager mar..."
19602,ok so i really like kris kristofferson and his...,"[ok, so, i, realli, like, kri, kristofferson, ...","[ok so, so i, i real, really lik, like kri, kr...","[ok so i, so i r, i really lik, really like kr..."
45519,spoiler do not read this if you think about w...,"[spoiler, do, not, read, thi, if, you, think, ...","[spoiler do, do not, not read, read thi, this ...","[spoiler do not, do not read, not read thi, re..."
25747,hi for all the people who have seen this wonde...,"[hi, for, all, the, peopl, who, have, seen, th...","[hi for, for al, all th, the peopl, people who...","[hi for al, for all th, all the peopl, the peo..."
42642,i recently bought the dvd forgetting just how ...,"[i, recent, bought, the, dvd, forget, just, ho...","[i rec, recently bought, bought th, the dvd, d...","[i recently bought, recently bought th, bought..."


#### Stemming :
Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish (It requires knowledge of suffix and prefix)

![stemming](https://user-images.githubusercontent.com/54467567/68964605-6aba2500-079f-11ea-9144-6e4c5b43cc79.PNG)

reference : https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

In [91]:
def stemming(text) :
    '''
    creating definition to generate
    1-gram tokens and stem those to 
    the root words using porter stemming
    algorithm that removes morphological 
    endings from the words
    '''
    return [porter.stem(word) for word in text]

In [92]:
df['1gram_tokens'] = df['1gram_tokens'].apply(stemming)
df['2gram_tokens'] = df['2gram_tokens'].apply(stemming)
df['3gram_tokens'] = df['3gram_tokens'].apply(stemming)

#### STOP word removal :
When working with text mining applications, we often hear of the term “stop words” or “stop word list” or even “stop list”. Stop words are just a set of commonly used words in any language. Stop words are commonly eliminated from many text processing applications because these words can be distracting, non-informative (or non-discriminative) and are additional memory overhead.

In [18]:
stop = stopwords.words('english')

def remove_stopwords(text):
    '''
    creating definition to remove
    stopwords from tokenized 
    content in the text field
    '''
    list_words= []
    for w in text: 
        if w not in stop:
            list_words.append(w)
    return (list_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\priyabrata.thatoi\AppData\Roaming\nltk_data..
[nltk_data]     .
[nltk_data]   Package stopwords is already up-to-date!


In [95]:
df['1gram_tokens'] = df['1gram_tokens'].apply(remove_stopwords)
df['2gram_tokens'] = df['2gram_tokens'].apply(remove_stopwords)
df['3gram_tokens'] = df['3gram_tokens'].apply(remove_stopwords)

In [96]:
# BEFORE APPLYING STOPWORD REMOVAL
df[['reviews','1gram_tokens','2gram_tokens','3gram_tokens']].head(5)

Unnamed: 0,reviews,1gram_tokens,2gram_tokens,3gram_tokens
11841,in 1974 the teenager martha moxley maggie grac...,"[1974, teenag, martha, moxley, maggi, grace, m...","[in 1974, 1974 the, the teenag, teenager marth...","[in 1974 th, 1974 the teenag, the teenager mar..."
19602,ok so i really like kris kristofferson and his...,"[ok, realli, like, kri, kristofferson, hi, usu...","[ok so, so i, i real, really lik, like kri, kr...","[ok so i, so i r, i really lik, really like kr..."
45519,spoiler do not read this if you think about w...,"[spoiler, read, thi, think, watch, movi, altho...","[spoiler do, do not, not read, read thi, this ...","[spoiler do not, do not read, not read thi, re..."
25747,hi for all the people who have seen this wonde...,"[hi, peopl, seen, thi, wonder, movi, im, sure,...","[hi for, for al, all th, the peopl, people who...","[hi for al, for all th, all the peopl, the peo..."
42642,i recently bought the dvd forgetting just how ...,"[recent, bought, dvd, forget, much, hate, movi...","[i rec, recently bought, bought th, the dvd, d...","[i recently bought, recently bought th, bought..."


#### Lemmatization : 
Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. See how it works with the same example words.(It requires the complete dictionary of the language)
![lemma](https://user-images.githubusercontent.com/54467567/68964888-13688480-07a0-11ea-8745-578d07bd186f.PNG)

reference : https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/

In [49]:
def lemmatize_word(text):
    '''
    creating definition to convert words 
    into lemmas (root words)
    '''
    return [lemmatizer.lemmatize(word) for word in text]

In [97]:
df['1gram_tokens'] = df['1gram_tokens'].apply(lemmatize_word)
df['2gram_tokens'] = df['2gram_tokens'].apply(lemmatize_word)
df['3gram_tokens'] = df['3gram_tokens'].apply(lemmatize_word)

### SKLearn Ngrams, Tf-Idf vector & Text Topics using Latent Dirichlet Allocation
SKlearn has a direct package(CountVectorizer) which generates n-grams from the text. It produces the bag-of-word matrixs based on the n-grams. In this method, it automatically lower cases the text and removes the stop words. therefore explicitly performing these stems are not required. Latent Dirichlet Allocation is a probabilistic method that looks for group of words frequently occuring in the documents. The input of the LDA is the bag-of-words model matrix. It decomposes bag-of-word matrix into a document-to-topic matrix and a word-to-topic matrix.

### TF-IDF

Term-Frequency and Inverse Document Frequency is the process of creating numeric values for each of the text terms and assigning proper weigts as per the relevance of the word in the documents. It is calculated as
  

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$


where $$\text{tf (t,d)}$$ is frequency distribution of each word in the document. it creates a sparse vector with the index of the word and frequency/count of the word

In [6]:
# CREATING 1,2 & 3 GRAMS WORD FROM TEXT
count_1 = TfidfVectorizer(max_df=.1 , max_features = 50, ngram_range=(1,1))
bag_1 = count_1.fit_transform(df['reviews'].values)

count_2 = TfidfVectorizer(max_df=.1 , max_features = 50, ngram_range=(2,2))
bag_2 = count_2.fit_transform(df['reviews'].values)

count_3 = TfidfVectorizer(max_df=.1 , max_features = 50, ngram_range=(3,3))
bag_3 = count_3.fit_transform(df['reviews'].values)

In [7]:
# SAMPLE DATAFRAME CREATED TO DO SENTIMENT MODELING
a=pd.DataFrame(bag_1.toarray())
a.columns = count_1.get_feature_names()
a.head()

Unnamed: 0,action,although,am,anyone,away,believe,book,comedy,course,day,...,script,series,set,since,sure,trying,tv,woman,worst,yet
0,0.0,0.0,0.0,0.30865,0.0,0.0,0.0,0.309077,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.318364,0.0,0.0,0.0,0.0
1,0.525235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.267626,0.265719,...,0.0,0.0,0.269061,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.445067,0.0,0.0,0.0,0.0,0.0,0.0,0.448112,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.708978,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
feature_array = np.array(count_1.get_feature_names())
tfidf_sorting = np.argsort(bag_1.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting[:n]]
print('Top 3 from 1-gram bag of word model by tfidf value :' , top_n )

Top 3 from 1-gram bag of word model by tfidf value : ['goes' 'day' 'yet']


In [9]:
# SAMPLE DATAFRAME CREATED TO DO SENTIMENT MODELING
b=pd.DataFrame(bag_2.toarray())
b.columns = count_2.get_feature_names()
b.head()

Unnamed: 0,and his,as well,at all,at least,br this,could have,going to,have to,he is,he was,...,to say,trying to,want to,was the,which is,who is,would be,would have,you can,you re
0,0.0,0.444586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.471144,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.710714,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.530374,0.0,0.0,0.485885,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.410938,0.0,0.0,0.409518,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
feature_array = np.array(count_2.get_feature_names())
tfidf_sorting = np.argsort(bag_2.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting[:n]]
print('Top 3 from 2-grams bag of word model by tfidf value :' , top_n )

Top 3 from 2-grams bag of word model by tfidf value : ['to say' 'the whole' 'is one']


In [11]:
# SAMPLE DATAFRAME CREATED TO DO SENTIMENT MODELING
c=pd.DataFrame(bag_3.toarray())
c.columns = count_3.get_feature_names()
c.head()

Unnamed: 0,as well as,at the end,br br and,br br but,br br if,br br in,br br it,br br there,br br this,could have been,...,the movie was,the rest of,the story is,there is no,this film is,this is one,this is the,this movie is,this movie was,would have been
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.748112,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.656479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.420694,0.0,...,0.0,0.0,0.514837,0.0,0.0,0.541982,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
feature_array = np.array(count_3.get_feature_names())
tfidf_sorting = np.argsort(bag_3.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting[:n]]
print('Top 3 from 3-grams bag of word model by tfidf value :' , top_n )

Top 3 from 3-grams bag of word model by tfidf value : ['this is one' 'br br there' 'is one of']


In [127]:
# EXAMPLE: TEXT TOPICS ON BIGRAM WORDS
lda_2 = LatentDirichletAllocation(n_components=10,random_state=123,learning_method='batch')
X_topics = lda_2.fit_transform(bag_2)

n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda_2.components_) :
    print("Topic %d:" % (topic_idx+1))
    print(" ".join([feature_names[i]
                   for i in topic.argsort()\
                   [:-n_top_words - 1: -1]]))

Topic 1:
ever seen the worst the book have ever movie and
Topic 2:
he is and his she is in his who is
Topic 3:
was the this was he was was very and was
Topic 4:
the show this show you can they are the characters
Topic 5:
would have could have there was was the this was
Topic 6:
he is that he he was they are is that
Topic 7:
the original the characters into the the world some of
Topic 8:
the acting special effects the whole the rest at all
Topic 9:
is very the acting very good is one is great
Topic 10:
you re you can kind of it not going to


### Modeling  : Text Classifier - ngram model

In [13]:
# ADDING TARGET VARIABLE
a['sentiment'] = df['sentiment']
b['sentiment'] = df['sentiment']
c['sentiment'] = df['sentiment']

In [14]:
print(a.shape,b.shape,c.shape)

(50000, 51) (50000, 51) (50000, 51)


In [15]:
# GET A TRAIN TEST SPLIT (set seed for consistent results)
X_1 = a.drop(['sentiment'],axis=1)
y_1 = a['sentiment']
X_2 = b.drop(['sentiment'],axis=1)
y_2 = b['sentiment']
X_3 = c.drop(['sentiment'],axis=1)
y_3 = c['sentiment']

In [18]:
# CREATING TRAIN & TEST DF
X_1_tr, X_1_te , y_1_tr,y_1_te = train_test_split(X_1, y_1, test_size=0.3, random_state=42)
X_2_tr, X_2_te , y_2_tr,y_2_te = train_test_split(X_2, y_2, test_size=0.3, random_state=42)
X_3_tr, X_3_te , y_3_tr,y_3_te = train_test_split(X_3, y_3, test_size=0.3, random_state=42)

In [33]:
# SIMPLE LOGISTIC REGRESSION
logisticRegr = LogisticRegression()
logisticRegr.fit(X_1_tr, y_1_tr)
score_1 = logisticRegr.score(X_1_te, y_1_te)
print('Accuracy of 1gram text classifier ', score_1)

logisticRegr.fit(X_2_tr, y_2_tr)
score_2 = logisticRegr.score(X_2_te, y_2_te)
print('Accuracy of 2gramstext classifier ', score_2)

logisticRegr.fit(X_3_tr, y_3_tr)
score_3 = logisticRegr.score(X_3_te, y_3_te)
print('Accuracy of 3grams text classifier ', score_3)



Accuracy of 1gram text classifier  0.6567333333333333
Accuracy of 2gramstext classifier  0.6438
Accuracy of 3grams text classifier  0.5864


This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.

For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.

reference : https://stackoverflow.com/questions/36542993/when-are-uni-grams-more-suitable-than-bi-grams-or-higher-n-grams