### Multinomial Naive Bayes' theorem to understand the Bag of words model: -

* Multinomial Naïve Bayes uses term frequency i.e. the number of times a given term appears in a document.
* Term frequency is often normalized by dividing the raw term frequency by the document length. After normalization, term frequency can be used to compute maximum likelihood estimates based on the training data to estimate the conditional probability.
* After normalization, term frequency can be used to compute maximum likelihood estimates based on the training data to estimate the conditional probability.

### The Bag of Words Model:
* With an ever-growing amount of textual information stored in electronic form such as legal documents, policies, company strategies, etc., automatic text classification is becoming increasingly important. 
* A supervised learning technique that classifies every new document by assigning one or more class labels from a fixed or predefined class.  
* It uses the bag of words approach, where the individual words in the document constitute its features, and the order of the words is ignored.
* It treats the language like it’s just a bag full of words and each message is a random handful of them.
* When very large tex data, then this learning algorithm requires to tackle high dimensional problems, both in terms of classification performance and computational speed. 
    
Feature extraction and Selection are the most important sub-tasks in pattern classification. The three main criteria of good features are:

1. Salient: The features should be meaningful and important to the problem
2. Invariant: The features are resistant to scaling, distortion and orientation etc. 
3. Discriminatory:  For training of classifiers, the features should have enough information to distinguish between patterns.

### Stemming and Lemmatization: -

* Stemming and Lemmatization are the process of transforming a word into its root form and aims to obtain the grammatically correct forms of words.These process comes under the Bag of Words Model. 

In [1]:
import pandas as pd

data = pd.read_csv(r"C:\Users\singhegm\Downloads\train\train.csv")
data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [2]:
data['author_num'] = data["author"].map({'EAP':0, 'HPL':1, 'MWS':2})
data.head()

Unnamed: 0,id,text,author,author_num
0,id26305,"This process, however, afforded me no means of...",EAP,0
1,id17569,It never once occurred to me that the fumbling...,HPL,1
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,0
3,id27763,How lovely is spring As we looked from Windsor...,MWS,2
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,1


In [3]:
X = data['text']
y = data['author_num']

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data-

1. CountVectorizer - To convert text to word count vectors.
2. TfidfVectorizer - To convert text to word frequency vectors.
3. HashingVectorizer - To convert text to unique integers.

### Model 1: CountVectorizer

* The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

* You can use it as follows:

1. Create an instance of the CountVectorizer class.
2. Call the fit() function in order to learn a vocabulary from one or more documents.
3. Call the transform() function on one or more documents as needed to encode each as a vector.

* An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

### Stop Words: -

* Stop Words also known as un-informative words such as (so, and, or, the) should be removed from the document.

In [5]:
from nltk.corpus import stopwords

print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Tokenization: -

* It is the process of breaking down the text corpus into individual elements. These individual elements act as an input to machine learning algorithms.
* Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).


### Vectorization: -
* The process of converting words into numbers are called Vectorization. ie; Defining a good numerical measure to characterize these texts.
* It is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.
* Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.


#### "The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization)."

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words = 'english')

X_train_cv = cv.fit_transform(X_train)
X_train_cv.shape

(13705, 21535)

### %time(CPU time) and the %%time(Wall time) magic command: -
* It helps to give CPU times (time that CPU is busy) and Wall time (total time for the script execution). 
* So Wall time - CPU times gives the time that the system is busy elsewhere (time.sleep or time for I/O).
* %time will be calculated to start system before even reading I/O but %%time will calculate the time taken for I/O tasks.

In [7]:
%time

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


model = MultinomialNB()
model.fit(X_train_cv, y_train)
print(model)
print(model.score(X_train_cv, y_train))

Wall time: 0 ns
MultinomialNB()
0.915213425757023


In [8]:
%%time

X_test_cv = cv.transform(X_test) 
print (model.score(X_test_cv, y_test))

0.8261831801157644
Wall time: 261 ms


In [9]:
prediction1 = model.predict(X_test_cv)

In [10]:
from sklearn.metrics import classification_report

print(classification_report(y_test,prediction1))

              precision    recall  f1-score   support

           0       0.84      0.82      0.83      2381
           1       0.86      0.81      0.83      1735
           2       0.78      0.86      0.82      1758

    accuracy                           0.83      5874
   macro avg       0.83      0.83      0.83      5874
weighted avg       0.83      0.83      0.83      5874



### Model 2: TfidfVectorizer

* TFIDF - "Term Frequency – Inverse Document” Frequency.
* Term Frequency: This summarizes how often a given word appears within a document.
* Inverse Document Frequency: This downscales words that appear a lot across documents.y
* TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
* The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

*** The same create, fit, and transform process is used as with the CountVectorizer. ***

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfid = TfidfVectorizer(stop_words = 'english')

X_train_tfidf = tfid.fit_transform(X_train) 
X_train_tfidf.shape

(13705, 21535)

In [12]:
%%time

model2 = MultinomialNB()
model2.fit(X_train_tfidf, y_train)
print(model2)
print(model2.score(X_train_tfidf, y_train))

MultinomialNB()
0.9165997811017876
Wall time: 27 ms


In [13]:
%%time

X_test_tfidf = tfid.transform(X_test) 
print (model2.score(X_test_tfidf, y_test))

0.8094994892747702
Wall time: 542 ms


In [14]:
prediction2 = model2.predict(X_test_tfidf)

In [15]:
print(classification_report(y_test,prediction2))

              precision    recall  f1-score   support

           0       0.76      0.88      0.82      2381
           1       0.91      0.70      0.79      1735
           2       0.81      0.82      0.82      1758

    accuracy                           0.81      5874
   macro avg       0.83      0.80      0.81      5874
weighted avg       0.82      0.81      0.81      5874



### Model 3: HashingVectorizer

* Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.
* HashingVectorizer: Hence we can use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector - n_features = n. 
* But a downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).
* As MultinomialNB() cannot take input as -ve integers, we can specify alternate_sign = False to consider only +ve integers as input. 

In [16]:
from sklearn.feature_extraction.text import HashingVectorizer

hv = HashingVectorizer(stop_words = 'english', alternate_sign = False, n_features = 21528)

X_train_hv = hv.transform(X_train)
print(X_train_hv.shape)

(13705, 21528)


In [17]:
%%time

model3 = MultinomialNB()
model3.fit(X_train_hv, y_train)
print(model3)
print(model3.score(X_train_hv, y_train))

MultinomialNB()
0.8623859905144108
Wall time: 37 ms


In [18]:
%%time

X_test_hv = hv.transform(X_test) 
print (model3.score(X_test_hv, y_test))

0.7689819543752128
Wall time: 286 ms


In [19]:
prediction3 = model3.predict(X_test_hv)

In [20]:
print(classification_report(y_test,prediction3))

              precision    recall  f1-score   support

           0       0.71      0.87      0.78      2381
           1       0.87      0.63      0.73      1735
           2       0.80      0.76      0.78      1758

    accuracy                           0.77      5874
   macro avg       0.79      0.76      0.77      5874
weighted avg       0.78      0.77      0.77      5874

