**Basic sentient analysis using scikit learn and naive bayes**

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import files
uploaded = files.upload()

Saving imdb_labelled.txt to imdb_labelled.txt


In [None]:
with open("imdb_labelled.txt", "r") as text_file:
    lines = text_file.read().split("\n")

newLines = [line.split("\t") for line in lines if len(line.split("\t")) == 2 and line.split("\t")[1] != ""]

In [None]:
newLines

In [None]:
train_documents = [line[0] for line in newLines]
train_labels = [int(line[1]) for line in newLines]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB

In [None]:
count_vectorizer = CountVectorizer(binary="true")
train_documents = count_vectorizer.fit_transform(train_documents)

In [None]:
classifier = BernoulliNB().fit(train_documents, train_labels)

In [None]:
def predictionOutput(sentence):
    prediction = classifier.predict(count_vectorizer.transform([sentence]))
    if(prediction[0] == 1):
        print("This is a Positive Sentiment Sentence")
    elif (prediction[0] == 0):
        print("This is a Negative Sentiment Sentence")

In [None]:
predictionOutput("I am having a very bad day")

This is a Negative Sentiment Sentence


**Text encoding with scikit learn:**

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).


We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

Scikit learn libray has 3 relevant schemes for this purpose:

1) Text to word count vectors with CountVectorizer

2) Text to word frequency vectors with TfidfVectorizer.

3) Text to unique integers with HashingVectorizer.




1) Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

- Create an instance of the CountVectorizer class.
- Call the fit() function in order to learn a vocabulary from one or more documents.
- Call the transform() function on one or more documents as needed to encode each as a vector.



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# feeding text input
text = ["In a beautful day like this its easy to feel hopeful "]



In [None]:
# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

In [None]:
# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

In [None]:

# vocab check
text2 = ["No words are there above"]
vector = vectorizer.transform(text2)
print(vector.toarray())

2) Word Frequencies with TfidfVectorizer 




Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Feeding in text input
text = ["In a beautiful day like this its easy to feel hopeful",
		"day", "day", "day"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

3) Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector.

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
# Feeding in text input
text = ["In a beautiful day like this its easy to feel hopeful"]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

**Classification**

-Rule-based systems

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content.

-Machine learning-based systems

Instead of relying on manually crafted rules, machine learning text classification learns to make classifications based on past observations. By using pre-labeled examples as training data, machine learning algorithms can learn the different associations between pieces of text, and that a particular output (i.e., tags) is expected for a particular input (i.e., text).

Example - Naive Bayes, Support vector machines, Deep learning (CNN, RNN)

-Hybrid systems

Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules for those conflicting tags that haven’t been correctly modeled by the base classifier.

Naive bayes Classifier

Naive bayes classifier is a machine learning algorithm to classify or filter data. It is especially used in natural language processing such as categorizing news articles into topics, filtering spam mail, and sentiment analysis. If you want to classify, filter, or categorize any text data, you can use naive bayes classifier. Naive bayes classifier has three different algorithms: Guassian naive bayes, multinomial naive bayes, bernoulli naive bayes.

Naive bayes classifier is categorized to supervised learning and is based on Bays’ theorem. Bayes’ theorem is a math theory based on conditional probability to calculate the probability of an event. It calculates the probability of an event based on prior probability of an event. In other words, it shows how much the prior probability of an event affect to the final probability. As a supervised learning, naive bayes classifier needs training data for classification. Quality of training data affect accuracy of classification so that you need to keep in mind that it is important to collect qualified training dataset.


Types of NB in scikit learn

1) Gaussian NB

GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.

2) Multinominal 

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).

Note:
ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model’s weights. The inventors of CNB show empirically that the parameter estimates for CNB are more stable than those for MNB. Further, CNB regularly outperforms MNB (often by a considerable margin) on text classification tasks.

3) Bernoulli NB

BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. In the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents. 

4) Categorical NB

CategoricalNB implements the categorical naive Bayes algorithm for categorically distributed data. It assumes that each feature, which is described by the index , has its own categorical distribution.

Note:
 Out of core NB fitting

Naive Bayes models can be used to tackle large scale classification problems for which the full training set might not fit in memory. To handle this case, MultinomialNB, BernoulliNB, and GaussianNB expose a partial_fit method that can be used incrementally as done with other classifiers as demonstrated in Out-of-core classification of text documents. 


*Advantages*

-It is not only a simple approach but also a fast and accurate method for prediction.

-Naive Bayes has very low computation cost.

-It can efficiently work on a large dataset.

-It performs well in case of discrete response variable compared to the continuous variable.

-It can be used with multiple class prediction problems.

-It also performs well in the case of text analytics problems.
When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

*Disadvantages*

-The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.

-If there is no training tuple of a particular class, this causes zero posterior probability. In this case, the model is unable to make predictions. This problem is known as Zero Probability/Frequency Problem.

*Improvisation*

- If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.

-If test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.

-Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.

-Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options.

-Think to apply some classifier combination technique like ensembling, bagging and boosting.