# Introduction
Effectiveness of bigram in sentiment analysis.

# Environment set-up and data preparation
Let's start by setting up the environment.
To have a clean installation that would not mess up my current python packages, I created a virtual environment named sentimentVenv. The python version is 3.6.

```console
virtualenv sentimentVenv --python=python3.6
```

Now, activate the environment.

```console
source sentimentVenv/bin/activate
```

Inside this environment, we'll need to install these libraries:
* scikit-learn
* scipy
* jupyter

```console
pip install scikit-learn
pip install scipy
pip install jupyter
```

The environment should now be ready.
The dataset can be downloaded from this link. It includes 50000 text files. Each text represents movie review. These files are stored in pos/neg directory, corresponding sentiment.

Let's load the python libraries and have a look at the dataset.

In [1]:
import os

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.naive_bayes import MultinomialNB

DATA_DIR = os.path.join(os.getcwd(), 'data/raw')

Let's define a function that loads the dataset and extracts the two columns we need:
* The sentiment: a binary (0/1) variable
* The text of the movie review: string

In [2]:
def load_data():
    print('Now Loading...')
    res = {'train': {'label': [], 'data': []},
           'test': {'label': [], 'data': []}}
    for test_or_train in ['test', 'train']:
        for label in ['neg', 'pos']:
            dir_name = os.path.join(DATA_DIR, 'aclImdb', test_or_train, label)
            files = os.listdir(dir_name)
            for file in files:
                if not file.endswith('.txt'):
                    continue
                with open(os.path.join(dir_name, file)) as f:
                    item = f.read()
                    res[test_or_train]['label'].append(label)
                    res[test_or_train]['data'].append(item)
    X_train, X_test =res['train']['data'], res['test']['data']
    y_train, y_test =res['train']['label'], res['test']['label']
    print('Dataset loaded')
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = load_data()

Now Loading...


Dataset loaded


# Building a sentiment classifier: unigram features
Let's now get to the sentiment classification part. 
In order to classify text, we have to turn them into vectors as well. In scikit-learn, this task is very easy. We have only to pass dataset to CountVectorizer. It tokenizes text and convert tokenized text to frequency matrix. In addition,, a better operation, we compute weights for words where each weight gives the importance of the word. Such a weight could the tf-idf score.

Let's start by building a tf-idf matrix.

In [3]:
def build_pipeline():
    text_clf = Pipeline([('vect', CountVectorizer(min_df=1, stop_words='english')),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB()),
                         ])
    return text_clf

We should now be ready to feed these vectors into a classifier. 

In [4]:
text_clf = build_pipeline()
text_clf = text_clf.fit(X_train, y_train)

Now that the model is trained, let's evaluate it on the test set:

In [5]:
y_pred = text_clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))

Accuracy: 0.82992
             precision    recall  f1-score   support

        neg       0.80      0.88      0.84     12500
        pos       0.87      0.78      0.82     12500

avg / total       0.83      0.83      0.83     25000



Almost 83% accuracy. This is not bad. If we tune more parameters,  we reach a higher score.

# Building a sentiment classifier: bigram features

In [6]:
def build_pipeline():
    text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1, stop_words='english')),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB()),
                         ])
    return text_clf

We should now be ready to feed these vectors into a classifier

In [7]:
text_clf = build_pipeline()
text_clf = text_clf.fit(X_train, y_train)

Now that the model is trained, let's evaluate it on the test set:

In [8]:
y_pred = text_clf.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))

Accuracy: 0.85484
             precision    recall  f1-score   support

        neg       0.83      0.90      0.86     12500
        pos       0.89      0.81      0.85     12500

avg / total       0.86      0.85      0.85     25000



Almost 86% accuracy. This is not bad. If we tune more parameters,  we reach a higher score.

# Conclusion
In this post we explored different features to perform sentiment analysis: We built a sentiment classifier using unigram and bigram.
The classifier using unigram feature resulted in a 83% classification model accuracy. This is not bad.

For improving this classifier, we can investigate the classifier using bigram features. The classifier resulted in a 86% accuracy. It is higher than the classifier based on unigram.

I hope this tutorial was a good introductory start to sentiment analysis.