## TUTORIAL. Machine Learning Methods for NLP -I - Text Classification

In this tutorial, let's cover text-classification using machine learning methods. For this we chose the popular text classificaiton task of **Sentiment Analysis**.

**This tutorial is graded**. complete the exercises and turn in under week 7.

### 1. What is Sentiment Analysis
Sentiment analysis (SA), formally known as opinion mining, is a natural language processing (NLP) task that involves determining and quantifying the emotional tone or sentiment expressed within a piece of text, typically written or spoken language. In simple terms, sentiment analysis aims to classify text into predefined categories that represent the sentiment or emotional polarity conveyed by the text. These categories are typically binary, classifying text as either *positive* or *negative*, but they can also be more fine-grained, such as *positive*, *negative*, or *neutral*.

### 1.1. Dataset for SA
There are many datasets out there for sentiment analysis from which we chose the popular  "IMDb movie reviews dataset." It is a widely used dataset for sentiment analysis and consists of 2000 movie reviews from the Internet Movie Database (IMDb) website -- 1000 positive and 1000 negative. The dataset is often used to train and evaluate machine learning models for sentiment classification tasks. The IMDB movie review data is now a part of NLTK and can be accessed through **nltk.download()**.

**Note:** I have already extracted and provided the training and test data in the form of CSV files.
- `train.csv` - Contains 80% of the IMDB data to be used for training classifiers.

- `test.csv` - Contains 20% of the IMDB data to be used for training classifiers.

Each CSV file has two columns

- **text** : containing the movie review
- **sentiment** : containing the original sentiment -- 0 representing negaitve and 1 representing positive

Let's load the data in dataframes


In [1]:
# These two lines are needed to print variables by just mentioning them, e.g., training_data.head()
# If we don't use this, only the last call of a variable gets printed

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

training_data = pd.read_csv("train.csv")
print (f"Training Data: {len(training_data)} example")
training_data.head()

test_data = pd.read_csv("test.csv")
print (f"Test Data: {len(test_data)} example")
test_data.head()



Training Data: 1600 example
Test Data: 400 example


Unnamed: 0,text,label
0,the verdict : spine-chilling drama from horror...,1
1,""" the 44 caliber killer has struck again . "" ...",0
2,in the company of men made a splash at the sun...,1
3,"in the year 2029 , captain leo davidson ( mark...",0
4,[note that followups are directed to rec . art...,1


## 2. Instantiate a bunch Machine Learning based Classifiers

These classifiers work on a wide range of tasks and datasets (and not just textual data) as long as the datasets are featurized and labels are encoded into numbers.

We will instantiate the following classifiers:

1. Logistic Regression
2. Support Vector Classifier
3. Feed Forward Neural Network



In [2]:
## 1. Logistic Regression
from sklearn.linear_model import LogisticRegression
## 2. Support Vector Machine
from sklearn.svm import SVC
## 3. Feed forward neural network or multi-layered perceptron
from sklearn.neural_network import MLPClassifier

We also write a generic function that can be reused for any classifier as long as we are using them from scikit-learn package

In [3]:
def train_and_evaluate_classifier(classifier, X_train, y_actual, X_test, y_test_actual):
  classifier.fit(X_train, y_actual)
  y_pred = classifier.predict(X_test)
  accuracy = accuracy_score(y_test_actual, y_pred)
  return accuracy


## 3. Using Unigram / Bag of Words Features for classification

Unigram (Bag of Words) Vectorization converts text data into a numerical representation by counting the presence-absence / frequency of individual words (unigrams) in a document, creating a sparse vector where each dimension corresponds to a unique word in the corpus. This technique disregards word order and focuses solely on word presence, making it a basic but efficient method for text classification.



In [4]:
from sklearn.feature_extraction.text import CountVectorizer
# compute "goodness" of classification through accuracy
from sklearn.metrics import accuracy_score

# Extract text and labels
X_train = training_data['text']
y_train = training_data['label']
X_test = test_data['text']
y_test = test_data['label']

# Create a CountVectorizer for unigrams (bag of words)
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Train and test Logistic Regression.  

In [5]:
classifier = LogisticRegression(max_iter=1000)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Logistic Regression = {accuracy*100}%")

Accuracy of Logistic Regression = 82.0%


Train and test SVM

In [6]:
classifier = SVC(kernel="linear")
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

Accuracy of Support Vector Classificaiton = 81.75%


Train and test Feed Forward Network

In [7]:
classifier = MLPClassifier(random_state=1, max_iter=300)
accuracy = train_and_evaluate_classifier(classifier, X_train_vec, y_train, X_test_vec, y_test)
print (f"Accuracy of Support Vector Classificaiton = {accuracy*100}%")

Accuracy of Support Vector Classificaiton = 85.0%


### 4. Using Linguistic Features alongside unigram features

Adding Part-of-Speech (POS) feature extraction to the Count Vectorizer can enhance sentiment analysis by considering the grammatical structure and syntactic information in text. POS tags provide valuable insights into the role of words in a sentence, allowing the model to capture nuances that simple word counts may miss.

**Example:** Consider the sentence "The movie was not good." In this case, POS tagging can help distinguish between the negation "not" and the sentiment-carrying word "good." The Count Vectorizer alone might treat "not" and "good" as two separate unigrams without capturing their relationship. By appending POS tags, you can represent the sentence as follows: "DT NN VB RB JJ." Here, DT represents determiner, NN is a noun, VB is a verb, RB is an adverb, and JJ is an adjective. This added information can help the sentiment analysis model better understand the sentence's structure and sentiment orientation.

We extract POS tags from each review using one the techniques we discussed in our previous practicum. Define POS unigrams (count of each POS tag) as a feature and concatenate them with the unigram features.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
from nltk import word_tokenize, pos_tag
import nltk

nltk.download('averaged_perceptron_tagger')

# Define a function to extract POS tags from text
def tokenize_and_get_pos_tags(text):
    tokens = text.split()
    pos_tags = pos_tag(tokens)
    return ' '.join([tag for word, tag in pos_tags])

# Create a CountVectorizer for text and a CountVectorizer for POS tags
text_vectorizer = CountVectorizer()
pos_vectorizer = CountVectorizer()

X_train_text_vec = text_vectorizer.fit_transform(X_train)
X_test_text_vec = text_vectorizer.transform(X_test)

X_train_pos_vec = pos_vectorizer.fit_transform(X_train.apply(tokenize_and_get_pos_tags))
X_test_pos_vec = pos_vectorizer.transform(X_test.apply(tokenize_and_get_pos_tags))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [15]:
# Concatenate the two feature sets
import scipy.sparse as sp
X_train_combined = sp.hstack([X_train_text_vec, X_train_pos_vec])
X_test_combined = sp.hstack([X_test_text_vec, X_test_pos_vec])

# Train a classifier
classifier = LogisticRegression(max_iter=10000)

accuracy = train_and_evaluate_classifier(classifier, X_train_combined, y_train, X_test_combined, y_test)
print(f"Accuracy with POS Features and Count Vectorizer: {accuracy*100}")


Accuracy with POS Features and Count Vectorizer: 82.75


## 5. Using word-vectors such as Glove as features for classificaiton

**Motivation:** Implementing Glove (Global Vectors for Word Representation) vector averaging as a feature can capture semantic relationships between words and improve the sentiment analysis model's understanding of text. Glove vectors encode word meanings in dense vector representations, allowing the model to capture subtle nuances in language, such as word similarities and context.

Following practicum II, we can process the data and extract the average embedding vectors for each review. This vector will serve as our feature.

First we download and prepare Glove vectors for usage:



In [16]:
# this is a one time download
!wget -c http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# do some necessary conversions
!python -m gensim.scripts.glove2word2vec --input  glove.6B.50d.txt --output glove.6B.50d.vec
!python -m gensim.scripts.glove2word2vec --input  glove.6B.200d.txt --output glove.6B.200d.vec
!rm glove*.txt


--2024-02-29 04:16:57--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-02-29 04:16:57--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-02-29 04:16:57--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [17]:
import numpy as np
from gensim.models import KeyedVectors

# Load pre-trained GloVe embeddings
word_vectors = KeyedVectors.load_word2vec_format('glove.6B.200d.vec', binary=False)

# Define a function to calculate the average GloVe vector for a text
def get_average_glove_vector(text):
    vectors = [word_vectors[word] for word in text.split() if word in word_vectors]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(word_vectors.vector_size)

# Apply the function to the dataset
X_train_glove = [get_average_glove_vector(text) for text in X_train]
X_test_glove = [get_average_glove_vector(text) for text in X_test]

# Train a classifier (e.g., Logistic Regression)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train_glove, y_train)

# Evaluate the classifier
y_pred = classifier.predict(X_test_glove)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with Glove Vector Averaging:", accuracy)


Accuracy with Glove Vector Averaging: 0.7575


## Exercise:

Now let's do the following.

1. Repeat Section 3 with TfIdfVectorizer.
2. Try to train and evaluate SVC and MLP classifiers using GloVE features following Section 5. Write down your observations.