# Simple Text Classfier
### What is Text Classification?
Text classification, also known as text categorization, is the task of assigning predefined categories or labels to a document based on its content. It involves automatically classifying a piece of text into one or more predefined categories or classes. Text classification is a fundamental task in natural language processing (NLP) and is used in various applications such as spam detection, sentiment analysis, topic classification, and document categorization.
### How to Represent a Text Document?
There are several common methods for representing a text document in a format that can be processed by machine learning algorithms. Here are some of the most popular techniques:

- Bag-of-Words (BoW): In this approach, each document is represented as a vector where each element corresponds to the frequency of a particular word in the document. Stop words (commonly occurring words like "and," "the," "is") are often removed to reduce noise. The order of words is disregarded, hence the term "bag-of-words."

- Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is similar to the BoW approach but also takes into account the importance of words in the context of the entire corpus. It calculates a weight for each word based on its frequency in the document (term frequency) and its rarity across all documents (inverse document frequency).

- Word Embeddings: Word embeddings represent words as dense, low-dimensional vectors in a continuous vector space. Techniques like Word2Vec, GloVe, and FastText learn to map words to these vector representations in a way that captures semantic relationships between words.

- N-grams: Instead of considering individual words, N-grams represent sequences of N contiguous words in the text. For example, bigrams (N=2) consider pairs of consecutive words, while trigrams (N=3) consider sequences of three words.

- Character-level Representations: In some cases, especially when dealing with languages with complex morphology or when word boundaries are not well-defined (e.g., Chinese or Thai), character-level representations can be used. Each character is treated as a separate feature.

In this tutorial, we will build a simple text classifier based on Bag-of-Words representation of document.

### Import Libararies

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Load Dataset
In this tutorial, we will use the IMDB review dataset. 

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

The dataset can be accessed from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 

In [2]:
# Load the dataset
df = pd.read_csv("IMDB Dataset.csv")

# Explore the dataset
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Data Preprocessing

In [3]:
# Convert sentiment labels to numerical values (0 for negative, 1 for positive)
df['sentiment'] = df['sentiment'].map({'negative': 0, 'positive': 1})

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [4]:
# Split the data into features (reviews) and target (sentiment)
X = df['review']
y = df['sentiment']

print(X.shape, y.shape)

(50000,) (50000,)


In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, we convert each text document (movie review) into a "bag of words", i.e., a vector of term frequencies (TF). 
"Term", "token", and "word" refer to the same concept. 
By vectorization, we can convert a collection of text documents into a matrix of token counts.
The "CountVectorizer" class in sklearn library can be used for this task. Here's an explanation of how it works:

- **Tokenization**: First, the text data is tokenized, which means splitting the text into individual words or terms. For example, the sentence "The quick brown fox jumps over the lazy dog" would be tokenized into ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]. 
- **Stopword removal**: Stopwords like "a", "the" don't have actually meanings so are often removed from the token list to reduce noise.
- **Vocabulary Building**: Next, the unique tokens from all the documents are collected to build a vocabulary. This vocabulary consists of all the unique words or terms present in the entire corpus of documents.
- **Vectorization**: For each document, the count of each token in the vocabulary is calculated. This count represents how many times each token appears in the document. The result is a matrix where each row corresponds to a document and each column corresponds to a token in the vocabulary. The value at each position in the matrix is the count of the corresponding token in the document.

For example, consider the following documents:

- Document 1: "The quick brown fox"
- Document 2: "The lazy dog"

The vocabulary would be: ["The", "quick", "brown", "fox", "lazy", "dog"]

The count matrix would be:

            The  quick  brown  fox  lazy  dog
            
Document 1:   1     1      1     1     0    0

Document 2:   1     0      0     0     1    1

Each row in the matrix represents a document, and each column represents a token from the vocabulary. The value at each position indicates the count of the corresponding token in the document.


In [6]:
# Convert text data into numerical features using CountVectorizer
#vectorizer = TfidfVectorizer(stop_words= 'english')
vectorizer = CountVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()

The **fit_transform** method is used to both fit the vectorizer to the training data and transform the training data into a matrix of token counts or TF-IDF features. 
During the fitting process, the vectorizer learns the vocabulary from the training data, which involves tokenization and vocabulary building. It collects all unique tokens (words or terms) from the training data and assigns each one an index.

The **transform** method is used to transform the testing or unseen data into the same matrix representation based on the vocabulary learned from the training data.
Unlike fit_transform, transform does not learn the vocabulary or build the token count matrix. Instead, it only applies the transformation based on the existing vocabulary learned during the fitting process.
This method is applied to the testing or new data after the vectorizer has been fitted to the training data. It ensures that the testing data is represented in the same format as the training data, allowing for consistency in feature representation.

In [7]:
X_train_vec.shape

(40000, 92692)

In [8]:
X_test_vec.shape

(10000, 92692)

In [9]:
X_test_vec[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [10]:
# Convert NumPy arrays to PyTorch tensors
# try using Data loader
X_train = torch.tensor(X_train_vec, dtype=torch.float32)
X_test = torch.tensor(X_test_vec, dtype=torch.float32)
y_train = torch.tensor(y_train.values, dtype=torch.long)
y_test = torch.tensor(y_test.values, dtype=torch.long)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

torch.Size([40000, 92692]) torch.Size([10000, 92692]) torch.Size([40000]) torch.Size([10000])


### Define the Neural Network

In [11]:
# Define neural network architecture
class Net(nn.Module):
    def __init__(self, input_size):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 2)  # Output layer with 2 neurons for binary classification
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Instantiate the model
input_size = X_train.shape[1]
model = Net(input_size)

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

### Create Data Loaders

In [12]:
# Create DataLoader for training and testing sets
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

### Train the Neural Network

In [13]:
# Training loop
num_epochs = 1
for epoch in range(num_epochs):
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        inputs, targets = data

        # Forward pass
        outputs = model(inputs)
        
        # Calculate loss
        loss = criterion(outputs, targets)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 100 == 99:    # Print every 100 mini-batches
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {running_loss / 100:.4f}')
            running_loss = 0.0

print('Finished Training')

Epoch [1/1], Step [100/1250], Loss: 0.4648
Epoch [1/1], Step [200/1250], Loss: 0.3297
Epoch [1/1], Step [300/1250], Loss: 0.3693
Epoch [1/1], Step [400/1250], Loss: 0.3288
Epoch [1/1], Step [500/1250], Loss: 0.3067
Epoch [1/1], Step [600/1250], Loss: 0.3007
Epoch [1/1], Step [700/1250], Loss: 0.3039
Epoch [1/1], Step [800/1250], Loss: 0.2859
Epoch [1/1], Step [900/1250], Loss: 0.2967
Epoch [1/1], Step [1000/1250], Loss: 0.2839
Epoch [1/1], Step [1100/1250], Loss: 0.2750
Epoch [1/1], Step [1200/1250], Loss: 0.2644
Finished Training


### Evaluate on Test Set

In [14]:
# Evaluation on test set
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        inputs, targets = data
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += (predicted == targets).sum().item()

print(f'Accuracy on test set: {100 * correct / total:.2f}%')

Accuracy on test set: 89.62%


In [15]:
# Alternative
# Evaluation on test set
with torch.no_grad():
    outputs = model(X_test)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == y_test).sum().item() / len(y_test)
    print(f'Test Accuracy: {100 * accuracy:.2f}%')

Test Accuracy: 89.62%
