In [1]:
from IPython.display import clear_output

In [3]:
# %pip install torchtext scikit-learn nltk tqdm

%pip install --upgrade portalocker

clear_output()

# Content

In this demo, we will make an SVM based text sentiment classifier.

We'll use the IMDB movies review dataset along with sklearn for model training and evaluation

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from torchtext.datasets import IMDB

from tqdm import tqdm

## Downloading and preparing the data

In [21]:
train_data, test_data = IMDB(split=('train', 'test'))

train_data = list(tqdm(train_data, total=25000))
test_data = list(tqdm(test_data, total=25000))

100%|██████████| 25000/25000 [00:00<00:00, 33271.78it/s]
100%|██████████| 25000/25000 [00:00<00:00, 36820.03it/s]


In [45]:
y_train, X_train = zip(*train_data)
y_test, X_test = zip(*test_data)

Predicting Labels:: 0it [1:01:25, ?it/s]
0it [1:00:53, ?it/s]


## Training the model

we will use SVC (support vector classifier) for training.

the labels are:

1 - Negative

2 - Positive

We will also use TF-IDF instead of word to index to convert our text to numerical form

### If you dont know about TF-IDF:
The details of TF-IDF are outside the scope of this demo so we will not go too much into how it works. The only important part is that it learns and converts words to numerical vectors which can then be fed to models.

It also identifies which words are auto occuring(stop words) and assigns them a very low or zero value so we dont need to manually remove them.

Lastly, the Sklearn API accepts the strings as it is and we don't have to tokenize the text ourselves.


In [46]:
# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [50]:
# Train the SVM classifier
clf = SVC(kernel='linear')
# clf = LogisticRegression(max_iter=1000)  # If we want to use Logistic Regression instead of SVC
clf.fit(X_train, y_train)  # SVC takes some time to train

In [51]:
# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.87936


In [52]:
sentences = ["I was bored during this movie", "The movie was fun"]
sentence_vectors = vectorizer.transform(sentences)
clf.predict(sentence_vectors)  # 1 is neg. 2 is pos

array([1, 2])