## Text Classification with Sentence Transformers

Build a text classification model using the sentence-transformers library. The goal is to classify movie reviews into positive and negative categories. You can use a dataset like the IMDB movie reviews dataset, which is available in various places, including the TensorFlow Datasets catalog.

In [33]:
import pandas as pd
import numpy as np

In [34]:
review_df = pd.read_csv('IMDB Dataset.csv')
review_df.shape

(50000, 2)

In [35]:
review_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


50K movie review is very large, so take sample

In [37]:
df = review_df.sample(frac=0.01)
df.shape

(500, 2)

#### 1. Preprocessing

In [41]:
import spacy
import string

nlp = spacy.load('en_core_web_sm')

stop_words_spacy = spacy.lang.en.STOP_WORDS

def preprocess_text(text):
    doc = nlp(text.lower())
    tokens = [token.text for token in doc if token.is_alpha and token.text not in string.punctuation and not token.is_stop]
    return ' '.join(tokens)

df['clean_review'] = df['review'].apply(preprocess_text)
df.head()

Unnamed: 0,review,sentiment,clean_review
25368,I couldn't find anyone to watch DiG! with me b...,positive,find watch dig knew fan bands naturally assume...
10364,Absolutely fantastic trash....this one has it ...,positive,absolutely fantastic trash nudity good fight s...
49550,"(Review in English, since Swedish is not allow...",negative,review english swedish saw movie extremely low...
2784,Christopher Guest is the master of the mockume...,negative,christopher guest master mockumentary werner h...
5675,I enjoy watching Robert Forster. That was the ...,negative,enjoy watching robert forster main reason rent...


#### 2. Sentence Embeddings

In [44]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
MODEL_NAME = 'bert-base-nli-mean-tokens'
model = SentenceTransformer(MODEL_NAME)

sentence_embeddings = model.encode(df['clean_review'].tolist())

In [45]:
import pickle

df.to_pickle('df.pkl')

with open('sentence_embeddings.pkl', 'wb') as file:
    pickle.dump(sentence_embeddings, file)

#### 3. Model Training

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

X = sentence_embeddings
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#### 4. Evaluation

In [48]:
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)

In [55]:
print(classification_report_result)

              precision    recall  f1-score   support

    negative       0.77      0.80      0.78        70
    positive       0.82      0.79      0.80        80

    accuracy                           0.79       150
   macro avg       0.79      0.79      0.79       150
weighted avg       0.79      0.79      0.79       150

