### Task: Sentiment Classification of Movie Reviews  


Alice is a time traveler who visits different eras in the past to solve important missions. While there, she must always be careful to disguise herself so that no one will know she is from the future. This time, she joined an NLP company in 2014 year and was assigned the task of sentiment analysis on user reviews for movies. Help Alice with this task.

You need to solve sentiment classification task using the imdb movie review dataset. Each review is labeled as either positive (1) or negative (0), indicating its sentiment. You will be provided by basic LinearSVC classifier with TF-IDF features.

You need to solve 3 tasks:

1.   Task1: Text Preprocessing with spaCy (this is your baseline)
2.   Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category
3.   Task 3: Development of new features to improve classification accuracy

**Note!** Do not change the classifier. Change only cells with TODO mark.



In [None]:
import os
import random
import re
import numpy as np
import pandas as pd
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import (
    TfidfVectorizer,
    CountVectorizer,
)
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

In [None]:
os.environ["PYTHONHASHSEED"] = str(42)

random.seed(42)
np.random.seed(42)

### Loading the dataset

In [None]:
! gdown --id 1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
! gdown --id 1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2

Downloading...
From: https://drive.google.com/uc?id=1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
To: /content/imdb_train_hw1.csv
100% 8.25M/8.25M [00:00<00:00, 27.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2
To: /content/imdb_test_hw1.csv
100% 2.10M/2.10M [00:00<00:00, 151MB/s]


In [None]:
df_train = pd.read_csv("imdb_train_hw1.csv")
df_test = pd.read_csv("imdb_test_hw1.csv")
df_train.sample(5)

Unnamed: 0.1,Unnamed: 0,label,text
8681,8681,1,I noticed this movie was getting trashed well ...
2362,2362,1,When it comes to creating a universe George Lu...
6232,6232,0,"""National Treasure"" (2004) is a thoroughly mis..."
1318,1318,1,I must admit - the only reason I bought this m...
543,543,1,Ten out of the 11 short films in this movie ar...


In [None]:
y_train = df_train["label"]
y_test = df_test["label"]

Since the classes in our dataset are nearly balanced, we can use accuracy as the evaluation metric. Accuracy provides a straightforward measure of how well the model classifies reviews correctly across both sentiment classes.  

However, we will consider the F1-score for a more detailed performance assessment. Even with balanced classes, the model might still be biased towards one class due to feature distributions (e.g., it may predict negative reviews more confidently than positive ones).  

The F1-score, which is the harmonic mean of precision and recall, helps us identify such imbalances. It ensures that both false positives and false negatives are accounted for, providing a better understanding of how well the model performs on each sentiment class.

## 0. LinearSVC with TF-IDF Features  

We will now train a LinearSVC model using TF-IDF (Term Frequency-Inverse Document Frequency) as features.

In [None]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(df_train["text"])
X_test_tfidf = vectorizer.transform(df_test["text"])

In [None]:
y_train = df_train["label"]
y_test = df_test["label"]

In [None]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy (TF-IDF): 0.841747984726347
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1213
           1       0.83      0.84      0.84      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



The model's accuracy using TF-IDF is 0.8417 (84.17%) this our **baseline result**.

## Task1: Text Preprocessing with spaCy

Lemmatize original review texts with [spacy ](https://spacy.io/usage/linguistic-features#lemmatization)library.
With spacy remove:

*   stop words
*   punctuation
*   digits
*   emails
*   numbers
*   empty word

Train classifier with a new tf-idf representation of text. Obtain baseline classification metrics.

In [None]:
import spacy
import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
import string

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
def clean_text(text):
    text = re.sub(r'\S+@\S+', '', text)

    text = re.sub(r'\d+', '', text)

    text = text.translate(str.maketrans('', '', string.punctuation))

    doc = nlp(text)

    tokens = []
    for token in doc:
        if (
            token.is_stop == False and
            token.is_punct == False and
            token.like_num == False and
            token.is_space == False and
            token.text.strip() != ''
        ):
            tokens.append(token.lemma_.lower())

    return ' '.join(tokens)


In [None]:
from tqdm import tqdm
tqdm.pandas()

In [None]:
df_train["text_lemmatized"] = df_train["text"].progress_apply(clean_text)
df_test["text_lemmatized"] = df_test["text"].progress_apply(clean_text)

100%|██████████| 9427/9427 [06:14<00:00, 25.21it/s]
100%|██████████| 2357/2357 [01:26<00:00, 27.22it/s]


In [None]:
tfidf_vectorizer = TfidfVectorizer()

In [None]:
# TODO get tf-idf vectors for your lemmatized texts

X_train_tfidf_lemmatized = tfidf_vectorizer.fit_transform(df_train['text_lemmatized'])
X_test_tfidf_lemmatized = tfidf_vectorizer.transform(df_test['text_lemmatized'])

In [None]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf_lemmatized, y_train)
y_pred = model.predict(X_test_tfidf_lemmatized)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))

Accuracy (TF-IDF): 0.841747984726347


In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1213
           1       0.84      0.84      0.84      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



This is your **baseline** metrics!

## Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category

For each text add part-of-speach (pos) tags as feature in TF-IDF manner. Use Spacy to get pos tag features. Combine them with lemmatized tf-idf features, obtained in the Task1.

For example, if you have two sentences with following tf-idf vectors:

1.   sent1: "The cat sat on the mat." -> [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0]
2.   sent2: "The dog sat on the floor. " -> [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44]

And you obtained the following pos tag features (with dictionary {'det': 1, 'noun': 2, 'verb': 3, 'adp': 0}):

*   sent1: [0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0.63, 0.31, 0.31]


Then final representation should be:

*   sent1: [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0, 0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44, 0.63, 0.63, 0.31, 0.31]

**Note!** Do not use pos tags punctuation and empty words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from scipy.sparse import hstack
import re

In [None]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

In [None]:
def extract_pos_tags(text):
    doc = nlp(text)
    pos_tags = []
    for token in doc:
        if token.pos_ not in ["PUNCT", "SPACE"]:  # исключаем пунктуацию и пробелы
            pos_tags.append(token.pos_.lower())    # приводим к нижнему регистру
    return " ".join(pos_tags)

In [None]:
df_train["pos_text"] = df_train["text"].progress_apply(extract_pos_tags)
df_test["pos_text"] = df_test["text"].progress_apply(extract_pos_tags)

100%|██████████| 9427/9427 [06:13<00:00, 25.25it/s]
100%|██████████| 2357/2357 [01:36<00:00, 24.44it/s]


We need to bring the features obtained by CountVectorizer for POS tags to the same scale as TF-IDF. The easiest way is to apply TfidfTransformer to the CountVectorizer result.

In [None]:
lemm_tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf_lemmatized = lemm_tfidf_vectorizer.fit_transform(df_train["text_lemmatized"])
X_test_tfidf_lemmatized = lemm_tfidf_vectorizer.transform(df_test["text_lemmatized"])

pos_tfidf_vectorizer = TfidfVectorizer()

X_train_pos_tfidf = pos_tfidf_vectorizer.fit_transform(df_train["pos_text"])
X_test_pos_tfidf = pos_tfidf_vectorizer.transform(df_test["pos_text"])


X_train_combined = hstack([X_train_tfidf_lemmatized, X_train_pos_tfidf])
X_test_combined = hstack([X_test_tfidf_lemmatized, X_test_pos_tfidf])

In [None]:
lr_combined = LinearSVC(random_state=42)
lr_combined.fit(X_train_combined, y_train)
y_pred_combined = lr_combined.predict(X_test_combined)

print("Accuracy (tf-idf + POS):", accuracy_score(y_test, y_pred_combined))

Accuracy (tf-idf + POS): 0.8408994484514213


## Task 3: Development of new features to improve classification accuracy

Come up with another feature or set of features and help Alice improve the quality. Remember that Alice is in the past and does not have access to any . Additional training data cannot be used either. You can use third-party resources to generate features.

Compare with result of your **baseline** from the Task 1. Any improvement will be counted. Use X_train_tfidf_lemmatized and X_test_tfidf_lemmatized, add combine your features with them as in task 2.

In [None]:
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))

def get_custom_feature(text):
    words = word_tokenize(text)
    sentences = sent_tokenize(text)

    word_count = len(words)
    char_count = sum(len(word) for word in words)
    avg_word_length = char_count / word_count if word_count > 0 else 0
    unique_word_count = len(set(words))
    lexical_diversity = unique_word_count / word_count if word_count > 0 else 0

    sentence_count = len(sentences)

    punctuation_count = len(re.findall(r'[^\w\s]', text))

    uppercase_words = sum(1 for word in words if word.isupper())
    uppercase_ratio = uppercase_words / word_count if word_count > 0 else 0

    stopword_count = sum(1 for word in words if word.lower() in stop_words)
    stopword_ratio = stopword_count / word_count if word_count > 0 else 0

    return [
        word_count,
        sentence_count,
        avg_word_length,
        char_count,
        unique_word_count,
        lexical_diversity,
        punctuation_count,
        uppercase_ratio,
        stopword_ratio
    ]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Получаем фичи для каждого текста в обучающей и тестовой выборках
X_train_custom = np.array(df_train['text'].progress_apply(get_custom_feature).tolist())
X_test_custom = np.array(df_test['text'].progress_apply(get_custom_feature).tolist())


100%|██████████| 9427/9427 [00:17<00:00, 552.27it/s]
100%|██████████| 2357/2357 [00:03<00:00, 645.25it/s]


In [None]:
X_train_combined = hstack([X_train_tfidf_lemmatized, X_train_custom])
X_test_combined = hstack([X_test_tfidf_lemmatized, X_test_custom])


In [None]:
lr_combined = LinearSVC(random_state=42)
lr_combined.fit(X_train_combined, y_train)
y_pred_combined = lr_combined.predict(X_test_combined)

print("Accuracy (tf-idf + Custom feature):", accuracy_score(y_test, y_pred_combined))

Accuracy (tf-idf + Custom feature): 0.9681798896902842


