### Task: Sentiment Classification of Movie Reviews  


Alice is a time traveler who visits different eras in the past to solve important missions. While there, she must always be careful to disguise herself so that no one will know she is from the future. This time, she joined an NLP company in 2014 year and was assigned the task of sentiment analysis on user reviews for movies. Help Alice with this task.

You need to solve sentiment classification task using the imdb movie review dataset. Each review is labeled as either positive (1) or negative (0), indicating its sentiment. You will be provided by basic LinearSVC classifier with TF-IDF features.

You need to solve 3 tasks:

1.   Task1: Text Preprocessing with spaCy (this is your baseline)
2.   Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category
3.   Task 3: Development of new features to improve classification accuracy

**Note!** Do not change the classifier. Change only cells with TODO mark.



In [1]:
import os
import random
import re
import numpy as np
import pandas as pd
import spacy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import (
    TfidfVectorizer,
    CountVectorizer,
)
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

In [2]:
os.environ["PYTHONHASHSEED"] = str(42)

random.seed(42)
np.random.seed(42)

### Loading the dataset

In [3]:
! gdown --id 1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
! gdown --id 1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2

Downloading...
From: https://drive.google.com/uc?id=1C6TIP8c33fHM6dxs6DoxJeKY6ZXGWpBx
To: /kaggle/working/imdb_train_hw1.csv
100%|██████████████████████████████████████| 8.25M/8.25M [00:00<00:00, 40.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1K8WBFVVvVlsvIMRG8HiaFkldiyuNkLD2
To: /kaggle/working/imdb_test_hw1.csv
100%|███████████████████████████████████████| 2.10M/2.10M [00:00<00:00, 157MB/s]


In [4]:
df_train = pd.read_csv("imdb_train_hw1.csv")
df_test = pd.read_csv("imdb_test_hw1.csv")
df_train.sample(5)

Unnamed: 0.1,Unnamed: 0,label,text
8681,8681,1,I noticed this movie was getting trashed well ...
2362,2362,1,When it comes to creating a universe George Lu...
6232,6232,0,"""National Treasure"" (2004) is a thoroughly mis..."
1318,1318,1,I must admit - the only reason I bought this m...
543,543,1,Ten out of the 11 short films in this movie ar...


In [5]:
y_train = df_train["label"]
y_test = df_test["label"]

Since the classes in our dataset are nearly balanced, we can use accuracy as the evaluation metric. Accuracy provides a straightforward measure of how well the model classifies reviews correctly across both sentiment classes.  

However, we will consider the F1-score for a more detailed performance assessment. Even with balanced classes, the model might still be biased towards one class due to feature distributions (e.g., it may predict negative reviews more confidently than positive ones).  

The F1-score, which is the harmonic mean of precision and recall, helps us identify such imbalances. It ensures that both false positives and false negatives are accounted for, providing a better understanding of how well the model performs on each sentiment class.

## 0. LinearSVC with TF-IDF Features  

We will now train a LinearSVC model using TF-IDF (Term Frequency-Inverse Document Frequency) as features.

In [6]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(df_train["text"])
X_test_tfidf = vectorizer.transform(df_test["text"])

In [7]:
y_train = df_train["label"]
y_test = df_test["label"]

In [8]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy (TF-IDF): 0.841747984726347
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      1213
           1       0.83      0.84      0.84      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



The model's accuracy using TF-IDF is 0.8417 (84.17%) this our **baseline result**.

## Task1: Text Preprocessing with spaCy

Lemmatize original review texts with [spacy ](https://spacy.io/usage/linguistic-features#lemmatization)library.
With spacy remove:

*   stop words
*   punctuation
*   digits
*   emails
*   numbers
*   empty word

Train classifier with a new tf-idf representation of text. Obtain baseline classification metrics.

In [34]:
import pandas as pd
import string

# Функция для очистки текста
def clean_text(text):
    # Приводим текст к нижнему регистру
    text = text.lower()
    # Удаляем знаки препинания
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Убираем пробелы
    text = text.replace(" ", "")
    return text

# Загрузка данных
df_train = pd.read_csv("imdb_train_hw1.csv")

# Применение функции clean_text к текстам
df_train["text_cleaned"] = df_train["text"].apply(clean_text)

# Вывод первых 5 строк для проверки
print(df_train[["text", "text_cleaned"]].head())

                                                text  \
0  So, this movie has been hailed, glorified, and...   
1  This Filmfour funded Sci-Fi movie is most defi...   
2  Okay this is stupid,they say their not making ...   
3  Of course, by any normal standard of film crit...   
4  What the movie The 60s really represents (to t...   

                                        text_cleaned  
0  sothismoviehasbeenhailedglorifiedandcarriedtoi...  
1  thisfilmfourfundedscifimovieismostdefinitelyam...  
2  okaythisisstupidtheysaytheirnotmakinganotherni...  
3  ofcoursebyanynormalstandardoffilmcriticismsold...  
4  whatthemoviethe60sreallyrepresentstothoseofusw...  


In [35]:
nlp = spacy.load("en_core_web_sm")

# TODO: function to clean text using spaCy
def clean_text(text):
    doc = nlp(text)
    cleaned_tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct and not token.is_digit and not token.like_email and not token.like_num and not token.is_space:
            cleaned_tokens.append(token.lemma_)
    return " ".join(cleaned_tokens)



In [36]:
df_train["text_lemmatized"] = df_train["text"].apply(clean_text)
df_test["text_lemmatized"] = df_test["text"].apply(clean_text)

In [38]:
# TODO get tf-idf vectors for your lemmatized texts

vectorizer = TfidfVectorizer()
X_train_tfidf_lemmatized = vectorizer.fit_transform(df_train["text_lemmatized"])
X_test_tfidf_lemmatized = vectorizer.transform(df_test["text_lemmatized"])

In [39]:
model = LinearSVC(random_state=42)
model.fit(X_train_tfidf_lemmatized, y_train)
y_pred = model.predict(X_test_tfidf_lemmatized)
print("Accuracy (TF-IDF):", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy (TF-IDF): 0.8413237165888842
              precision    recall  f1-score   support

           0       0.85      0.84      0.84      1213
           1       0.83      0.85      0.84      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



This is your **baseline** metrics!

## Task 2: Adding Part-of-Speech (POS) Features as a TF-IDF for Each POS Category

For each text add part-of-speach (pos) tags as feature in TF-IDF manner. Use Spacy to get pos tag features. Combine them with lemmatized tf-idf features, obtained in the Task1.

For example, if you have two sentences with following tf-idf vectors:

1.   sent1: "The cat sat on the mat." -> [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0]
2.   sent2: "The dog sat on the floor. " -> [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44]

And you obtained the following pos tag features (with dictionary {'det': 1, 'noun': 2, 'verb': 3, 'adp': 0}):

*   sent1: [0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0.63, 0.31, 0.31]


Then final representation should be:

*   sent1: [0.63, 0.44, 0.31, 0.31, 0.44, 0, 0, 0.63, 0.63, 0.31, 0.31]
*   sent2: [0.63, 0, 0.31, 0.31, 0, 0.44, 0.44, 0.63, 0.63, 0.31, 0.31]

**Note!** Do not use pos tags punctuation and empty words

We need to bring the features obtained by CountVectorizer for POS tags to the same scale as TF-IDF. The easiest way is to apply TfidfTransformer to the CountVectorizer result.

In [5]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
from scipy.sparse import hstack

df_train = pd.read_csv("imdb_train_hw1.csv")
df_test = pd.read_csv("imdb_test_hw1.csv")

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def clean_text(text):
    doc = nlp(text)
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space]
    return " ".join(cleaned_tokens)

df_train["text_lemmatized"] = df_train["text"].apply(clean_text)
df_test["text_lemmatized"] = df_test["text"].apply(clean_text)

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf_lemmatized = tfidf_vectorizer.fit_transform(df_train["text_lemmatized"])
X_test_tfidf_lemmatized = tfidf_vectorizer.transform(df_test["text_lemmatized"])

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def extract_pos_tags(text):
    doc = nlp(text)
    pos_tags = [token.pos_ for token in doc if token.pos_ in {"NOUN", "ADJ", "VERB", "ADV"}]
    return " ".join(pos_tags)

df_train["pos_text"] = df_train["text"].apply(extract_pos_tags)
df_test["pos_text"] = df_test["text"].apply(extract_pos_tags)



In [6]:
pos_vectorizer = CountVectorizer()
X_train_pos_bow = pos_vectorizer.fit_transform(df_train["pos_text"])
X_test_pos_bow = pos_vectorizer.transform(df_test["pos_text"])

pos_tfidf_transformer = TfidfTransformer()
X_train_pos_tfidf = pos_tfidf_transformer.fit_transform(X_train_pos_bow)
X_test_pos_tfidf = pos_tfidf_transformer.transform(X_test_pos_bow)

X_train_combined = hstack([X_train_tfidf_lemmatized, X_train_pos_tfidf])
X_test_combined = hstack([X_test_tfidf_lemmatized, X_test_pos_tfidf])

In [8]:
y_train = df_train["label"]
y_test = df_test["label"]

svc_model = LinearSVC(random_state=42, C=0.1)
svc_model.fit(X_train_combined, y_train)

In [9]:
y_pred_svc = svc_model.predict(X_test_combined)
print("Accuracy (TF-IDF + POS + Embeddings + LinearSVC):", accuracy_score(y_test, y_pred_svc))
print(classification_report(y_test, y_pred_svc))

Accuracy (TF-IDF + POS + Embeddings + LinearSVC): 0.835383962664404
              precision    recall  f1-score   support

           0       0.86      0.82      0.84      1213
           1       0.81      0.86      0.83      1144

    accuracy                           0.84      2357
   macro avg       0.84      0.84      0.84      2357
weighted avg       0.84      0.84      0.84      2357



## Task 3: Development of new features to improve classification accuracy

Come up with another feature or set of features and help Alice improve the quality. Remember that Alice is in the past and does not have access to any . Additional training data cannot be used either. You can use third-party resources to generate features.

Compare with result of your **baseline** from the Task 1. Any improvement will be counted. Use X_train_tfidf_lemmatized and X_test_tfidf_lemmatized, add combine your features with them as in task 2.

In [10]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
from scipy.sparse import hstack
from textblob import TextBlob
from sklearn.preprocessing import StandardScaler

df_train = pd.read_csv("imdb_train_hw1.csv")
df_test = pd.read_csv("imdb_test_hw1.csv")

def get_custom_feature(text):
    blob = TextBlob(text)
    return [blob.sentiment.polarity, blob.sentiment.subjectivity]

In [12]:
df_train["sentiment_score"] = df_train["text"].apply(lambda x: get_custom_feature(x)[0])
df_train["subjectivity_score"] = df_train["text"].apply(lambda x: get_custom_feature(x)[1])
df_test["sentiment_score"] = df_test["text"].apply(lambda x: get_custom_feature(x)[0])
df_test["subjectivity_score"] = df_test["text"].apply(lambda x: get_custom_feature(x)[1])

scaler = StandardScaler()
X_train_sentiment = scaler.fit_transform(df_train[["sentiment_score", "subjectivity_score"]])
X_test_sentiment = scaler.transform(df_test[["sentiment_score", "subjectivity_score"]])

In [13]:
X_train_combined = hstack([X_train_tfidf_lemmatized, X_train_sentiment])
X_test_combined = hstack([X_test_tfidf_lemmatized, X_test_sentiment])

svc_model = LinearSVC(random_state=42, C=0.1)
svc_model.fit(X_train_combined, y_train)

In [14]:
y_pred_svc = svc_model.predict(X_test_combined)
print("Accuracy (TF-IDF + Sentiment + LinearSVC):", accuracy_score(y_test, y_pred_svc))
print(classification_report(y_test, y_pred_svc))

Accuracy (TF-IDF + Sentiment + LinearSVC): 0.8294442087399236
              precision    recall  f1-score   support

           0       0.84      0.82      0.83      1213
           1       0.81      0.84      0.83      1144

    accuracy                           0.83      2357
   macro avg       0.83      0.83      0.83      2357
weighted avg       0.83      0.83      0.83      2357

