<a href="https://colab.research.google.com/github/JessicaSunQI/JessicaSunQI/blob/main/MMA_865.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MMA 865, Individual Assignment 1

Last Updated December 11, 2023.

- [First name, Last name]
- [Student number]
- [Date]

# Part 1: Sentiment Analysis via the ML-based approach

Download the `Product Sentiment` dataset from the course portal: sentiment_train.csv and sentiment_test.csv.

### Part 1.a. Loading and Prep

Load, clean, and preprocess the data as you find necessary.

In [5]:
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

import re
import os

import prettytable as pt


# # set_proxy
# proxy = "http://127.0.0.1:7897"
# os.environ["http_proxy"] = proxy
# os.environ["HTTP_PROXY"] = proxy
# os.environ["https_proxy"] = proxy
# os.environ["HTTPS_PROXY"] = proxy

# if os.path.exists(nltk_dir):
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("vader_lexicon")

df_train = pd.read_csv("sentiment_train.csv")

print(df_train.info())
print(df_train.head())

df_test = pd.read_csv("sentiment_test.csv")

print(df_train.info())
print(df_train.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  2400 non-null   object
 1   Polarity  2400 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.6+ KB
None
                                            Sentence  Polarity
0                           Wow... Loved this place.         1
1                                 Crust is not good.         0
2          Not tasty and the texture was just nasty.         0
3  Stopped by during the late May bank holiday of...         1
4  The selection on the menu was great and so wer...         1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sentence  2400 non-null   object
 1   Polarity  2400 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.6+ KB

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [6]:
def clean_text(
    text: str, stop_words: set, stemmer: PorterStemmer, lemmatizer: WordNetLemmatizer
) -> str:
    """
    Clean and preprocess a single text string.
    - Lowercase conversion
    - Remove punctuation
    - Remove stopwords
    - Apply stemming or lemmatization

    :param text: Input text string
    :param stop_words: Set of stopwords
    :param stemmer: PorterStemmer instance
    :param lemmatizer: WordNetLemmatizer instance
    :return: Cleaned text string
    """
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = re.sub(r"[^\w\s'-]", "", text)

    # Tokenize and process each word
    words = [
        lemmatizer.lemmatize(word)
        for word in text.split()  # if word not in stop_words
    ]  # can not remove stop words, because it has important words like "not"

    return " ".join(words)


def transform_text(pd_df: pd.DataFrame) -> pd.DataFrame:
    """
    Transform text data in a pandas DataFrame, column is named "Sentence".

    - Convert to lowercase
    - Remove punctuation
    - Remove stopwords
    - Apply stemming and/or lemmatization

    :param pd_df: pandas DataFrame
    :return: pandas DataFrame
    """
    pd_df = pd_df.copy()

    # Prepare stopwords, stemmer, and lemmatizer
    stop_words = set(stopwords.words("english"))
    important_words = {
        "not",
        "no",
        "don't",
        "aren't",
        "couldn't",
        "didn't",
        "doesn't",
        "hadn't",
        "hasn't",
        "haven't",
        "isn't",
        "mightn't",
        "mustn't",
        "needn't",
        "shan't",
        "shouldn't",
        "wasn't",
        "weren't",
        "won't",
        "wouldn't",
    }
    stop_words = stop_words - important_words
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    # Apply text cleaning
    pd_df["Sentence"] = pd_df["Sentence"].apply(
        lambda x: clean_text(x, stop_words, stemmer, lemmatizer)
    )

    return pd_df

In [7]:
# remove duplicates, na
df_train_modified = transform_text(df_train)
df_test_modified = transform_text(df_test)

In [8]:
df_test

Unnamed: 0,Sentence,Polarity
0,A good commentary of today's love and undoubte...,1
1,For people who are first timers in film making...,1
2,"It was very popular when I was in the cinema, ...",1
3,It's a feel-good film and that's how I felt wh...,1
4,It has northern humour and positive about the ...,1
...,...,...
595,I just got bored watching Jessice Lange take h...,0
596,"Unfortunately, any virtue in this film's produ...",0
597,"In a word, it is embarrassing.",0
598,Exceptionally bad!,0


In [9]:
df_test_modified

Unnamed: 0,Sentence,Polarity
0,a good commentary of today's love and undoubte...,1
1,for people who are first timer in film making ...,1
2,it wa very popular when i wa in the cinema a g...,1
3,it's a feel-good film and that's how i felt wh...,1
4,it ha northern humour and positive about the c...,1
...,...,...
595,i just got bored watching jessice lange take h...,0
596,unfortunately any virtue in this film's produc...,0
597,in a word it is embarrassing,0
598,exceptionally bad,0


In [10]:
count = 0
for i in df_train_modified["Sentence"]:
    count += len(i.split())

print(count)

26364


In [13]:
# vectorize the text data
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))

# train data
X_train = vectorizer.fit_transform(df_train_modified["Sentence"])
y_train = df_train_modified["Polarity"]

# test data
X_test = vectorizer.transform(df_test_modified["Sentence"])
y_test = df_test_modified["Polarity"]

### Part 1.b. Modeling

I think the logistic regression model is very efficient in its simplicity, and beyond that it is highly interpretable, which will facilitate my error analysis in the last section.

Specifically, the regression model assigns positive weights to positive words and negative weights to negative words, while the logistic regression is robust in dealing with high-dimensional sparse textual features, is fast to train and consumes low computational resources.

In [14]:
# Define the parameter grid
param_grid = {
    "C": [
        0.01,
        0.1,
        1,
        10,
        30,
        40,
        50,
        100,
        120,
        125,
        130,
        135,
    ],
    "solver": ["saga", "liblinear"],
    "penalty": ["l1", "l2"],
    "max_iter": [
        450,
        500,
        750,
        1000,
        2000,
    ],
    "tol": [1e-3],
}

# Initialize the Logistic Regression model
lr_model = LogisticRegression(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=lr_model,
    param_grid=param_grid,
    scoring="f1",  # Use accuracy as the evaluation metric
    cv=5,  # 5-fold cross-validation
    verbose=1,
    n_jobs=-1,  # Use all available cores
)

# Perform the search
grid_search.fit(X_train, y_train)

# Best parameters
print(f"Best Parameters: {grid_search.best_params_}")

# Best model
lr_model = grid_search.best_estimator_

Fitting 5 folds for each of 240 candidates, totalling 1200 fits
Best Parameters: {'C': 135, 'max_iter': 450, 'penalty': 'l1', 'solver': 'saga', 'tol': 0.001}


### Part 1.c. Assessing

Use the testing data to measure the accuracy and F1-score of your model.  

In [15]:
y_pred = lr_model.predict(X_test)

# Calculate accuracy and F1 score
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.7683
F1 Score: 0.7681


### Part 2. Given the accuracy and F1-score of your model, are you satisfied with the results, from a business point of view? Explain.

The achieved accuracy of 77.50% and F1 score of 77.48% are reasonable but may not fully meet the requirements depending on the business context.

If the goal is to gain general insights or analyze sentiment trends, this performance could be satisfactory. However, for high-stakes applications, such as customer support triaging or financial sentiment analysis, the current error rate (~22%) might lead to undesirable misclassifications, impacting business decisions.

### Part 3. Show five example instances in which your model’s predictions were incorrect. Describe why you think the model was wrong. Don’t just guess: dig deep to figure out the root cause.

In [16]:
# 1. get the predicted results
df_test["Predicted"] = lr_model.predict(X_test)
df_test["Correct"] = df_test["Predicted"] == df_test["Polarity"]

# 2. Randomly select 5 incorrectly classified samples
df_test_incorrect = df_test[~df_test["Correct"]].sample(5, random_state=42)

# 4. output the incorrect samples, original sentence, preprocessed sentence, true label, predicted label
for idx, row in enumerate(df_test_incorrect.itertuples()):
    print(f"------------------- Incorrect Sample {idx + 1} -------------------")
    print(f"Original: {row.Sentence}")
    print(f"Modified: {df_test_modified.iloc[row.Index].Sentence}")
    print(f"True: {row.Polarity}")
    print(f"Predicted: {row.Predicted}")


------------------- Incorrect Sample 1 -------------------
Original: If you have not seen this movie, I definitely recommend it!  
Modified: if you have not seen this movie i definitely recommend it
True: 1
Predicted: 0
------------------- Incorrect Sample 2 -------------------
Original: This is definitely one of the better documentaries I have seen looking at family relationships and marriage.  
Modified: this is definitely one of the better documentary i have seen looking at family relationship and marriage
True: 1
Predicted: 0
------------------- Incorrect Sample 3 -------------------
Original: I'm terribly disappointed that this film would receive so many awards and accolades, especially when there are far more deserving works of film out there.  
Modified: i'm terribly disappointed that this film would receive so many award and accolade especially when there are far more deserving work of film out there
True: 0
Predicted: 1
------------------- Incorrect Sample 4 ------------------

In [17]:
# check the TF-IDF values of the incorrect samples in the feature space


# get the feature names
feature_names = vectorizer.get_feature_names_out()

# get the coefficients of the model
coefficients = lr_model.coef_.flatten()


for idx, row in enumerate(df_test_incorrect.itertuples()):
    print(f"------------------- Incorrect Sample {idx + 1} -------------------")
    print(f"Original: {row.Sentence}")
    print(f"Modified: {df_test_modified.iloc[row.Index].Sentence}")
    print(f"True: {row.Polarity}")
    print(f"Predicted: {row.Predicted}")

    tfidf_values = X_test[row.Index].toarray().flatten()

    tb = pt.PrettyTable()

    tb.field_names = ["Word", "TF-IDF", "Coefficient"]

    for word, tfidf, coef in zip(feature_names, tfidf_values, coefficients):
        if word in df_test_modified.iloc[row.Index].Sentence.split():
            tb.add_row([word, tfidf, coef])

    tb.float_format = ".2"

    tb.sortby = "Coefficient"

    print(tb)

------------------- Incorrect Sample 1 -------------------
Original: If you have not seen this movie, I definitely recommend it!  
Modified: if you have not seen this movie i definitely recommend it
True: 1
Predicted: 0
+------------+--------+-------------+
|    Word    | TF-IDF | Coefficient |
+------------+--------+-------------+
|    not     |  0.17  |    -25.18   |
|     if     |  0.22  |    -4.25    |
|     it     |  0.13  |    -2.74    |
|    seen    |  0.33  |    -1.38    |
|    this    |  0.14  |    -0.87    |
|    have    |  0.19  |     0.00    |
|    you     |  0.20  |     2.31    |
|   movie    |  0.22  |     2.57    |
| recommend  |  0.25  |     6.03    |
| definitely |  0.27  |    13.74    |
+------------+--------+-------------+
------------------- Incorrect Sample 2 -------------------
Original: This is definitely one of the better documentaries I have seen looking at family relationships and marriage.  
Modified: this is definitely one of the better documentary i have se

By analyzing the error samples of the logistic regression model in the sentiment categorization task, it can be observed that the cause of the model misclassification is closely related to the interactions between the TF-IDF weights of the words and the corresponding coefficients. The following is an itemized explanation of each error sample:

1. In Error Sample 1, the original movie review explicitly expresses a negative evaluation, but the model incorrectly predicts it as positive. Although `worst` has a significant negative coefficient (-14.64) and a high TF-IDF value, the superimposed effect of other positive words such as `action` (coefficient +3.49), `one` (+4.73), and `day` (+2.52) may have overridden the negative signal.

2. The misclassification in Error Sample 2 reveals the model's limitations in understanding negative structures. Although `wrong` itself is not assigned a coefficient, the strong positive coefficient of `everything` (+5.12) together with the positive weight of `cut`(+1.98) dominates the prediction. This reflects the model's failure to capture the logic of the overall negative expression `everything is wrong`.

3. Error Sample 3 illustrates a typical case of feature conflict. Although the core positive word `interesting` has a very high positive coefficient (+7.45), multiple negative words such as `predictable` (-3.58), `bit` (-4.66), and `if` (-4.45) combine to lower the predicted probability. This coexistence of contradictory modifiers leads the model to favor negative judgments, indicating the limitations of logistic regression in capturing context-neutralizing effects.

4. The misjudgment of Error Sample 4 highlights the specificity of the feature weight distribution. The key positive words `power` (+2.86) and `film` (+4.00) should have supported positive predictions, but the negative coefficients of several neutral words, such as `later` (-2.90) and `lost` (-2.50), backfired. This may be due to the high frequency of these words in negative contexts in the training data, e.g., `lost` often co-occurs with negative expressions such as `plot`.

5. Error Sample 5 reflects the impact of feature selection. Although `hypocrisy` and `vomit` clearly carry negative connotations, these words are not included in the feature space (they may have been excluded by preprocessing or not included in the lexicon). On the contrary, the positive coefficient of `film` (+4.00) and the strong positive weight of `me` (+3.70) dominated the prediction results, suggesting that the model may be overly reliant on certain non-sentiment indicative words.

These samples illustrate that our model is unable to effectively handle complex linguistic phenomena such as negation structure, modifier contradiction, and context dependency, while being limited by the linear decision-making mechanism of feature selection strategy and coefficient superposition.

Personally, I think in sentiment analysis, neural network models may be better with sufficient amount of data.