*De La Salle University – Dasmariñas* \
*College of Information and Computer Studies*

**S–CSIS312LA — Natural Language Processing (Laboratory)** \
__Instructor:__ Josephine T. Eduardo

**Name:** Luis Anton P. Imperial \
**CYS:** BCS32

# Finals Activity 3

## Problem:

A team of data scientists at a marketing firm wants to classify customer reviews into two categories: *Positive* and *Negative*. They are debating whether to remove stopwords from the text during the preprocessing phase of their NLP pipeline. The goal is to understand how stopword removal impacts the model's performance.

## Approach

1. **Dataset:** A collection of 5,000 customer reviews from an e-commerce website.
2. **Preprocessing:**
  - **Model 1:** Removes stopwords using a predefined list from the NLTK library.
  - **Model 2:** Does not remove stopwords, keeping all words for classification.
  - **Evaluation Metrics:** Precision, recall, and F1-score.

## Results

- **Model 1 (with stopword removal):** Achieved a precision of 0.85, recall of 0.83, and F1-score of 0.84.
- **Model 2 (without stopword removal):** Achieved a precision of 0.83, recall of 0.81, and F1-score of 0.82.

## Analysis

- The removal of stopwords helped the model achieve slightly better performance in terms of precision, recall, and F1-score. The model’s focus was more on the sentiment-bearing words, which are more indicative of positive or negative reviews.
- The slight improvement in performance was due to the reduced dimensionality of the feature space and the focus on content-heavy words.
- However, removing stopwords did not lead to a dramatic improvement, indicating that for this specific task, the impact of stopword removal was moderate.


In [4]:
import pandas as pd

In [5]:
dataset = "Imperial_FActivity3_Reviews.tsv"

In [6]:
# Load the dataset
df = pd.read_csv(dataset, sep="\t")

# Display the first few rows
print(df.head())

                                              Review  Liked
0                           Wow... Loved this place.      1
1                                 Crust is not good.      0
2          Not tasty and the texture was just nasty.      0
3  Stopped by during the late May bank holiday of...      1
4  The selection on the menu was great and so wer...      1


In [7]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    word_tokens = word_tokenize(text)
    filtered_text = [w for w in word_tokens if not w.lower() in stop_words]
    return " ".join(filtered_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lpimp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [8]:
nltk.download('punkt_tab')

# Create a new column with stopwords removed
df['Review_No_Stopwords'] = df['Review'].apply(remove_stopwords)

# Display the first few rows with the new column
print(df.head())

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\lpimp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


                                              Review  Liked  \
0                           Wow... Loved this place.      1   
1                                 Crust is not good.      0   
2          Not tasty and the texture was just nasty.      0   
3  Stopped by during the late May bank holiday of...      1   
4  The selection on the menu was great and so wer...      1   

                                 Review_No_Stopwords  
0                              Wow ... Loved place .  
1                                       Crust good .  
2                              tasty texture nasty .  
3  Stopped late May bank holiday Rick Steve recom...  
4                      selection menu great prices .  


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Prepare the data for model training
X_with_stopwords = df['Review']
y = df['Liked']  # Assuming 'Liked' column represents sentiment labels (0 or 1)

X_without_stopwords = df['Review_No_Stopwords']

# Split the data into training and testing sets
X_train_with, X_test_with, y_train_with, y_test_with = train_test_split(
    X_with_stopwords, y, test_size=0.2, random_state=42
)

X_train_without, X_test_without, y_train_without, y_test_without = train_test_split(
    X_without_stopwords, y, test_size=0.2, random_state=42
)

# Create and train the models

# Model 1 (with stopwords)
vectorizer_with = TfidfVectorizer()
X_train_vec_with = vectorizer_with.fit_transform(X_train_with)
X_test_vec_with = vectorizer_with.transform(X_test_with)

model_with = LogisticRegression()
model_with.fit(X_train_vec_with, y_train_with)
y_pred_with = model_with.predict(X_test_vec_with)


# Model 2 (without stopwords)
vectorizer_without = TfidfVectorizer()
X_train_vec_without = vectorizer_without.fit_transform(X_train_without)
X_test_vec_without = vectorizer_without.transform(X_test_without)

model_without = LogisticRegression()
model_without.fit(X_train_vec_without, y_train_without)
y_pred_without = model_without.predict(X_test_vec_without)

# Evaluate the models

print("Model with Stopwords Removed:")
print(classification_report(y_test_without, y_pred_without))

print("\nModel with Stopwords Retained:")
print(classification_report(y_test_with, y_pred_with))

Model with Stopwords Removed:
              precision    recall  f1-score   support

           0       0.72      0.85      0.78        96
           1       0.84      0.69      0.76       104

    accuracy                           0.77       200
   macro avg       0.78      0.77      0.77       200
weighted avg       0.78      0.77      0.77       200


Model with Stopwords Retained:
              precision    recall  f1-score   support

           0       0.75      0.85      0.80        96
           1       0.85      0.74      0.79       104

    accuracy                           0.80       200
   macro avg       0.80      0.80      0.79       200
weighted avg       0.80      0.80      0.79       200

