**Objectives:**

- The goal of this assignment is to build a sentiment analysis system using the Bag of Words (BoW) and TF-IDF techniques. Students will preprocess the dataset, clean and tokenize text using regular expressions (regex) in Python, and apply at least three machine learning models to classify the sentiment of given text data. Finally, they will evaluate and compare model performances to determine the best-performing model.

In [1]:
import pandas as pd

dataset = pd.read_csv("./Data/Dataset.csv")
dataset.head()

Unnamed: 0,text,sentiment
0,,0
1,Horrible!!! The worst experience ever. Do not ...,0
2,Terrible service!! I won't buy from here again...,0
3,"I had high hopes, but it broke after a week. :-/",0
4,"Product is okay, but packaging was awful. ?!?",0


In [2]:
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       980 non-null    object
 1   sentiment  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None


In [3]:
print(dataset.isnull().sum())

text         20
sentiment     0
dtype: int64


In [4]:
# Drop rows with missing values in the 'text' or 'sentiment' columns
dataset.dropna(subset = ["text", "sentiment"], inplace = True)

# Reset the index after dropping rows
dataset.reset_index(drop = True, inplace = True)
dataset.head()

Unnamed: 0,text,sentiment
0,Horrible!!! The worst experience ever. Do not ...,0
1,Terrible service!! I won't buy from here again...,0
2,"I had high hopes, but it broke after a week. :-/",0
3,"Product is okay, but packaging was awful. ?!?",0
4,"Good quality, but a bit expensive. Worth it th...",0


In [5]:
#Count the number of unique text in the dataset
unique_text_count = dataset["text"].nunique()
print(f"Number of unique text entries: {unique_text_count}")

# Count the number of duplicates text in the dataset
duplicates_count = dataset["text"].duplicated().sum()
print(f"Number of duplicate text entries: {duplicates_count}")

Number of unique text entries: 20
Number of duplicate text entries: 960


In [6]:
# #Drop duplicates
# dataset.drop_duplicates(subset = ['text'], inplace = True)
# # Reset the index after dropping duplicates
# dataset.reset_index(drop = True, inplace = True)
# dataset.info()

In [7]:
import re

def clean_text(text):

    # Remove non-alphabetic characters and ASCII codes
    text = re.sub(r"[^a-zA-Z\s]", "", text)

    # Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [8]:
# Apply the cleaning function to the "text" column
dataset["text"] = dataset["text"].apply(clean_text)
dataset.head()

Unnamed: 0,text,sentiment
0,Horrible The worst experience ever Do not buy,0
1,Terrible service I wont buy from here again,0
2,I had high hopes but it broke after a week,0
3,Product is okay but packaging was awful,0
4,Good quality but a bit expensive Worth it though,0


In [9]:
dataset["text"] = dataset["text"].str.lower()
dataset.head()

Unnamed: 0,text,sentiment
0,horrible the worst experience ever do not buy,0
1,terrible service i wont buy from here again,0
2,i had high hopes but it broke after a week,0
3,product is okay but packaging was awful,0
4,good quality but a bit expensive worth it though,0


In [10]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rahul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Get English stopwords
stop_words = set(stopwords.words("english"))

def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

In [12]:
# Apply the stopwords removal function
dataset["text"] = dataset["text"].apply(remove_stopwords)
dataset.head()

Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hopes broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


In [13]:
from nltk.stem import WordNetLemmatizer

# Download WordNet if not already downloaded
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rahul\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [15]:
# Apply the lemmatization function
dataset["text"] = dataset["text"].apply(lemmatize_text)

dataset.head()

Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hope broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


**Summary of Preprocessing Steps**

- Handled Missing Values: Dropped rows with missing values in the text or sentiment columns to ensure data quality.

- Removed Non-Alphabetic Characters: Used regex to remove special characters, ASCII codes, and extra spaces.

- Converted Text to Lowercase: Ensured uniformity in the text data.

- Removed Stopwords: Eliminated common words that do not contribute to sentiment analysis.

- Performed Lemmatization: Normalized words to their base forms for consistency.

### Implement bag of words and TF-IDF.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
# Initialize CountVectorizer
bow_vectorizer = CountVectorizer()

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

In [18]:
# Fit and transform the text data
X_bow = bow_vectorizer.fit_transform(dataset["text"])
print(f"Shape of the bag of words matrix: {X_bow.shape}")

Shape of the bag of words matrix: (980, 76)


In [19]:
# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(dataset["text"])
print(f"Shape of the TF-IDF matrix: {X_tfidf.shape}")

Shape of the TF-IDF matrix: (980, 76)


In [20]:
# Convert the Bag of word matrix to a DataFrame
df_bow = pd.DataFrame(X_bow.toarray(), columns = bow_vectorizer.get_feature_names_out())
df_bow.head()

Unnamed: 0,absolutely,advertised,amazing,arrived,away,awful,best,better,bit,broke,...,time,took,trust,week,wont,work,worst,worth,worthit,would
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0


In [21]:
# Convert the TF-IDF matrix to a DataFrame
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns = tfidf_vectorizer.get_feature_names_out())
df_tfidf.head()

Unnamed: 0,absolutely,advertised,amazing,arrived,away,awful,best,better,bit,broke,...,time,took,trust,week,wont,work,worst,worth,worthit,would
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.388662,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.474759,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.551071,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.432693,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.432693,0.0,0.0


| **Aspect**           | **Bag of Words (BoW)**                                              | **TF-IDF**                                                             |
|-----------------------|---------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Values**           | Raw word counts (e.g., 1, 2, etc.).                                 | Weighted scores based on term frequency and inverse document frequency. |
| **Focus**            | Focuses on word frequency.                                          | Focuses on word importance in a document relative to the corpus.        |
| **Common Words**     | Common words may dominate unless stopwords are removed.             | Common words are down-weighted automatically.                           |
| **Interpretability** | Easier to interpret as it directly represents counts.               | Harder to interpret due to weighted values.                             |
| **Use Case**         | Suitable for simple models or when frequency is sufficient.         | Suitable for tasks where word relevance matters more.                   |

### Model Training and Evaluation

In [22]:
# Splitting the Dataset
from sklearn.model_selection import train_test_split

In [23]:
# Split the data into training and testing sets
X_train_bow, X_test_bow, y_train_bow, y_test_bow = train_test_split(X_bow, 
                                                                    dataset["sentiment"], 
                                                                    test_size = 0.2, 
                                                                    random_state = 42,
                                                                    shuffle = True)

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, 
                                                                            dataset["sentiment"], 
                                                                            test_size = 0.2, 
                                                                            random_state = 42,
                                                                            shuffle = True)

In [24]:
# Shapes of the Train and Test sets
print(f"Training set size (Bag of Words): {X_train_bow.shape}")
print(f"Testing set size (Bag of Words): {X_test_bow.shape}")

print(f"Training set size (TF-IDF): {X_train_tfidf.shape}")
print(f"Testing set size (TF-IDF): {X_test_tfidf.shape}")

Training set size (Bag of Words): (784, 76)
Testing set size (Bag of Words): (196, 76)
Training set size (TF-IDF): (784, 76)
Testing set size (TF-IDF): (196, 76)


Model Training

In [25]:
# Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Metrics for evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

For Bag of Words

In [26]:
# Initialize models
model_xgb_bow = XGBClassifier(use_label_encoder = False, eval_metric = "mlogloss")

model_svm_bow = SVC(kernel = "linear", probability = True)

model_rf_bow = RandomForestClassifier(n_estimators = 100, random_state = 42)

In [27]:
# Train models for Bag of Words
model_xgb_bow.fit(X_train_bow, y_train_bow)
model_svm_bow.fit(X_train_bow, y_train_bow)
model_rf_bow.fit(X_train_bow, y_train_bow)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [28]:
# Make predictions
y_pred_xgb = model_xgb_bow.predict(X_test_bow)
y_pred_svm = model_svm_bow.predict(X_test_bow)
y_pred_rf = model_rf_bow.predict(X_test_bow)

For TF-IDF vector

In [29]:
# Initialize models
model_xgb_tfidf = XGBClassifier(use_label_encoder = False, eval_metric = "mlogloss")

model_svm_tfidf = SVC(kernel = "linear", probability = True)

model_rf_tfidf = RandomForestClassifier(n_estimators = 100, random_state = 42)

In [30]:
# Train models for Bag of Words
model_xgb_tfidf.fit(X_train_tfidf, y_train_tfidf)
model_svm_tfidf.fit(X_train_tfidf, y_train_tfidf)
model_rf_tfidf.fit(X_train_tfidf, y_train_tfidf)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


In [34]:
# Make predictions
y_pred_xgb_tfidf = model_xgb_tfidf.predict(X_test_tfidf)
y_pred_svm_tfidf = model_svm_tfidf.predict(X_test_tfidf)
y_pred_rf_tfidf = model_rf_tfidf.predict(X_test_tfidf)

In [31]:
# Function to evaluate models
def evaluate_model(name, y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average = "weighted", zero_division = 0)
    recall = recall_score(y_test, y_pred, average = "weighted")
    f1 = f1_score(y_test, y_pred, average = "weighted")

    print(f'\nEvaluation Metrics for {name} Model:')
    print(f'{"Metric":<10} {"Score":<10}')
    print("-" * 20)
    print(f'{"Accuracy":<10} {accuracy:.4f}')
    print(f'{"Precision":<10} {precision:.4f}')
    print(f'{"Recall":<10} {recall:.4f}')
    print(f'{"F1 Score":<10} {f1:.4f}')

#### Evaluation for the Bag of Words

In [33]:
# Evaluate all models
print("Evaluation for the Bag of Word vectors.")
evaluate_model("XGBoost", y_test_bow, y_pred_xgb)
evaluate_model("SVM", y_test_bow, y_pred_svm)
evaluate_model("Random Forest", y_test_bow, y_pred_rf)

Evaluation for the Bag of Word vectors.

Evaluation Metrics for XGBoost Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000

Evaluation Metrics for SVM Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000

Evaluation Metrics for Random Forest Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000


In [36]:
# Evaluate all models
print("Evaluation for the Bag of TF-IDF vectors.")
evaluate_model("XGBoost", y_test_tfidf, y_pred_xgb_tfidf)
evaluate_model("SVM", y_test_tfidf, y_pred_svm_tfidf)
evaluate_model("Random Forest", y_test_tfidf, y_pred_rf_tfidf)

Evaluation for the Bag of TF-IDF vectors.

Evaluation Metrics for XGBoost Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000

Evaluation Metrics for SVM Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000

Evaluation Metrics for Random Forest Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000


### Predictions on the unseen Data

In [37]:
# Unseen text for prediction
unseen_text = ["This is a great product! I love it.", "I am not satisfied with the service."]

# Clean the unseen text
unseen_text_cleaned = [clean_text(text) for text in unseen_text]
unseen_text_cleaned = [text.lower() for text in unseen_text_cleaned]
unseen_text_cleaned = [remove_stopwords(text) for text in unseen_text_cleaned]
unseen_text_cleaned = [lemmatize_text(text) for text in unseen_text_cleaned]

In [38]:
# Convert the cleaned text to Bag of Words and TF-IDF features
X_unseen_bagofword = bow_vectorizer.transform(unseen_text_cleaned)  # BoW
X_unseen_tfidf = tfidf_vectorizer.transform(unseen_text_cleaned)  # TF-IDF

In [39]:
# Make predictions using BoW
y_pred_unseen_xgb_bow = model_xgb_tfidf.predict(X_unseen_bagofword)
y_pred_unseen_svm_bow = model_svm_tfidf.predict(X_unseen_bagofword)
y_pred_unseen_rf_bow = model_rf_tfidf.predict(X_unseen_bagofword)

# Make predictions using TF-IDF
y_pred_unseen_xgb_tfidf = model_xgb_tfidf.predict(X_unseen_tfidf)
y_pred_unseen_svm_tfidf = model_svm_tfidf.predict(X_unseen_tfidf)
y_pred_unseen_rf_tfidf = model_rf_tfidf.predict(X_unseen_tfidf)

In [42]:
# Print the predictions
print("🔹 Predictions for Unseen Text:")
for i, text in enumerate(unseen_text):
    print(f"\n📌 Text: {text}")
    print(f"   ✅ XGBoost (BoW): {y_pred_unseen_xgb_bow[i]} | XGBoost (TF-IDF): {y_pred_unseen_xgb_tfidf[i]}")
    print(f"   ✅ SVM (BoW): {y_pred_unseen_svm_bow[i]} | SVM (TF-IDF): {y_pred_unseen_svm_tfidf[i]}")
    print(f"   ✅ Random Forest (BoW): {y_pred_unseen_rf_bow[i]} | Random Forest (TF-IDF): {y_pred_unseen_rf_tfidf[i]}")

🔹 Predictions for Unseen Text:

📌 Text: This is a great product! I love it.
   ✅ XGBoost (BoW): 1 | XGBoost (TF-IDF): 1
   ✅ SVM (BoW): 1 | SVM (TF-IDF): 1
   ✅ Random Forest (BoW): 1 | Random Forest (TF-IDF): 1

📌 Text: I am not satisfied with the service.
   ✅ XGBoost (BoW): 0 | XGBoost (TF-IDF): 0
   ✅ SVM (BoW): 1 | SVM (TF-IDF): 0
   ✅ Random Forest (BoW): 0 | Random Forest (TF-IDF): 0
