<center><h3 style="color:red;">Sentiment analysis system</h3></center>

**Objectives:**

- The goal of this assignment is to build a sentiment analysis system using the Bag of Words (BoW) and TF-IDF techniques. Students will preprocess the dataset, clean and tokenize text using regular expressions (regex) in Python, and apply at least three machine learning models to classify the sentiment of given text data. Finally, they will evaluate and compare model performances to determine the best-performing model.

**1. Data Preprocessing & Cleaning (30 Marks)**

In [56]:
import pandas as pd

df = pd.read_csv("../Data/Dataset.csv")
df.head()

Unnamed: 0,text,sentiment
0,,0
1,Horrible!!! The worst experience ever. Do not ...,0
2,Terrible service!! I won't buy from here again...,0
3,"I had high hopes, but it broke after a week. :-/",0
4,"Product is okay, but packaging was awful. ?!?",0


In [57]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       980 non-null    object
 1   sentiment  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None


**Justification**: Missing values can lead to errors during analysis. Identifying them early ensures proper handling.

In [58]:
print(df.isnull().sum())

text         20
sentiment     0
dtype: int64


In [59]:
# Drop rows with missing values in the 'text' or 'sentiment' columns
df.dropna(subset=['text', 'sentiment'], inplace=True)

# Reset the index after dropping rows
df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,text,sentiment
0,Horrible!!! The worst experience ever. Do not ...,0
1,Terrible service!! I won't buy from here again...,0
2,"I had high hopes, but it broke after a week. :-/",0
3,"Product is okay, but packaging was awful. ?!?",0
4,"Good quality, but a bit expensive. Worth it th...",0


**Justification:** Missing sentiment labels cannot be inferred, and missing text data cannot be used for analysis. Dropping these rows ensures data quality

In [60]:
#Count the number of unique text in the dataset
unique_text_count = df['text'].nunique()
print(f"Number of unique text entries: {unique_text_count}")

# Count the number of duplicates text in the dataset
duplicates_count = df['text'].duplicated().sum()
print(f"Number of duplicate text entries: {duplicates_count}")

Number of unique text entries: 20
Number of duplicate text entries: 960


In [61]:
# #Drop duplicates
# df.drop_duplicates(subset=['text'], inplace=True)
# # Reset the index after dropping duplicates
# df.reset_index(drop=True, inplace=True)
# df.info()

**Justification**: Dropping duplicate text entries in sentiment analysis ensures that the dataset is clean and free from redundant data and avoids bias and reduces overfitting.

In [62]:
import re

def clean_text(text):
    # Remove non-alphabetic characters and ASCII codes
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'text' column
df['text'] = df['text'].apply(clean_text)
df.head()

Unnamed: 0,text,sentiment
0,Horrible The worst experience ever Do not buy,0
1,Terrible service I wont buy from here again,0
2,I had high hopes but it broke after a week,0
3,Product is okay but packaging was awful,0
4,Good quality but a bit expensive Worth it though,0


**Justification:** Non-alphabetic characters and ASCII codes do not contribute to sentiment analysis and can introduce noise.

In [63]:
df['text'] = df['text'].str.lower()
df.head()

Unnamed: 0,text,sentiment
0,horrible the worst experience ever do not buy,0
1,terrible service i wont buy from here again,0
2,i had high hopes but it broke after a week,0
3,product is okay but packaging was awful,0
4,good quality but a bit expensive worth it though,0


**Justification:** Lowercasing ensures that words like "Happy" and "happy" are treated as the same token.



In [64]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
import nltk
nltk.download('stopwords')

# Get English stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

# Apply the stopwords removal function
df['text'] = df['text'].apply(remove_stopwords)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rajprasadshrestha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hopes broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


**Justification:** Stopwords do not carry significant meaning and can be removed to reduce dimensionality.



In [65]:
from nltk.stem import WordNetLemmatizer

# Download WordNet if not already downloaded
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the lemmatization function
df['text'] = df['text'].apply(lemmatize_text)

df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rajprasadshrestha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,sentiment
0,horrible worst experience ever buy,0
1,terrible service wont buy,0
2,high hope broke week,0
3,product okay packaging awful,0
4,good quality bit expensive worth though,0


**Justification:** Lemmatization reduces words to their base forms, ensuring that different forms of the same word are treated as one.

**Summary of Preprocessing Steps**

- Handled Missing Values: Dropped rows with missing values in the text or sentiment columns to ensure data quality.

- Removed Non-Alphabetic Characters: Used regex to remove special characters, ASCII codes, and extra spaces.

- Converted Text to Lowercase: Ensured uniformity in the text data.

- Removed Stopwords: Eliminated common words that do not contribute to sentiment analysis.

- Performed Lemmatization: Normalized words to their base forms for consistency.

**2. Feature Engineering using NLP Techniques (20 Marks)**

- Implement Bag of Words (BoW) for text representation

In [66]:
#Implement bag of words in df[text]

# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
bagofword_vectorizer = CountVectorizer()

# Fit and transform the text data
X_bagofword = bagofword_vectorizer.fit_transform(df['text'])

print(f"Shape of the bag of words matrix: {X_bagofword.shape}")


Shape of the bag of words matrix: (980, 76)


- Implement TF-IDF for text representation.


In [67]:
#implement TF-IDF in df[text]
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
X_tfidf = tfidf_vectorizer.fit_transform(df['text'])
print(f"Shape of the TF-IDF matrix: {X_tfidf.shape}")


Shape of the TF-IDF matrix: (980, 76)


**Comparision of TF-IDF and Bag of Words**

In [68]:
# Convert the sparse matrix to a DataFrame
df_bagofword = pd.DataFrame(X_bagofword.toarray(), columns=bagofword_vectorizer.get_feature_names_out())
print(df_bagofword)

     absolutely  advertised  amazing  arrived  away  awful  best  better  bit  \
0             0           0        0        0     0      0     0       0    0   
1             0           0        0        0     0      0     0       0    0   
2             0           0        0        0     0      0     0       0    0   
3             0           0        0        0     0      1     0       0    0   
4             0           0        0        0     0      0     0       0    1   
..          ...         ...      ...      ...   ...    ...   ...     ...  ...   
975           0           0        0        0     0      0     0       0    0   
976           0           0        0        0     0      0     0       0    0   
977           0           0        0        0     0      0     0       0    0   
978           1           0        0        0     0      0     0       0    0   
979           0           0        0        0     0      0     0       0    0   

     broke  ...  time  took

In [69]:
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=bagofword_vectorizer.get_feature_names_out())
print(df_tfidf)

     absolutely  advertised  amazing  arrived  away     awful  best  better  \
0      0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
1      0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
2      0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
3      0.000000         0.0      0.0      0.0   0.0  0.551071   0.0     0.0   
4      0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
..          ...         ...      ...      ...   ...       ...   ...     ...   
975    0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
976    0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
977    0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
978    0.370199         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   
979    0.000000         0.0      0.0      0.0   0.0  0.000000   0.0     0.0   

          bit  broke  ...  time  took  trust  week 

| **Aspect**           | **Bag of Words (BoW)**                                              | **TF-IDF**                                                             |
|-----------------------|---------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Values**           | Raw word counts (e.g., 1, 2, etc.).                                 | Weighted scores based on term frequency and inverse document frequency. |
| **Focus**            | Focuses on word frequency.                                          | Focuses on word importance in a document relative to the corpus.        |
| **Common Words**     | Common words may dominate unless stopwords are removed.             | Common words are down-weighted automatically.                           |
| **Interpretability** | Easier to interpret as it directly represents counts.               | Harder to interpret due to weighted values.                             |
| **Use Case**         | Suitable for simple models or when frequency is sufficient.         | Suitable for tasks where word relevance matters more.                   |

**3. Model Training & Evaluation (30 Marks)**

<center><b><i>XGboost</b></i></center>

- Split the dataset into training and testing sets (80/20) for both bag of word and TF-IDF.

In [70]:

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train_bagofword, X_test_bagofword, y_train_bagofword, y_test_bagofword = train_test_split(X_bagofword, df['sentiment'], test_size=0.2, random_state=42,shuffle=True, stratify=df['sentiment'])
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, df['sentiment'], test_size=0.2, random_state=42,shuffle=True,stratify=df['sentiment'])
print(f"Training set size (Bag of Words): {X_train_bagofword.shape}")
print(f"Testing set size (Bag of Words): {X_test_bagofword.shape}")
print(f"Training set size (TF-IDF): {X_train_tfidf.shape}")
print(f"Testing set size (TF-IDF): {X_test_tfidf.shape}")

Training set size (Bag of Words): (784, 76)
Testing set size (Bag of Words): (196, 76)
Training set size (TF-IDF): (784, 76)
Testing set size (TF-IDF): (196, 76)


In [71]:
# Train a model using Bag of Words using XGBoost
from xgboost import XGBClassifier

# Initialize the model
model_bagofword = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

# Train the model using the bag of words features
model_bagofword.fit(X_train_bagofword, y_train_bagofword)

# Make predictions on the test set of bag of words
y_pred_bagofword = model_bagofword.predict(X_test_bagofword)

#Evaluate models using accuracy, precision, recall, and F1-score.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Calculate metrics
accuracy_bagofword = accuracy_score(y_test_bagofword, y_pred_bagofword)
precision_bagofword = precision_score(y_test_bagofword, y_pred_bagofword, average='weighted')
recall_bagofword = recall_score(y_test_bagofword, y_pred_bagofword, average='weighted')
f1_bagofword = f1_score(y_test_bagofword, y_pred_bagofword, average='weighted')

# Print metrics in a formatted table
print("Evaluation Metrics for Bag of Words Model:")
print(f"{'Metric':<10} {'Score':<10}")
print("-" * 20)
print(f"{'Accuracy':<10} {accuracy_bagofword:.4f}")
print(f"{'Precision':<10} {precision_bagofword:.4f}")
print(f"{'Recall':<10} {recall_bagofword:.4f}")
print(f"{'F1 Score':<10} {f1_bagofword:.4f}")


Evaluation Metrics for Bag of Words Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000


Parameters: { "use_label_encoder" } are not used.



In [72]:
# Train a model using TF-IDF using XGBoost

# Initialize the model
model_tfidf = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

# Train the model using the TF-IDF features
model_tfidf.fit(X_train_tfidf, y_train_tfidf)

# Make predictions on the test set of TF-IDF
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)

# Calculate metrics
accuracy_tfidf = accuracy_score(y_test_tfidf, y_pred_tfidf)
precision_tfidf = precision_score(y_test_tfidf, y_pred_tfidf, average='weighted')
recall_tfidf = recall_score(y_test_tfidf, y_pred_tfidf, average='weighted')
f1_tfidf = f1_score(y_test_tfidf, y_pred_tfidf, average='weighted')

# Print metrics in a formatted table
print("Evaluation Metrics for TF-IDF Model:")
print(f"{'Metric':<10} {'Score':<10}")
print("-" * 20)
print(f"{'Accuracy':<10} {accuracy_tfidf:.4f}")
print(f"{'Precision':<10} {precision_tfidf:.4f}")
print(f"{'Recall':<10} {recall_tfidf:.4f}")
print(f"{'F1 Score':<10} {f1_tfidf:.4f}")


Evaluation Metrics for TF-IDF Model:
Metric     Score     
--------------------
Accuracy   1.0000
Precision  1.0000
Recall     1.0000
F1 Score   1.0000


Parameters: { "use_label_encoder" } are not used.



**Prediction on unseen data**

In [73]:
#Create a unseen text for prediction
unseen_text = ["This is a great product! I love it.", "I am not satisfied with the service."]

# Clean the unseen text
unseen_text_cleaned = [clean_text(text) for text in unseen_text]

# Convert the cleaned text to lowercase
unseen_text_cleaned = [text.lower() for text in unseen_text_cleaned]

# Remove stopwords from the unseen text
unseen_text_cleaned = [remove_stopwords(text) for text in unseen_text_cleaned]

# Lemmatize the unseen text
unseen_text_cleaned = [lemmatize_text(text) for text in unseen_text_cleaned]


# Convert the cleaned text to Bag of Words features
X_unseen_bagofword = bagofword_vectorizer.transform(unseen_text_cleaned)
# Convert the cleaned text to TF-IDF features
X_unseen_tfidf = tfidf_vectorizer.transform(unseen_text_cleaned)

# Make predictions using the Bag of Words model
y_pred_unseen_bagofword = model_bagofword.predict(X_unseen_bagofword)
# Make predictions using the TF-IDF model
y_pred_unseen_tfidf = model_tfidf.predict(X_unseen_tfidf)
# Print the predictions
print("Predictions for unseen text using Bag of Words model:")
for text, pred in zip(unseen_text, y_pred_unseen_bagofword):
    print(f"Text: {text} | Predicted Sentiment: {pred}")
print("Predictions for unseen text using TF-IDF model:")
for text, pred in zip(unseen_text, y_pred_unseen_tfidf):
    print(f"Text: {text} | Predicted Sentiment: {pred}")
    

Predictions for unseen text using Bag of Words model:
Text: This is a great product! I love it. | Predicted Sentiment: 1
Text: I am not satisfied with the service. | Predicted Sentiment: 0
Predictions for unseen text using TF-IDF model:
Text: This is a great product! I love it. | Predicted Sentiment: 1
Text: I am not satisfied with the service. | Predicted Sentiment: 0


Discuss results and select the best-performing model.