#### Week 10 Exercise 10.2 Author: Rex Gayas Course & Section: DSC360-T301 Data Mining: Text Analytics an (2243-1) Date: 18 FEB 2024

#### Loading the Dataset. Checking and Preliminary Comprehension of Stucture

In [24]:
import pandas as pd

# File path of Hotel Reviews dataset
file_path = 'D:\\ALPHA\\Dynamic Folder\\Bellevue\\Winter 2023\\Data Mining\\Week 10\\archive\\hotel-reviews.csv'
hotel_reviews = pd.read_csv(file_path)

# Display the first few rows to confirm structure of dataset
hotel_reviews.head()



Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [25]:
# Display the column names of the DataFrame
hotel_reviews.columns


Index(['User_ID', 'Description', 'Browser_Used', 'Device_Used', 'Is_Response'], dtype='object')

#### Preprocess the Text

This step cleans the text data to improve the performance of the sentiment analysis model. We remove HTML tags, accented characters, convert text to lowercase, remove extra newlines, special characters, digits, extra whitespace, and stopwords. These steps standardize the text and reduce noise in the dataset.

In [27]:
import re
import pandas as pd

# Define the TextNormalizer class used from Week 4
class TextNormalizer:
    def __init__(self):
        self.stop_words = set(['a', 'an', 'the', 'and', 'or', 'for', 'to', 'of', 'in', 'on', 'at', 'by', 'up', 'out', 'as'])

    def strip_html_tags(self, text):
        pattern = re.compile('<.*?>')
        return re.sub(pattern, '', text)

    def remove_accented_chars(self, text):
        return text.encode('ascii', 'ignore').decode('ascii')

    def text_to_lower(self, text):
        return text.lower()

    def remove_extra_newlines(self, text):
        return re.sub(r'\r\n|\r|\n', ' ', text)

    def remove_special_characters_and_digits(self, text, remove_digits=True):
        pattern = r'[^a-zA-Z\s]' if not remove_digits else r'[^a-zA-Z0-9\s]'
        return re.sub(pattern, '', text)

    def remove_extra_whitespace(self, text):
        return re.sub(' +', ' ', text.strip())

    def remove_stopwords(self, text):
        return ' '.join([word for word in text.split() if word not in self.stop_words])

    def normalize_corpus(self, corpus):
        normalized_corpus = []
        for doc in corpus:
            doc = self.strip_html_tags(doc)
            doc = self.remove_accented_chars(doc)
            doc = self.text_to_lower(doc)
            doc = self.remove_extra_newlines(doc)
            doc = self.remove_special_characters_and_digits(doc)
            doc = self.remove_extra_whitespace(doc)
            doc = self.remove_stopwords(doc)
            normalized_corpus.append(doc)
        return normalized_corpus

# Instantiate the text normalizer
text_normalizer = TextNormalizer()

# Normalize the review text
hotel_reviews['Cleaned_Description'] = text_normalizer.normalize_corpus(hotel_reviews['Description'])

# Display the first few rows of the cleaned text
hotel_reviews[['Description', 'Cleaned_Description']].head()


Unnamed: 0,Description,Cleaned_Description
0,The room was kind of clean but had a VERY stro...,room was kind clean but had very strong smell ...
1,I stayed at the Crown Plaza April -- - April -...,i stayed crown plaza april april staff was fri...
2,I booked this hotel through Hotwire at the low...,i booked this hotel through hotwire lowest pri...
3,Stayed here with husband and sons on the way t...,stayed here with husband sons way alaska cruis...
4,My girlfriends and I stayed here to celebrate ...,my girlfriends i stayed here celebrate our th ...


#### Label Encoding

This process of converts categorical text data into a numerical format as subsequent models require input to be numeric.

In [28]:
# Encode 'Is_Response' column: 'happy' as 0 and 'not happy' as 1
hotel_reviews['Is_Response_Encoded'] = hotel_reviews['Is_Response'].apply(lambda x: 0 if x == 'happy' else 1)

# Display the first few rows to check the encoded column
hotel_reviews[['Is_Response', 'Is_Response_Encoded']].head()


Unnamed: 0,Is_Response,Is_Response_Encoded
0,not happy,1
1,not happy,1
2,not happy,1
3,happy,0
4,not happy,1


#### Data Splitting

To evaluate the performance of the sentiment analysis model, we need to split our dataset into training, validation, and testing sets. This allows us to train our model on one subset of the data and test its performance on another, unseen subset, providing an estimate of performance on an independent dataset.

In [29]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# Save 20% of the data for testing
X_train, X_test, y_train, y_test = train_test_split(
    hotel_reviews['Cleaned_Description'],  # the features used for training a.k.a the cleaned reviews
    hotel_reviews['Is_Response_Encoded'],  # the target variable or the label encoded 'Is_Response'
    test_size=0.2,  # specifies the proportion of the dataset to include in the test split
    random_state=42  # a seed for the random number generator for reproducible results
)

# Then split the training data into a training set and a validation set
# Taking 25% of the original training set to create the validation set
X_train, X_val, y_train, y_val = train_test_split(
    X_train,  # the remaining features for training
    y_train,  # the remaining target variable for training
    test_size=0.25,  # specifies the proportion of the training dataset to include in the validation split
    random_state=42  # a seed for the random number generator for reproducible results
)

# It should be noted that X_train and y_train are for training, X_val and y_val are for validation, 
# and X_test and y_test are for final testing.

# Confirming the size of each dataset
print(f'Training set: {X_train.shape[0]} samples')
print(f'Validation set: {X_val.shape[0]} samples')
print(f'Test set: {X_test.shape[0]} samples')


Training set: 23358 samples
Validation set: 7787 samples
Test set: 7787 samples


#### Sentiment Analysis with AFINN

From the text, the AFINN lexicon is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen between 2009 and 2011. We will use the AFINN lexicon to score the sentiment of the hotel reviews.

In [30]:
from afinn import Afinn

# Initialize Afinn sentiment analyzer
afinn = Afinn()

# Define a function to apply Afinn and calculate sentiment scores
def calculate_sentiment_scores(reviews):
    sentiment_scores = [afinn.score(review) for review in reviews]
    return sentiment_scores

# Calculate sentiment scores for each dataset
train_sentiments = calculate_sentiment_scores(X_train)
val_sentiments = calculate_sentiment_scores(X_val)
test_sentiments = calculate_sentiment_scores(X_test)

# Convert sentiment scores to binary labels based on the sentiment score being positive (>0) or negative (<=0)
y_train_pred = [1 if score > 0 else 0 for score in train_sentiments]
y_val_pred = [1 if score > 0 else 0 for score in val_sentiments]
y_test_pred = [1 if score > 0 else 0 for score in test_sentiments]


In [31]:
# Print out the first 5 sentiment scores and their corresponding binary labels from the training set
print("Sample sentiment scores and their corresponding binary labels from the training set:")
for review, score, label in zip(X_train[:5], train_sentiments[:5], y_train_pred[:5]):
    print(f"Review: {review}, Sentiment score: {score}, Binary label: {label}")

Sample sentiment scores and their corresponding binary labels from the training set:
Review: i stayed here with my dog during one snow storms hotel did excellant job keeping sidewalk clear they were middle renovating lobby which had fireplace several nice seating areas there were computers tv with seating area it was nice they let my lb dog stay there but gave me doggy bag that looked like it had been used previously i cleaned after my dog outside but there were no trash cans anywhere outside hotel me dispose his poop electronic lock my door didnt work i stopped front desk let them know hours later no one had come fix it which meant if i left my room it would not be locked i called front desk again they arrived mins later fix it seems it just needed new battery cleanliness side my room was clean room did not have refridgerator microwave but there was shared microwave near vending machines it was dirty with something spilled all over inside it days also room that held vending machines m

#### Model Evaluation

After scoring the sentiment of the reviews using AFINN, we will evaluate our model's performance by comparing the predicted sentiment scores to the actual labels in the test set. We will use accuracy, precision, recall, and F1-score as our metrics.

In [32]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate performance metrics on the test set
accuracy = accuracy_score(y_test, y_test_pred)
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)
f1 = f1_score(y_test, y_test_pred)

# Print the performance metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')


Accuracy: 0.2312
Precision: 0.2456
Recall: 0.6723
F1 Score: 0.3597


#### Observations and Analysis

The high recall and low precision could indicate that the model is over-predicting the “not happy” class. In other words, it is better at catching “not happy” reviews at the cost of incorrectly classifying some “happy” reviews as “not happy”.

It should be noted that the AFINN lexicon model uses a fixed list of words with assigned sentiment scores. They may not capture the sentiment of words that are not in the lexicon or understand the context in which words are used. For consideration, a context-aware sentiment analysis approach such as machine learning models trained on large, labeled datasets that may better understand the sentiment. Seemingly, this model does not improve over time or adapt to new expressions of sentiment. Hence, training a supervised machine learning model that can learn from a labeled dataset or using a deep learning model (which is in Week 11) may exhibit improvement of its predictions over time.
