# Task and Data Analysis

#### Overview of the Data

The dataset consists of product reviews from Amazon, segmented into features with annotated sentiment scores. Reviews include multiple aspects such as ease of use, picture quality, and additional functionalities, each tagged with a corresponding sentiment score. These annotations provide a rich basis for sentiment analysis but require precise parsing to effectively utilize the structured format in which they are presented.

#### System-Level Outline

##### Data Parsing and Pre-processing

Using a custom `read_file` function, the system initially parses the structured reviews from text files. This function handles the dataset's specific format, which includes initial metadata and reviewer comments split by a unique delimiter ('##'). This is crucial for separating feature tags from review content, facilitating subsequent analysis.

The `pre_process_review` function further refines this by extracting titles and adjusting reviews accordingly, ensuring that the context provided by review headers is not lost. Additionally, it preserves the integrity of the review flow, which is vital for understanding the nuances of each review.

##### Enhancing NLP with Custom Processing

To deepen the analysis, the `preserve_compound_phrases` function is employed. This function utilizes an NLP model to identify and preserve compound nouns and adjectives directly linked to nouns, which are often critical in understanding the specific features discussed. By preserving these compounds, the system maintains the granular detail necessary for precise feature extraction.

Following this, the `chunking_post_process` method reassembles the text from tokenized forms back into a structured format conducive to further analysis, ensuring that compound phrases are treated as single entities within the dataset.

##### Comprehensive Review Filtering

The `pre_processing_controller` function orchestrates the entire preprocessing pipeline. It transforms raw review texts into a tokenized format, applies compound preservation, and executes two levels of filtering: soft filtering (preserving basic structure and some stopwords) and hard filtering (removing all non-alphabetic characters and stopwords). This dual approach allows for flexibility in analysis, from high-level sentiment trends to detailed feature-specific sentiments.

##### Sentiment Analysis and Feature Extraction

Once preprocessed, the data is ripe for sentiment analysis. Leveraging the structured format of feature tags and sentiment scores, the system can map sentiments directly to product features, allowing for an aggregated sentiment score for each feature. This quantification is pivotal in determining which features are most appreciated or criticized by users.

##### Leveraging Data for Business Insights

The final step involves synthesizing the analyzed data into actionable business insights. By understanding which features correlate strongly with positive or negative sentiments, companies can prioritize product improvements or highlight successful aspects in marketing strategies.

In [87]:
import pandas as pd
import numpy as np
import nltk
import spacy
import gensim.downloader as api
import copy

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords
from gensim.models.word2vec import Word2Vec

nlp = spacy.load("en_core_web_sm")
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
Word2Vec_corpus = api.load('text8') 
Word2Vec_model = Word2Vec(Word2Vec_corpus) 
glove_model = api.load("glove-twitter-25") 

[nltk_data] Downloading package stopwords to /Users/leon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [88]:
files = ['Data/Customer_review_data/Apex AD2600 Progressive-scan DVD player.txt',
         'Data/Customer_review_data/Canon G3.txt',
         'Data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
         'Data/Customer_review_data/Nikon coolpix 4300.txt',
         'Data/Customer_review_data/Nokia 6610.txt',
         'Data/CustomerReviews-3_domains/Computer.txt',
         'Data/CustomerReviews-3_domains/Router.txt',
         'Data/CustomerReviews-3_domains/Speaker.txt',
         'Data/Reviews-9-products/Canon PowerShot SD500.txt',
         'Data/Reviews-9-products/Canon S100.txt',
         'Data/Reviews-9-products/Diaper Champ.txt',
         'Data/Reviews-9-products/Hitachi router.txt',
         'Data/Reviews-9-products/ipod.txt',
         'Data/Reviews-9-products/Linksys Router.txt',
         'Data/Reviews-9-products/MicroMP3.txt',
         'Data/Reviews-9-products/Nokia 6600.txt',
         'Data/Reviews-9-products/norton.txt']

# Data Pre-Processing

#### Data Ingestion and Initial Processing

The process begins with the reading of data files, where each file potentially contains multiple reviews with varying structures. A Python function, `read_file`, is employed to open and read the content of these files. Reviews are often separated by new lines and may begin with a distinctive marker of asterisks indicating metadata or headers that are not part of the actual review content. Such lines are programmatically identified and skipped, ensuring that only relevant text is processed further.

Reviews within these files are then split using '##' as a delimiter to segregate tags from the main content, which allows for the extraction of embedded metadata or sentiment tags when present. Each piece of the review, along with its associated tags, is stored in a structured format within a pandas DataFrame, facilitating subsequent manipulations and analyses.

#### Advanced Text Processing Techniques

Once the initial ingestion is complete, the reviews undergo a series of sophisticated text processing steps encapsulated within the `pre_process_review` function. This function is designed to handle various nuances of the text, including title concatenation where necessary and preservation of the textual integrity for reviews that continue across multiple lines.

##### Preservation of Semantic Structures

To maintain the semantic integrity of phrases within the reviews, the `preserve_compound_phrases` function is applied. This function utilizes spaCy, an advanced NLP library, to parse the text and identify compound nouns and adjectival modifiers, which are crucial for understanding the context and sentiment related to specific product features. These phrases are then reconstructed with underscores replacing spaces to ensure they are treated as single tokens in subsequent analyses, preventing the loss of their semantic unity.

##### Enhancement of Tokenization and Filtering

Post semantic preservation, the `pre_processing_controller` function orchestrates several layers of tokenization and filtering. The text is first tokenized, ensuring that each word or phrase is individually analyzed. This tokenization feeds into a dual filtering process where:
1. A 'Soft Filtered Review' captures tokens that either form part of the identified compound phrases or are standalone alphabetic words.
2. A 'Filtered Review' applies a more stringent filter, additionally removing common stop words to focus on the more meaningful terms relevant to sentiment analysis.

These tokens are then reassembled into coherent strings, forming the basis for deeper linguistic analysis, including lemmatization and stemming. Lemmatization is performed to reduce words to their base or dictionary form, whereas stemming further strips down the words to their root forms, often leading to a more generalized but powerful analysis of text data.

In [89]:
def read_file(file_path):
    
    tagged_reviews = []
    
    with open(file_path, 'r') as file:
        text = file.read()
        # Split the text into lines and remove any leading/trailing whitespace
        reviews = text.strip().split('\n')

        # Check if the file starts with a specific marker line of asterisks
        if reviews[0] == '*' * 77:
            # Skip the first 11 lines if the marker is present - This is a quirk to parse the data files
            reviews = reviews[11:]

        reviews = pre_process_review(reviews)
        
        for review in reviews:
            # Split each review on '##' to separate tags from the content
            parts = review.split('##')
            
            # If the split results in more than one part, process tags and content
            if len(parts) > 1:
                tags = parts[0].strip().split(',')
                content = parts[1].strip() 
            else:
                # If no '##' is found, set tags as empty and content to the whole line
                tags = []
                content = parts
                
            # Append a dictionary of tags and review content to the list
            tagged_reviews.append({'Tags': tags, 'Review': content})

        df = pd.DataFrame(tagged_reviews)
        # Store the name of the file as an attribute of the DataFrame
        df.attrs['title'] = file_path.split('/')[-1]

        return df

In [90]:
# Pre-processing functions

def pre_process_review(reviews):
    processed_reviews = []
    title_switch = False  # Indicates whether next review should append a title
    title = ''

    for review in reviews:
        if review.startswith('[t]'):  # Checks for title marker
            title = review[3:]  # Stores the title
            title_switch = True
        elif title_switch:  # Appends title to the review if flag is true
            processed_reviews.append(review + title)
            title_switch = False
            title = ''
        else:
            processed_reviews.append(review)  # Adds review as is if no title is pending

    return processed_reviews



def preserve_compound_phrases(text):
    # Process the text with an NLP model
    doc = nlp(text)
    processed_tokens = []

    for token in doc:
        # Check for compounds or adjectives linked directly to nouns
        if token.dep_ in ('compound', 'amod') and token.head.pos_ == 'NOUN':
            compound_phrase = f"{token.text}_{token.head.text}"
            if compound_phrase not in processed_tokens:
                processed_tokens.append(compound_phrase)
        # Skip nouns that are already part of a compound to prevent duplicates
        elif token.pos_ == 'NOUN' and any(child.dep_ == 'compound' for child in token.children):
            continue
        # Include all other tokens normally
        else:
            processed_tokens.append(token.text)

    return processed_tokens




def chunking_post_process(text):
    # Split the text into individual words
    words = text.split()
    processed_words = []

    i = 0
    while i < len(words):
        # Check if the current word is part of a compound phrase
        if '_' in words[i]:
            # Append all parts of the compound phrase to the list
            while i < len(words) and '_' in words[i]:
                processed_words.append(words[i])
                i += 1
            continue  # Move to the next word after finishing the compound phrase
        # Append non-compound words directly to the list
        processed_words.append(words[i])
        i += 1

    # Return the processed words as a single string
    return ' '.join(processed_words)

In [91]:
def pre_processing_controller(df):
    
    # Convert lists to strings and applies compound phrase preservation
    df['Tokenised_Review'] = df['Review'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
    df['Tokenised_Review'] = df['Tokenised_Review'].apply(lambda review: preserve_compound_phrases(review))
    
    # Filters the reviews. Handling compound phrases, capitalisation and numbers 
    df['Soft_Filtered_Review'] = df['Tokenised_Review'].apply(lambda tokens: [token.lower() for token in tokens if ("_" in token) or (token.isalpha() and token.lower())])
    
    # Convert lists of tokens back to strings and retains the compound phrases - soft means not to filter out stop words
    df['Soft_Filtered_Review_String'] = df['Soft_Filtered_Review'].apply(lambda tokens: ' '.join(tokens))
    df['Soft_Filtered_Review_String'] = df['Soft_Filtered_Review_String'].apply(chunking_post_process)
    
    # Filters the reviews. Handling compound phrases, capitalisation, numbers and stop words 
    df['Filtered_Review'] = df['Tokenised_Review'].apply(lambda tokens: [token.lower() for token in tokens if ("_" in token) or (token.isalpha() and token.lower() not in stop_words)])
    df['Filtered_Review_String'] = df['Filtered_Review'].apply(lambda tokens: ' '.join(tokens))
    
    # Lemmatise the filtered review strings
    df['Lemmatised_Review_String'] = df['Filtered_Review_String'].apply(lambda review_string: " ".join([token.lemma_ for token in nlp(review_string)]))
    df['Lemmatised_Tokenised_Filtered_Review'] = df['Filtered_Review_String'].apply(lambda review: word_tokenize(review))
    
    return df

# Product Feature Extraction

##### Part-of-Speech (POS) Noun Tagging

The first step in the feature extraction process is the identification of nouns from customer reviews. This is accomplished by the `POS_Noun_Tagging` function, which employs tokenization and POS tagging to sift through the textual data. Nouns are indicative of features, as they often name the components or aspects of a product that customers discuss. By focusing on nouns, the function narrows down the vast amount of information in reviews to specific elements likely relevant to consumer sentiments.

##### Concrete Noun Identification

Further refining the noun extraction, the `is_concrete_noun` function determines whether a noun describes tangible aspects of a product, such as an "object" or "device." This distinction is crucial because tangible feature mentions often directly relate to customer satisfaction and are actionable from a product improvement perspective. Using WordNet synsets and their hypernyms allows for an understanding of the word's concreteness in context, enhancing the relevance of the extracted features.

##### Similarity-Based Filtering

With the noun list refined to concrete nouns, the `similarity_filter` function applies vector space models like Word2Vec and GloVe to filter out words based on their semantic similarity. This step is vital to cluster similar features together, reducing redundancy and focusing on unique aspects mentioned across reviews. By setting a similarity threshold, the function ensures only significantly similar terms are grouped, maintaining a diverse yet focused feature set.

##### Further Similarity Filtering

The `further_similar_filter` function adds another layer of filtering by using GloVe's similarity scoring to eliminate closely related nouns, further condensing the feature list. This step is particularly useful in focusing on the most mentioned and hence potentially impactful features in customer reviews.

##### Building the Feature Dictionary and DataFrames

The `create_product_features_dict` function initializes a structured dictionary to map each product feature against its mentions and sentiment polarity. This dictionary is foundational for aggregating and analyzing sentiment data related to each feature. The `build_featured_df` function then constructs a DataFrame from the reviews, associating each with identified features and preparing the data for sentiment analysis.

##### Sentiment Analysis and Count Aggregation

The final analytical steps involve `feature_dict_count`, which tallies positive and negative mentions for each feature, providing a quantified sentiment outlook for each aspect of the product. This aggregation is pivotal for identifying strengths and weaknesses in the product as perceived by consumers.

##### Efficient Data Management

Lastly, the `df_filter` function restructures the DataFrame to optimize it for analysis, ensuring data is clean and well-organized for subsequent processing steps or visualization.


In [97]:
def POS_Noun_Tagging(string_list):
    # Convert the string list to a regular list
    reviews = string_list.tolist()
    features = []
    
    # Process each review to extract nouns
    for review in reviews:
        tokens = word_tokenize(review)  # Tokenize the text
        tagged = pos_tag(tokens)  # POS tagging
        # Collect nouns from tags
        features.extend(word.lower() for word, tag in tagged if tag.startswith('NN'))
    
    # Count and retrieve the 15 most common nouns
    feature_counts = Counter(features)
    common_features = feature_counts.most_common(10)
    
    return common_features



def is_concrete_noun(word):
    # Define indicators for concrete nouns
    concrete_indicators = {'object', 'artifact', 'instrumentality', 'container', 'device'}
    # Retrieve synsets for the word as a noun
    synsets = wn.synsets(word, pos=wn.NOUN)
    
    # Check categories for each synset to determine if it's a concrete noun
    for synset in synsets:
        for hypernym in synset.closure(lambda s: s.hypernyms()):
            if concrete_indicators.intersection(hypernym.lemma_names()):
                return True
    return False

In [102]:
def similarity_filter(word_tuple_list):
    # Set a threshold for filtering similar words
    similarity_threshold = 0.25
    
    # Create a list of words from the tuples
    words = [word for word, _ in word_tuple_list]
    
    filtered_words = []
    # Iterate over words to compute similarities
    for word in words:
        try:
            # Calculate similarity using Word2Vec
            w2v_similarity = Word2Vec_model.wv.similarity(words[0], word)
            # Calculate similarity using GloVe
            glove_similarity = glove_model.similarity(words[0], word)
            
            # Calculate average similarity
            avg_similarity = (w2v_similarity + glove_similarity) / 2
            
            # Append word to list if it meets the threshold
            if avg_similarity >= similarity_threshold:
                filtered_words.append(word)
        except KeyError:
            # Skip the word if it's not found in the model's vocabulary
            continue

    return filtered_words





def further_similar_filter(noun_list):
    
    items_to_remove = []
    
    for word in noun_list:
        glove_similar_words = dict(glove_model.most_similar(noun_list[0], topn=10))
        glove_similarity = glove_similar_words.get(word, 0) 
        if glove_similarity > 0:
            items_to_remove.append(word)
    
    filtered_list = [item for item in noun_list if item not in items_to_remove]
    
    return filtered_list

In [103]:
def create_product_features_dict(product_info):
    # Extract the product title
    product_title = product_info[0]

    # Extract the features
    product_features = product_info[1:]

    # Create the dictionary with the desired structure
    product_dict = {
        product_title: {
            feature: {"positive": 0, "negative": 0} for feature in product_features
        }
    }

    return product_dict



def build_featured_df(df, feature_list_pos_noun):
    
    feature_df = pd.DataFrame()
    feature_list = feature_list_pos_noun[1:]
    
    featured_items_list = []
    
    for idx, review in df.iterrows():
        tokenised_review = review['Filtered_Review']
        # Find the features present in the tokenised_review
        featured_items = [item for item in feature_list if item in tokenised_review]
            
        if featured_items:
            # Convert the review Series to a DataFrame with one row
            review_df = review.to_frame().transpose()
            
            # New: Append the found features as a string (or you can keep it as list)
            review_df['Featured_Items'] = [', '.join(featured_items)]  # As a single string of items
    
            review_df['Main_Index'] = idx
            
            # Append this review to the feature_df
            feature_df = pd.concat([review_df, feature_df], axis=0)
            
            # Additionally, append the found features to the featured_items_list
            featured_items_list.append(featured_items)

    return feature_df



def feature_dict_count(feature_df, product_dict):
    
    sentiment_class_list = feature_df[['Featured_Items', 'Sentiment']]
    
    for idx, row in sentiment_class_list.iterrows():
        feature = row['Featured_Items']
        sentiment = row['Sentiment']
    
        if sentiment == 1:
            val = product_dict[list(product_dict.keys())[0]][feature]['positive'] + 1
            product_dict[list(product_dict.keys())[0]][feature]['positive'] = val
        elif sentiment == -1:
            val = product_dict[list(product_dict.keys())[0]][feature]['negative'] + 1
            product_dict[list(product_dict.keys())[0]][feature]['negative'] = val
        else:
            # print('More than two features, cannot classify')
            pass

    return product_dict



def df_filter(df):
    df = df[['Main_Index', 'Featured_Items', 'Sentiment', 'Tags', 'Review', 'Tokenised_Review', 'Soft_Filtered_Review', 'Soft_Filtered_Review_String', 'Filtered_Review', 'Filtered_Review_String', 'Lemmatised_Review_String', 'Lemmatised_Tokenised_Filtered_Review']]
    df = df.set_index('Main_Index', drop=True) 
    df.index.name = 'Main Index'
    df = df.iloc[::-1]
    return df

# Sentiment Analysis

##### Sentiment Controller Function

The core of the sentiment analysis system is the `sentiment_controller` function, which is responsible for aggregating and assigning sentiment scores to each review in the dataset. This function works by iterating through each review entry and identifying the associated product features. It is designed to handle reviews with a single feature distinctly, ensuring that the sentiment analysis is as precise and relevant to specific product attributes as possible.

For reviews associated with exactly one product feature, the function utilizes the `get_phrase_sentiment` function to calculate a comprehensive sentiment score based on the textual content of the review. This distinction is crucial as it allows for a focused analysis, linking sentiments directly to individual features rather than generalizing across multiple attributes.

##### Phrase and Word-Level Sentiment Calculation

The `get_phrase_sentiment` function represents a layered approach to sentiment scoring. Initially, it attempts to understand the sentiment of the entire phrase as a compound unit, preserving the contextual integrity of the user’s opinion. If specific sentiment scores are available for the compound phrase, these are used directly. This approach is beneficial for capturing the sentiment of phrases where the combined meaning might differ significantly from the sum of individual word sentiments.

If a direct compound sentiment score is unavailable, the function breaks down the phrase into individual words and computes the average sentiment score across all words. This breakdown is critical for capturing the nuances of language where compound scoring is not feasible, ensuring no sentiment information is lost.

##### Individual Word Sentiment Scoring

At the word level, the `get_word_sentiment` function retrieves sentiment scores using SentiWordNet, a well-regarded lexicon in sentiment analysis research. By analyzing the positive and negative sentiment scores associated with each word’s synset, the function provides a balanced view of the word’s emotional impact within the context of the review.

This granular approach to sentiment analysis allows for a nuanced understanding of how specific words contribute to the overall sentiment of a phrase or sentence, enhancing the system's ability to detect and interpret the subtlest emotional undertones in consumer feedback.

##### Integration into Review Analysis

Once sentiments are calculated for each review, the `sentiment_controller` function appends these scores to the main DataFrame as a new column, facilitating further analysis and visualization of sentiment trends across different product features. This integration is pivotal in enabling end-to-end analysis, from raw review data to actionable insights.

In [104]:
def sentiment_controller(feature_df):
    # Initialize a list to hold sentiment values for each review
    sentiment_list = []
    
    # Iterate over each row in the DataFrame
    for idx, entry in feature_df.iterrows():
        review = entry['Soft_Filtered_Review_String']
        features = [feature.strip() for feature in entry['Featured_Items'].split(',')]
        
        # Process sentiment only if there is exactly one feature
        if len(features) == 1:
            pos_score, neg_score = get_phrase_sentiment(review)
            sentiment = 1 if pos_score > neg_score else -1
            sentiment_list.append(sentiment)
        else:
            # Append neutral sentiment (0) if there are multiple or no features
            sentiment_list.append(0)

    # Add the sentiment list to the DataFrame as a new column
    feature_df['Sentiment'] = sentiment_list
    return feature_df



def get_phrase_sentiment(phrase):
    # First try to get sentiment score for the whole phrase as a compound
    pos_score, neg_score = get_word_sentiment(phrase.replace(" ", "_"))
    if pos_score or neg_score:
        return pos_score, neg_score
    
    # If no score, break down the phrase and calculate average sentiment scores
    words = phrase.split()
    total_pos, total_neg = 0, 0
    for word in words:
        pos, neg = get_word_sentiment(word)
        total_pos += pos
        total_neg += neg
    
    # Average the sentiment scores
    avg_pos = total_pos / len(words) if words else 0
    avg_neg = total_neg / len(words) if words else 0
    
    return avg_pos, avg_neg



def get_word_sentiment(word):
    # Retrieve sentiment scores from SentiWordNet
    synsets = list(swn.senti_synsets(word))
    if synsets:
        return synsets[0].pos_score(), synsets[0].neg_score()
    else:
        return 0, 0


# Evaluation

In [105]:
def opinion_miner_controller(file):
    
    df = read_file(file)
    df = pre_processing_controller(df)
    
    nouns = POS_Noun_Tagging(df['Lemmatised_Review_String'])
    similar_nouns = similarity_filter(nouns)
    features = further_similar_filter(similar_nouns)
    feature_df = build_featured_df(df, features)
    feature_df = sentiment_controller(feature_df)
    feature_df = df_filter(feature_df)
    product_dict = create_product_features_dict(features)
    feature_dict = feature_dict_count(feature_df, copy.deepcopy(product_dict))
    return feature_dict


feature_dict = opinion_miner_controller(files[1])
display(feature_dict)

{'camera': {'use': {'positive': 24, 'negative': 12},
  'picture': {'positive': 2, 'negative': 10},
  'canon': {'positive': 18, 'negative': 18},
  'time': {'positive': 8, 'negative': 15},
  'shoot': {'positive': 15, 'negative': 5},
  'feature': {'positive': 0, 'negative': 1},
  'quality': {'positive': 14, 'negative': 2}}}