# Mining and Summarizing Customer Reviews


## Introduction

As e-commerce continues to flourish, an ever-increasing volume of customer feedback on products and services is generated online. Popular products often attract hundreds of reviews, creating a rich yet overwhelming source of information for potential buyers. This project addresses the need to make sense of this plethora of reviews by introducing an innovative approach to summarizing customer feedback. Distinct from traditional text summarization methods, this approach zeroes in on extracting and organizing customer opinions about specific product features and categorizing them as positive or negative. Our summarization process diverges from the usual practice of sentence selection or rewriting, focusing instead on pinpointing and collating opinions related to product features. The paper outlines various strategies for effective feature mining, backed by experimental results.

This project is centered around the challenge of feature-based opinion summarization in customer reviews for online products. It encompasses a two-stage process: initially, identifying the product features that customers have commented on, referred to as "opinion features." These features are then ranked by their frequency in the reviews. The second stage is associating each feature with positive or negative sentiments. For each identified feature, we quantify the number of customer reviews expressing positive or negative views and link these specific reviews to their respective features. Our goal is to enhance the review navigation experience for potential customers, making it more straightforward and efficient.


## Pipeline for Mining and Summarizing Customer Reviews

1. Analyze the task and define the framework.
2. Preprocess data, inspect, and gain insights.
3. Identify relevant information, extract data.
4. Select the most suitable algorithm and implement it.
5. Apply the algorithm practically, test, and evaluate its performance.

### 1. Analyze the Task and Define the Framework

#### 1.1 Define Used Libraries

In this critical initial phase, we clearly define the task and establish a comprehensive framework. Our main objective is to extract pivotal features from sentences found in product reviews, which are methodically organized in various folders. We start by identifying essential libraries that are key to processing the reviews, including extracting both implicit and explicit features and determining opinion polarities. The subsequent sections will explore various functions, each uniquely crafted to harness the full potential of these libraries for specific tasks.

#### 1.2. Read Data from Files in the Folders

Here, we developed two functions to efficiently navigate through folders and files, where the product reviews are stored, while excluding 'Readme.txt' files:
   
   - `read_files_in_folder`: This function is designed to read and process text files within a specified folder. It efficiently extracts the product names from the file names, using these as keys in a dictionary, with the corresponding processed text file data as values.
   
   - `read_folders`: This function extends the capability of reading text files to multiple folders. It accepts a list of folder names as input and compiles the data into a comprehensive dictionary, encompassing processed information from text files across all the specified folders.

In [1]:
import os
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag, RegexpParser
from apyori import apriori
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
nlp = spacy.load("en_core_web_sm")

folder_names = ['Customer_review_data', 'CustomerReviews-3_domains', 'Reviews-9-products']
#folder_names = ['CustomerReviews-3_domains']


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\ALIENWARE\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
def read_files_in_folder(folder_path):
    """
     
    Reads text files in a specified folder and processes their content.

    Parameters:
    - folder_path (str): The path to the folder containing text files.

    Returns:
    - file_data (dict): A dictionary where keys are product names (extracted from file names)
      and values are processed data from corresponding text files.
    """
    
    file_data = {}
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        if file_name == "Readme.txt":
            continue
        else:
            if os.path.isfile(file_path) and file_name.endswith('.txt'):
                with open(file_path, 'r') as file:
                    content = file.readlines()
                    product_name = file_name.replace(".txt", "")
                    processed_data = process_content(content)
                    file_data[product_name] = processed_data
    return file_data


def read_folders(folders):
    """
    ADD DOCUMENTATION
    """
    all_data = {}
    base_path = os.getcwd()  # Use os.getcwd() to get the current working directory
    for folder in folders:
        folder_path = os.path.join(base_path, folder)
        folder_data = read_files_in_folder(folder_path)
        all_data.update(folder_data)
    return all_data


### 2. Data Preprocessing and Insights Extraction

In the second phase of the data processing pipeline, the primary focus is on cleaning, parsing, and structuring sentences for subsequent analysis. The central function, `process_content`, efficiently organizes a list of lines into a structured format. It skips irrelevant lines, identifies review boundaries, parses sentences, and exta  ctlabel s features using regular expressions. The function is adept at handling various scenarios, such as skipping empty or star-labeled lines, separating reviews, and extracting labeled features. This preprocessing stage lays the foundation for gaining insights from the data, providing a structured and organized dataset for further analys for the next step which mainly will be applied on the extracted sentences where some proceses will be applied on them to keep important parts and eliminate useless partsis

`preprocess_reviews_in_place`:
Now we will take the sentence portion from the entire dictionary of lists. As indicated below, we applied text cleaning by removing stop words from the sentence before tokenizing it into words for several reasons:

- **Text Size Reduction:** Eliminating stop words before tokenization reduces the overall text size, which can lead to faster processing, particularly for large datasets.

- **Noise Reduction:** If stop words significantly contribute to noise in the data, removing them early in the process simplifies downstream tasks.

Following the stop words removal, the next preprocessing step involves POS (Part-of-Speech) tagging of the cleaned tokens. This step, executed with the `pos_tag` function from the NLTK library, assigns a grammatical category or part of speech to each token. This categorization provides valuable linguistic information about the words in the text.

The subsequent preprocessing step offers flexibility by providing the option to apply either lemmatization or stemming. Leveraging SpaCy for lemmatization ensures words are transformed into their base or dictionary form, promoting text normalization. Additionally, the code supports stemming using SpaCy, which reduces words to their root form. This adaptability allows users to choose between lemmatization and stemming based on their preferences or the specific characteristics of the text data.

The processed information, now including cleaned, POS-tagged, and normalized tokens, seamlessly integrates back into the original data structure. The `preprocess_reviews_in_place` function encapsulates these preprocessing steps, serving as a versatile tool for optimizing the quality of text data before advancing to subsequent analyses.

In [3]:
def process_content(content):
    reviews = []
    # Temporary variables to store information about the current review
    current_review = []  # Initialize current_review as an empty list
    current_sentence = None

    for line in content:
        # Skip empty lines
        if not line.strip():
            continue

        if line.startswith("*"):
            continue

        if line.startswith("[t]"):
            # If there is an existing review, add it to the reviews list
            if current_review:
                reviews.append(current_review)

            # Start a new review with the current line
            current_review = []  # Initialize current_review as an empty list
            current_sentence = None
        else:

            if line.startswith("##"):
                sentence_text = line[2:].strip()  # Remove "##" and strip whitespace
            else:
                sentence_text = re.sub(r'^(.+?)##', '', line).strip()

            # Create a tuple for each feature
            features = re.findall(r'([\w\-_]+(?:\s[\w\-_]+)*)\[([\+\-]?\d)\](?:\[(\w+)\])?', line)
            feature_tuples = [tuple(feature) for feature in features]

            # If it's a continuation of the previous sentence, add the features
            if current_sentence == sentence_text:
                current_review[-1].append([feature_tuples])
            else:
                # Start a new sentence in the review
                current_review.append([sentence_text] + [feature_tuples])
                current_sentence = sentence_text

    # Add the last review to the reviews list
    if current_review:
        reviews.append(current_review)

    return reviews


In [4]:
"""
 snippet reads and processes folders specified in folder_names using the read_folders function,
 stores the result in the result variable, and then prints the result to the console. 
 The exact nature of the result will depend on the implementation of the read_folders function.

"""

result = read_folders(folder_names)

print("result :\n",result)

result :


In [5]:
def preprocess_reviews_in_place(sentence_info, lemmatization=False, stemming=False, stop_words=True):
    # Get the list of English stop words
    stop_words_set = set(stopwords.words('english'))

    u = []
    original_sentence = sentence_info[0].lower()

    # Remove special characters and replace sequences of special characters with a single space
    cleaned_sentence = re.sub(r'[^A-Za-z0-9\s\-]+', '', original_sentence)
    cleaned_sentence = re.sub(r'\s+', ' ', cleaned_sentence)


    # Tokenize the cleaned sentence
    tokens = word_tokenize(cleaned_sentence)

    # Get POS for all tokens without removing stop words
    filtered_tokens = pos_tag(tokens)

    if lemmatization:
        # Lemmatization using SpaCy
        doc = nlp(cleaned_sentence)
        lemmatized_tokens = [token.lemma_ for token in doc]
        filtered_tokens = list(zip(lemmatized_tokens, [pos for _, pos in filtered_tokens]))

    if stemming:
        # Stemming using SpaCy
        doc = nlp(cleaned_sentence)
        stemmed_tokens = [token.orth_ for token in doc]
        filtered_tokens = list(zip(stemmed_tokens, [pos for _, pos in filtered_tokens]))

    if stop_words:
        # Remove stop words and get POS for all tokens
        filtered_tokens = [(token, pos) for token, pos in filtered_tokens if token.lower() not in stop_words_set]

    filtered_tokens = [(token, pos) for token, pos in filtered_tokens if len(token) > 1]

    u.extend(sentence_info)
    u.extend([filtered_tokens])

    return u

In the final stage of the text file preprocessing within the pipeline, the culmination involves incorporating all the retrieved tokens from the preceding step, along with their respective Parts of Speech (PoS), into the original dictionary of lists. The resulting structure mirrors the format illustrated in the following example which stored in result_preproc:

```plaintext
[
  "But, if you're looking for my opinion of the Apex DVD player, I love it!",
  [('dvd player', '+2', '')],
  [
    ('look', ('looking', 'VBG')),
    ('opinion', ('opinion', 'NN')),
    ('Apex', ('Apex', 'NN')),
    ('DVD', ('DVD', 'NN')),
    ('player', ('player', 'NN')),
    ('love', ('love', 'VBP'))
  ]
]
))]],
 

In [6]:
result_preproc = {}  # Initialize an empty dictionary to store preprocessed data

for product, reviews in result.items():
    result_preproc[product] = []  # Create an entry for the current product in the preprocessed data dictionary
    for review in reviews:
        preproc_review = []  # Initialize an empty list to store preprocessed sentences in the current review
        for sentence_info in review:
            # Perform preprocessing in-place using the preprocess_reviews_in_place function
            # The function handles tasks such as lowercasing, removing special characters, tokenization,
            # lemmatization, stemming, and stop words removal based on specified parameters
            preproc_review.append(preprocess_reviews_in_place(sentence_info, lemmatization=True, stemming=False, stop_words=True))
        result_preproc[product].append(preproc_review)  # Append the preprocessed review to the product's entry in the dictionary


In [7]:
#result_preproc

### 3. Define Relevant Information, Extract Information from Data

### Extract Reviews Features
The `extract_features` function plays a pivotal role in the data processing pipeline, meticulously tailored to reveal and categorize pertinent features from the preprocessed text data. Operating within the structured format of a dictionary, where each product encapsulates a series of reviews, this function systematically traverses through the data hierarchy. For each individual product, it intricately navigates through its associated reviews and sentences, meticulously extracting Part-of-Speech (POS) tags from the preprocessed information.
The subsequent step in this intricate process involves the application of a meticulously crafted grammar designed for Noun Phrase (NP) chunking. Leveraging a chunk parser, the function adeptly performs NP chunking based on the predefined grammatical rules. This distinctive strategy enhances the extraction of meaningful units from the text, contributing to a nuanced understanding of the linguistic structure.
Moving forward, the function strategically captures both NP chunks and individual nouns from the POS tags. This dual approach enriches the extracted features by including both standalone nouns and more complex noun phrases, fostering a comprehensive representation of the linguistic nuances present in the reviews. The culmination involves combining these extracted features into a list, with due diligence to eliminate any duplicates and ensure a concise yet comprehensive representation.
The final output of this meticulous process manifests in the form of a dictionary, denoted as `all_features`. Within this dictionary, each product is meticulously associated with a list of features curated from its respective reviews. This granular and systematic approach to feature extraction significantly contributes to the identification and categorization of key elements within product reviews. As a result, the function stands as a pivotal component in the broader analytical framework, offering invaluable insights for subsequent analyses and informed decision-making.


### Association Rules and Pruning

### Association Rules Extraction

To further enhance the analysis, association rules are extracted from the previously obtained features. The `get_association_rules` function utilizes the Apriori algorithm to identify frequent itemsets and generate association rules based on specified support and confidence thresholds. The resulting association rules are stored in the `top_features` dictionary, associating each product with its respective rules.
 The min_support parameter is set to 0.005, representing the minimum support threshold for an itemset to be considered frequent. In the context of association rule mining, support measures the frequency of occurrence of an itemset in the dataset. Setting a relatively low min_support value, such as 0.005, allows the algorithm to identify frequent itemsets even if they occur relatively infrequently in the dataset. This can help capture more diverse patterns from the reviews (sentences).
 
#### Compactness Pruning

The compactness_pruning function aims to prune features based on their compactness, ensuring that retained features are meaningful and occur in at least two sentences within close proximity. It defines an internal function, is_compact, which assesses the compactness of a given feature phrase by examining its occurrences in sentences. For a phrase with more than one word, the function counts the number of times it appears in sentences with matching words within a specified distance of each other (default maximum distance is 3). If the count exceeds or equals two, the feature is considered compact. The main function iterates over each product's set of feature phrases in top_features, evaluates their compactness using the internal function, and creates a dictionary pruned_features associating each product with a list of compact features. This process refines the feature set, ensuring that retained features exhibit meaningful co-occurrences in the sentences of the preprocessed data.

#### Redundancy Pruning

The redundancy_pruning function is designed to refine association rules obtained from the extraction of features by eliminating redundant single-word features. It iterates over each product (file) in the dataset, compares the features to the compact features obtained from a previous compactness pruning step, and identifies non-redundant features based on their individual word composition. The function calculates the p-support for each non-redundant feature, representing the number of sentences in which the feature appears within the preprocessed data, and filters out features with insufficient p-support. The resulting redundancy_features dictionary associates each product with a set of meaningful features that have undergone both compactness and redundancy pruning, forming a refined and more concise representation of the essential information within the product reviews. The is_superset helper function is utilized to determine whether one set of words is a superset of another during the redundancy pruning process.

#### Features Combining

The final step involves combining the compact and redundancy-pruned features into a unified set of meaningful features. The `combined_features` dictionary associates each product with a comprehensive list of features that have undergone both compactness and redundancy pruning.

This comprehensive set of functions and processes, from feature extraction to association rules and pruning, forms a robust foundation for deriving insights and understanding key patterns within product reviews.

In [8]:
def extract_features(result_preproc):
    all_features = {}

    for product, reviews in result_preproc.items():
        product_features = []

        for review in reviews:
            for sentence_info in review:
                # Extract the POS tags from the preprocessed data
                pos_tags = sentence_info[-1]

                # Define a grammar for NP (noun phrase) chunking
                
                grammar = r"""
                   NP: {<NN.*>+} # Noun Phrase
                  """

                # Create a chunk parser with the defined grammar
                chunk_parser = RegexpParser(grammar)

                # Perform NP chunking
                chunks = chunk_parser.parse(pos_tags)

                # Extract NP chunks
                noun_phrases = [" ".join([token[0] for token in subtree.leaves()]) for subtree in
                                chunks.subtrees(filter=lambda t: t.label() == 'NP')]

                # Extract individual nouns
                nouns = [token[0] for token in pos_tags if token[1].startswith('NN')]

                # Combine nouns and noun phrases, remove duplicates
                features = list(set(nouns + noun_phrases))

                product_features.append(features)

        all_features[product] = product_features

    return all_features

In [9]:
result_features = extract_features(result_preproc)
#print("Result Features:\n", result_features)



In [10]:
def get_association_rules(sentences, min_support=0.005, min_confidence=0.2):
    # Generate association rules using the Apriori algorithm
    rules = apriori(sentences, min_support=min_support, min_confidence=min_confidence)

    # List to store top features obtained from association rules
    top_features = []

    # Extract items from each rule and add to the top_features list
    for r in list(rules):
        top_features.extend(r.items)

    # Return the list of unique top features
    return list(set(top_features))


In [11]:
# Dictionary to store association rules for each product
top_features = {}

# Iterate over products and their corresponding features
for product, features in result_features.items():
    # Obtain association rules for the current product's features
    association_rules = get_association_rules(features)

    # Store the cleaned association rules in the dictionary
    top_features[product] = association_rules

# Print the dictionary containing top features for each product
print("Top Features:\n", top_features)


Top Features:
 {'Apex AD2600 Progressive-scan DVD player': ['problem', 'dvd', 'name', 'video', 'dvds', 'vcd', 'format', 'number', 'apex', 'amazon', 'movie', 'return', 'work', 'medium', 'customer', 'remote', 'customer service', 'time', 'everything', 'gift', 'sound', 'disc', 'dvd player', 'player', 'service', 'quality', 'apex dvd player', 'picture', 'year', 'button', 'month', 'present', 'brand', 'money', 'christmas', 'support', 'purchase', 'feature', 'review', 'price', 'thing', 'play', 'unit', 'output', 'family'], 'Canon G3': ['problem', 'auto mode', 'buy', 'life', 'shoe', 'image', 'shutter', 'line', 'flaw', 'corner', 'year', 'speed', 'battery', 'review', 'point', 'strap', 'camera', 'choice', 'card', 'consumer', 'lot', 'viewfinder', 'ability', 'week', 'view', 'moment', 'quality', 'fact', 'nikon', 'shot', 'pic', 'software', 'lcd screen', 'resolution', 'powershot', 'point shoot', 'experience', 'g3', 'mode', 'ease', 'adobe', 'detail', 'market', 'setting', 'canon g3', 'month', 'cap', 'flash'

In [12]:
def compactness_pruning(result_preproc, top_features):
    def is_compact(phrase, sentences, min_words=2):
        
        if len(phrase.split(' ')) < 2:
            return 0

        compact_count = 0
    
        # Loop through each sentence in the dataset
        for sentence in sentences:
            # Find the indices of words in the sentence that match words in the phrase
            words_in_order = [i for i, word in enumerate(sentence.split(' ')) if word.lower() in phrase.split(' ')]
    
            # Check if there are at least min_words matching words
            if len(words_in_order) >= min_words:
                # Calculate the distances between consecutive matching words
                distances = [j - i for i, j in zip(words_in_order[:-1], words_in_order[1:])]
    
                # Check if all distances are less than or equal to 3
                if all(distance <= 3 for distance in distances):
                    compact_count += 1
            
            if compact_count >= 2:
                break;

        # Check if the compact count is greater than or equal to 2
        return compact_count >= 2

    pruned_features = {}

    for file_name, feature_phrases in top_features.items():
        sentences = [sentence_info[0].lower() for review in result_preproc[file_name] for sentence_info in review]
        compact_features = [phrase for phrase in feature_phrases if is_compact(phrase, sentences)]
        pruned_features[file_name] = compact_features

    return pruned_features


In [13]:
# Apply compactness pruning
compact_features = compactness_pruning(result_preproc, top_features)
print('compact_features:\n', compact_features,"\n\n\n")

compact_features:
 {'Apex AD2600 Progressive-scan DVD player': ['customer service', 'dvd player', 'apex dvd player'], 'Canon G3': ['auto mode', 'lcd screen', 'point shoot', 'canon g3', 'battery life'], 'Creative Labs Nomad Jukebox Zen Xtra 40GB': ['scroll wheel', 'zen xtra', 'mp3 player', 'battery life'], 'Nikon coolpix 4300': ['card reader', 'quality picture', 'flash card', 'camera price range', 'cf card', 'picture quality', 'scene mode', 'nikon coolpix', 'buyer remorse', 'memory card', 'camera money', 'battery life'], 'Nokia 6610': ['battery life', 'color screen', 'camera phone', 'fm radio', 'sound quality'], 'Computer': ['customer service', 'monitor price', 'lcd monitor'], 'Router': ['router modem', 'access point', 'wireless router'], 'Speaker': ['mp3 player', 'sound quality'], 'Canon PowerShot SD500': ['high speed', 'camera owner', 'video quality', 'sd card', 'picture quality', 'pocket camera', 'lcd screen'], 'Canon S100': ['digital camera', 'canon s100', 'picture quality', 'memory

In [14]:
# Perform redundancy pruning on association rules based on p-support
def redundancy_pruning(result_preproc, top_features, compact_features, min_p_support=3):
    """
    :param result_preproc: Preprocessed data.
    :param top_features: Extracted features.
    :param compact_features: Pruned features based on compactness.
    :param min_p_support: Minimum p-support for pruning.
    :return: Meaningful features after redundancy pruning.
    """
    # Dictionary to store redundancy-pruned features for each file (product)
    redundancy_features = {}

    # Iterate over each product and its associated features
    for file_name, features in top_features.items():
        # Retrieve compact features set for the current file
        compact_features_set = set(compact_features[file_name]) if file_name in compact_features else set()

        # List to store redundancy-pruned rules
        pruned_rules = []

        # Iterate over each feature for the current file
        for feature in features:
            # Skip features with more than one word
            if len(feature.split(' ')) > 1:
                continue

            # Check for redundancy by comparing with compact features
            redundant = any(is_superset(feature, compact_feature) for compact_feature in compact_features_set)

            # Add non-redundant features to the pruned_rules list
            if not redundant:
                pruned_rules.append(feature)

        # Apply p-support pruning
        p_support = {}  # Dictionary to store p-support for each feature
        for rule in pruned_rules:
            p_support[rule] = 0

        # Extract sentences from the preprocessed data for the current file
        sentences = [sentence_info[0].lower() for review in result_preproc[file_name] for sentence_info in review]

        # Count p-support for each rule in the sentences
        for sentence in sentences:
            for rule in pruned_rules:
                if rule in sentence:
                    # Check if there is no superset of rule in the sentence
                    if all(rule not in s for s in pruned_rules if s != rule):
                        p_support[rule] += 1

        # Filter features based on minimum p-support
        pruned_rules = [rule for rule in pruned_rules if p_support[rule] >= min_p_support]

        # Store the redundancy-pruned features for the current file
        redundancy_features[file_name] = pruned_rules

    return redundancy_features


# Helper function to check if set 'a' is a superset of set 'b'
def is_superset(a, b):
    """
    :param a: Set 'a'.
    :param b: Set 'b'.
    :return: True if 'a' is a superset of 'b', False otherwise.
    """
    return set(a.split()) >= set(b.split())


In [15]:
# Apply redundancy pruning
redundancy_features = redundancy_pruning(result_preproc, top_features, compact_features)
print('Redundancy Features:\n', redundancy_features)

Redundancy Features:
 {'Apex AD2600 Progressive-scan DVD player': ['problem', 'name', 'video', 'dvds', 'vcd', 'format', 'number', 'apex', 'amazon', 'movie', 'return', 'work', 'customer', 'remote', 'time', 'everything', 'gift', 'sound', 'disc', 'player', 'service', 'quality', 'picture', 'year', 'button', 'month', 'present', 'brand', 'money', 'christmas', 'support', 'purchase', 'feature', 'review', 'price', 'unit', 'output', 'family'], 'Canon G3': ['problem', 'buy', 'life', 'shoe', 'image', 'shutter', 'flaw', 'corner', 'year', 'speed', 'battery', 'review', 'point', 'strap', 'camera', 'choice', 'card', 'consumer', 'lot', 'viewfinder', 'ability', 'week', 'moment', 'quality', 'fact', 'nikon', 'software', 'resolution', 'powershot', 'experience', 'g3', 'mode', 'ease', 'adobe', 'detail', 'market', 'setting', 'month', 'cap', 'flash', 'shoot', 'feature', 'screen', 'lcd', 'color', 'flexibility', 'auto', 'lens', 'metz', 'g2', 'people', 'time', 'everything', 'research', 'picture', 'wife', 'film', '

In [16]:
# Combine compact_features and redundancy_features into a single dictionary
combined_features = {}

for product in compact_features:
    compact_list = compact_features[product]
    redundancy_list = redundancy_features[product] if product in redundancy_features else []
    
    combined_list = compact_list + redundancy_list
    
    # Remove duplicates by converting the list to a set and back to a list
    combined_list = list(set(combined_list))
    
    combined_features[product] = combined_list

# Print the combined features
print('Combined Features:', combined_features)

Combined Features: {'Apex AD2600 Progressive-scan DVD player': ['problem', 'name', 'video', 'dvds', 'vcd', 'format', 'number', 'apex', 'amazon', 'movie', 'return', 'work', 'customer service', 'customer', 'remote', 'time', 'everything', 'gift', 'sound', 'disc', 'dvd player', 'player', 'service', 'quality', 'apex dvd player', 'picture', 'year', 'button', 'month', 'present', 'brand', 'money', 'christmas', 'support', 'purchase', 'feature', 'review', 'price', 'unit', 'output', 'family'], 'Canon G3': ['viewfinder', 'problem', 'auto mode', 'ability', 'buy', 'auto', 'week', 'g3', 'lens', 'metz', 'g2', 'people', 'time', 'everything', 'mode', 'research', 'life', 'shoe', 'resolution', 'image', 'ease', 'adobe', 'moment', 'quality', 'shutter', 'fact', 'flaw', 'nikon', 'detail', 'market', 'corner', 'setting', 'picture', 'year', 'speed', 'canon g3', 'month', 'wife', 'cap', 'flash', 'shoot', 'battery life', 'film', 'day', 'battery', 'use', 'feature', 'screen', 'review', 'point', 'photography', 'way', 

### 4. Feature Extraction and Sentiment Analysis Pipeline

#### Select Appropriate Algorithm and Implement It

In this step, we compare user-labeled features for every product with the features extracted either before or after association. The process involves iterating through the labeled feature list, checking for the presence of each labeled feature in the top feature list. If a match is found, the true positive counter increases; otherwise, the false positive counter increases. Another iteration is performed on the extracted features list, checking if any of the extracted features are present in the user-labeled list. If not, the false negative counter increases.

These steps are crucial for calculating precision and recall, which serve as indicators to evaluate the accuracy of the feature extraction process when compared to user-defined labeled features. The `Features_Evaluation` function is implemented for this purpose, utilizing the `lemmatize_labels` function to lemmatize labeled features for alignment with the pre-extracted features.

We evaluate the results of the `Features_Evaluation` function on two pairs of inputs. The first pair is `Features_Evaluation(result, top_features)`, where `result` is the main dictionary of lists, and `top_features` are the extracted features before associations. The result shows low precision and recall. The second pair is `Features_Evaluation(result, combined_features)`, where combined features include both compact and redundant features after applying association. Precision and recall slightly worsen.

#### Extracting Infrequent Features (Explicit):

To improve precision and recall, we attempt to extract and include infrequent features. First, we identify features that are in the implicit features (extracted features). We refer to these as explicit features. For explicit features not present in implicit features, we check if there is an adjective before or after the feature. These features are combined with the pre-extracted features (combined = compactness + redundancy). We recalculate precision and recall, observing an improvement in precision but a worsened recall. This is demonstrated by calling `Features_Evaluation(result, final_features)` with final features containing both implicit and explicit features.

#### Features Sentiment Analysis:

This step is responsible for providing the polarity for every feature in every product. We use the `SentimentIntensityAnalyzer` class from the VADER sentiment analysis tool. VADER is specifically designed for analyzing sentiment in text, providing a score for every review. The scores are accumulated, and the result includes the product, labeled features, extracted features, and the polarity of the sentence (`sentiment_results`).

### 5. Sentiment Analysis Evaluation:

The final step in this iteration of the pipeline involves evaluating the sentiment analysis model's performance. The code compares its predictions with labeled sentiments for specific features in reviews. It calculates true positive, false positive, and false negative counts, as well as precision and recall metrics for each product. The results are stored in the `evaluation_results` dictionary and then printed using a function named `print_results_table`. The exact details of the evaluation depend on the specifics of the labeled features and sentiments used in the comparison.

As appearing in the evaluation sentiments results, precision and recall gave great results.

In [17]:
def lemmatize_labels(labels):
    doc = nlp(labels)
    lemmatized_labels = ' '.join([token.lemma_ for token in doc])
    return lemmatized_labels

def Features_Evaluation(User_Label_List,Features_List):
    New_results = {}
    for product, reviews in User_Label_List.items():        
        # Get the combined features for the current product
        product_features = Features_List.get(product, [])
        labeled_features = []
    
        tp_count = 0
        fp_count = 0
        fn_count = 0
        # Iterate over each review
        for review in reviews:
            for review_line in review:
                # Extract the user label               
                labels   = review_line[1]

                # Lemmatize user labels using spaCy
                user_labels = [lemmatize_labels(label[0].lower()) for label in labels]

                labeled_features.extend(user_labels)

        labeled_features = list(set(labeled_features))
        
        # Compare user_labels with combined features
        for user_label in labeled_features:
            if user_label in product_features:
                # True Positive
                tp_count += 1
            else:
                # False Positive
                fp_count += 1
    
        # Count False Negatives
        for product_feature in product_features:
            if product_feature not in labeled_features:
                fn_count += 1
    
        # Calculate Precision and Recall
        precision = round(tp_count / (tp_count + fp_count),2) if (tp_count + fp_count) > 0 else 0
        recall = round(tp_count / (tp_count + fn_count),2) if (tp_count + fn_count) > 0 else 0
    
        # Print the New_results
        New_results[product] = {
            'tp_count': tp_count,
            'fp_count': fp_count,
            'fn_count': fn_count,
            'precision': precision,
            'recall': recall
        }

    print_results_table(New_results)
    
def print_results_table(results_dict):
    header = ["Product", "True Positives", "False Positives", "False Negatives", "Precision", "Recall"]

    # Extract data for each column
    columns = [header] + [[product, New_results['tp_count'], New_results['fp_count'], New_results['fn_count'], New_results['precision'], New_results['recall']] for product, New_results in results_dict.items()]
    
    # Calculate column widths
    col_widths = [max(len(str(item)) for item in col) + 1 for col in zip(*columns)]
    
    # Print header
    print(" | ".join(f"{col.ljust(width)}" for col, width in zip(header, col_widths)))
    print("-" * (sum(col_widths) + len(header) - 1))
    
    # Print data
    for i, row in enumerate(columns):
        if i == 0:
            continue  # Skip printing header again for data lines
        print(" | ".join(f"{str(item).ljust(width)}" for item, width in zip(row, col_widths)))




In [18]:
Features_Evaluation(result,top_features)

Product                                    | True Positives  | False Positives  | False Negatives  | Precision  | Recall 
---------------------------------------------------------------------------------------------------------------
Apex AD2600 Progressive-scan DVD player    | 24              | 91               | 21               | 0.21       | 0.53   
Canon G3                                   | 25              | 80               | 54               | 0.24       | 0.32   
Creative Labs Nomad Jukebox Zen Xtra 40GB  | 22              | 164              | 16               | 0.12       | 0.58   
Nikon coolpix 4300                         | 22              | 53               | 95               | 0.29       | 0.19   
Nokia 6610                                 | 34              | 75               | 42               | 0.31       | 0.45   
Computer                                   | 17              | 133              | 23               | 0.11       | 0.42   
Router                            

In [19]:
Features_Evaluation(result,combined_features)

Product                                    | True Positives  | False Positives  | False Negatives  | Precision  | Recall 
---------------------------------------------------------------------------------------------------------------
Apex AD2600 Progressive-scan DVD player    | 21              | 94               | 20               | 0.18       | 0.51   
Canon G3                                   | 23              | 82               | 49               | 0.22       | 0.32   
Creative Labs Nomad Jukebox Zen Xtra 40GB  | 22              | 164              | 16               | 0.12       | 0.58   
Nikon coolpix 4300                         | 20              | 55               | 71               | 0.27       | 0.22   
Nokia 6610                                 | 29              | 80               | 41               | 0.27       | 0.41   
Computer                                   | 17              | 133              | 22               | 0.11       | 0.44   
Router                            

### Extract Infrequent Features

In [20]:
infrequent_features = {}
for product, reviews in result_preproc.items():
    infrequent_per_product = []
    for review in reviews:
        for sentence in review:
            labelled_features = sentence[1]
            words = sentence[-1]

            for j, (word, pos) in enumerate(words):
                # Check if the word has no features from combined_features[product]
                if all(feature not in word for feature in combined_features[product]):
                    before_word = words[j - 1][0] if j > 0 else None
                    after_word = words[j + 1][0] if j < len(words) - 1 else None
        
                    # Check the POS of the words before and after
                    before_pos = words[j - 1][1] if j > 0 else None
                    after_pos  = words[j + 1][1] if j < len(words) - 1 else None
        
                    if before_pos in ['JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS']:
                        infrequent_per_product.append(word)
        
                    if after_pos in ['JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS']:
                        infrequent_per_product.append(word)

    infrequent_features[product] = list(set(infrequent_per_product))

final_features = {}
for product, pruned_features in combined_features.items():
    final_features[product] = infrequent_features[product] + pruned_features

In [21]:
Features_Evaluation(result,final_features)

Product                                    | True Positives  | False Positives  | False Negatives  | Precision  | Recall 
---------------------------------------------------------------------------------------------------------------
Apex AD2600 Progressive-scan DVD player    | 57              | 58               | 627              | 0.5        | 0.08   
Canon G3                                   | 55              | 50               | 678              | 0.52       | 0.08   
Creative Labs Nomad Jukebox Zen Xtra 40GB  | 102             | 84               | 1186             | 0.55       | 0.08   
Nikon coolpix 4300                         | 32              | 43               | 487              | 0.43       | 0.06   
Nokia 6610                                 | 65              | 44               | 607              | 0.6        | 0.1    
Computer                                   | 74              | 76               | 653              | 0.49       | 0.1    
Router                            

### Features Sentiment Analysis

In [22]:
sentiment_results  = {}
feature_sentiments = {}

for product, reviews in result_preproc.items():
    sentiment_results[product] = []
    feature_sentiments[product] = {}

    # Iterate over each review
    for review in reviews:
        for sentence_info in review:
            sid = SentimentIntensityAnalyzer()

            # Use VADER sentiment analysis to get the sentiment score
            score = sid.polarity_scores(sentence_info[0])['compound']

            sentiment = 'positive' if score > 0 else 'negative'

            sentiment_results[product].append([sentence_info + [sentiment]])

            words = sentence_info[-2]
            for word in words:
                if word:  # Check if word is not an empty list
                    feature = word[0]
                    if feature not in feature_sentiments[product]:
                        feature_sentiments[product][feature] = {'positive': 0, 'negative': 0}

                    if sentiment == 'positive':
                        feature_sentiments[product][feature]['positive'] += 1
                    else:
                        feature_sentiments[product][feature]['negative'] += 1

    print(f"Product:         {product}")
    for feature, sentiment_counts in feature_sentiments[product].items():
        print(f"          Feature: {feature}")
        print(f"                      Positive: {sentiment_counts['positive']}")
        print(f"                      Negative: {sentiment_counts['negative']}\n\n")


Product:         Apex AD2600 Progressive-scan DVD player
          Feature: p button
                      Positive: 1
                      Negative: 0


          Feature: dvd player
                      Positive: 9
                      Negative: 3


          Feature: player
                      Positive: 53
                      Negative: 47


          Feature: sound
                      Positive: 6
                      Negative: 1


          Feature: price
                      Positive: 4
                      Negative: 3


          Feature: look
                      Positive: 9
                      Negative: 2


          Feature: panel button layout
                      Positive: 1
                      Negative: 0


          Feature: feature
                      Positive: 7
                      Negative: 2


          Feature: forward
                      Positive: 1
                      Negative: 1


          Feature: rewind
                      Positive: 0


Product:         Canon G3
          Feature: canon powershot g3
                      Positive: 1
                      Negative: 0


          Feature: use
                      Positive: 11
                      Negative: 0


          Feature: picture
                      Positive: 10
                      Negative: 5


          Feature: picture quality
                      Positive: 9
                      Negative: 0


          Feature: camera
                      Positive: 45
                      Negative: 11


          Feature: feature
                      Positive: 6
                      Negative: 1


          Feature: option
                      Positive: 1
                      Negative: 0


          Feature: dial
                      Positive: 1
                      Negative: 1


          Feature: viewfinder
                      Positive: 3
                      Negative: 9


          Feature: speed
                      Positive: 1
                      Neg

Product:         Creative Labs Nomad Jukebox Zen Xtra 40GB
          Feature: affordability
                      Positive: 0
                      Negative: 1


          Feature: bang-for-the-buck
                      Positive: 0
                      Negative: 1


          Feature: size
                      Positive: 16
                      Negative: 10


          Feature: weight
                      Positive: 7
                      Negative: 7


          Feature: navigational system
                      Positive: 1
                      Negative: 0


          Feature: sound
                      Positive: 29
                      Negative: 5


          Feature: screen
                      Positive: 13
                      Negative: 3


          Feature: deal
                      Positive: 3
                      Negative: 0


          Feature: wma file
                      Positive: 0
                      Negative: 1


          Feature: software
                 

Product:         Nikon coolpix 4300
          Feature: camera
                      Positive: 34
                      Negative: 7


          Feature: picture
                      Positive: 13
                      Negative: 4


          Feature: macro
                      Positive: 2
                      Negative: 1


          Feature: size
                      Positive: 7
                      Negative: 1


          Feature: weight
                      Positive: 2
                      Negative: 0


          Feature: feature
                      Positive: 7
                      Negative: 0


          Feature: manual
                      Positive: 1
                      Negative: 1


          Feature: auto focus
                      Positive: 1
                      Negative: 0


          Feature: scene mode
                      Positive: 6
                      Negative: 0


          Feature: rechargable battery
                      Positive: 0
                  

Product:         Computer
          Feature: inexpensive
                      Positive: 0
                      Negative: 1


          Feature: monitor
                      Positive: 19
                      Negative: 16


          Feature: screen
                      Positive: 6
                      Negative: 2


          Feature: picture quality
                      Positive: 0
                      Negative: 2


          Feature: Display
                      Positive: 0
                      Negative: 1


          Feature: colors
                      Positive: 9
                      Negative: 1


          Feature: size
                      Positive: 4
                      Negative: 0


          Feature: computer
                      Positive: 5
                      Negative: 2


          Feature: quality
                      Positive: 4
                      Negative: 0


          Feature: keyboard
                      Positive: 4
                      Negativ

Product:         Router
          Feature: price
                      Positive: 5
                      Negative: 1


          Feature: item
                      Positive: 1
                      Negative: 0


          Feature: customer service
                      Positive: 4
                      Negative: 2


          Feature: set up
                      Positive: 7
                      Negative: 3


          Feature: speed
                      Positive: 4
                      Negative: 5


          Feature: connection
                      Positive: 0
                      Negative: 4


          Feature: firmware upgrade
                      Positive: 2
                      Negative: 0


          Feature: netgear genie tool
                      Positive: 1
                      Negative: 0


          Feature: signal strength
                      Positive: 3
                      Negative: 0


          Feature: router
                      Positive: 11
          

Product:         Speaker
          Feature: speakers
                      Positive: 33
                      Negative: 8


          Feature: sound
                      Positive: 41
                      Negative: 9


          Feature: bass
                      Positive: 14
                      Negative: 3


          Feature: sound quality
                      Positive: 6
                      Negative: 4


          Feature: portable
                      Positive: 3
                      Negative: 0


          Feature: remote
                      Positive: 1
                      Negative: 1


          Feature: Sound quality
                      Positive: 3
                      Negative: 0


          Feature: Tuner quality
                      Positive: 1
                      Negative: 0


          Feature: Looks
                      Positive: 2
                      Negative: 0


          Feature: Setup
                      Positive: 0
                      Negati

Product:         Canon PowerShot SD500
          Feature: SD500
                      Positive: 3
                      Negative: 0


          Feature: design
                      Positive: 1
                      Negative: 0


          Feature: image-processing system
                      Positive: 1
                      Negative: 0


          Feature: image
                      Positive: 4
                      Negative: 1


          Feature: pricey
                      Positive: 0
                      Negative: 1


          Feature: LCD
                      Positive: 2
                      Negative: 7


          Feature: camera
                      Positive: 20
                      Negative: 6


          Feature: pictures
                      Positive: 5
                      Negative: 4


          Feature: manual control
                      Positive: 1
                      Negative: 0


          Feature: Canon
                      Positive: 1
               

Product:         Diaper Champ
          Feature: Works
                      Positive: 1
                      Negative: 1


          Feature: Diaper Champ
                      Positive: 19
                      Negative: 3


          Feature: diapers
                      Positive: 2
                      Negative: 2


          Feature: odors
                      Positive: 3
                      Negative: 1


          Feature: refills
                      Positive: 9
                      Negative: 3


          Feature: bags
                      Positive: 0
                      Negative: 3


          Feature: bag
                      Positive: 0
                      Negative: 1


          Feature: smelling
                      Positive: 0
                      Negative: 1


          Feature: diaper
                      Positive: 3
                      Negative: 2


          Feature: product
                      Positive: 9
                      Negative: 3


     

Product:         Hitachi router
          Feature: performed
                      Positive: 1
                      Negative: 0


          Feature: adjustment
                      Positive: 9
                      Negative: 4


          Feature: collet
                      Positive: 3
                      Negative: 3


          Feature: use
                      Positive: 6
                      Negative: 2


          Feature: price
                      Positive: 9
                      Negative: 8


          Feature: power
                      Positive: 8
                      Negative: 6


          Feature: speed
                      Positive: 5
                      Negative: 3


          Feature: runs
                      Positive: 2
                      Negative: 1


          Feature: cuts
                      Positive: 0
                      Negative: 1


          Feature: bigger
                      Positive: 0
                      Negative: 1


          F

Product:         ipod
          Feature: features
                      Positive: 1
                      Negative: 3


          Feature: sound
                      Positive: 6
                      Negative: 0


          Feature: equalizer
                      Positive: 0
                      Negative: 2


          Feature: bass
                      Positive: 2
                      Negative: 0


          Feature: treble
                      Positive: 1
                      Negative: 0


          Feature: battery
                      Positive: 1
                      Negative: 7


          Feature: Replaceable battery
                      Positive: 0
                      Negative: 1


          Feature: flip side
                      Positive: 1
                      Negative: 0


          Feature: case
                      Positive: 0
                      Negative: 1


          Feature: FM tuner
                      Positive: 0
                      Negative: 1



Product:         Linksys Router
          Feature: router
                      Positive: 13
                      Negative: 10


          Feature: setup
                      Positive: 7
                      Negative: 4


          Feature: installation
                      Positive: 3
                      Negative: 2


          Feature: install
                      Positive: 2
                      Negative: 2


          Feature: works
                      Positive: 6
                      Negative: 2


          Feature: Wizard
                      Positive: 0
                      Negative: 1


          Feature: wizard
                      Positive: 0
                      Negative: 4


          Feature: configuration
                      Positive: 1
                      Negative: 1


          Feature: CD
                      Positive: 2
                      Negative: 5


          Feature: user manual
                      Positive: 0
                      Negativ

Product:         MicroMP3
          Feature: instruction
                      Positive: 1
                      Negative: 1


          Feature: touch-pad
                      Positive: 0
                      Negative: 2


          Feature: navigate
                      Positive: 0
                      Negative: 1


          Feature: EXPLAIN
                      Positive: 1
                      Negative: 0


          Feature: navigating
                      Positive: 0
                      Negative: 1


          Feature: sound
                      Positive: 22
                      Negative: 4


          Feature: bass
                      Positive: 1
                      Negative: 3


          Feature: machine
                      Positive: 1
                      Negative: 0


          Feature: Zen Micro
                      Positive: 4
                      Negative: 1


          Feature: store
                      Positive: 0
                      Negative: 1


Product:         Nokia 6600
          Feature: phone
                      Positive: 36
                      Negative: 28


          Feature: battery life
                      Positive: 4
                      Negative: 11


          Feature: 6600
                      Positive: 10
                      Negative: 5


          Feature: bluetooth
                      Positive: 7
                      Negative: 10


          Feature: LCD
                      Positive: 1
                      Negative: 0


          Feature: camera quality
                      Positive: 1
                      Negative: 0


          Feature: PDA
                      Positive: 2
                      Negative: 0


          Feature: w
                      Positive: 0
                      Negative: 1


          Feature: priced
                      Positive: 0
                      Negative: 1


          Feature: use
                      Positive: 2
                      Negative: 1


       

Product:         norton
          Feature: software
                      Positive: 0
                      Negative: 1


          Feature: Norton products
                      Positive: 0
                      Negative: 1


          Feature: McAfee Anti-Virus 8
                      Positive: 0
                      Negative: 1


          Feature: Norton
                      Positive: 3
                      Negative: 9


          Feature: Installation
                      Positive: 0
                      Negative: 1


          Feature: NIS
                      Positive: 1
                      Negative: 0


          Feature: Manual shutdown
                      Positive: 0
                      Negative: 1


          Feature: product
                      Positive: 6
                      Negative: 14


          Feature: uninstall
                      Positive: 1
                      Negative: 2


          Feature: NIS 2003
                      Positive: 1
         

### Sentiment Analysis Evaluation

In [23]:
evaluation_results = {}
for product, reviews in sentiment_results.items():
    tp_count = 0
    fp_count = 0
    fn_count = 0
    for review in reviews:
        for sentence in review:
            labelled_features = sentence[1]
            sentiment = sentence[-1]

            for labelled_feature in labelled_features:
                f = lemmatize_labels(labelled_feature[0].lower())
                if f in final_features[product]:
                    labelled_sentiment = 'positive' if labelled_feature[1][0] == '+' else 'negative'
                    if sentiment == labelled_sentiment:
                        tp_count += 1
                    else:
                        fp_count += 1
                else:
                    fn_count += 1
          
    # Calculate Precision and Recall
    precision = round(tp_count / (tp_count + fp_count), 2) if (tp_count + fp_count) > 0 else 0
    recall = round(tp_count / (tp_count + fn_count), 2) if (tp_count + fn_count) > 0 else 0


    # Print the New_results
    evaluation_results[product] = {
            'tp_count': tp_count,
            'fp_count': fp_count,
            'fn_count': fn_count,
            'precision': precision,
            'recall': recall
    }

print_results_table(evaluation_results)

                

Product                                    | True Positives  | False Positives  | False Negatives  | Precision  | Recall 
---------------------------------------------------------------------------------------------------------------
Apex AD2600 Progressive-scan DVD player    | 280             | 72               | 78               | 0.8        | 0.78   
Canon G3                                   | 170             | 41               | 74               | 0.81       | 0.7    
Creative Labs Nomad Jukebox Zen Xtra 40GB  | 508             | 186              | 152              | 0.73       | 0.77   
Nikon coolpix 4300                         | 129             | 18               | 56               | 0.88       | 0.7    
Nokia 6610                                 | 220             | 53               | 65               | 0.81       | 0.77   
Computer                                   | 202             | 55               | 97               | 0.79       | 0.68   
Router                            

### Future Work

In future research endeavors, there are several promising directions to enhance the sentiment analysis and feature sentiment extraction processes for product reviews. Exploring advanced deep learning models, including BERT and GPT, could refine sentiment analysis by capturing intricate contextual information. Fine-tuning sentiment analysis models based on domain-specific datasets and incorporating more nuanced approaches, such as sentiment strength analysis and aspect-based sentiment analysis, may provide richer insights into user opinions. Moreover, optimizing the parameters of the Apriori Algorithm enhances association processes and adeptly manages implicit/explicit features. This tuning allows the Apriori Algorithm to excel in discovering frequent patterns, refining the association process by discerning meaningful features. Simultaneously, the utilization of Word Sense Disambiguation (WSD) techniques proves instrumental in distinguishing between various meanings of words, refining the treatment of implicit features, and elevating sentiment analysis accuracy. Advanced evaluation metrics like F1 score and confusion matrices can offer a more comprehensive understanding of model performance. Addressing challenges such as negation handling and contextual sentiment analysis could further improve accuracy. Integrating user-driven model updates through active learning strategies and encouraging users to provide explicit sentiment labels can ensure continuous adaptation to evolving language patterns. Additionally, leveraging external knowledge sources such as domain-specific ontologies and expanding sentiment lexicons can enhance the model's understanding of product features and sentiments. Collectively, these efforts aim to advance the sophistication and accuracy of sentiment analysis in product reviews.

### References

Hu, M., Liu, B., 2004. Mining and summarizing customer reviews. *KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining*,22 August 2004, Seattle, Washington, USA. New York, United States: Association for Computing Machinery, pp. 168-177.

M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,” 2017.

Hu, M., Liu, B., 2004. Mining Opinion Features in Customer Reviews. *AAAI'04: Proceedings of the 19th national conference on Artifical intelligence* , 25-29 July 2004, San Jose, California, USA. Palo Alto, California, USA: AAAI Press, pp. 755–760.

R. Agrawal, Rakesh Srikant, “Fast algorithms for mining association rules,” in Proc. 20th Int. Conf. Very Large DataBases, 2000.