# Task and Data Analysis

My opinion miner project is designed to process a broad spectrum of consumer reviews across various product categories, employing a sophisticated NLP pipeline that addresses the complexities of unstructured text data. This extensive pipeline begins with the parsing and preprocessing of data from diverse sources, where the raw text is organized into a structured pandas DataFrame. Reviews are segmented based on embedded sentiment tags, and the text analysis phase involves Noun Phrase Chunking to isolate relevant phrases and Concrete Noun Filters to identify nouns that substantively relate to physical product attributes.

The system manages the diversity of content and style, blending technical specifications with personal experiences and incorporating colloquial terminologies from reviews that range from single to multiline texts. To refine the extracted features, a Similarity Buffer assesses semantic proximity to the identified product name, ensuring the features are directly related to specific products. This is crucial especially as the dataset often includes product models in the titles of the reviews, but inconsistently, adding complexity to the task of accurately identifying and associating reviews with specific products.

For handling the diverse and sometimes ambiguous nature of product identification, I employ sophisticated text classification models that are trained to detect and categorize product mentions accurately. Each identified feature then undergoes sentiment classification where the context surrounding each feature is analyzed to determine sentiment scores, which are subsequently aggregated into a product-feature dictionary.

However, the evaluation process faces significant challenges due to the lack of explicit sentiment tags in much of the data, and where tags exist, they often display inconsistency in format and correlation with textual sentiment. This discrepancy necessitates the development of advanced models and custom rules to align and calibrate the sentiment analysis, ensuring accuracy and reliability in results. Through continuous learning mechanisms that update models with new data and user feedback, the system remains adaptable and current with evolving consumer language and product features.

Overall, by integrating advanced NLP techniques and machine learning models, my opinion miner is capable of handling the complexities of a multi-product review environment, providing detailed and actionable insights that aid businesses in understanding consumer sentiments across a broad spectrum of products. This facilitates strategic decision-making based on robust data-driven analytics, enhancing product design, marketing strategies, and customer satisfaction.

In [21]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import spacy
import re
import warnings
import copy

warnings.filterwarnings('ignore', message='Discarded redundant search for Synset')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt') 
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('sentiwordnet')
nltk.download('wordnet')
nltk.download('omw-1.4')
nlp = spacy.load("en_core_web_sm")

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
Word2Vec_corpus = api.load('text8') 
Word2Vec_model = Word2Vec(Word2Vec_corpus) 
glove_model = api.load("glove-twitter-25") 

from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.chunk import ne_chunk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.utils import deaccent
from sklearn.decomposition import LatentDirichletAllocation as LDA
from spacy.matcher import Matcher

from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/leon/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/leon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/leon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/leon/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/leon/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/leon/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/leon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloadin

In [22]:
files = ['Data/Customer_review_data/Apex AD2600 Progressive-scan DVD player.txt',
         'Data/Customer_review_data/Canon G3.txt',
         'Data/Customer_review_data/Creative Labs Nomad Jukebox Zen Xtra 40GB.txt',
         'Data/Customer_review_data/Nikon coolpix 4300.txt',
         'Data/Customer_review_data/Nokia 6610.txt',
         'Data/CustomerReviews-3_domains/Computer.txt',
         'Data/CustomerReviews-3_domains/Router.txt',
         'Data/CustomerReviews-3_domains/Speaker.txt',
         'Data/Reviews-9-products/Canon PowerShot SD500.txt',
         'Data/Reviews-9-products/Canon S100.txt',
         'Data/Reviews-9-products/Diaper Champ.txt',
         'Data/Reviews-9-products/Hitachi router.txt',
         'Data/Reviews-9-products/ipod.txt',
         'Data/Reviews-9-products/Linksys Router.txt',
         'Data/Reviews-9-products/MicroMP3.txt',
         'Data/Reviews-9-products/Nokia 6600.txt',
         'Data/Reviews-9-products/norton.txt']

# Data Pre-Processing

#### Data Ingestion and Initial Processing

The process begins with the reading of data files, where each file potentially contains multiple reviews with varying structures. A Python function, `read_file`, is employed to open and read the content of these files. Reviews are often separated by new lines and may begin with a distinctive marker of asterisks indicating metadata or headers that are not part of the actual review content. Such lines are programmatically identified and skipped, ensuring that only relevant text is processed further.

Reviews within these files are then split using '##' as a delimiter to segregate tags from the main content, which allows for the extraction of embedded metadata or sentiment tags when present. Each piece of the review, along with its associated tags, is stored in a structured format within a pandas DataFrame, facilitating subsequent manipulations and analyses.

#### Advanced Text Processing Techniques

Once the initial ingestion is complete, the reviews undergo a series of sophisticated text processing steps encapsulated within the `pre_process_review` function. This function is designed to handle various nuances of the text, including title concatenation where necessary and preservation of the textual integrity for reviews that continue across multiple lines.

##### Preservation of Semantic Structures

To maintain the semantic integrity of phrases within the reviews, the `preserve_compound_phrases` function is applied. This function utilizes spaCy, an advanced NLP library, to parse the text and identify compound nouns and adjectival modifiers, which are crucial for understanding the context and sentiment related to specific product features. These phrases are then reconstructed with underscores replacing spaces to ensure they are treated as single tokens in subsequent analyses, preventing the loss of their semantic unity.

##### Enhancement of Tokenization and Filtering

Post semantic preservation, the `pre_processing_controller` function orchestrates several layers of tokenization and filtering. The text is first tokenized, ensuring that each word or phrase is individually analyzed. This tokenization feeds into a dual filtering process where:
1. A 'Soft Filtered Review' captures tokens that either form part of the identified compound phrases or are standalone alphabetic words.
2. A 'Filtered Review' applies a more stringent filter, additionally removing common stop words to focus on the more meaningful terms relevant to sentiment analysis.

These tokens are then reassembled into coherent strings, forming the basis for deeper linguistic analysis, including lemmatization and stemming. Lemmatization is performed to reduce words to their base or dictionary form, whereas stemming further strips down the words to their root forms, often leading to a more generalized but powerful analysis of text data.

In [23]:
def read_file(file_path):
    # Initialise an empty list to store reviews and their associated tags
    tagged_reviews = []
    
    # Open the specified file in read mode
    with open(file_path, 'r') as file:
        # Read the entire content of the file
        text = file.read()
        # Split the text into lines and remove any leading/trailing whitespace
        reviews = text.strip().split('\n')

        # Check if the file starts with a specific marker line of asterisks
        if reviews[0] == '*' * 77:
            # Skip the first 11 lines if the marker is present
            reviews = reviews[11:]

        # Pre-process the reviews for further handling (assuming a function 'pre_process_review' exists)
        reviews = pre_process_review(reviews)
        
        # Loop over each review in the processed list
        for review in reviews:
            # Split each review on '##' to separate tags from the content
            parts = review.split('##')
            
            # If the split results in more than one part, process tags and content
            if len(parts) > 1:
                # Split the first part by commas to get tags and strip spaces
                tags = parts[0].strip().split(',')
                # Take the second part as the review content and strip spaces
                content = parts[1].strip() 
            else:
                # If no '##' is found, set tags as empty and content to the whole line
                tags = []
                content = parts
                
            # Append a dictionary of tags and review content to the list
            tagged_reviews.append({'Tags': tags, 'Review': content})

        # Convert the list of tagged reviews into a pandas DataFrame
        df = pd.DataFrame(tagged_reviews)
        # Store the name of the file as an attribute of the DataFrame
        df.attrs['title'] = file_path.split('/')[-1]

        # Return the DataFrame containing the tagged reviews
        return df

In [24]:
# Pre-processing functions

def pre_process_review(reviews):
    processed_reviews = []
    title_switch = False  # Indicates whether next review should append a title
    title = ''

    for review in reviews:
        if review.startswith('[t]'):  # Checks for title marker
            title = review[3:]  # Stores the title
            title_switch = True
        elif title_switch:  # Appends title to the review if flag is true
            processed_reviews.append(review + title)
            title_switch = False
            title = ''
        else:
            processed_reviews.append(review)  # Adds review as is if no title is pending

    return processed_reviews


def preserve_compound_phrases(text):
    doc = nlp(text)  # Process text with NLP model
    processed_tokens = []

    for token in doc:
        if token.dep_ in ('compound', 'amod') and token.head.pos_ == 'NOUN':
            compound_phrase = token.text + "_" + token.head.text
            if compound_phrase not in processed_tokens:
                processed_tokens.append(compound_phrase)
        elif token.pos_ == 'NOUN' and any(child.dep_ == 'compound' for child in token.children):
            continue  # Avoids duplicating compound nouns
        else:
            processed_tokens.append(token.text)

    return processed_tokens


def chuncking_post_process(text):
    words = text.split()
    processed_words = []

    i = 0
    while i < len(words):
        if '_' in words[i]:  # Handles compound phrases with underscores
            while i < len(words) and '_' in words[i]:
                processed_words.append(words[i])
                i += 1
            continue  # Skips to next after processing a compound phrase
        processed_words.append(words[i])
        i += 1
            
    return ' '.join(processed_words)  # Returns the processed text


In [25]:
def pre_processing_controller(df):
    
    # Convert lists to strings in the 'Review' column, if necessary
    df['Tokenised_Review'] = df['Review'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
    
    # Apply compound phrase preservation to the 'Tokenised_Review' column
    df['Tokenised_Review'] = df['Tokenised_Review'].apply(lambda review: preserve_compound_phrases(review))
    
    # Create a 'Soft_Filtered_Review' column with tokens filtered by conditions
    df['Soft_Filtered_Review'] = df['Tokenised_Review'].apply(lambda tokens: [token.lower() for token in tokens if ("_" in token) or (token.isalpha() and token.lower())])
    
    # Convert lists of tokens back to strings in the 'Soft_Filtered_Review_String' column
    df['Soft_Filtered_Review_String'] = df['Soft_Filtered_Review'].apply(lambda tokens: ' '.join(tokens))
    
    # Apply chunking post-process to the soft filtered review strings
    df['Soft_Filtered_Review_String'] = df['Soft_Filtered_Review_String'].apply(chuncking_post_process)
    
    # Create a 'Filtered_Review' column applying a stricter filtering with stop words check
    df['Filtered_Review'] = df['Tokenised_Review'].apply(lambda tokens: [token.lower() for token in tokens if ("_" in token) or (token.isalpha() and token.lower() not in stop_words)])
    
    # Convert lists of filtered tokens back to strings in the 'Filtered_Review_String' column
    df['Filtered_Review_String'] = df['Filtered_Review'].apply(lambda tokens: ' '.join(tokens))
    
    # Lemmatize the filtered review strings
    df['Lemmatised_Review_String'] = df['Filtered_Review_String'].apply(lambda review_string: " ".join([token.lemma_ for token in nlp(review_string)]))
    
    # Tokenize the lemmatized and filtered review strings
    df['Lemmatised_Tokenised_Filtered_Review'] = df['Filtered_Review_String'].apply(lambda review: word_tokenize(review))
    
    # Apply stemming to the tokenized words
    df['Stemmed_Review'] = df['Lemmatised_Tokenised_Filtered_Review'].apply(lambda tokens: [stemmer.stem(token) for token in tokens])
    
    # Convert lists of stemmed tokens back to strings in the 'Stemmed_Review_String' column
    df['Stemmed_Review_String'] = df['Stemmed_Review'].apply(lambda tokens: ' '.join(tokens))
    
    return df

# Product Feature Extraction

#### Feature Extraction Process

The feature extraction begins with the identification of key nouns and phrases that potentially represent product features. This is achieved through two main processes: **POS Noun Tagging** and **Noun Phrase Chunking**. POS (Part of Speech) Noun Tagging involves parsing reviews to tag parts of speech and extracting nouns, as these often represent features. This process not only identifies individual nouns but also tags them according to their grammatical types, enhancing the accuracy of feature identification.

Noun Phrase Chunking goes a step further by extracting coherent noun phrases using Natural Language Processing (NLP) techniques. This process involves analyzing the syntactic patterns in the text to capture phrases that are likely to represent product attributes. These extracted noun phrases are then normalized to lower case to maintain consistency across the dataset.

#### Refinement and Semantic Analysis

Following the initial extraction, the features undergo a refinement process using two distinct filters: the **Concrete Noun Filter** and the **Semantic Filter**. The Concrete Noun Filter assesses each noun or phrase to determine if it pertains to tangible product attributes. This is done by checking if the nouns align with concrete categories such as objects, devices, or artifacts, using semantic networks like WordNet to understand their hypernym relationships.

The Semantic Filter is designed to ensure the relevance of the features to the product being reviewed. It involves a similarity analysis where features are compared against a list of product-related terms to ascertain their pertinence. Features that are semantically related to the identified product categories are retained for further analysis.

#### Feature-Sentiment Linkage and Dictionary Construction

Each identified and refined feature is then linked with its respective sentiment scores derived from the sentiment analysis phase of the opinion miner. This linkage is crucial as it allows for the aggregation of sentiments specifically associated with individual features. The result is a structured product-feature dictionary where each feature is associated with a compiled sentiment score, facilitating an organized review of consumer opinions.

#### Data Structuring and Review Aggregation

To enhance the usability of the extracted data, the features and their corresponding sentiments are structured into a DataFrame, which organizes the data by product and feature. This structured format is essential for conducting detailed analysis and supports the aggregation of data across multiple reviews.

Additionally, the DataFrame is filtered and refined to ensure that only the most relevant and accurately tagged data is presented. This involves filtering out redundancies and ensuring that the data is presented in a clear and concise manner, making it easy for stakeholders to interpret and make informed decisions.

In [26]:
# Noun Extraction - Method 1  

def POS_Noun_Tagging(string_list):
    
    reviews = string_list.tolist()
    features = []
    
    for review in reviews:
        tokens = word_tokenize(review)
        tagged = pos_tag(tokens)
        # Extracts nouns from POS tagged text as nouns likely features names
        features.extend([word.lower() for word, tag in tagged if tag in ['NN', 'NNS', 'NNP', 'NNPS']])
    
    feature_counts = Counter(features)
    common_features = feature_counts.most_common(15)

    return common_features



def is_concrete_noun(word):
    """Check if a noun is concrete by examining its categories."""
    # Categories that indicate a concrete noun
    concrete_indicators = {'object', 'artifact', 'instrumentality', 'container', 'device'}
    # Get all noun meanings of the word
    synsets = wn.synsets(word, pos=wn.NOUN)
    # Check each meaning for relevant categories
    for synset in synsets:
        # Explore each category hierarchy for the word
        for hypernym in synset.closure(lambda s: s.hypernyms()):
            # Check if any category names indicate a concrete noun
            if concrete_indicators.intersection(set(hypernym.lemma_names())):
                return True
    return False





# def Noun_Phrase_Chuncking(string_list):
    
#     all_noun_phrases = []
    
#     for review in string_list:
#         doc = nlp(review)
#         noun_phrases = [chunk.text.lower() for chunk in doc.noun_chunks]
#         all_noun_phrases.extend(noun_phrases)
    
#     # Count the occurrences of each noun phrase

#     phrase_counts = Counter(all_noun_phrases).most_common(15)
    
#     # Display most common noun phrases
#     common_phrases = phrase_counts
#     # display(common_phrases)
    
#     return common_phrases



# def context_based_filter(nouns):
#     """Filter nouns based on enhanced concrete checks and contextual usage."""
#     filtered_features = []
#     for noun in nouns:
#         if is_concrete_noun(noun):
#             filtered_features.append(noun)
#     return filtered_features



# -------------------- Concrete noun filter 2 ---------------------

# # Define a list of concrete domains (as WordNet synsets)
# concrete_domains = [
#     wn.synset('artifact.n.01'),  # Artifacts, objects made by humans
#     wn.synset('device.n.01'),    # Devices, tools or instruments
#     wn.synset('instrumentality.n.03'),  # Instrumentalities, means of achieving an end
#     # Add more domains as necessary
# ]

# def is_related_to_concrete_domain(word):
#     """
#     Check if a word is semantically related to predefined concrete domains.
#     """
#     synsets = wn.synsets(word, pos=wn.NOUN)
#     for synset in synsets:
#         for domain in concrete_domains:
#             if domain in synset.closure(lambda s: s.hypernyms()):
#                 return True
#     return False


# def filter_nouns_semantically(nouns):
#     """
#     Filter a list of nouns, keeping only those related to concrete domains.
#     """
#     return [noun for noun in nouns if is_related_to_concrete_domain(noun)]

In [27]:
def similarity_filter(word_tuple_list):
    
    # Define a threshold for filtering; this is arbitrary and might need adjustment
    similarity_threshold = 0.2
    
    # Extract just the words for similarity comparison
    words = [word for word, _ in word_tuple_list]
    
    # Calculate average similarities and filter
    filtered_words = []
    for word in words:
        try:
            # Calculate similarity using Word2Vec
            w2v_similarity = Word2Vec_model.wv.similarity(words[0], word)
            # Get the top N similar words using GloVe and convert to a dictionary for easier access
            glove_similar_words = dict(glove_model.most_similar(words[0], topn=10))
            # Get the similarity score for the current word from the GloVe model
            glove_similarity = glove_similar_words.get(word, 0)  # Default to 0 if word is not found
            # Average the similarities
            avg_similarity = (w2v_similarity + glove_similarity) / 2
            # Filter based on the threshold
            if avg_similarity >= similarity_threshold:
                filtered_words.append(word)
        except KeyError:
            pass

    return filtered_words


def further_similar_filter(noun_list):
    
    main_word = 'picture'
    word = 'pic'
    
    w2v_similarity = Word2Vec_model.wv.similarity(main_word, word)
    glove_similar_words = dict(glove_model.most_similar(main_word, topn=10))
    glove_similarity = glove_similar_words.get(word, 0) 
    
    items_to_remove = []
    for word in noun_list:
        glove_similar_words = dict(glove_model.most_similar(main_word, topn=10))
        glove_similarity = glove_similar_words.get(word, 0) 
        if glove_similarity > 0:
            items_to_remove.append(word)
    
    filtered_list = [item for item in noun_list if item not in items_to_remove]
    # print(filtered_list)
    return filtered_list

In [28]:
def create_product_features_dict(product_info):
    # Extract the product title
    product_title = product_info[0]

    # Extract the features
    product_features = product_info[1:]

    # Create the dictionary with the desired structure
    product_dict = {
        product_title: {
            feature: {"positive": 0, "negative": 0} for feature in product_features
        }
    }

    return product_dict



def build_featured_df(df, feature_list_pos_noun):
    
    feature_df = pd.DataFrame()
    feature_list = feature_list_pos_noun[1:]
    
    featured_items_list = []
    
    for idx, review in df.iterrows():
        tokenised_review = review['Filtered_Review']
        # Find the features present in the tokenised_review
        featured_items = [item for item in feature_list if item in tokenised_review]
            
        if featured_items:
            # Convert the review Series to a DataFrame with one row
            review_df = review.to_frame().transpose()
            
            # New: Append the found features as a string (or you can keep it as list)
            review_df['Featured_Items'] = [', '.join(featured_items)]  # As a single string of items
    
            review_df['Main_Index'] = idx
            
            # Append this review to the feature_df
            feature_df = pd.concat([review_df, feature_df], axis=0)
            
            # Additionally, append the found features to the featured_items_list
            featured_items_list.append(featured_items)

    return feature_df



def feature_dict_count(feature_df, product_dict):
    
    sentiment_class_list = feature_df[['Featured_Items', 'Sentiment']]
    
    for idx, row in sentiment_class_list.iterrows():
        feature = row['Featured_Items']
        sentiment = row['Sentiment']
    
        if sentiment == 1:
            val = product_dict[list(product_dict.keys())[0]][feature]['positive'] + 1
            product_dict[list(product_dict.keys())[0]][feature]['positive'] = val
        elif sentiment == -1:
            val = product_dict[list(product_dict.keys())[0]][feature]['negative'] + 1
            product_dict[list(product_dict.keys())[0]][feature]['negative'] = val
        else:
            # print('More than two features, cannot classify')
            pass

    return product_dict



def df_filter(df):
    df = df[['Main_Index', 'Featured_Items', 'Sentiment', 'Tags', 'Review', 'Tokenised_Review', 'Soft_Filtered_Review', 'Soft_Filtered_Review_String', 'Filtered_Review', 'Filtered_Review_String', 'Lemmatised_Review_String', 'Lemmatised_Tokenised_Filtered_Review', 'Stemmed_Review', 'Stemmed_Review_String']]
    df = df.set_index('Main_Index', drop=True) 
    df.index.name = 'Main Index'
    df = df.iloc[::-1]
    return df

# Sentiment Analysis

#### Overview of Sentiment Analysis Process

The sentiment analysis process starts with the sentiment controller function, which systematically processes each review to determine sentiment orientation. The function parses the DataFrame containing reviews, where each review is analyzed to extract and evaluate sentiments associated with specific product features. This begins by segmenting the review text into individual features, followed by sentiment classification for each segment.

#### Detailed Sentiment Classification Approach

The core of sentiment classification is executed through two primary models: a basic rule-based model and a more refined contextual analysis model. The rule-based model, `sentiment_classifier_1`, integrates two distinct sentiment evaluation methods. It first applies a direct sentiment assessment method to compound terms, leveraging the SentiWordNet lexicon for immediate sentiment scores. If direct scores are available, they are considered; otherwise, the model splits the phrase into individual words to calculate an average sentiment score based on the cumulative sentiment values of the constituent words.

For a more nuanced interpretation, especially in contexts with negations or intensifiers such as "not" or "very," the second method adjusts the sentiment scores accordingly. This involves parsing the phrase, tagging each word with its part of speech, and then adjusting the sentiment scores if modifiers that change sentiment polarity or intensity are detected. This dual-method approach ensures a balanced sentiment evaluation, accounting for both overt and subtle linguistic cues that might influence sentiment perception.

#### Handling Complex Sentiment Dynamics

Recognizing the limitations of rule-based methods in capturing the full spectrum of human emotions, the sentiment analysis also includes mechanisms to handle complex sentiment dynamics. Adjustments for negations and intensifiers are crucial, as they can significantly alter the sentiment conveyed by a phrase. For instance, the phrase "not good" has a different sentiment implication compared to "good," and the system is calibrated to recognize and adjust for such nuances.

The sentiment analysis framework is also designed to be robust against the variability in sentiment expression across different reviews. It includes a comprehensive sentiment scoring system that aggregates individual word and phrase scores to form an overall sentiment score for each reviewed feature. This aggregate score is then used to update the sentiment attributes in the feature DataFrame, providing a holistic view of consumer sentiment towards specific product features.

#### Integration with Machine Learning Models

[Space for future development and integration of machine learning models]

This section will detail the incorporation of advanced machine learning models to enhance sentiment analysis capabilities. These models, based on neural networks or other machine learning techniques, will be trained on domain-specific corpora to improve accuracy in sentiment classification, especially in interpreting complex sentence structures and sarcasm, which are often challenging for rule-based systems.

#### Conclusion

In conclusion, the sentiment analysis component of my opinion miner is a sophisticated amalgamation of rule-based and potentially machine learning-enhanced methodologies designed to accurately interpret and quantify sentiments expressed in consumer reviews. By effectively addressing the intricacies of language used in consumer feedback, this system not only provides insights into consumer sentiments but also aids businesses in refining their products and strategies based on reliable, data-driven sentiment analysis.

In [29]:
def sentiment_controller(feature_df):
    sentiment_list = []
    
    for idx, entry in feature_df.iterrows():
        review = entry['Soft_Filtered_Review_String']

        features = entry['Featured_Items'].split(',')
        features = [feature.strip() for feature in features] 
        
        if len(features) == 1:
            
            sentiment = sentiment_classifier_1(review)
            sentiment_list.append(sentiment)
            
        else:
            sentiment_list.append(0)

    feature_df['Sentiment'] = sentiment_list
    return feature_df



def sentiment_classifier_1(review):
    # Average sentiment method 1 and 2 
    pos_score_1, neg_score_1 = get_phrase_sentiment_1(review)
    pos_score_2, neg_score_2 = get_phrase_sentiment_2(review)
    avg_pos_score = (pos_score_1 + pos_score_2) / 2
    avg_neg_score = (neg_score_1 + neg_score_2) / 2

    if avg_pos_score > avg_neg_score:
        sentiment = 1
    else:
        sentiment = -1 

    # Single model
    # pos_score, neg_score = get_phrase_sentiment_2(review)

    # if pos_score > neg_score:
    #     sentiment = 1
    # else:
    #     sentiment = -1 

    return sentiment 



"""
- This method assumes equal weighting of words in calculating the average sentiment, which might not reflect the actual sentiment conveyed by the phrase.
- Handling negations and intensifiers (e.g., "not" in "not good", "very" in "very good") requires more sophisticated logic, as they can significantly alter the sentiment.
- Advanced models designed for sentiment analysis at the sentence or document level (like BERT-based models) may provide more accurate sentiment assessments for phrases and sentences by considering the broader context.
This approach gives a basic approximation but is limited by the nuances of natural language. For more accurate sentiment analysis on phrases or sentences, consider using pre-trained sentiment analysis models or services.
"""


def get_word_sentiment(word):
    synsets = list(swn.senti_synsets(word))
    if synsets:
        return synsets[0].pos_score(), synsets[0].neg_score()
    else:
        return 0, 0


def get_phrase_sentiment_1(phrase):
    # Try the phrase directly (useful for compound terms recognized by SWN)
    pos_score, neg_score = get_word_sentiment(phrase.replace(" ", "_"))
    if pos_score or neg_score:
        return pos_score, neg_score
    
    # Split the phrase into individual words and average their sentiment scores
    words = phrase.split()
    total_pos, total_neg = 0, 0
    for word in words:
        pos, neg = get_word_sentiment(word)
        total_pos += pos
        total_neg += neg
    avg_pos = total_pos / len(words) if words else 0
    avg_neg = total_neg / len(words) if words else 0
    
    return avg_pos, avg_neg



"""
This version attempts to address negations directly before a word and could be extended to consider intensifiers (like "very") by further
modifying the sentiment scores. For even more nuanced sentiment analysis, exploring deep learning models trained specifically for sentiment 
analysis is recommended.
"""


def penn_to_wn(tag):
    """Converts Penn Treebank tags to WordNet tags."""
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

def get_word_sentiment_2(word, tag):
    wn_tag = penn_to_wn(tag)
    if wn_tag:
        synsets = list(swn.senti_synsets(word, wn_tag))
        if synsets:
            return synsets[0].pos_score(), synsets[0].neg_score()
    return 0, 0

def adjust_scores_for_negation_and_intensifiers(scores, words, i):
    """Adjusts sentiment scores based on negations and intensifiers around the i-th word."""
    if i > 0 and words[i-1].lower() in ["not", "no"]:
        return -scores[0], -scores[1]  # Inverting the sentiment
    # Further adjustments for intensifiers (like "very") can be added here
    return scores

def get_phrase_sentiment_2(phrase):
    words = word_tokenize(phrase)
    tagged = pos_tag(words)
    total_pos, total_neg = 0, 0
    
    for i, (word, tag) in enumerate(tagged):
        scores = get_word_sentiment_2(word, tag)
        scores = adjust_scores_for_negation_and_intensifiers(scores, words, i)
        total_pos += scores[0]
        total_neg += scores[1]
    
    avg_pos = total_pos / len(words) if words else 0
    avg_neg = total_neg / len(words) if words else 0
    
    return avg_pos, avg_neg

In [30]:
def opinion_miner_controller(file):
    
    df = read_file(file)
    df = pre_processing_controller(df)
    nouns = POS_Noun_Tagging(df['Lemmatised_Review_String'])
    similar_nouns = similarity_filter(nouns)
    features = further_similar_filter(similar_nouns)

    if len(features) == 1:
        return print(f'Only item name extracted: {features[0]} - No other features')
    else:
        feature_df = build_featured_df(df, features)
        feature_df = sentiment_controller(feature_df)
        feature_df = df_filter(feature_df)
        product_dict = create_product_features_dict(features)
        feature_dict = feature_dict_count(feature_df, copy.deepcopy(product_dict))
        return feature_dict, feature_dfA


# for file in files:
#     feature_dict = opinion_miner_controller(file)
#     print(feature_dict)
#     print('\n')

feature_dict, feature_df = opinion_miner_controller(files[1])
display(feature_dict)
# display(feature_df)

{'camera': {'picture': {'positive': 8, 'negative': 10},
  'shoot': {'positive': 17, 'negative': 5},
  'flash': {'positive': 8, 'negative': 10},
  'photo': {'positive': 1, 'negative': 0}}}

# Evaluation and Discussion

# Experimentation

In [17]:
from transformers import AlbertTokenizer, AlbertModel

try:
    # Replace 'albert-base-v2' with any model you actually want to test
    tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
    model = AlbertModel.from_pretrained('albert-base-v2')
    print("Model and tokenizer loaded successfully.")
except Exception as e:
    print(f"An error occurred: {e}")


An error occurred: 
AlbertTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.



In [18]:
from transformers import pipeline
import spacy

# Load the sentiment analysis pipeline
sentiment_analysis = pipeline("sentiment-analysis")

def extract_features_and_analyze_sentiment(text, features):
    # Analyze the text with spaCy for NER
    doc = nlp(text)
    feature_sentiments = {}

    # Identify provided features in text
    for ent in doc.ents:
        if ent.text.lower() in [feature.lower() for feature in features]:
            # Perform sentiment analysis on the sentence containing the feature
            sentiment = sentiment_analysis(ent.sent.text)
            feature_sentiments[ent.text] = sentiment

    return feature_sentiments

features = ["battery life", "camera", "display"]
text = "The phone has a great camera and display but the battery life could be better."

sentiments = extract_features_and_analyze_sentiment(text, features)
print(sentiments)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

{}


In [20]:
sample = feature_df[feature_df.index == 214]

sample_features = sample['Featured_Items'].iloc[0].split(',')
sample_review = sample['Review'].iloc[0]

# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

def analyze_feature_sentiment(sentence, features):
    # Dictionary to store sentiment results for each feature
    sentiment_results = {}

    features = [feature.strip() for feature in features]  
    
    # Analyze sentiment for each feature
    for feature in features:
        # Extract context around the feature if needed (optional improvement)
        start_index = sentence.lower().find(feature.lower())
        if start_index != -1:
            # Extract a sub-sentence for context-based sentiment analysis
            sub_sentence = sentence[max(start_index - 30, 0):min(start_index + 30 + len(feature), len(sentence))]

            # Get sentiment using TextBlob
            tb_sentiment = TextBlob(sub_sentence).sentiment.polarity
            # Get sentiment using VADER
            vader_sentiment = analyzer.polarity_scores(sub_sentence)['compound']

            # Store results
            sentiment_results[feature] = {
                'TextBlob Sentiment': 'Positive' if tb_sentiment > 0 else 'Negative' if tb_sentiment < 0 else 'Neutral',
                'VADER Sentiment': 'Positive' if vader_sentiment > 0.05 else 'Negative' if vader_sentiment < -0.05 else 'Neutral'
            }
        else:
            sentiment_results[feature] = {
                'TextBlob Sentiment': 'Not Found',
                'VADER Sentiment': 'Not Found'
            }

    return sentiment_results

results = analyze_feature_sentiment(sample_review, sample_features)
# display(results)

In [None]:
# ******************************************************************************************************************************************************************

In [392]:
# def display_freq_dist(df):
#     all_words = [word for review in df['Filtered_Review'] for word in review]
#     freq_dist = FreqDist(all_words)
#     top_items = sorted(freq_dist.items(), key=lambda x: x[1], reverse=True)[:20]
#     words, frequencies = zip(*top_items)
#     plt.figure(figsize=(6, 3))  
#     plt.bar(words, frequencies, color='skyblue')  
#     plt.xlabel('Words') 
#     plt.ylabel('Frequency') 
#     plt.title('Top Words Frequency Distribution')  
#     plt.xticks(rotation=45) 
#     plt.show()

# display_freq_dist(df)

In [332]:
# def k_means(string_list):

#     num_clusters = 3

#     tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'), max_df=0.85, min_df=2)
#     tfidf_matrix = tfidf_vectorizer.fit_transform(string_list)
    
#     km = KMeans(n_clusters=num_clusters, n_init=10)
#     km.fit(tfidf_matrix)
#     clusters = km.labels_.tolist()
    
#     order_centroids = km.cluster_centers_.argsort()[:, ::-1]
#     terms = tfidf_vectorizer.get_feature_names_out()
    
#     for i in range(num_clusters):
#         top_terms = [terms[ind] for ind in order_centroids[i, :10]]  # Get top 10 terms for each cluster
#         print(f"Cluster {i}: {top_terms}")
    
#     pca = PCA(n_components=2)
#     reduced_data = pca.fit_transform(tfidf_matrix.toarray())
    
#     # Get the cluster labels for each data point
#     cluster_labels = km.labels_
    
#     plt.figure(figsize=(8, 4))  # Set figure size
    
#     # Scatter plot of the reduced data, colored by cluster labels
#     plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=cluster_labels, cmap='viridis', s=50, alpha=0.6)
    
#     # Adding labels for axes
#     plt.xlabel('PCA 1')
#     plt.ylabel('PCA 2')
    
#     # Title of the plot
#     plt.title('2D Visualization of K-Means Clusters')
    
#     # Display the plot
#     print('\n')
#     plt.show()


# df = read_file(files[1])
# df = pre_processing_controller(df)
# # k_means(df['Filtered_Review_String'])
# # k_means(df['Lemmatised_Review_String'])
# # k_means(df['Stemmed_Review_String'])

In [333]:
# def LDA_Model(tokenised_reviews):
    
#     # Create a dictionary representation of the documents
#     dictionary = corpora.Dictionary(tokenised_reviews)
    
#     # Convert dictionary to a bag of words corpus
#     corpus = [dictionary.doc2bow(text) for text in tokenised_reviews]
    
#     # Number of topics
#     num_topics = 5
    
#     # Generate LDA model
#     lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=100, update_every=1, passes=10, alpha='auto')
    
#     # Print the topics
#     topics = lda_model.print_topics(num_words=5)
#     for topic in topics:
#         print(topic)
        

# # LDA_Model(df['Stemmed_Review'])
# # print('\n')
# # LDA_Model(df['Filtered_Review'])
# # print('\n')
# # LDA_Model(df['Lemmatised_Tokenised_Filtered_Review'])

In [334]:
# def LDA_Model_2(string_reviews):
    
#     tfidf_vectorizer = TfidfVectorizer()
#     tfidf_matrix = tfidf_vectorizer.fit_transform(string_reviews)
    
#     num_topics = 5
#     lda = LDA(n_components=num_topics)
#     lda.fit_transform(tfidf_matrix)
    
#     # Explore the topics
#     terms = tfidf_vectorizer.get_feature_names_out()
#     for topic_idx, topic in enumerate(lda.components_):
#         print(f"Topic #{topic_idx+1}:")
#         print(" ".join([terms[i] for i in topic.argsort()[:-10 - 1:-1]]))
#         print('\n')

# # LDA_Model_2(df['Lemmatised_Review_String'])

In [335]:
# def disambiguate_word_sense(sentence, word):
#     # Use Lesk algorithm for WSD
#     sense = lesk(nltk.word_tokenize(sentence), word)
#     if not sense:
#         return None
    
#     # Get sentiment scores
#     senti_synset = swn.senti_synset(sense.name())
#     return {
#         'word': word,
#         'synset_name': sense.name(),
#         'definition': sense.definition(),
#         'examples': sense.examples(),
#         'positivity_score': senti_synset.pos_score(),
#         'negativity_score': senti_synset.neg_score(),
#         'objectivity_score': senti_synset.obj_score()
#     }


# # review = df.iloc[200]
# # for word in review['Filtered_Review']:
# #     disambiguated_sense = disambiguate_word_sense(review['Review'], word)
#     # print(disambiguated_sense)
#     # print('\n')

In [336]:
# def preprocess_and_ner(tokens):
    
#     tagged = pos_tag(tokens)
#     named_entities = ne_chunk(tagged)
#     return named_entities

# # tokenised_text = df['Filtered_Review'].iloc[106]
# # named_entities = preprocess_and_ner(tokenised_text)
# # print(named_entities)

In [337]:
# def tf_idf(reviews):
#     # Initialize the TF-IDF Vectorizer
#     tfidf_vectorizer = TfidfVectorizer()
    
#     # Transform the reviews into a TF-IDF matrix
#     tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)
    
#     # Extract the feature names/terms from the TF-IDF Vectorizer
#     feature_names = tfidf_vectorizer.get_feature_names_out()
    
#     # Calculate the average TF-IDF score for each term across all documents
#     scores = tfidf_matrix.mean(axis=0)
#     term_scores = {feature_names[col]: scores[0, col] for col in range(scores.shape[1])}
    
#     # Sort the terms by their average TF-IDF score in descending order
#     sorted_term_scores = sorted(term_scores.items(), key=lambda x: x[1], reverse=True)
    
#     # Optionally: Display the top 10 terms with the highest average TF-IDF scores
#     print("Top 15 terms by average TF-IDF score:")
#     for term, score in sorted_term_scores[:15]:
#         print(f"Term: {term}, Score: {round(score, 4)}")
    
#     # Calculate cosine similarity among the documents using the TF-IDF matrix
#     cos_sim_matrix = cosine_similarity(tfidf_matrix)
    
#     # return tfidf_matrix, feature_names, cos_sim_matrix

# # tf_idf(df['Filtered_Review_String'])

In [338]:
# matcher = Matcher(nlp.vocab)

# # Define custom patterns
# patterns = [
#     [{"POS": "ADJ"}, {"POS": "NOUN", "OP": "+"}],  # Adjective followed by one or more nouns
#     [{"POS": "NOUN", "OP": "+"}, {"LOWER": "mode"}]  # One or more nouns followed by "mode"
# ]
# matcher.add("CUSTOM_PATTERNS", patterns)


# def POS_Chuck_Parser_Matcher(review):
#     doc = nlp(review)
#     matches = matcher(doc)

#     extracted_phrases = []
#     for match_id, start, end in matches:
#         span = doc[start:end]
#         extracted_phrases.append(span.text)

#     print(extracted_phrases)

In [339]:
# def POS_Chuck_Parser(review):

#     doc = nlp(review)
    
#     # Initialize a list to hold our extracted phrases
#     extracted_phrases = []
    
#     # Iterate over tokens in the doc
#     for token in doc:
#         # Look for an adverb modifying an adjective and check the adjective doesn't have a noun child
#         if token.pos_ == "ADV" and token.head.pos_ == "ADJ":
#             is_adj_modified = False
#             for child in token.head.children:
#                 if child.dep_ in ["attr", "dobj", "pobj"]:  # The adjective is modifying a noun
#                     is_adj_modified = True
#                     break
#             if not is_adj_modified:
#                 # Capture the adverb-adjective pair "rather heavy"
#                 extracted_phrases.append(token.text + " " + token.head.text)
    
#         # Look for an adjective modifying a noun and check if it's in a prepositional phrase
#         if token.pos_ == "ADJ" and token.head.pos_ in ["NOUN", "PROPN"]:
#             is_in_prep_phrase = False
#             for ancestor in token.head.ancestors:
#                 if ancestor.dep_ == "prep":
#                     is_in_prep_phrase = True
#                     breakg
#             if not is_in_prep_phrase:
#                 # Capture the adjective-noun pair "great camera"
#                 extracted_phrases.append(token.text + " " + token.head.text)

#     print(extracted_phrases)

# # index = 24

# # print(df['Tags'].iloc[index])
# # print(df['Review'].iloc[index])

# # POS_Chuck_Parser(df['Review'].iloc[index])
# # POS_Chuck_Parser(df['Filtered_Review_String'].iloc[index])
# # POS_Chuck_Parser(df['Lemmatised_Review_String'].iloc[index])

# # POS_Chuck_Parser_Matcher(df['Review'].iloc[index])
# # POS_Chuck_Parser_Matcher(df['Filtered_Review_String'].iloc[index])
# # POS_Chuck_Parser_Matcher(df['Lemmatised_Review_String'].iloc[index])