<a href="https://colab.research.google.com/github/Priyasi7/DS_task/blob/main/DS_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To solve the Subtheme Sentiment Analysis task, we'll develop a system that can identify subthemes (aspects and problems) in a review and determine their respective sentiments. We'll break down the solution into the following steps:

Text Preprocessing: Clean and prepare the text for analysis.
Subtheme Identification: Detect specific subthemes related to the service.
Sentiment Analysis: Determine the sentiment associated with each subtheme.
Combining Subthemes with Sentiments: Link each identified subtheme with its respective sentiment.
Output the Results: Format and display the results.
Step-by-Step Implementation
Step 1: Text Preprocessing
We'll use the spaCy library for text preprocessing, which includes tokenization and lemmatization.

Step 2: Subtheme Identification
We'll identify subthemes by using a predefined dictionary of keywords for each subtheme.

Step 3: Sentiment Analysis
We'll use TextBlob to determine the sentiment (positive or negative) for phrases related to each subtheme.

Step 4: Combining Subthemes with Sentiments
We'll aggregate the identified subthemes with their respective sentiments.

Step 5: Output the Results
We'll format the output to show each subtheme and its sentiment.

Step-by-Step Implementation

In [None]:
import spacy
from textblob import TextBlob
from collections import defaultdict
import pandas as pd

# Load pre-trained NLP model
nlp = spacy.load('en_core_web_sm')

# Predefined aspect dictionary
aspects = {
    "incorrect tyres sent": ["incorrect", "missing", "wrong", "tyre", "tire"],
    "garage service": ["garage", "service", "mechanic"],
    "wait time": ["delay", "wait", "time"]
}

def identify_aspects(review, aspects):
    doc = nlp(review)
    identified_aspects = defaultdict(list)

    for token in doc:
        for aspect, keywords in aspects.items():
            if token.lemma_ in keywords:
                identified_aspects[aspect].append(token.text)

    return identified_aspects

def get_sentiment(phrase):
    blob = TextBlob(phrase)
    return "positive" if blob.sentiment.polarity > 0 else "negative"

def subtheme_sentiment_analysis(review, aspects):
    identified_aspects = identify_aspects(review, aspects)
    subtheme_sentiments = {}

    for aspect in identified_aspects:
        # Extract the sentences that contain the aspect
        sentences = [sent for sent in review.split('.') if any(word in sent for word in identified_aspects[aspect])]
        combined_sentence = ' '.join(sentences)
        sentiment = get_sentiment(combined_sentence)
        subtheme_sentiments[aspect] = sentiment

    return subtheme_sentiments

# Load dataset from CSV file
dataset = pd.read_csv('/content/Evaluation-dataset.csv')

# Iterate over each row in the dataset
for index, row in dataset.iterrows():
    review_text = row[0]  # Assuming the review text is in the first column
    subtheme_sentiments = subtheme_sentiment_analysis(review_text, aspects)

    # Output the results
    print(f"Review {index + 1}:")
    for subtheme, sentiment in subtheme_sentiments.items():
        print(f"{subtheme}: {sentiment}")
    print("\n")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Review 8947:
garage service: positive


Review 8948:


Review 8949:
garage service: positive


Review 8950:
garage service: positive


Review 8951:
incorrect tyres sent: positive
garage service: positive


Review 8952:
garage service: positive
incorrect tyres sent: positive


Review 8953:
incorrect tyres sent: positive
garage service: positive
wait time: positive


Review 8954:
incorrect tyres sent: positive
wait time: positive


Review 8955:


Review 8956:
garage service: positive
incorrect tyres sent: positive


Review 8957:


Review 8958:


Review 8959:
wait time: negative


Review 8960:


Review 8961:
garage service: positive


Review 8962:
garage service: positive
incorrect tyres sent: positive
wait time: positive


Review 8963:
garage service: positive
incorrect tyres sent: positive


Review 8964:
garage service: positive
incorrect tyres sent: positive


Review 8965:
incorrect tyres sent: positive
garage service: p

Evaluation
Approach:

The approach systematically preprocesses the text, identifies subthemes, analyzes sentiment, and combines the results.
It uses spaCy for advanced NLP preprocessing and TextBlob for sentiment analysis.

Approach2:: To enhance sentiment accuracy, we can integrate more sophisticated models like VADER and fine-tuned BERT into our subtheme sentiment analysis. Here's how we can incorporate these models into our approach:

VADER Sentiment Analysis: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis
To enhance sentiment accuracy, we can integrate more sophisticated models like VADER and fine-tuned BERT into our subtheme sentiment analysis. Here's how we can incorporate these models into our approach:

VADER Sentiment Analysis: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

BERT Sentiment Analysis: BERT (Bidirectional Encoder Representations from Transformers) can be fine-tuned on a sentiment analysis dataset to capture nuanced sentiments.

In [None]:
!pip install spacy textblob vaderSentiment transformers torch
!python -m spacy download en_core_web_sm


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_

Step-by-Step Implementation with VADER and BERT

In [None]:
import spacy
from collections import defaultdict
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline
import pandas as pd

# Load pre-trained NLP model
nlp = spacy.load('en_core_web_sm')

# Predefined aspect dictionary
aspects = {
    "incorrect tyres sent": ["incorrect", "missing", "wrong", "tyre", "tire"],
    "garage service": ["garage", "service", "mechanic", "dealt"],
    "wait time": ["delay", "wait", "time"]
}

# Initialize VADER sentiment analyzer
vader_analyzer = SentimentIntensityAnalyzer()

# Initialize BERT sentiment analysis pipeline
bert_analyzer = pipeline('sentiment-analysis')

def identify_aspects(review, aspects):
    doc = nlp(review)
    identified_aspects = defaultdict(list)

    for token in doc:
        for aspect, keywords in aspects.items():
            if token.lemma_ in keywords:
                identified_aspects[aspect].append(token.text)

    return identified_aspects

def get_vader_sentiment(phrase):
    scores = vader_analyzer.polarity_scores(phrase)
    return "positive" if scores['compound'] >= 0 else "negative"

def get_bert_sentiment(phrase):
    results = bert_analyzer(phrase)
    return results[0]['label'].lower()

def subtheme_sentiment_analysis(review, aspects):
    identified_aspects = identify_aspects(review, aspects)
    subtheme_sentiments = {}

    for aspect in identified_aspects:
        # Extract the sentences that contain the aspect
        sentences = [sent for sent in review.split('.') if any(word in sent for word in identified_aspects[aspect])]
        combined_sentence = ' '.join(sentences)

        # Get sentiments from VADER and BERT
        vader_sentiment = get_vader_sentiment(combined_sentence)
        bert_sentiment = get_bert_sentiment(combined_sentence)

        # Combine or choose one based on preference
        final_sentiment = vader_sentiment if vader_sentiment == bert_sentiment else bert_sentiment

        subtheme_sentiments[aspect] = final_sentiment

    return subtheme_sentiments

# Load dataset from CSV file
dataset = pd.read_csv('/content/Evaluation-dataset.csv')

# Iterate over each row in the dataset
for index, row in dataset.iterrows():
    review_text = row[0]  # Assuming the review text is in the first column
    subtheme_sentiments = subtheme_sentiment_analysis(review_text, aspects)

    # Output the results
    print(f"Review {index + 1}:")
    for subtheme, sentiment in subtheme_sentiments.items():
        print(f"{subtheme}: {sentiment}")
    print("\n")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Review 1070:
incorrect tyres sent: negative
garage service: negative


Review 1071:
incorrect tyres sent: negative
garage service: negative


Review 1072:
incorrect tyres sent: negative


Review 1073:
garage service: positive


Review 1074:
garage service: negative
incorrect tyres sent: negative
wait time: negative


Review 1075:


Review 1076:
incorrect tyres sent: positive


Review 1077:
wait time: negative
incorrect tyres sent: negative


Review 1078:


Review 1079:
incorrect tyres sent: negative
garage service: positive


Review 1080:
incorrect tyres sent: negative


Review 1081:
garage service: negative
incorrect tyres sent: negative


Review 1082:


Review 1083:
garage service: positive


Review 1084:
garage service: negative
incorrect tyres sent: negative
wait time: negative


Review 1085:
incorrect tyres sent: negative


Review 1086:


Review 1087:


Review 1088:


Review 1089:
incorrect tyres sent: positive


Rev

Explanation:
Text Preprocessing:

We use spaCy for tokenization and lemmatization.
Subtheme Identification:

The identify_aspects function loops through each token in the review and matches it against the keywords in the aspect dictionary.
Identified aspects are stored in a dictionary with the aspect as the key and the matching tokens as values.
Sentiment Analysis:

We use VADER to get sentiment scores and classify the sentiment as positive or negative based on the compound score.
We use BERT, a state-of-the-art transformer model, to get a more nuanced sentiment analysis.
Combining Sentiments:

For each subtheme, we extract the sentences related to that subtheme and determine their combined sentiment using both VADER and BERT.
We decide on the final sentiment based on the agreement between VADER and BERT or prioritize one over the other.