*Problem Statement* : is to develop an approach that given a sample will identify the subthemes along with their respective sentiments.
**SUBTHEME EXTRACTION**

In [48]:
#Importing Necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as sns
%matplotlib inline

In [49]:
df = pd.read_csv("/content/Evaluation-dataset.csv", usecols=[0], names=['Reviews'],header=0)


Manual engineering was required to get the necessary structure of the dataframe

In [50]:
df['Reviews'] = df['Reviews'].str.strip('\"')


In [51]:
df.head()

Unnamed: 0,Reviews
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...


In [52]:
print("Total rows:", df.shape[0])


Total rows: 10132


Here we can see no rows are missing and we have the full dataframe

In [53]:
#Importing necessary libraries for Natural Language Processing
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Here we preprocess the data

*   Type checking for string
*   Tokenization
*   Stopword removal







In [54]:
#Preprocessing the data
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean(review):
    if not isinstance(review, str):#Handling non-empty strings
        review = ''
    tokens = word_tokenize(review)
    preprocessed = [lemmatizer.lemmatize(word.lower()) for word in tokens if word.isalnum() and word.lower() not in stop_words]
    return preprocessed

 **Since we have the preprocessed corpus**

 We now create function to extract subthemes using part-of-speech tagging to identify subthemes based on nouns, adjectives etc.

 We will then use Textblob to identify the theme


In [55]:
#Function for extraction of subtheme
def subtheme_sentiment(review, pos_tags):
    if not isinstance(review, str) or not isinstance(pos_tags, list) or not pos_tags:
        return {}
    extracted_subthemes = []
    current_subtheme = []
    for word, pos in pos_tags:
      #part-of-sppech tags to identify themes with nouns, verbs etc..
        if pos.startswith('NN') or pos.startswith('JJ') or pos.startswith('RB') or pos.startswith('VB'):
            current_subtheme.append(word)
        else:
            if current_subtheme:
                extracted_subthemes.append(" ".join(current_subtheme))
                current_subtheme = []
    if current_subtheme:
        extracted_subthemes.append(" ".join(current_subtheme))
    sentiments = {}
    for theme in extracted_subthemes:
        theme_blob = TextBlob(theme)
        sentiment_score = theme_blob.sentiment.polarity
        sentiment_label = "positive" if sentiment_score > 0 else "negative" if sentiment_score < 0 else "neutral"
        sentiments[theme] = sentiment_label
    return sentiments

In [56]:
#We take the given example in the assignment
"""One tyre went missing, so there was a delay to get the two tyres fitted. The way garage dealtwith it was fantastic."""

message = 'One tyre went missing, so there was a delay to get the two tyres fitted. The way garage dealt with it was fantastic.'

cleaned_words = clean(message)
tagged_words = pos_tag(cleaned_words)
final_result =subtheme_sentiment(message, tagged_words)
print(final_result)



{'tyre went missing delay get': 'negative', 'tyre fitted way garage dealt fantastic': 'positive'}


Up above we can see that we got the subtheme sentiments; further moving on we predict it for the whole dataframe

In [57]:
def pred_dataframe(df, text_column):
    subthemes = []
    sentiments = []
    for review in df[text_column]:
        cleaned_words = clean(review)
        tagged_words = pos_tag(cleaned_words)
        result = subtheme_sentiment(review, tagged_words)
        for subtheme, sentiment in result.items():
            subthemes.append(subtheme)
            sentiments.append(sentiment)
    result_df = pd.DataFrame({
        'subtheme': subthemes,
        'sentiment': sentiments
    })
    return result_df


In [58]:
#Final result
result_df = pred_dataframe(df, 'Reviews')
print(result_df)

                                                subtheme sentiment
0      tire delivered garage choice garage notified d...  positive
1      easy tyre selection process competitive pricin...  positive
2                              easy use good value money  positive
3                         really easy convenient arrange  positive
4      easy select tyre size arrange local fitting pr...  positive
...                                                  ...       ...
16113  ordered tyre needed line booked specified time...   neutral
16114          use redacted good price tyre quick search  positive
16115  excellent service point order fitting complain...  positive
16116                          seamless well managed end  positive
16117                                          recommend   neutral

[16118 rows x 2 columns]


#Thank You!