In [21]:
import pandas as pd
import numpy as np
from copy import deepcopy

import sys
sys.path.append('./utils')
from utils import review_feature
rf = review_feature()



In [33]:
#from pandas_profiling import ProfileReport

In [23]:
pwd

'C:\\Users\\manik\\Documents\\online_Courses\\projects\\ecommerce\\ReviewRanking'

In [24]:
df = pd.read_csv('./data/Preprocessed_Reviews.csv').sort_values(by = ['product'])

In [25]:
df

Unnamed: 0,product,answer_option,label,review_len
0,Accucheck,Fast and accurate delivery,0,4
1453,Accucheck,Expected a longer expire date. Your Product Li...,0,14
1454,Accucheck,I liked the prompt service,0,5
1455,Accucheck,Good product,0,2
1456,Accucheck,I not needed,0,3
...,...,...,...,...
131,shampoo,Its not much effective as it has been stated i...,0,12
130,shampoo,Liked it very nicely working now my scalp is a...,1,11
129,shampoo,its my regular choice,0,4
139,shampoo,Good but not very effective,0,5


# Feature Engineering
### From these review text we wanted to extract relevance out of these, understanding in depth sense of reviews. 

#### Features extraction covers every necessary property/viewpoints and to measure features in a quantitative manner is a much-needed task in order to achieve highly accurate outcomes. Hence, this section discusses all the features extracted from reviews.

1. Noun Strength (Rn): Nouns are subjects and considered as the most informative part of a language. The amount of subjects shows the importance of review because only a noun describes the prime factors of review (which tells us what the review is about). We did POS Tagging to find nouns in a review and computed score as:
<br>Score(Rn) = TFIDF(noun) / TFIDF(all words)
<br><br>
2. Review Polarity (Rp): Its value lies between -1 to +1 which tells whether a review has sentiment or negative sentiment.
<br><br>
3. Review Subjectivity (Rs): The subjectivity is a measure of the sentiment being objective to subjective and goes from 0 to 1. Objective expressions are facts while Subjective expressions are opinions that describe a person’s feelings. Consider the following expression:
<br>Bournvita tastes very good with milk: Subjective <br>
Bournvita is brown in color: Objective
<br><br>
4. Review Complexity (Rc): To evaluate how good and complex a review is, in terms of unique words within a review and across entire review corpus of a particular product.
Rc = Number of unique words in a Review / Number of unique words in entire Corpus
<br><br>
5. Review Word Length (Rw): Word count of a Review
<br><br>
6. Service Tagger (Rd): The best review is one that talks more about how is the product, how it tastes, what are its uses, and the one which talks about the effectiveness of a product. Reviews are basically to describe a product. So, a dictionary of words is created which would mark reviews as service-based, delivery reviews, and customer support.
<br>Fuzzy matching of every word in a review is done with the words in the dictionary with Levenshtein distance. Levenshtein distance helps in measuring the difference between two sequences and tackle spell errors in review, for example, instead of “My delivery was on time”, Reviews is wrongly written as “My dilivery was on time”. In this case, Fuzzy matching would help us to match both the reviews.
<br><br>
7. Compound Score (Rsc): To improve the efficiency of the system. We compute the compound score using VaderSentimentAnalyser. This library is taken from VADER (Valence Aware Dictionary and sEntiment Reasoner). This is a lexicon and rule-based sentiment analysis tool that is specifically tuned to determine sentiments expressed in social media content. It has the ability to find the sentiment of Slang (e.g. SUX!), Emoji (😩, 😂), Emoticons ( :), :D ) and the difference between capitalized word expressions(I am SAD, I am sad are different expressions).
<br>Rsc ≥ 0.5 (Positive Sentiment)
<br>-0.5<Rsc<+0.5 (Neural Sentiment)
<br>Rsc≤ -0.5 (Negative Sentiment)
<br><br>
Miscellaneous: We purposely did not include Reviews Rating as a feature. Inclusion of Ratings totally blunders the entire system because of two reasons:
<br>1. Common confusion between Rating and Reviews. For example, someone who rates the product ‘1’ (On a rating scale of 1–5, ‘1’ being the ‘lowest’ and ‘5’ being the ‘highest’) writes the review comment as ‘very good and useful medicine’.
<br>2. A large portion of Reviews from customers are either 5 stars or 1 star.

TextBlob: https://textblob.readthedocs.io/en/dev/index.html <br>
VaderSentiment: https://github.com/cjhutto/vaderSentiment <br>
spaCy: https://spacy.io/ <br>


![reviewfeature](Photos/ReviewFeature.png)

In [26]:
## Add Feature Columns
df['Rn'] = 0.0
df['Rp'] = 0.0
df['Rs'] = 0.0
df['Rc'] = 0.0
df['Rd'] = 0.0
df['Rsc'] = 0.0

In [27]:
df

Unnamed: 0,product,answer_option,label,review_len,Rn,Rp,Rs,Rc,Rd,Rsc
0,Accucheck,Fast and accurate delivery,0,4,0.0,0.0,0.0,0.0,0.0,0.0
1453,Accucheck,Expected a longer expire date. Your Product Li...,0,14,0.0,0.0,0.0,0.0,0.0,0.0
1454,Accucheck,I liked the prompt service,0,5,0.0,0.0,0.0,0.0,0.0,0.0
1455,Accucheck,Good product,0,2,0.0,0.0,0.0,0.0,0.0,0.0
1456,Accucheck,I not needed,0,3,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
131,shampoo,Its not much effective as it has been stated i...,0,12,0.0,0.0,0.0,0.0,0.0,0.0
130,shampoo,Liked it very nicely working now my scalp is a...,1,11,0.0,0.0,0.0,0.0,0.0,0.0
129,shampoo,its my regular choice,0,4,0.0,0.0,0.0,0.0,0.0,0.0
139,shampoo,Good but not very effective,0,5,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
product_list = df['product'].unique()

In [29]:
for product in product_list:
    data = df[df['product']==product]
    unique_bag = set()
    for review in data['answer_option']:
        review = review.lower()
        words = review.split()
        unique_bag = unique_bag.union(set(words))

    for indx in data.index:
        review = data.at[indx, 'answer_option']
        df.at[indx, 'Rp'] = rf.polarity_sentiment(review)
        df.at[indx, 'Rs'] = rf.subjectivity_sentiment(review)
        df.at[indx, 'Rd'] = rf.service_tag(review)
        df.at[indx, 'Rsc'] = rf.slang_emoji_polarity_compoundscore(review)
        df.at[indx, 'Rc'] = float(len(set(review.split()))) / float(len(unique_bag))

    df.loc[df['product']==product, 'Rn'] = rf.noun_score(data['answer_option'].values).values

## With these features we have leached out all informative from a Review. 
### One may add more features like Readability Score: SMOG Index depending on the usecase of your problem. 

### Reason why we are not taking Readability score as a metric because we have taken reviews from Tier I, Tier II and Tier III cities. We don't want to penalise reviews (from a underpriviledged background) by adding this. 

#### Source- [Wikipedia](https://en.wikipedia.org/wiki/Readability)

In [30]:
df.to_csv('data/Features.csv',index = False)

In [32]:
#profile = ProfileReport(df)

In [None]:
#profile

In [12]:
#profile.to_file(output_file="feature_analysis.html")

## We have 1655 Reviews with use, let's get to the Model Training Section.