We attempt to analysis each reviews by extracting the problems/benefits of the application mentioned in each reviews
To do so, we utilise the hugging face question-answering pipeline
Similarly, hugging face provides us with the sentiment analysis pipeline, allowing us to check the sentiment value of each review

In [1]:
#libraries
import pandas as pd


In [2]:
#import data
appStore = pd.read_csv('AppStoreData.csv')
googlePlay = pd.read_csv('PlayStoreData.csv')

In [3]:
#combine review data 
as_review = appStore['review']
gp_review = googlePlay['text']

reviews = as_review.tolist() + gp_review.tolist()

In [4]:
#data cleaning to remove weird comments
print(reviews)



In [5]:
from transformers import pipeline
sent_pipeline = pipeline("sentiment-analysis")
qa_pipeline = pipeline("question-answering")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Sentiment analysis

In [6]:
sentiment_scoring = [sent_pipeline(review) for review in reviews]

In [7]:
label = []
score = []
for lst in sentiment_scoring:
    label.append(lst[0]['label'])
    score.append((lst[0]['score']))

In [8]:
sentiment_df = pd.DataFrame(list(zip(reviews, label, score)), columns=['review','label', 'sentiment_score'])

In [9]:
unique_labels = sentiment_df['label'].unique()

# Print the unique labels
print("Unique Labels:", unique_labels)

Unique Labels: ['POSITIVE' 'NEGATIVE']


In [10]:
sentiment_df.head()

Unnamed: 0,review,label,sentiment_score
0,Great banking app with attractive interest rat...,POSITIVE,0.994833
1,"A bank like no other, no bank have such amazin...",POSITIVE,0.99939
2,Notice that the drop in interest rate of 0.8% ...,NEGATIVE,0.998625
3,Sending money into my GXS account is a breeze ...,NEGATIVE,0.994006
4,I have to say that the UI/UX is one of the bes...,POSITIVE,0.998633


In [11]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

def roberta_score(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    score_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return score_dict

neg_scores = []
neu_scores = []
pos_scores = []

# Score each example and store the results
for rev in reviews:
    scores = roberta_score(rev)
    neg_scores.append(scores['roberta_neg'])
    neu_scores.append(scores['roberta_neu'])
    pos_scores.append(scores['roberta_pos'])

# Create a DataFrame from the scores
score_df = pd.DataFrame({
    'Review': reviews,
    'Negative': neg_scores,
    'Neutral': neu_scores,
    'Positive': pos_scores
})

print(score_df.head())

                                              Review  Negative   Neutral  \
0  Great banking app with attractive interest rat...  0.006360  0.044363   
1  A bank like no other, no bank have such amazin...  0.002113  0.013991   
2  Notice that the drop in interest rate of 0.8% ...  0.632239  0.330659   
3  Sending money into my GXS account is a breeze ...  0.908289  0.080240   
4  I have to say that the UI/UX is one of the bes...  0.001639  0.011583   

   Positive  
0  0.949278  
1  0.983896  
2  0.037101  
3  0.011471  
4  0.986778  


In [12]:
labels = []
for neg, neu, pos in zip(neg_scores, neu_scores, pos_scores):
    if max(neg, neu, pos) == neg:
        labels.append('Detractors')
    elif max(neg, neu, pos) == neu:
        labels.append('Passives')
    else:
        labels.append('Promoters')

# Add labels to the DataFrame
score_df['Label'] = labels

print(score_df.head())

                                              Review  Negative   Neutral  \
0  Great banking app with attractive interest rat...  0.006360  0.044363   
1  A bank like no other, no bank have such amazin...  0.002113  0.013991   
2  Notice that the drop in interest rate of 0.8% ...  0.632239  0.330659   
3  Sending money into my GXS account is a breeze ...  0.908289  0.080240   
4  I have to say that the UI/UX is one of the bes...  0.001639  0.011583   

   Positive       Label  
0  0.949278   Promoters  
1  0.983896   Promoters  
2  0.037101  Detractors  
3  0.011471  Detractors  
4  0.986778   Promoters  


In [13]:
unique_labels = score_df['Label'].unique()

# Print the unique labels
print("Unique Labels:", unique_labels)

Unique Labels: ['Promoters' 'Detractors' 'Passives']


In [14]:
specific_label = 'Passives'  # Specify the label you want to filter for
specific_label_rows = score_df[score_df['Label'] == specific_label]

# Print the extracted rows
print("Rows with label", specific_label, ":")
print(specific_label_rows)

Rows with label Passives :
                                                Review  Negative   Neutral  \
5    Have been waiting for a slot for the account s...  0.391285  0.465227   
18   Already 1 year but basic functions like adding...  0.306678  0.420772   
32                                       User friendly  0.058341  0.539364   
38   Noticed the news writing they offer 3.48% inte...  0.433862  0.441348   
67   Show allow to increase amount not reduce and s...  0.377178  0.559982   
68                          Guys u guy to try this out  0.028829  0.749295   
80   could you add a way to sign in using singpass ...  0.041927  0.880339   
81   Not prepared? Then don’t launch. See your dire...  0.289861  0.407345   
85                                           Efficient  0.048277  0.507522   
89   Fair and transparent loans.\r\nInterest is not...  0.443049  0.503660   
92   Reduced interest from 3.48 to 2.68% in less th...  0.330284  0.606729   
102             Did an intern write t

In [15]:
label_counts = score_df['Label'].value_counts()

# Print the count of each label
print("Label Counts:")
print(label_counts)

Label Counts:
Label
Detractors    195
Promoters     163
Passives       60
Name: count, dtype: int64


In [16]:
#NPS (pretend the topic splitting actually works)
import pandas as pd

def net_promoter_score(score_df, category_column):
    categories = score_df[category_column].unique()  # Get unique categories from the specified column
    category_results = {}  # Dictionary to store results for each category

    for category in categories:
        # Filter the DataFrame for the current category
        category_df = score_df[score_df[category_column] == category]

        # Count the occurrences of each label
        label_counts = category_df['Label'].value_counts()

         # Calculate Net Promoter Score (NPS)
        promoter_count = label_counts.get('Promoters', 0)
        detractor_count = label_counts.get('Detractors', 0)
        passive_count = label_counts.get('Passives', 0)
        total_count = promoter_count + detractor_count + passive_count

        # Calculate NPS
        nps = ((promoter_count - detractor_count) / total_count) * 100

        # Store the result for the current category
        category_results[category] = round(nps, 2)

        
    return category_results

# Example usage
score_df = pd.DataFrame({
    'Category': ['Category A', 'Category B', 'Category A', 'Category B', 'Category A'],
    'Review': ['Review 1', 'Review 2', 'Review 3', 'Review 4', 'Review 5'],
    'Negative': [0.1, 0.3, 0.2, 0.5, 0.1],
    'Neutral': [0.6, 0.4, 0.3, 0.2, 0.4],
    'Positive': [0.3, 0.3, 0.5, 0.3, 0.5],
    'Label': ['Passives', 'Passives', 'Promoters', 'Detractors', 'Promoters']
})


grouped_df = score_df.groupby('Category')
for name, group in grouped_df:
    print(f"Category: {name}")
    print(group)
    print()

# Calculate the percentage difference by category
category_diffs = net_promoter_score(score_df, 'Category')
print("Category Differences:")
print(category_diffs)


Category: Category A
     Category    Review  Negative  Neutral  Positive      Label
0  Category A  Review 1       0.1      0.6       0.3   Passives
2  Category A  Review 3       0.2      0.3       0.5  Promoters
4  Category A  Review 5       0.1      0.4       0.5  Promoters

Category: Category B
     Category    Review  Negative  Neutral  Positive       Label
1  Category B  Review 2       0.3      0.4       0.3    Passives
3  Category B  Review 4       0.5      0.2       0.3  Detractors

Category Differences:
{'Category A': 66.67, 'Category B': -50.0}


In [28]:
#VADER
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Initialize the SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Initialize lists to store data
review_texts = []
positive_scores = []
negative_scores = []
neutral_scores = []
compound_scores = []
nps_indiv = []
nps_category = []  # New column for NPS categories

# Perform sentiment analysis and store scores in lists
for review in reviews:
    vs = analyzer.polarity_scores(review)
    review_texts.append(review)
    positive_scores.append(vs['pos'])
    negative_scores.append(vs['neg'])
    neutral_scores.append(vs['neu'])
    compound_scores.append(vs['compound'])
    
    # Map compound scores to nps_indiv based on specified intervals
    if -1 <= vs['compound'] <= -9/11:
        nps_indiv.append(0)
    elif -9/11 < vs['compound'] <= -7/11:
        nps_indiv.append(1)
    elif -7/11 < vs['compound'] <= -5/11:
        nps_indiv.append(2)
    elif -5/11 < vs['compound'] <= -3/11:
        nps_indiv.append(3)
    elif -3/11 < vs['compound'] <= -1/11:
        nps_indiv.append(4)
    elif -1/11 < vs['compound'] <= 1/11:
        nps_indiv.append(5)
    elif 1/11 < vs['compound'] <= 3/11:
        nps_indiv.append(6)
    elif 3/11 < vs['compound'] <= 5/11:
        nps_indiv.append(7)
    elif 5/11 < vs['compound'] <= 7/11:
        nps_indiv.append(8)
    elif 7/11 < vs['compound'] <= 9/11:
        nps_indiv.append(9)
    else:
        nps_indiv.append(10)
    
    # Map nps_indiv scores to NPS categories
    if nps_indiv[-1] >= 9:  # Promoters
        nps_category.append('Promoter')
    elif nps_indiv[-1] >= 7:  # Passives
        nps_category.append('Passive')
    else:  # Detractors
        nps_category.append('Detractor')

# Create dataframe
score_df = pd.DataFrame({
    'Review': review_texts,
    'Positive Score': positive_scores,
    'Negative Score': negative_scores,
    'Neutral Score': neutral_scores,
    'Compound Score': compound_scores,
    'nps_indiv': nps_indiv,
    'nps_category': nps_category  # Adding the new column for NPS categories
})

# Display the dataframe
print(score_df)

                                                Review  Positive Score  \
0    Great banking app with attractive interest rat...           0.367   
1    A bank like no other, no bank have such amazin...           0.147   
2    Notice that the drop in interest rate of 0.8% ...           0.201   
3    Sending money into my GXS account is a breeze ...           0.059   
4    I have to say that the UI/UX is one of the bes...           0.141   
..                                                 ...             ...   
413  Not ready to roll out completely. Aint even al...           0.115   
414                                       Can't work .           0.000   
415  Can not download yet, just always show pending...           0.000   
416  Looks cool and sleek! Can I get an invite if I...           0.208   
417  It's doesn't work, they're just trying to coll...           0.079   

     Negative Score  Neutral Score  Compound Score  nps_indiv nps_category  
0             0.024          0.609

In [35]:
#NPS (pretend the topic splitting actually works)
import pandas as pd
import numpy as np

def net_promoter_score(score_df, category_column):
    categories = score_df[category_column].unique()  # Get unique categories from the specified column
    category_results = {}  # Dictionary to store results for each category

    for category in categories:
        # Filter the DataFrame for the current category
        category_df = score_df[score_df[category_column] == category]

        # Count the occurrences of each label
        label_counts = category_df['nps_category'].value_counts()

         # Calculate Net Promoter Score (NPS)
        promoter_count = label_counts.get('Promoter', 0)
        detractor_count = label_counts.get('Detractor', 0)
        passive_count = label_counts.get('Passive', 0)
        total_count = promoter_count + detractor_count + passive_count

        # Calculate NPS
        nps = ((promoter_count - detractor_count) / total_count) * 100

        # Store the result for the current category
        category_results[category] = round(nps, 2)

        
    return category_results

# Example usage
example=score_df.head()
categories = ['A', 'B']
example['Category'] = np.random.choice(categories, size=len(score_df.head()))  # Initialize the 'Category' column

grouped_df = example.groupby('Category')
for name, group in grouped_df:
    print(f"Category: {name}")
    print(group)
    print()

# Calculate the percentage difference by category
category_diffs = net_promoter_score(example, 'Category')
print("Category Differences:")
print(category_diffs)

Category: A
                                              Review  Positive Score  \
0  Great banking app with attractive interest rat...           0.367   
3  Sending money into my GXS account is a breeze ...           0.059   
4  I have to say that the UI/UX is one of the bes...           0.141   

   Negative Score  Neutral Score  Compound Score  nps_indiv nps_category  \
0           0.024          0.609          0.9622         10     Promoter   
3           0.103          0.838         -0.8316          0    Detractor   
4           0.032          0.826          0.8313         10     Promoter   

  Category  
0        A  
3        A  
4        A  

Category: B
                                              Review  Positive Score  \
1  A bank like no other, no bank have such amazin...           0.147   
2  Notice that the drop in interest rate of 0.8% ...           0.201   

   Negative Score  Neutral Score  Compound Score  nps_indiv nps_category  \
1           0.051          0.801    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  example['Category'] = np.random.choice(categories, size=len(score_df.head()))  # Initialize the 'Category' column


Review analysis

In [18]:
reviews[0]

'Great banking app with attractive interest rates! Please allow us to add and/or save payees so we don’t have to keep typing out UEN numbers or account numbers. Would be nice to be able to add the debit card to Apple Pay too!!'

In [19]:
#context = reviews #this part, make it into a forloop or a function to output the desired data

def report(review):
    #should return a dictionary i guess
    ans = dict.fromkeys(['Good', 'Suggested improvements'], [])
    ans['Good'] = qa_pipeline(question="What is good and positive about this application?", context=review)["answer"]
    ans['Suggested improvements'] = qa_pipeline(question="How can the pplication improve?", context=review)["answer"]
    return ans

context = reviews[0]
print(qa_pipeline(question="What is good and positive about this application?", context=context))
print(qa_pipeline(question="What can be added to the application?", context=context))
print(qa_pipeline(question="What do the application lack", context=context))

{'score': 0.27957233786582947, 'start': 23, 'end': 48, 'answer': 'attractive interest rates'}


{'score': 0.7127764821052551, 'start': 196, 'end': 206, 'answer': 'debit card'}
{'score': 0.10196662694215775, 'start': 128, 'end': 158, 'answer': 'UEN numbers or account numbers'}


In [20]:
check = reviews[0]
print(check)
report(check)

Great banking app with attractive interest rates! Please allow us to add and/or save payees so we don’t have to keep typing out UEN numbers or account numbers. Would be nice to be able to add the debit card to Apple Pay too!!


{'Good': 'attractive interest rates',
 'Suggested improvements': 'Please allow us to add and/or save payees'}

In [21]:
#qa on all the reviews
idk = [report(review) for review in reviews]

KeyboardInterrupt: 

In [None]:
print(idk)



Other question answering method that may be more accuracte

In [None]:

# import
from transformers.pipelines import pipeline
from transformers import AutoModelForQuestionAnswering
from transformers import AutoTokenizer

# var
model_name = "deepset/xlm-roberta-base-squad2"

# generate pipeline
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

input = {
    'question': 'How can the application improve?' ,
    'context': 'My name is Mohit. I am going to visit my grandmother. She is old.'
}
print(nlp(input))
## Output --> {'score': 0.30, 'start': 10, 'end': 17, 'answer': ' Mohit.'}


ValueError: Couldn't instantiate the backend tokenizer from one of: 
(1) a `tokenizers` library serialization file, 
(2) a slow tokenizer instance to convert or 
(3) an equivalent slow tokenizer class to instantiate and convert. 
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Comparing hugging face 1 and nlp

In [None]:
#i think what we can do is to analyse the sentiment of the review. If it is bad
print(reviews[3])
input = {
    'question': 'How can the application be improved?',
    'context': reviews[3]
}
print(nlp(input))

input = {
    'question': 'what is good about the application',
    'context': reviews[3]
}
print(nlp(input))

#compared to hugging
check = reviews[3]
report(check)


Sending money into my GXS account is a breeze and instantaneous - regardless of the amounts. I’m able to immediately see that my funds are in GXS. 

Transferring money OUT is a huge issue. Since June I’ve had problems transferring amounts higher than $500 back to my other banking accounts, each time a red banner will pop up and said something went wrong please try again later. TODAY I can’t transfer more than $1000 back to myself - even the $1000 had to be transferred in TWO transactions of $500 each. Customer service officers did their best to help each time but it’s annoying that the advice provided (killing the app, re-logging in with SingPass) still don’t work.
{'score': 0.14726075530052185, 'start': 608, 'end': 655, 'answer': ' (killing the app, re-logging in with SingPass)'}
{'score': 0.018554577603936195, 'start': 608, 'end': 655, 'answer': ' (killing the app, re-logging in with SingPass)'}


{'Good': 'it’s annoying that the advice provided',
 'Suggested improvements': 'still don’t work'}