# **REVIEW ANALYSIS**



## **Introduction**
This project analyzes customer's review of an app services on four various platforms; Playstore, Appstore, Trustpilot, and Twitter. The goal is to observe general opinions by the users of Inpost. Limitted by capacity, this analysis only considered 1,650 reviews overall.

Here is the flow order for the project;
- We will install all dependencies.
- Next we will install all required libraries (other needed libraries would be installed as we go).
- For simplicity and to limit repetitions we with establish important functions (tools), that makes the code easy.
- We will access each reviews individually, Playstore, Appstore, Trustpilot and Twitter, in that order.

Basically, we will extract reviews from these platforms, use the Transformer BERT model to generate positive, negative or neutral sentiments for these reviews. Finally we analyze the sentiment counts, as well as identify common topics/mentions within the reviews.



### **INSTALLING DEPENDENCIES**

Here are the dependencies we will require;
- Transformers: This will help us to process the BERT model for generating sentiments.
- google-play-scraper - This is useful for extracting reviews from Google Playstore
- app-store-scraper - This is useful for extracting reviews from Apple Appstore.

In [None]:
pip install transformers

In [None]:
pip install google-play-scraper

In [None]:
pip install app-store-scraper

### **IMPORTING LIBRARIES**

Here are some of the necessary libraries:
 - Sort and Reviews from google-play-scrapper. Necessary to access the reviews from Playstore.
 - Appstore from app-store-scrapper. Necessary for extracting reviews from Appstore.
 - Plotly for visualization.
 - Pandas and Numpy for data manipulation using dataframes and numerical operations.
 - Torch, for using Pytorch which is necessary for the BERT model.
 - Regular expressions for text processing.


 ***NB:*** We would possibly need more libraries as we go, and will install directly then.

In [None]:
from google_play_scraper import Sort, reviews  #Extract playstore reviews
from app_store_scraper import AppStore                 # Extract Appstore reviews
from transformers import AutoTokenizer, AutoModelForSequenceClassification #Bert model for sentiment groupings
import plotly.express as px # for visualization
import pandas as pd
import numpy as np
import torch # Using Pytorch (useful for BERT)
import re # regular expression (useful for text processing)

### **DEFINING TOOLS**
The data structure for each of the platforms are kept relatively the same to make analysis easy. Hence, to reduce repetitions in retyping code, we will define necessary functions.

A list of these functions;
- A function to plot a chart for the ratings
- A function to plot a chart for the sentiments
- A function to extract common words and phrases from the reviews
- A function to create various catefories for the reviews.
- A function to calculate distributions, for both ratings, sentiment scores and categories.

In [None]:
# Function to plot bar chart for ratings distribution
def plot_rating_distribution(df, rating_column='Rating', title='Rating Distribution'):
    # Calculate rating counts
    rating_counts = df[rating_column].value_counts().sort_index()

    # Create the bar chart
    fig = px.bar(
        x=rating_counts.index,
        y=rating_counts.values,
        labels={'x': rating_column, 'y': 'Count'},
        title=title
    )

    # Customize the chart
    fig.update_layout(
        xaxis_type='category',
        xaxis_title=rating_column,
        yaxis_title='Number of Reviews'
    )
    fig.update_layout(title_x=0.47)
    # Show the chart
    fig.show()


In [None]:
# Function to plot a pie chart for sentiment distribution
def plot_sentiment_donut(df, sentiment_column='Sentiments', title='Sentiment Distribution'):
    sentiment_counts = df[sentiment_column].value_counts()
    fig = px.pie(
        values=sentiment_counts.values,
        names=sentiment_counts.index,
        hole=0.4,
        title=title
    )
    fig.update_traces(textposition='inside', textinfo='percent+label')

    fig.update_layout(title_x=0.47)
    return fig

In [None]:
# Function to identify common words and phrases from reviews
from sklearn.feature_extraction.text import CountVectorizer

def analyze_reviews(dataframe, sentiment_column):
    """
    Analyzes the reviews based on the given sentiment type (e.g., Positive, Negative).
    Extracts the most frequent words and phrases (bigrams and trigrams).
    """
    # Filter reviews based on sentiment type
    reviews = dataframe[sentiment_column]

    # Extract most frequent words
    vectorizer_words = CountVectorizer(stop_words='english', max_features=20)
    keywords = vectorizer_words.fit_transform(reviews)
    word_counts = dict(zip(vectorizer_words.get_feature_names_out(), keywords.toarray().sum(axis=0)))

    # Extract most frequent phrases (bigrams and trigrams)
    vectorizer_phrases = CountVectorizer(ngram_range=(2, 3), stop_words='english', max_features=20)
    phrases = vectorizer_phrases.fit_transform(reviews)
    phrase_counts = dict(zip(vectorizer_phrases.get_feature_names_out(), phrases.toarray().sum(axis=0)))

    # Calculate total reviews for percentage calculation
    total_reviews = len(dataframe)
    print("====Common words:=====\n")
    for word, count in word_counts.items():
      percentage = (count / total_reviews) * 100
      print(f"'{word}': {count} occurrences ({percentage:.2f}%)")
    print("\n=====Common Phrases:=====\n")
    for phrase, count in phrase_counts.items():
      percentage = (count / total_reviews) * 100
      print(f"'{phrase}': {count} occurrences ({percentage:.2f}%)")

    return word_counts

In [None]:
# Function to create count and percentage dataframes
def calculate_distribution(df, reference_column, reference_df=None):
    """
    Calculate sentiment counts and percentages from a given DataFrame.
    """
    # Use the provided reference DataFrame or the input DataFrame for total reviews
    total_reviews = len(reference_df) if reference_df is not None else len(df)

    # Calculate sentiment counts
    total_counts = df[reference_column].value_counts()

    # Calculate sentiment percentages
    total_percentages = round((total_counts / total_reviews) * 100,2)

    # Create the output DataFrame
    sentiment_df = pd.DataFrame({'Counts': total_counts, 'Percentage(%)': total_percentages})

    return sentiment_df


In [None]:
# Function to categorize reviews
def categorize_review(review):
    review_lower = review.lower() # convert reviews to lower characters
    for category, keywords in categories.items():
        if any(keyword in review_lower for keyword in keywords):
            return category
    return 'Other'

In [None]:
# Function to create chart for categories
def category_chart(df, title):
    """
    Creates and displays a bar chart for categories and their percentages.
    """
    # Create the bar chart
    fig = px.bar(
        x=df.index,
        y=df['Percentage(%)'],
        labels={'x': 'category', 'y': 'Percentage'},
        title=title
    )

    # Customize the chart
    fig.update_layout(
        xaxis_type='category',
        xaxis_title='Categories',
        yaxis_title='Percentages(%)',
        title_x=0.47
    )

    # Show the chart
    fig.show()


###**BERT Model Intialization**

We are specifically using the *'nlptown/bert-base-multilingual-uncased-sentiment'*  fine-tuned model, specifically designed for multilingual sentiment analysis and uses BERT's transformer architecture to classify text into sentiment categories.

There are other fine-tuned models to explore.

Here we initialize the BERT model, using AutoTokenizer to preprocess the input text into BERT compatible formats, and AutoModelForSequenceClassification to load the pretrained sentiment classifier.

In [None]:
# Initiating the transformer
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

In [None]:
# Getting the sentiment from BERT LLM
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

## **Google Playstore**

In this phase we extract reviews from Playstore using the google-play-scrapper library. We write a function that extracts the reviews, and store in a pandas dataframe. The total number of reviews we would scrape from the platform is **500**, starting from the most latest.

Once extracted, we pass the sentiments through the BERT model to generate sentiments, and then we analyse the data, getting the distribution of the sentiments, ratings and identifying common themes in the negative and positive reviews.




In [None]:
# Getting googl-playstore Reviews
def get_google_play_reviews(app_id):
        # Get reviews using play_scraper
        result, continuation_token = reviews(
            app_id,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            count= 500,
            filter_score_with= None
        )

        result, _ = reviews(app_id,
            continuation_token=continuation_token # defaults to None(load from the beginning)
        )
        # Convert the reviews to a DataFrame
        g_df = pd.DataFrame(np.array(result), columns=['review'])
        g_df2 = g_df.join(pd.DataFrame(g_df.pop('review').tolist()))

        # Extract relevant columns and rename them
        df1 = g_df2[['content', 'score', 'at']]
        df1 = df1.rename(columns={'content': 'Reviews', 'score': 'Ratings', 'at': 'Date'})

        # Add a 'Source' column
        df1['Source'] = 'Google Play'

        # Adding a year column
        df1['Year']= df1['Date'].dt.year


        #Return DataFrame
        return df1

In [None]:
# Printing the dataframe
google_df = get_google_play_reviews('uk.co.inpost.inmobile')
google_df.head()

Unnamed: 0,Reviews,Ratings,Date,Source,Year
0,so easy to use. love it.,5,2024-06-03 11:01:29,Google Play,2024
1,Love it... so easy to use üëç,5,2024-06-02 11:40:11,Google Play,2024
2,"Brilliant, so much easier using the app",5,2024-06-01 18:56:21,Google Play,2024
3,Reports locker availability but when you get t...,1,2024-06-01 18:26:42,Google Play,2024
4,very good forst times used the app,5,2024-06-01 15:09:07,Google Play,2024


In [None]:
# Including Sentiment analysis to the data from the bert model
j = google_df['Reviews'].apply(lambda x: sentiment_score(x[:512])) # Using the sentiment score function(bert model)
google_df['Sentiments'] = np.select([j.isin([1,2]), j==3, j.isin([4,5])],['Negative','Neutral','Positive'], default ='Unknown') # Splitting sentiments into ranges
google_df.head()

Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments
0,brilliant,5,2024-05-22 04:47:33,Google Play,2024,Positive
1,Can not speak to anyone. Except a robot with l...,1,2024-05-21 12:21:54,Google Play,2024,Negative
2,"very user friendly, no problems at all",5,2024-05-21 12:21:12,Google Play,2024,Positive
3,"wow, much easier to us app to open.",5,2024-05-20 10:27:53,Google Play,2024,Positive
4,Great service,5,2024-05-20 08:49:48,Google Play,2024,Positive


In [None]:
# Saving the data as a csv
google_df.to_csv('google_reviews.csv', index=False)

**A quick note;** While observing the sentiments, the BERT model has done a great job classifying the tweets in sentiments, however there has been few incorrect classifications.

Noting the correlation between the Ratings and the Sentiments, it made sense to equate reviews with high ratings as positive and those with low ratings as negative.

This will be done for all other platform reviews.

In [None]:
# Update the Sentiments column based on Ratings
google_df.loc[google_df['Ratings'].isin([1, 2]), 'Sentiments'] = 'Negative'  # Ratings 1 or 2 -> Negative
google_df.loc[google_df['Ratings'].isin([4, 5]), 'Sentiments'] = 'Positive'  # Ratings 4 or 5 -> Positive

In [None]:
# Getting the general info of the data
google_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Reviews     500 non-null    object
 1   Ratings     500 non-null    int64 
 2   Date        500 non-null    object
 3   Source      500 non-null    object
 4   Year        500 non-null    int64 
 5   Sentiments  500 non-null    object
dtypes: int64(2), object(4)
memory usage: 23.6+ KB


In [None]:
# calculate sentiment counts
playstore_sentiment_count = calculate_distribution(google_df, 'Sentiments', reference_df=None)
playstore_sentiment_count

Unnamed: 0_level_0,Counts,Percentage(%)
Sentiments,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,385,77.0
Negative,108,21.6
Neutral,7,1.4


In [None]:
# Playstore Sentiment
plot_sentiment_donut(google_df, sentiment_column='Sentiments', title='Playstore Sentiment Distribution')

In [None]:
# Calculate Ratings count
playstore_ratings_count = calculate_distribution(google_df, 'Ratings', reference_df=None)
playstore_ratings_count

Unnamed: 0_level_0,Counts,Percentage(%)
Ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
5,369,73.8
1,86,17.2
2,15,3.0
4,15,3.0
3,15,3.0


In [None]:
#Visualizing the ratings count
plot_rating_distribution(google_df, rating_column='Ratings', title='Google Play Ratings')

In [None]:
# Postive Reviews
positive_result_df = google_df[google_df['Sentiments'] == 'Positive']
positive_results = analyze_reviews(positive_result_df, 'Reviews')


====Common words:=====

'amazing': 17 occurrences (4.42%)
'app': 66 occurrences (17.14%)
'brilliant': 46 occurrences (11.95%)
'convenient': 17 occurrences (4.42%)
'easier': 20 occurrences (5.19%)
'easy': 126 occurrences (32.73%)
'excellent': 33 occurrences (8.57%)
'good': 26 occurrences (6.75%)
'great': 56 occurrences (14.55%)
'locker': 20 occurrences (5.19%)
'love': 34 occurrences (8.83%)
'open': 31 occurrences (8.05%)
'parcel': 18 occurrences (4.68%)
'parcels': 16 occurrences (4.16%)
'quick': 23 occurrences (5.97%)
'really': 15 occurrences (3.90%)
'remote': 16 occurrences (4.16%)
'remotely': 15 occurrences (3.90%)
'service': 37 occurrences (9.61%)
'use': 65 occurrences (16.88%)

=====Common Phrases:=====

'absolutely brilliant': 5 occurrences (1.30%)
'app easy': 5 occurrences (1.30%)
'app easy use': 4 occurrences (1.04%)
'brilliant easy': 6 occurrences (1.56%)
'brilliant easy use': 4 occurrences (1.04%)
'easy use': 48 occurrences (12.47%)
'excellent app': 5 occurrences (1.30%)
'excel

In [None]:
categories = {
    "Ease of Use": ["very user friendly", "easy", "ease", "easy peasy", "so quick and easy", "simple quick n easy", "fabulous easy 2 use", "incredibly easy to use"],
    "App Functionality": ["remote opening", "app", "live updates", "tracking information", "notifications", "QR code", "open remotely", "open", "remotely"],
    "Convenience": ["convenient", "hassle free", "24/7 access", "no need to queue", "collect anytime", "saves time", "quicker", "easier", "delivery"],
    "Service Quality": ["excellent service", "great service", "service", "brilliant", "fantastic", "amazing service", "superb", "excellent"],
    "Overall Experience": ["game changer", "amazing", "love", "the future", "impressive", "awesome", "love it", "would highly recommend", "great", "good"]
}

# Positive Reviews categories
positive_result_df['Category'] = positive_result_df['Reviews'].apply(categorize_review)
positive_result_df.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments,Category
0,brilliant,5,2024-05-22 04:47:33,Google Play,2024,Positive,Service Quality
2,"very user friendly, no problems at all",5,2024-05-21 12:21:12,Google Play,2024,Positive,Ease of Use
3,"wow, much easier to us app to open.",5,2024-05-20 10:27:53,Google Play,2024,Positive,App Functionality
4,Great service,5,2024-05-20 08:49:48,Google Play,2024,Positive,Service Quality
5,super,5,2024-05-18 21:53:23,Google Play,2024,Positive,Other


In [None]:
# Calculate category count
playstore_positive_category= calculate_distribution(positive_result_df, 'Category', reference_df=google_df)
playstore_positive_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Ease of Use,127,25.4
Service Quality,73,14.6
App Functionality,58,11.6
Overall Experience,56,11.2
Other,41,8.2
Convenience,30,6.0


####**Top Positive Themes:**
**Ease of Use:** Customers frequently describe the product as easy and appreciate that the it is straightforward and user-friendly.

**App Functionality:** Users frequently describe the app as helpful, functional, or well-designed.

**Service Quality:**
Many reviews highlight the use of the lockers as great or excellent customer service or overall satisfaction with the experience.

**Convenience:**
Words like "Convinent" are used to express strong approval or delight for the use of lockers in parcel delivery.


In [None]:
# Negative Reviews
negative_result_df = google_df[google_df['Sentiments'] == 'Negative']
negative_results = analyze_reviews(negative_result_df, 'Reviews')

====Common words:=====

'app': 68 occurrences (62.96%)
'availability': 13 occurrences (12.04%)
'code': 20 occurrences (18.52%)
'delivery': 15 occurrences (13.89%)
'doesn': 13 occurrences (12.04%)
'inpost': 20 occurrences (18.52%)
'locker': 31 occurrences (28.70%)
'lockers': 15 occurrences (13.89%)
'number': 17 occurrences (15.74%)
'open': 12 occurrences (11.11%)
'parcel': 30 occurrences (27.78%)
'parcels': 20 occurrences (18.52%)
'phone': 12 occurrences (11.11%)
'send': 10 occurrences (9.26%)
'time': 10 occurrences (9.26%)
'update': 17 occurrences (15.74%)
'useless': 14 occurrences (12.96%)
'verification': 11 occurrences (10.19%)
'won': 11 occurrences (10.19%)
'work': 18 occurrences (16.67%)

=====Common Phrases:=====

'app doesn': 3 occurrences (2.78%)
'app good': 3 occurrences (2.78%)
'app says': 3 occurrences (2.78%)
'doesn work': 9 occurrences (8.33%)
'error message': 3 occurrences (2.78%)
'live availability': 5 occurrences (4.63%)
'locker availability': 4 occurrences (3.70%)
'lock

In [None]:
categories = {
    "App Issues": ['app', 'phone', 'downloaded app', 'error message', 'use app', 'useless app', 'won let', 'work'],
    "Locker Issues": ['inpost', 'locker', 'lockers', 'locker availability', 'lockers available', 'live availability',
                                   'says lockers', 'qr code', 'availability', 'available', 'space'],
    "Verification Issues": ['number', 'mobile number', 'number send', 'phone number', 'verification', 'verification code', 'link'],
    "Parcel Issues": ['parcel', 'parcels', 'code', 'delivery', 'track parcel', 'send parcel', 'open', 'damage', 'lost', 'broken']
}

# Positive Reviews categories
negative_result_df['Category'] = negative_result_df['Reviews'].apply(categorize_review)
negative_result_df.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments,Category
1,Can not speak to anyone. Except a robot with l...,1,2024-05-21 12:21:54,Google Play,2024,Negative,Other
7,First use..App says some lockers. Lockers say ...,1,2024-05-18 12:40:55,Google Play,2024,Negative,App Issues
8,"Absolutely awful customer service, you cannot ...",1,2024-05-16 12:37:08,Google Play,2024,Negative,Locker Issues
10,This app is unreliable. Also I think the couri...,1,2024-05-15 15:30:29,Google Play,2024,Negative,App Issues
14,"It won't even let me type in my phone number, ...",1,2024-05-15 08:14:29,Google Play,2024,Negative,App Issues


In [None]:
# Calculate category count
playstore_negative_category= calculate_distribution(negative_result_df, 'Category', reference_df=google_df)
playstore_negative_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
App Issues,68,13.6
Parcel Issues,16,3.2
Locker Issues,12,2.4
Other,8,1.6
Verification Issues,4,0.8


#### **Top Issues:**
**App-related Issues:** Customers frequently mention problems with the app itself, such as bugs, usability challenges, or performance issues.

**Parcel-related Problems:** Complaints about handling parcels, such as delays, lost items, or difficulties in retrieving them.

**Locker-related Concerns** Issues with accessing or using lockers, including malfunctioning systems or unavailability.

**Verification Problems:** Frustrations with verifying codes and phone numbers.

In [None]:
# Combining the negative and positive categories
playstore_category_df = pd.concat([playstore_positive_category, playstore_negative_category], ignore_index=False)
playstore_category_df = playstore_category_df.sort_values(by='Counts', ascending=False)
playstore_category_df

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Ease of Use,127,25.4
Service Quality,73,14.6
App Issues,68,13.6
App Functionality,58,11.6
Overall Experience,56,11.2
Other,41,8.2
Convenience,30,6.0
Parcel Issues,16,3.2
Locker Issues,12,2.4
Other,8,1.6


In [None]:
# visualize the Categories
category_chart(playstore_category_df,'title')

## **Apple Appstore**

In this phase we extract reviews from Appstore using the App-store-scrapper library. We write a function that extracts the reviews, and store in a pandas dataframe. The total number of reviews we will extract is **400**, starting from the most latest.

Once extracted, we pass the sentiments through the BERT model to generate sentiments, and then we analyse the data, getting the distribution of the sentiments, ratings and identifying common themes in the negative and positive reviews.


In [None]:
# Getting App-store reviews

def get_app_store_reviews(countrycode, app_name, app_id):
        # Get reviews using app_store_scraper
        a_reviews = AppStore(countrycode, app_name, app_id)
        a_reviews.review(how_many=400)

        # Convert the reviews to a DataFrame
        a_df = pd.DataFrame(np.array(a_reviews.reviews), columns=['review'])
        a_df2 = a_df.join(pd.DataFrame(a_df.pop('review').tolist()))

        # Extract relevant columns and rename them
        df2 = a_df2[['review', 'rating', 'date']]
        df2 = df2.rename(columns={'review': 'Reviews', 'rating': 'Ratings', 'date': 'Date'})

        # Add a 'Source' column
        df2['Source'] = 'App Store'

        # Adding a year column
        df2['Year']= df2['Date'].dt.year

        return df2

In [None]:
appstore_df = get_app_store_reviews("gb","inpost-uk",1591214233)
appstore_df.head()

Unnamed: 0,Reviews,Ratings,Date,Source,Year
0,What a brilliant app. So many delivery service...,5,2024-01-27 11:32:02,App Store,2024
1,I have just returned from my third trip to the...,1,2024-04-06 16:57:15,App Store,2024
2,I‚Äôve sent 3 parcels so far with inpost as a tr...,1,2024-08-18 12:22:51,App Store,2024
3,Would've been rated brilliant 5-stars but ther...,4,2024-01-30 09:29:19,App Store,2024
4,This app is literally so good for selling and ...,5,2024-08-20 20:53:31,App Store,2024


In [None]:
# Including Sentiment analysis to the data
j = appstore_df['Reviews'].apply(lambda x: sentiment_score(x[:512])) # Using the sentiment score function
appstore_df['Sentiments'] = np.select([j.isin([1,2]), j==3, j.isin([4,5])],['Negative','Neutral','Positive'], default ='Unknown') # Splitting sentiments into ranges
appstore_df.head()

Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments
0,What a brilliant app. So many delivery service...,5,2024-01-27 11:32:02,App Store,2024,Positive
1,I have just returned from my third trip to the...,1,2024-04-06 16:57:15,App Store,2024,Negative
2,I‚Äôve sent 3 parcels so far with inpost as a tr...,1,2024-08-18 12:22:51,App Store,2024,Negative
3,Would've been rated brilliant 5-stars but ther...,4,2024-01-30 09:29:19,App Store,2024,Neutral
4,This app is literally so good for selling and ...,5,2024-08-20 20:53:31,App Store,2024,Positive


In [None]:
# Saving the data as a csv
appstore_df.to_csv('Appstore_reviews.csv', index=False)

**A quick note;** While observing the sentiments, the BERT model has done a great job classifying the tweets in sentiments, however there has been few incorrect classifications.
Noting the correlation between the Ratings and the Sentiments, it made sense to equate reviews with high ratings as positive and those with low ratings as negative

In [None]:
# Update the Sentiments column based on Ratings
appstore_df.loc[appstore_df['Ratings'].isin([1, 2]), 'Sentiments'] = 'Negative'  # Ratings 1 or 2 -> Negative
appstore_df.loc[appstore_df['Ratings'].isin([4, 5]), 'Sentiments'] = 'Positive'  # Ratings 4 or 5 -> Positive

In [None]:
appstore_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Reviews     400 non-null    object
 1   Ratings     400 non-null    int64 
 2   Date        400 non-null    object
 3   Source      400 non-null    object
 4   Year        400 non-null    int64 
 5   Sentiments  400 non-null    object
dtypes: int64(2), object(4)
memory usage: 18.9+ KB


In [None]:
# calculate sentiment counts
Appstore_sentiment_count = calculate_distribution(appstore_df, 'Sentiments', reference_df=None)
Appstore_sentiment_count

Unnamed: 0_level_0,Counts,Percentage(%)
Sentiments,Unnamed: 1_level_1,Unnamed: 2_level_1
Negative,202,50.5
Positive,189,47.25
Neutral,9,2.25


In [None]:
#visualizing Appstore sentiments
plot_sentiment_donut(appstore_df, sentiment_column='Sentiments', title='Appstore Sentiment Distribution')

In [None]:
# Calculate the Ratings count
Appstore_ratings_count = calculate_distribution(appstore_df, 'Ratings', reference_df=None)
Appstore_ratings_count

Unnamed: 0_level_0,Counts,Percentage(%)
Ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
1,157,39.25
5,156,39.0
2,34,8.5
4,27,6.75
3,26,6.5


In [None]:
#visualizing the ratings count
plot_rating_distribution(appstore_df, rating_column='Ratings', title='App Store Ratings')

In [None]:
# Postive Reviews
app_positive_result = appstore_df[appstore_df['Sentiments'] == 'Positive']
positive_results = analyze_reviews(app_positive_result, 'Reviews')

====Common words:=====

'able': 17 occurrences (8.99%)
'app': 81 occurrences (42.86%)
'collect': 15 occurrences (7.94%)
'easy': 60 occurrences (31.75%)
'excellent': 15 occurrences (7.94%)
'good': 29 occurrences (15.34%)
'great': 37 occurrences (19.58%)
'inpost': 26 occurrences (13.76%)
'just': 16 occurrences (8.47%)
'like': 15 occurrences (7.94%)
'locker': 37 occurrences (19.58%)
'lockers': 36 occurrences (19.05%)
'love': 33 occurrences (17.46%)
'open': 19 occurrences (10.05%)
'parcel': 21 occurrences (11.11%)
'parcels': 31 occurrences (16.40%)
'really': 21 occurrences (11.11%)
'service': 33 occurrences (17.46%)
'time': 23 occurrences (12.17%)
'use': 48 occurrences (25.40%)

=====Common Phrases:=====

'app easy': 10 occurrences (5.29%)
'app easy use': 6 occurrences (3.17%)
'collect parcel': 4 occurrences (2.12%)
'easy use': 31 occurrences (16.40%)
'excellent service': 6 occurrences (3.17%)
'great app': 8 occurrences (4.23%)
'inpost app': 5 occurrences (2.65%)
'local lockers': 5 occurre

In [None]:
categories = {
    "Ease of Use": ["quick","easy","easier", "easy use"],
    "App Functionality": ["remote opening","app",'track','qr code'],
    "Convenience": ["convenient",'touch screen', 'remote', 'remotely'],
    "Service Quality": ["service","excellent","parcel","parcels",'locker','lockers'],
    "Overall Experience": ["game changer","amazing","brilliant","love","the future","impressive","awesome","love it","would highly recommend","great","good"]
}

# Positive Reviews categories
app_positive_result['Category'] = app_positive_result['Reviews'].apply(categorize_review)
app_positive_result.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments,Category
0,What a brilliant app. So many delivery service...,5,2024-01-27 11:32:02,App Store,2024,Positive,Ease of Use
3,Would've been rated brilliant 5-stars but ther...,4,2024-01-30 09:29:19,App Store,2024,Positive,App Functionality
4,This app is literally so good for selling and ...,5,2024-08-20 20:53:31,App Store,2024,Positive,App Functionality
5,Being able to check availability in the locker...,3,2024-05-18 14:03:01,App Store,2024,Positive,App Functionality
6,I love using the app to collect things. It sav...,5,2024-07-11 16:23:07,App Store,2024,Positive,App Functionality


In [None]:
# Calculate the Category count
appstore_positive_category= calculate_distribution(app_positive_result, 'Category', reference_df= appstore_df)
appstore_positive_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Ease of Use,75,18.75
App Functionality,46,11.5
Service Quality,26,6.5
Overall Experience,18,4.5
Other,15,3.75
Convenience,9,2.25


####**Top Positive Themes:**

**App Functionality:** Users frequently describe the app as helpful, functional, or well-designed.

**Ease of Use:** Customers frequently describe the product as easy and appreciate that the it is straightforward and user-friendly.

**Overall Experience:**
Many customers expressed appreciation with the convinence in using lockers for parcel delivery, particulary because the lockers are local and can be opened remotely.

**Service Quality:** Customers explicitly express that they "love" the locker service.




In [None]:
# Appstore Negative Reviews
app_negative_result = appstore_df[appstore_df['Sentiments'] == 'Negative']
positive_results = analyze_reviews(app_negative_result, 'Reviews')

====Common words:=====

'app': 155 occurrences (76.73%)
'availability': 33 occurrences (16.34%)
'available': 32 occurrences (15.84%)
'doesn': 18 occurrences (8.91%)
'don': 28 occurrences (13.86%)
'inpost': 34 occurrences (16.83%)
'just': 26 occurrences (12.87%)
'live': 21 occurrences (10.40%)
'locker': 69 occurrences (34.16%)
'lockers': 75 occurrences (37.13%)
'parcel': 77 occurrences (38.12%)
'parcels': 44 occurrences (21.78%)
'says': 28 occurrences (13.86%)
'send': 29 occurrences (14.36%)
'service': 33 occurrences (16.34%)
'time': 50 occurrences (24.75%)
'tracking': 27 occurrences (13.37%)
'use': 34 occurrences (16.83%)
'using': 24 occurrences (11.88%)
've': 26 occurrences (12.87%)

=====Common Phrases:=====

'app doesn': 12 occurrences (5.94%)
'app says': 8 occurrences (3.96%)
'app track': 4 occurrences (1.98%)
'checked app': 5 occurrences (2.48%)
'customer service': 10 occurrences (4.95%)
'doesn work': 6 occurrences (2.97%)
'downloaded app': 5 occurrences (2.48%)
'inpost app': 6 oc

In [None]:
categories = {
    "App Issues": ['downloaded app', 'error message', "app dosen", 'qr code','availability', 'available'],
    "Locker Issues": ['inpost', 'locker', 'lockers', 'locker availability', 'lockers available', 'live availability',
                      'says lockers', 'space'],
    "Verification Issues": ['number', 'mobile number', 'number send', 'phone number', 'verification', 'verification code', 'link'],
    "Parcel Issues": ['parcel', 'parcels', 'delivery', 'tracking', 'track parcel', 'send parcel', 'open', 'damage', 'lost', 'broken']
}
# Positive Reviews categories
app_negative_result['Category'] = app_negative_result['Reviews'].apply(categorize_review)
app_negative_result.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Reviews,Ratings,Date,Source,Year,Sentiments,Category
1,I have just returned from my third trip to the...,1,2024-04-06 16:57:15,App Store,2024,Negative,App Issues
2,I‚Äôve sent 3 parcels so far with inpost as a tr...,1,2024-08-18 12:22:51,App Store,2024,Negative,Locker Issues
7,Absolute rubbish! What is the purpose of havin...,2,2024-04-17 05:11:09,App Store,2024,Negative,App Issues
11,How hard can it be to add an option to delete ...,3,2024-10-16 13:51:55,App Store,2024,Negative,Parcel Issues
13,"Shows that there is space in a locker, when th...",2,2024-05-29 14:12:54,App Store,2024,Negative,Locker Issues


In [None]:
# Calculate the Category count
appstore_negative_category= calculate_distribution(app_negative_result, 'Category', reference_df= appstore_df)
appstore_negative_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Locker Issues,55,13.75
App Issues,49,12.25
Parcel Issues,48,12.0
Other,39,9.75
Verification Issues,11,2.75


####**Top Issues:**

**App-related Issues** Customers frequently mention problems with the app itself, especially with complains with misinformations on locker availabilty.

**Parcel-related Problems** Complaints about handling parcels, such as delays, lost items, or difficulties in retrieving them.

**Locker-related Concerns** Issues with accessing or using lockers, including malfunctioning lockers or unavailability, or lack of space in the lockers


**Verification Issues:** Complaints with customer on code verifications on the app.

In [None]:
# Combining the negative and positive categories
appstore_category_df = pd.concat([appstore_positive_category, appstore_negative_category], ignore_index=False)
appstore_category_df = appstore_category_df.sort_values(by='Counts', ascending=False)
appstore_category_df

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Ease of Use,75,18.75
Locker Issues,55,13.75
App Issues,49,12.25
Parcel Issues,48,12.0
App Functionality,46,11.5
Other,39,9.75
Service Quality,26,6.5
Overall Experience,18,4.5
Other,15,3.75
Verification Issues,11,2.75


In [None]:
# visualize the Categories
category_chart(appstore_category_df,'Appstore Review Categories')

## **TrustPilot**

In this phase, we write a Python script the helps us scrape reviews from trustpilot. The script accepts only two arguements, the url we want to scrape, helping us extract the reviews, and the number of reviews we want to scrape, here we will extract **500** reviews.

As before, we will pass the scraped review through the BERT model to classify them into various sentiments.


In [None]:
# Getting TrustPilot Reviews
import requests
from bs4 import BeautifulSoup
import json
import time
import pandas as pd
from random import choice
import logging

def scrape_trustpilot_reviews(base_url: str, max_reviews: int = 1000) -> list:
    """
    Scrape reviews from Trustpilot with rotating user agents and review limit.

    Args:
        base_url (str): The base URL of the Trustpilot reviews page
        max_reviews (int, optional): The maximum number of reviews to scrape. Defaults to 1000.

    Returns:
        list: A list of dictionaries containing review data.
    """
    # List of common user agents for rotation
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/91.0.864.59"
    ]

    reviews_data = []
    page_number = 1

    # Setup logging
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    while len(reviews_data) < max_reviews:
        try:
            # Rotate user agents
            current_user_agent = choice(user_agents)
            url = f"{base_url}?page={page_number}"

            # Make request with current user agent
            headers = {"User-Agent": current_user_agent}
            req = requests.get(url, headers=headers, timeout=10)
            req.raise_for_status()

            # Add delay to avoid rate limiting
            time.sleep(2 + choice([0, 0.5, 1, 1.5]))  # Random delay between 2-3.5 seconds

            # Parse the page
            soup = BeautifulSoup(req.text, 'html.parser')
            reviews_raw = soup.find("script", id="__NEXT_DATA__")

            if not reviews_raw:
                logger.info(f"No more reviews found at page {page_number}")
                break

            reviews = json.loads(reviews_raw.string)["props"]["pageProps"]["reviews"]

            if not reviews:
                logger.info(f"No more reviews found at page {page_number}")
                break

            # Process reviews
            for review in reviews:
                if len(reviews_data) >= max_reviews:
                    break

                data = {
                    'Date': pd.to_datetime(review["dates"]["publishedDate"]).strftime("%Y-%m-%d"),
                    'Author': review["consumer"]["displayName"],
                    'Body': review["text"],
                    'Heading': review["title"],
                    'Rating': review["rating"],
                    'Location': review["consumer"]["countryCode"]
                }
                reviews_data.append(data)

            logger.info(f"Collected {len(reviews_data)} reviews so far...")
            page_number += 1


            if page_number % 5 == 0:
                logger.info("Adding a longer delay after scraping multiple pages...")
                time.sleep(10 + choice([2, 3, 4]))  # Random delay between 10-14 seconds

        except requests.RequestException as e:
            logger.error(f"Request error on page {page_number}: {str(e)}")
            time.sleep(5)  # Longer delay on error
            continue
        except (json.JSONDecodeError, AttributeError, KeyError) as e:
            logger.error(f"Parsing error on page {page_number}: {str(e)}")
            break
        except Exception as e:
            logger.error(f"Unexpected error on page {page_number}: {str(e)}")
            break

    # Remove duplicates based on all fields
    reviews_data = [dict(t) for t in {tuple(d.items()) for d in reviews_data}]

    logger.info(f"Scraping completed. Total reviews collected: {len(reviews_data)}")
    return reviews_data


In [None]:
# Extracting the reviews
trustpilot_reviews = scrape_trustpilot_reviews("https://ie.trustpilot.com/review/inpost.co.uk", max_reviews=500)

#Converting the reviews to a dataframe
trustpilot_df = pd.DataFrame(trustpilot_reviews)
trustpilot_df.head()

Unnamed: 0,Date,Author,Body,Heading,Rating,Location
0,2024-12-16,Obd,üü©üü©üü©üü©üü©üü©üü©\n‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è\n‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è,üü©üü©üü©üü©üü©üü©üü©,5,GB
1,2024-12-17,Sally Anne Chester,Fast and efficient- easy till use for both sen...,In post are great!,5,GB
2,2024-12-16,customer,Easy to use and was easy to track parcel.,Easy to use and was easy to track‚Ä¶,4,GB
3,2024-12-16,Carolyn,Great service very easy to use,Great service very easy to use,5,GB
4,2024-12-16,Guest,Very straightforward!,Very straightforward!,5,GB


In [None]:
# Including Sentiment analysis to the data
j = trustpilot_df['Body'].apply(lambda x: sentiment_score(x[:512])) # Using the sentiment score function
trustpilot_df['Sentiments'] = np.select([j.isin([1,2]), j==3, j.isin([4,5])],['Negative','Neutral','Positive'], default ='Unknown') # Splitting sentiments into ranges
trustpilot_df.head()

Unnamed: 0,Date,Author,Body,Heading,Rating,Location,Sentiments
0,2024-12-16,Obd,üü©üü©üü©üü©üü©üü©üü©\n‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è\n‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è,üü©üü©üü©üü©üü©üü©üü©,5,GB,Positive
1,2024-12-17,Sally Anne Chester,Fast and efficient- easy till use for both sen...,In post are great!,5,GB,Positive
2,2024-12-16,customer,Easy to use and was easy to track parcel.,Easy to use and was easy to track‚Ä¶,4,GB,Positive
3,2024-12-16,Carolyn,Great service very easy to use,Great service very easy to use,5,GB,Positive
4,2024-12-16,Guest,Very straightforward!,Very straightforward!,5,GB,Negative


In [None]:
# Saving the data as a csv
trustpilot_df.to_csv('Trustpilot_reviews.csv', index=False)

In [None]:
# dropping unneeded columns
trustpilot_df= trustpilot_df.drop(columns=["Author","Heading","Location"],axis=1)
trustpilot_df.head(2)

Unnamed: 0,Date,Body,Rating,Sentiments
0,2024-12-16,üü©üü©üü©üü©üü©üü©üü©\n‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è\n‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è,5,Positive
1,2024-12-17,Fast and efficient- easy till use for both sen...,5,Positive


In [None]:
# Renaming columns to follow previous patterns
new_column_names = {
    'Body': 'Reviews',
    'Rating': 'Ratings',
    'Date': 'Date',
    'Sentiments': 'Sentiments'
}
trustpilot_df = trustpilot_df.rename(columns=new_column_names)
trustpilot_df.head(2)

Unnamed: 0,Date,Reviews,Ratings,Sentiments
0,2024-12-16,üü©üü©üü©üü©üü©üü©üü©\n‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è\n‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è,5,Positive
1,2024-12-17,Fast and efficient- easy till use for both sen...,5,Positive


**A quick note;** While observing the sentiments, the BERT model has done a great job classifying the tweets in sentiments, however there has been few incorrect classifications.
Noting the correlation between the Ratings and the Sentiments, it made sense to equate reviews with high ratings as positive and those with low ratings as negative

In [None]:
# Update the Sentiments column based on Ratings
trustpilot_df.loc[trustpilot_df['Ratings'].isin([1, 2]), 'Sentiments'] = 'Negative'  # Ratings 1 or 2 -> Negative
trustpilot_df.loc[trustpilot_df['Ratings'].isin([4, 5]), 'Sentiments'] = 'Positive'  # Ratings 4 or 5 -> Positive

In [None]:
trustpilot_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        500 non-null    object
 1   Reviews     500 non-null    object
 2   Ratings     500 non-null    int64 
 3   Sentiments  500 non-null    object
dtypes: int64(1), object(3)
memory usage: 15.8+ KB


In [None]:
# Calculate the sentiment count
trustpilot_sentiment_count = calculate_distribution(trustpilot_df, 'Sentiments', reference_df=None)
trustpilot_sentiment_count

Unnamed: 0_level_0,Counts,Percentage(%)
Sentiments,Unnamed: 1_level_1,Unnamed: 2_level_1
Positive,463,92.6
Negative,34,6.8
Neutral,3,0.6


In [None]:
#visualizing Trustpilot sentiments
plot_sentiment_donut(trustpilot_df, sentiment_column='Sentiments', title='Trustpilot Sentiment Distribution')

In [None]:
# Calculate the Ratings count
trustpilot_ratings_count = calculate_distribution(trustpilot_df, 'Ratings', reference_df=None)
trustpilot_ratings_count

Unnamed: 0_level_0,Counts,Percentage(%)
Ratings,Unnamed: 1_level_1,Unnamed: 2_level_1
5,445,89.0
1,23,4.6
4,16,3.2
3,11,2.2
2,5,1.0


In [None]:
# Visualizing the Ratings
plot_rating_distribution(trustpilot_df, rating_column='Ratings', title='Trustpilot Rating Distribution')

In [None]:
# Trustpilot Postive Reviews
trust_positive_result = trustpilot_df[trustpilot_df['Sentiments'] == 'Positive']
positive_results = analyze_reviews(trust_positive_result, 'Reviews')

====Common words:=====

'collect': 20 occurrences (4.32%)
'convenient': 52 occurrences (11.23%)
'delivery': 34 occurrences (7.34%)
'easy': 260 occurrences (56.16%)
'efficient': 22 occurrences (4.75%)
'fast': 33 occurrences (7.13%)
'good': 38 occurrences (8.21%)
'great': 69 occurrences (14.90%)
'inpost': 25 occurrences (5.40%)
'parcel': 41 occurrences (8.86%)
'parcels': 22 occurrences (4.75%)
'post': 20 occurrences (4.32%)
'quick': 90 occurrences (19.44%)
'really': 18 occurrences (3.89%)
'reliable': 17 occurrences (3.67%)
'service': 104 occurrences (22.46%)
'simple': 28 occurrences (6.05%)
'time': 28 occurrences (6.05%)
'use': 138 occurrences (29.81%)
'way': 16 occurrences (3.46%)

=====Common Phrases:=====

'brilliant service': 7 occurrences (1.51%)
'convenient easy': 6 occurrences (1.30%)
'easy collect': 7 occurrences (1.51%)
'easy convenient': 10 occurrences (2.16%)
'easy quick': 13 occurrences (2.81%)
'easy use': 102 occurrences (22.03%)
'fast delivery': 7 occurrences (1.51%)
'fast 

In [None]:
categories = {
    "App Functionality": ["remote opening", "app", "track", "qr code", "barcode", "phone"],
    "Convenience": ["convenient", "touch screen", "remote", "remotely", "time", "post", "collect"],
    "Service Quality": ["service", "excellent", "parcel", "parcels", "locker", "lockers", "fast", "reliable", "efficient"],
    "Overall Experience": ["game changer", "amazing", "good", "brilliant", "straight forward", "straightforward", "love",
                           "the future", "impressive", "awesome", "love it", "would highly recommend", "great"],
    "Ease of Use": ["quick", "easier", "very easy", "simple", "easy use", "easy to use", "easy"]
}

# Positive Reviews categories
trust_positive_result['Category'] = trust_positive_result['Reviews'].apply(categorize_review)
trust_positive_result.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Date,Reviews,Ratings,Sentiments,Category
0,2024-12-16,üü©üü©üü©üü©üü©üü©üü©\n‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è‚ù§Ô∏è‚¨úÔ∏è\n‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è‚¨õÔ∏è,5,Positive,Other
1,2024-12-17,Fast and efficient- easy till use for both sen...,5,Positive,Convenience
2,2024-12-16,Easy to use and was easy to track parcel.,4,Positive,App Functionality
3,2024-12-16,Great service very easy to use,5,Positive,Service Quality
4,2024-12-16,Very straightforward!,5,Positive,Overall Experience


In [None]:
# Calculate the category count
trustpilot_positive_category = calculate_distribution(trust_positive_result, 'Category', reference_df=trustpilot_df)
trustpilot_positive_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Service Quality,138,27.6
Convenience,131,26.2
Ease of Use,102,20.4
Overall Experience,42,8.4
App Functionality,27,5.4
Other,23,4.6


####**Top Positive Themes:**

**Service Quality:** Lots of comments made on good services, great and brillant services indicating customer satisfaction with the overall quality of service provided.

**Convenience:** There are enormous mention of the words fast, qucick, fast delivery, showing the customers appreciation of how quick using the inpost service is.

**Ease of Use:** Customers frequently describe the use of lockers for package deliveries as convinent and easy to use.


**Overall Experience:** Most user consider the use of the service as a reliable format for parcel delivery.


In [None]:
# Trustpilot Negative Reviews
trust_negative_result = trustpilot_df[trustpilot_df['Sentiments'] == 'Negative']
negative_results = analyze_reviews(trust_negative_result, 'Reviews')

====Common words:=====

'box': 5 occurrences (14.71%)
'code': 3 occurrences (8.82%)
'collect': 4 occurrences (11.76%)
'going': 3 occurrences (8.82%)
'inpost': 4 occurrences (11.76%)
'item': 6 occurrences (17.65%)
'locker': 5 occurrences (14.71%)
'lockers': 13 occurrences (38.24%)
'lost': 4 occurrences (11.76%)
'parcel': 21 occurrences (61.76%)
'received': 3 occurrences (8.82%)
'refund': 5 occurrences (14.71%)
'returned': 3 occurrences (8.82%)
'sent': 3 occurrences (8.82%)
'service': 3 occurrences (8.82%)
'time': 5 occurrences (14.71%)
'told': 3 occurrences (8.82%)
'use': 5 occurrences (14.71%)
'went': 4 occurrences (11.76%)
'wrong': 3 occurrences (8.82%)

=====Common Phrases:=====

'collect parcel': 2 occurrences (5.88%)
'item received': 2 occurrences (5.88%)
'lost lost': 2 occurrences (5.88%)
'providers alwyas faster': 1 occurrences (2.94%)
'public obviously': 1 occurrences (2.94%)
'public obviously thought': 1 occurrences (2.94%)
'push highest': 1 occurrences (2.94%)
'push highest lo

In [None]:
categories = {
    "App Issues": ['downloaded app', 'error message', "app dosen", 'qr code'],
    "Returns Issues": ['return', 'returned'],
    "Parcel Issues": ['parcel', 'parcels', 'delivery', 'tracking', 'track parcel', 'send parcel', 'returning', 'damage', 'lost', 'broken'],
    "Locker Issues": ['inpost', 'locker', 'lockers', 'locker availability', 'lockers available', 'live availability', 'open',
                      'code', 'says lockers', 'availability', 'available', 'space', 'screen']
}
# Positive Reviews categories
trust_negative_result['Category'] = trust_negative_result['Reviews'].apply(categorize_review)
trust_negative_result.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Date,Reviews,Ratings,Sentiments,Category
58,2024-12-16,Had to visit the lockers 3 times as the first ...,2,Negative,Locker Issues
100,2024-12-16,The screen at the post it point was not workin...,3,Negative,Locker Issues
111,2024-12-16,Really bad amount of time to take to take par...,1,Negative,Parcel Issues
124,2024-12-17,InPost refuse to check CCTV to prove their mac...,1,Negative,Parcel Issues
126,2024-12-16,Very Bad for returning parcels & not servicing...,2,Negative,Returns Issues


In [None]:
# Calculate the category count
trustpilot_negative_category = calculate_distribution(trust_negative_result, 'Category', reference_df=trustpilot_df)
trustpilot_negative_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Parcel Issues,17,3.4
Locker Issues,11,2.2
Returns Issues,3,0.6
Other,3,0.6


####**Top Issues:**

**Parcel Issues** Some customers have express concerns with having issues collecting parcels from lockers, wrong parcel deliveries and even lost/damaged parcels.

**Issues with lockers** Some concerns concerning lockers, particulary about size, inability to use codes to open lockers.

**Inability to Return** Some customers have expressed frustration in returning parcel processes, complaining it is slow and takes a while to process



In [None]:
# Combining the negative and positive categories
trustpilot_category_df = pd.concat([trustpilot_positive_category, trustpilot_negative_category], ignore_index=False)
trustpilot_category_df = trustpilot_category_df.sort_values(by='Counts', ascending=False)
trustpilot_category_df

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Service Quality,138,27.6
Convenience,131,26.2
Ease of Use,104,20.8
Overall Experience,40,8.0
App Functionality,27,5.4
Other,23,4.6
Parcel Issues,17,3.4
Locker Issues,11,2.2
Returns Issues,3,0.6
Other,3,0.6


In [None]:
# visualize the Categories
category_chart(trustpilot_category_df,'Trustpilot Review Categories')

##  **Twitter Analysis**
Here tweets are scrapped from twitter and analysed. Due to the lengthy code lines used to extract and clean the tweet data, I have made a seperate notebook for that, link will be provided here in due time.

For this phase, we do a painpoint analysis, seeing that most tweets are infact individuals going on twitter to express their fraustration using the service.


In [None]:
# Reading the Twitter data
twitter_df= pd.read_excel('Tweet_data.xlsx')
twitter_df.head()

Unnamed: 0,TWEET,DATE
0,"@InPostUK ugh inpostttttt, this is why I'm alw...",2024-09-12 00:00:00
1,@InPostUK why don't you reply to emails ? Also...,2024-09-12 00:00:00
2,@InPostUK I WANT A REFUND NOW!!! MY PARCEL IS ...,2024-10-12 00:00:00
3,@InPostUK please can you help me regarding two...,2024-09-12 00:00:00
4,"@InPostUK hi,is there a way of checking if the...",2024-10-12 00:00:00


In [None]:
# trimming the extraspace on the date
twitter_df.columns = twitter_df.columns.str.strip()

twitter_df.columns.tolist()


['TWEET', 'DATE']

In [None]:
# Getting basic info of the data
twitter_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249 entries, 0 to 248
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   TWEET   249 non-null    object
 1   DATE    248 non-null    object
dtypes: object(2)
memory usage: 4.0+ KB


In [None]:
Common_tweets = analyze_reviews(twitter_df, 'TWEET')

====Common words:=====

'customer': 24 occurrences (9.64%)
'days': 21 occurrences (8.43%)
'delivery': 25 occurrences (10.04%)
'help': 21 occurrences (8.43%)
'hi': 19 occurrences (7.63%)
'inpost': 48 occurrences (19.28%)
'inpostuk': 204 occurrences (81.93%)
'item': 21 occurrences (8.43%)
'just': 24 occurrences (9.64%)
'locker': 60 occurrences (24.10%)
'lockers': 27 occurrences (10.84%)
'parcel': 105 occurrences (42.17%)
'parcels': 29 occurrences (11.65%)
'sent': 28 occurrences (11.24%)
'service': 41 occurrences (16.47%)
'time': 21 occurrences (8.43%)
'tracking': 19 occurrences (7.63%)
've': 41 occurrences (16.47%)
'vinted': 47 occurrences (18.88%)
'vinteduk': 25 occurrences (10.04%)

=====Common Phrases:=====

'broken locker': 4 occurrences (1.61%)
'collect parcel': 6 occurrences (2.41%)
'customer service': 18 occurrences (7.23%)
'deliver parcel': 5 occurrences (2.01%)
'haven received': 4 occurrences (1.61%)
'inpost locker': 8 occurrences (3.21%)
'inpostuk hi': 14 occurrences (5.62%)
'i

In [None]:
# Define categories and keywords
categories = {
    'Collection Point Problems': ['collection point', 'moved', 'changed', 'Machines','code'],
    'Locker Problems': ['broken','locker','lockers' 'malfunctioning', 'out of order', 'full', 'unavailble','no space', 'locker full'],
    'Delivery Issues': ['delayed', 'missing', 'lost', 'stolen', 'not delivered', 'stuck', 'deliver', 'parcel', 'delivery'],
    'Customer Service': ['customer service', 'no reply', 'ignored', 'unhelpful'],
    'Tracking and Updates': ['tracking', 'update', 'status', 'tracking parcel'],
    'App and Website Issues': ['app', 'website', 'qr code', 'label','scan' ],
    'Refund and Compensation': ['refund', 'compensation']
}

# Categorize tweets
twitter_df['Category'] = twitter_df['TWEET'].apply(categorize_review)
twitter_df.head()

Unnamed: 0,TWEET,DATE,Category
0,"@InPostUK ugh inpostttttt, this is why I'm alw...",2024-09-12 00:00:00,Delivery Issues
1,@InPostUK why don't you reply to emails ? Also...,2024-09-12 00:00:00,Collection Point Problems
2,@InPostUK I WANT A REFUND NOW!!! MY PARCEL IS ...,2024-10-12 00:00:00,Locker Problems
3,@InPostUK please can you help me regarding two...,2024-09-12 00:00:00,Delivery Issues
4,"@InPostUK hi,is there a way of checking if the...",2024-10-12 00:00:00,Locker Problems


In [None]:
# Calculate the category count
tweet_category = calculate_distribution(twitter_df, 'Category', reference_df=None)
tweet_category

Unnamed: 0_level_0,Counts,Percentage(%)
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Delivery Issues,99,39.76
Locker Problems,54,21.69
Other,48,19.28
Collection Point Problems,18,7.23
Customer Service,10,4.02
App and Website Issues,9,3.61
Tracking and Updates,8,3.21
Refund and Compensation,3,1.2


In [None]:
j=twitter_df[twitter_df['Category'] == 'Delivery Issues']
print(j['TWEET'].iloc[48])


@InPostUK
 my local locker hasn‚Äôt been working in over 24 hours, i have two parcels inside. 
Locker is 110 Staines road at Esso, London/Twickenham


In [None]:
# visualize the Categories
category_chart(tweet_category,'Tweet Categories')

In [None]:
def categorize_reviews(review):
    review_lower = review.lower()  # Convert review to lowercase
    match_count = {category: 0 for category in categories}  # Initialize match counters

    # Count keyword matches for each category
    for category, keywords in categories.items():
        match_count[category] = sum(keyword in review_lower for keyword in keywords)

    # Filter categories with matches
    matched_categories = {category: count for category, count in match_count.items() if count > 0}

    # If no matches, classify as 'Other'
    if not matched_categories:
        return 'Other'

    # If multiple matches, return the category with the highest count
    if len(matched_categories) > 1:
        return max(matched_categories, key=matched_categories.get)  # Most matched category

    # If only one match, return that category
    return next(iter(matched_categories))

#### **Observation**
As a side note, while not highlighted in the categories,a constant complain and frustration expressed were concerning integration with **Vinted UK**. Had over 47 occurrences.


## **Concluding**
We have now highlighted certain issues that need to be evaluated, most expecially issues that has to do with Locker Availabiliity and Space issues, a proper functioning app, and better handling of parcel deliveries.
Needless to say, given it's years of operation in the UK, this company has made a good impress in the eyes of its customers.

To point out, the number of reviews used here is quite limitted, thus provide a small sample, for a more elaborate analysis, more reviews can be accessed.
