# Sentiment and Thematic Analysis
This notebook performs sentiment scoring on user reviews using a pretrained model (DistilBERT) to uncover satisfaction signals across banks. Themes will be extracted in the next section.

## Setup and Dataset Overview
Load required packages and preview the cleaned reviews dataset.
Ensure the presence of key columns like `review_clean`, `bank`, and `rating`.

In [1]:
import pandas as pd
import numpy as np
from transformers import pipeline
import os 
import sys
from tqdm import tqdm
tqdm.pandas()

sys.path.append(os.path.abspath("../"))
from src.utils.utils import load_data
from src.utils.keyword_text_processor import preprocess_for_keywords
from src.models.keyword_and_topicmodeling import extract_topic_keywords, map_topics_to_labels, save_topics_with_labels
from src.models.sentiments import  load_sentiment_model

# Load cleaned review data
raw_path = "../data/processed/cleaned_reviews.csv"
df = load_data(raw_path)

# Quick preview
df.head()

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,review_id,review,rating,date,bank,source,review_clean
0,24b9381c-3cd8-431a-b3c9-bd156427585a,wow,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play,wow
1,6a7da8ad-486f-4132-8eb9-aee1e55f7322,excellent,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play,excellent
2,d12f5fc4-9d9a-4f3d-b3e0-074f7b379f14,great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,great
3,fba3dc77-c9b1-4cfc-8afe-317b302b007c,Great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,Great
4,bd502a94-e136-4cee-bb6b-e0c51cc5c245,there is many thing u have to fix.,1,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,there is many thing u have to fix.


## Step 2: Sentiment Model Initialization
We use HuggingFace's Transformers to load `distilbert-base-uncased-finetuned-sst-2-english`, which classifies review texts into Positive, Negative, or Neutral categories.

In [2]:
# Load sentiment model
sentiment_model = load_sentiment_model()


Device set to use cpu


## Sentiment Inference on Reviews
Apply the model to each review using batch processing and store the results as new columns: `sentiment_label` and `sentiment_score`.

In [3]:
def get_sentiment_outputs(text):
    result = sentiment_model(text)[0]
    return pd.Series([result['label'], result['score']], index=['sentiment_label', 'sentiment_score'])

df[['sentiment_label', 'sentiment_score']] = df['review_clean'].progress_apply(get_sentiment_outputs)
df.head()

100%|██████████| 1492/1492 [01:11<00:00, 21.00it/s]


Unnamed: 0,review_id,review,rating,date,bank,source,review_clean,sentiment_label,sentiment_score
0,24b9381c-3cd8-431a-b3c9-bd156427585a,wow,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play,wow,POSITIVE,0.999592
1,6a7da8ad-486f-4132-8eb9-aee1e55f7322,excellent,5,2025-07-25,Commercial Bank of Ethiopia (CBE),Google Play,excellent,POSITIVE,0.999843
2,d12f5fc4-9d9a-4f3d-b3e0-074f7b379f14,great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,great,POSITIVE,0.999863
3,fba3dc77-c9b1-4cfc-8afe-317b302b007c,Great,5,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,Great,POSITIVE,0.999863
4,bd502a94-e136-4cee-bb6b-e0c51cc5c245,there is many thing u have to fix.,1,2025-07-24,Commercial Bank of Ethiopia (CBE),Google Play,there is many thing u have to fix.,NEGATIVE,0.987213


In [4]:
sentiment_summary = (
    df.groupby(['bank', 'rating'])['sentiment_score']
    .mean()
    .reset_index()
    .sort_values(by='sentiment_score', ascending=False)
)

sentiment_summary[sentiment_summary["rating"] == 1] # how much the model is condient about negative reviews

Unnamed: 0,bank,rating,sentiment_score
10,Dashen Bank,1,0.995749
0,Bank of Abyssinia (BOA),1,0.976366
5,Commercial Bank of Ethiopia (CBE),1,0.971417


## Sentiment Analysis Interpretation for Low Ratings

The sentiment classification for 1-star reviews across major banks reveals consistently high confidence scores from the model, ranging from **0.971 to 0.996**. This indicates that:

- The model is **highly certain** these reviews reflect **negative sentiment**.
- Negative emotional tone is **strongly aligned** with low user ratings, suggesting **semantic and numerical coherence**.
- Among the banks:
  - **Dashen Bank** shows the highest confidence (**0.996**), implying highly expressive dissatisfaction in its lowest-rated feedback.
  - **BOA and CBE** follow closely, with confidence scores above **0.97**, affirming reliable sentiment detection.

### Insight
The tight clustering of confidence scores across these 1-star ratings supports the robustness of the sentiment model in identifying clear negative sentiment. It also hints at **minimal ambiguity** in textual tone within low-rated reviews — a key factor when evaluating rating-sentiment consistency.

This insight adds validation to downstream analyses, such as contradiction checks, thematic clustering, and comparative sentiment strength across institutions.


# Thematic Analysis of User Reviews Across Banks

This notebook extracts meaningful keyword patterns from Google Play reviews to uncover common themes.  
It uses TF-IDF n-gram vectorization and LDA topic modeling to identify clusters of feedback signals.  
We apply this analysis to **Dashen Bank** first, then generalize manually across other banks as needed.

### Workflow Summary:
1. Text preprocessing for keyword extraction (emojis, stopwords, lemmatization)  
2. TF-IDF vectorization using 2–3 word n-grams  
3. LDA topic modeling with manual labeling setup  
4. Pipeline ready for judgment-based extension to other banks

In [5]:
# Text cleaning and processing
import re
import emoji
import nltk
import pandas as pd
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# NLP Modeling
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Setup
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adoni\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adoni\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\adoni\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\adoni\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\adoni\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

#### Clean `review_clean` column for keyword modeling

We'll apply a lemmatizer alongside stopword removal and emoji stripping.  
The goal is to normalize user feedback before extracting n-gram keywords.

##  Select Dashen Bank reviews and clean them for TF-IDF

This prepares a `keyword_ready` column with normalized phrases ready for vectorization.

In [6]:
dashen_df = df[df['bank'] == 'Dashen Bank'].copy()
dashen_df['keyword_ready'] = dashen_df['review_clean'].apply(preprocess_for_keywords)
dashen_df[['review_clean', 'keyword_ready']].head()

Unnamed: 0,review_clean,keyword_ready
1000,"Simple for usage and well designed, attractive...",simple usage well designed attractive visual d...
1001,amazing product,amazing product
1002,this app is not good and compatible with other...,app good compatible ethiopian bank apps fast s...
1003,ok,ok
1004,like it,like


## Extract bigrams and trigrams using TF-IDF

We restrict n-gram range to 2–3 to capture compound phrases like "login issue" or "transfer delay".

In [7]:
# initialize vectorizer 
vectorizer = TfidfVectorizer(
    max_df=0.1,
    min_df=1,
    stop_words='english',
    ngram_range=(3,4)
)
print("tfidf vectorizer initialized succesfully✅")

tfidf vectorizer initialized succesfully✅


In [8]:
# create tfidf matrix and extract feature names
dashen_tfidf_matrix = vectorizer.fit_transform(dashen_df['keyword_ready'])
dashen_feature_names = vectorizer.get_feature_names_out()
print("features and matrix created")

features and matrix created


## Identify latent topics with LDA and extract top keyword lists

This step models topics from the TF-IDF matrix and returns terms for each topic,  
so you can manually assign human-readable labels.

In [36]:
# initialize the lda model with custom parameters
n_topics = 5
lda_model = LatentDirichletAllocation(n_components=n_topics, max_iter=70, learning_method='batch', random_state=42)
print("lda model initialized successfully✅")

lda model initialized successfully✅


In [37]:
# fit the tfidf matrix to the lda model
lda_model.fit(dashen_tfidf_matrix)

# extract topics
lda_topics = extract_topic_keywords(lda_model, dashen_feature_names, top_n=20)
# display the topics
for topic, keywords in lda_topics.items():
    print(f"{topic}: {keywords}")

Topic 1: ['mobile banking app', 'proud dashen bank', 'easy flexible app', 'thanks digital transaction', 'easy simple use', 'real life changer', 'lightweightcatchy smooth app', 'platform ive used', 'digital platform ive', 'transaction seamless shopping', 'ive used smooth transaction', 'used smooth transaction', 'best digital platform', 'best digital platform ive', 'platform ive used smooth', 'ive used smooth', 'used smooth transaction seamless', 'smooth transaction seamless shopping', 'digital platform ive used', 'smooth transaction seamless']
Topic 2: ['app make difference', 'amazing app seen', 'hope better amole', 'fast best app', 'better catch competition', 'step ahead masterpiece', 'dashen bank choice', 'fastest easy use', 'ዳሽን ባንክ ይለያል', 'amazing app experience', 'app good guy', 'good easy use', 'wow excellent app', 'app easy use', 'good job dashen', 'dashen super app easy', 'super app easy', 'super app easy use', 'dashen super app', 'wly super app']
Topic 3: ['best mobile banking 

### Topic Labeling Interpretation (Dashen Bank)

We assigned five distinct thematic labels to topics derived from Dashen Bank user reviews. These labels reflect recurring patterns in customer feedback across usability, brand perception, and feature expectations:

- **Topic 1: SeamlessTransactions_DigitalPlatform** — emphasizes smooth, lightweight app performance and satisfaction with Dashen’s digital platform for everyday banking.

- **Topic 2: SuperApp_ExperiencePraise** — highlights enthusiastic praise for Dashen’s super app, ease of use, and competitive positioning against other platforms like Amole.

- **Topic 3: BankingExcellence_FeatureStrength** — showcases strong endorsements of Dashen’s mobile banking capabilities, game-changing features, and cash service integration.

- **Topic 4: BrandIdentity_ProgressiveDesign** — reflects Dashen’s branding as a forward-thinking bank, with references to inclusivity, innovation, and OTP-related platform improvements.

- **Topic 5: UserSentiment_TrustSignals** — blends emotional expressions of trust, national pride, and satisfaction with the app’s power, speed, and user-friendliness.

These thematic assignments will support future insight generation, sentiment alignment, and feature prioritization. We’ll next map these labels to individual reviews using topic probabilities from `lda.transform()`.

In [38]:
# Define Dashen Bank topic keywords (updated)
dashen_topic_keywords = {
    "Topic 1": ['mobile banking app', 'proud dashen bank', 'easy flexible app', 'thanks digital transaction',
                'easy simple use', 'real life changer', 'lightweightcatchy smooth app', 'platform ive used',
                'digital platform ive', 'transaction seamless shopping', 'ive used smooth transaction',
                'used smooth transaction', 'best digital platform', 'best digital platform ive',
                'platform ive used smooth', 'ive used smooth', 'used smooth transaction seamless',
                'smooth transaction seamless shopping', 'digital platform ive used', 'smooth transaction seamless'],

    "Topic 2": ['app make difference', 'amazing app seen', 'hope better amole', 'fast best app',
                'better catch competition', 'step ahead masterpiece', 'dashen bank choice', 'fastest easy use',
                'ዳሽን ባንክ ይለያል', 'amazing app experience', 'app good guy', 'good easy use', 'wow excellent app',
                'app easy use', 'good job dashen', 'dashen super app easy', 'super app easy', 'super app easy use',
                'dashen super app', 'wly super app'],

    "Topic 3": ['best mobile banking application', 'best mobile banking', 'mobile banking application',
                'wowslnwoooo wo amazing', 'best inclass app', 'good compared amole', 'best supper app',
                'dashen super app', 'wow dashen super', 'wow dashen super app', 'excellent game changer',
                'excellent game changer app', 'best banking app seen', 'banking app seen', 'need start cash',
                'need start cash service', 'start cash service', 'good mobile bank', 'mobile bank app',
                'good mobile bank app'],

    "Topic 4": ['dashen bank step', 'bank step ahead', 'dashen bank step ahead', 'dashen super app',
                'wow amazing app', 'good app easy', 'better inclusive app', 'super smart app', 'best app seen',
                'alway slow loading', 'worst app seen', 'awesome app going', 'truly db alwaysonestepahead',
                'simple fast easy', 'atm east africa', 'dashen bank super', 'best platform avoid otp',
                'platform avoid otp', 'best platform avoid', 'app finance sector'],

    "Topic 5": ['wow saff app', 'mobile banking level', 'amazing super app', 'amazing mobile app',
                'wallahi fantastic bank', 'best app easy', 'best app easy use', 'come busy working',
                'come busy working good', 'busy working good', 'amazings dashen bank', 'ethiopian amazings dashen',
                'ethiopian amazings dashen bank', 'easy use powerful application', 'easy use powerful',
                'use powerful application', 'job home bank', 'great job home', 'great job home bank',
                'fast friendly user application']
}

# Define theme labels for Dashen Bank topics (updated)
dashen_theme_labels = {
    0: "SeamlessTransactions_DigitalPlatform",
    1: "SuperApp_ExperiencePraise",
    2: "BankingExcellence_FeatureStrength",
    3: "BrandIdentity_ProgressiveDesign",
    4: "UserSentiment_TrustSignals"
}

print("✅ Dashen Bank topic keywords and theme labels are ready.")

✅ Dashen Bank topic keywords and theme labels are ready.


In [34]:
save_topics_with_labels("dashen", dashen_topic_keywords, dashen_theme_labels, "../data/processed/bank_themes")

✅ Saved to: ../data/processed/bank_themes\dashen_theme_map.json


## Assign Identified Theme to Each Review Using LDA Topic Probabilities

We use the trained LDA model to determine each review's dominant topic.  
This allows us to link the review to one of the five manually labeled themes, which is stored in the `identified_theme` column for downstream analysis.

In [39]:
# map appropriate topic map for each review we will map the top 3 most likely topic for review
dashen_df['identified_theme'] = map_topics_to_labels(lda_model,dashen_tfidf_matrix, dashen_theme_labels)

✅ Topics mapped to labels


In [40]:
dashen_df[["identified_theme","review_clean"]].head()

Unnamed: 0,identified_theme,review_clean
1000,"[BankingExcellence_FeatureStrength, UserSentim...","Simple for usage and well designed, attractive..."
1001,"[UserSentiment_TrustSignals, BrandIdentity_Pro...",amazing product
1002,"[BankingExcellence_FeatureStrength, UserSentim...",this app is not good and compatible with other...
1003,"[UserSentiment_TrustSignals, BrandIdentity_Pro...",ok
1004,"[UserSentiment_TrustSignals, BrandIdentity_Pro...",like it


# We will repeat the process for the remaining two banks 
- Commercial Bank of Ethiopia (CBE)   
- Bank of Abyssinia (BOA)  

# CBE bank reviews

In [17]:
# preprocess the reviews for cbe bank for thematics
cbe_df = df[df['bank'] == 'Commercial Bank of Ethiopia (CBE)'].copy()
cbe_df['keyword_ready'] = cbe_df['review_clean'].apply(preprocess_for_keywords)
cbe_df[['review_clean', 'keyword_ready']].head()

Unnamed: 0,review_clean,keyword_ready
0,wow,wow
1,excellent,excellent
2,great,great
3,Great,great
4,there is many thing u have to fix.,many thing u fix


In [18]:
vectorizer = TfidfVectorizer(
    max_df=0.1, # to avoid very frequent words
    min_df=1, # to capture words that are inherent to reviews 
    stop_words='english',
    ngram_range=(3,4)
)

# Create TF-IDF matrix for CBE bank
cbe_tfidf_matrix = vectorizer.fit_transform(cbe_df['keyword_ready'])
cbe_feature_names = vectorizer.get_feature_names_out()

In [19]:
# apply the same LDA model to CBE bank reviews
lda_model.fit(cbe_tfidf_matrix)

lda_topics = extract_topic_keywords(lda_model, cbe_feature_names, top_n=20)

# Display the extracted topics 
for topic, keywords in lda_topics.items():
    print(f"{topic}: {keywords}")

Topic 1: ['fast easy use', 'garrantty bank ebc', 'nice use app', 'slow account used', 'woxe harimo ribiso', 'good app fast', 'easy use clear', 'best app finance', 'app say sync', 'app android phone', 'nice app android phone', 'nice app android', 'app proactive good', 'proactive good connection', 'app proactive good connection', 'necessary app people', 'app necessary app', 'app necessary app people', 'excellent easy access uptodate', 'excellent easy access']
Topic 2: ['make life easy', 'best mobile banking ethiopia', 'mobile banking ethiopia', 'best mobile banking', 'work abroad cbe', 'removing screenshot feature', 'service mobile banking', 'work screen shot', 'fast best service head', 'fast best service', 'best service head', 'excellent application user', 'application user friendlynice', 'excellent application user friendlynice', 'dont know prefer', 'busy dont know prefer', 'busy dont know', 'amazing app im enjoying', 'amazing app im', 'app im enjoying']
Topic 3: ['cbe mobile banking',

### Topic Labeling Interpretation (CBE Bank)

We assigned five distinct thematic labels to topics derived from CBE user reviews. These labels reflect recurring patterns in customer feedback across usability, service quality, and operational functionality:

- **Topic 1: EaseOfUse_AppPerformance** — reviews praise the app’s clarity, speed, proactive design, and accessibility across devices.
- **Topic 2: ServiceExperience_Satisfaction** — highlights satisfaction with mobile banking services, international usability, and overall convenience.
- **Topic 3: Functionality_ReputationMixed** — blends praise for CBE’s legacy and secure performance with critical mentions of technical issues.
- **Topic 4: FeatureRequest_SecurityConcerns** — focuses on requests for network tweaks, returning screenshots, and security policies; also contains national pride themes.
- **Topic 5: MixedSentiment_UsabilityUpdates** — contains varied opinions about usability, service quality, update needs, and dissatisfaction with the app’s current state.

These thematic assignments will support future insight generation, sentiment alignment, and recommendation tracking. We’ll next map these labels to individual reviews using topic probabilities from `lda.transform()`.

In [20]:
# Create a dictionary for CBE topic keywords
cbe_topic_keywords = {
    "Topic 1": ['fast easy use', 'garrantty bank ebc', 'nice use app', 'slow account used', 'woxe harimo ribiso', 'good app fast', 'easy use clear', 'best app finance', 'app say sync', 'app android phone', 'nice app android phone', 'nice app android', 'app proactive good', 'proactive good connection', 'app proactive good connection', 'necessary app people', 'app necessary app', 'app necessary app people', 'excellent easy access uptodate', 'excellent easy access'],
    "Topic 2": ['make life easy', 'best mobile banking ethiopia', 'mobile banking ethiopia', 'best mobile banking', 'work abroad cbe', 'removing screenshot feature', 'service mobile banking', 'work screen shot', 'fast best service head', 'fast best service', 'best service head', 'excellent application user', 'application user friendlynice', 'excellent application user friendlynice', 'dont know prefer', 'busy dont know prefer', 'busy dont know', 'amazing app im enjoying', 'amazing app im', 'app im enjoying'],
    "Topic 3": ['cbe mobile banking', 'simple use amazing', 'wow amazing app', 'best app loved', 'cbe mobilr bankg', 'safe secure fast', 'fast simple easy', 'fast simple easy use', 'applicatiom doesnt work', 'amazing applicatiom doesnt work', 'amazing applicatiom doesnt', 'good app time manager', 'app time manager', 'good app time', 'send cbebirr app app', 'cbebirr app app', 'send cbebirr app', 'leading commercial bank', 'cbe leading commercial', 'cbe leading commercial bank'],
    "Topic 4": ['best mobile banking app', 'mobile banking app', 'best mobile banking', 'change default network', 'low quality application', 'bring screenshot feature', 'reliable easy use', 'pride ethiopian bank', 'working safari network', 'wow best application', 'best banking app', 'good easy use', 'nice fast app', 'transfer money telebirr', 'app mobile banking', 'option app mobile banking', 'option app mobile', 'update security policy screenshots', 'update security policy', 'security policy screenshots'],
    "Topic 5": ['app good using', 'malkaamuu jiidhaa namoo', 'satisfied beautiful app', 'easy thank cebe', 'good week oky', 'bad hard use', 'able make screenshot', 'wow simple life', 'best best fast', 'wonder easy use', 'good simple use', 'good thanks service', 'need update today', 'good need update today', 'good need update', 'app sure secure', 'satisfied app sure', 'satisfied app sure secure', 'worst bank app', 'bank app ethiopia']
}

# Define theme labels for CBE topic clusters
cbe_theme_labels = {
    0: "Ease of Use & App Performance",
    1: "Service Satisfaction & International Access",
    2: "Functionality & Mixed Reliability Signals",
    3: "Feature Requests & Security Expectations",
    4: "Update Needs & Sentiment Variation"
}

print("✅ CBE topic keywords and theme labels are defined successfully.")

✅ CBE topic keywords and theme labels are defined successfully.


In [21]:
# save theme labels for cbe for later analysis
save_topics_with_labels("cbe", cbe_topic_keywords, cbe_theme_labels, "../data/processed/bank_themes")

✅ Saved to: ../data/processed/bank_themes\cbe_theme_map.json


In [22]:
# Apply the thematic analysis function to CBE bank reviews
cbe_df['identified_theme'] = map_topics_to_labels(lda_model, cbe_tfidf_matrix, cbe_theme_labels)

✅ Topics mapped to labels


In [23]:
cbe_df[["identified_theme","review_clean"]]

Unnamed: 0,identified_theme,review_clean
0,"[Update Needs & Sentiment Variation, Feature R...",wow
1,"[Update Needs & Sentiment Variation, Feature R...",excellent
2,"[Update Needs & Sentiment Variation, Feature R...",great
3,"[Update Needs & Sentiment Variation, Feature R...",Great
4,"[Update Needs & Sentiment Variation, Feature R...",there is many thing u have to fix.
...,...,...
495,"[Update Needs & Sentiment Variation, Feature R...",wow.......cbe.....keep it up.....!!!!!!
496,"[Update Needs & Sentiment Variation, Service S...",ጊዜን ቆጣቢ እና ህይወትን ቀለል ከሚያደርጉ ኢትዬጲያ ካሉ ፋይናንስ አፕል...
497,"[Update Needs & Sentiment Variation, Feature R...",Excellent🙏app
498,"[Update Needs & Sentiment Variation, Feature R...",the most useful


# BOA bank reviews

In [24]:
# preprocess the reviews for cbe bank for thematics
boa_df = df[df['bank'] == 'Bank of Abyssinia (BOA)'].copy()
boa_df['keyword_ready'] = boa_df['review_clean'].apply(preprocess_for_keywords)
boa_df[['review_clean', 'keyword_ready']].head()

Unnamed: 0,review_clean,keyword_ready
500,nothing when I need to install the Apk it say ...,nothing need install apk say reup date
501,I can log in from any where,log
502,😇,
503,no proplem,proplem
504,👍👍👍,


In [25]:
vectorizer = TfidfVectorizer(
    max_df=0.1, # to avoid very frequent words
    min_df=1, # to capture words that are inherent to reviews 
    stop_words='english',
    ngram_range=(3,4)
)

# Create TF-IDF matrix for CBE bank
boa_tfidf_matrix = vectorizer.fit_transform(boa_df['keyword_ready'])
boa_feature_names = vectorizer.get_feature_names_out()

In [26]:
# apply the same LDA model to CBE bank reviews
lda_model.fit(boa_tfidf_matrix)

lda_topics = extract_topic_keywords(lda_model, boa_feature_names, top_n=20)

# Display the extracted topics 
for topic, keywords in lda_topics.items():
    print(f"{topic}: {keywords}")

Topic 1: ['app crush frequently', 'great financial company', 'lemn embi yilal', 'doesnt workso frustrating', 'professional banking app', 'ብዙዬ ሺዋየ በለጠ', 'አቢስኒያ የሁሉም ምርጫ', 'guy getting worst', 'faster bank abissinya', 'opening really frustrating', 'corrupted poor app', 'worst mobile banking app', 'mobile banking app', 'worst mobile banking', 'worst app human created', 'app human created', 'open display error message', 'display error message', 'open display error', 'better compared cbe']
Topic 2: ['poor application turned', 'work correctly update', 'gooood app dear', 'boa mobile backing', 'aadan axmed barkhadle', 'ahmed mohammed husen', 'liking application good', 'app doesnt work', 'dirtiest application seen', 'useless app downgraded', 'app doesnt start', 'ቆንጆ ነው በርቱ', 'worst app seen like', 'worst app seen', 'app seen like', 'want international mobile', 'international mobile banking', 'want international mobile banking', 'boa greqt ethiopian bank', 'boa greqt ethiopian']
Topic 3: ['worst

### Topic Labeling Interpretation (BOA App)

We assigned five distinct thematic labels to topics derived from BOA user reviews. These themes reflect recurring signals around functionality challenges, praise, and user frustration:

- **Topic 1: App Crashes & Frustration Signals** — highlights frequent app crashes, error messages, and comparison to competitor apps.
- **Topic 2: Inconsistency & Usability Complaints** — contains issues with app reliability, poor user flow, and requests for international access.
- **Topic 3: Mixed Feedback & Customer Experience** — blends praise for banking service with disruptions, update failures, and slow performance.
- **Topic 4: Trust Issues & Service Downtime** — surfaces poor load times, security concerns, user doubt, and negative experience emphasis.
- **Topic 5: Prior App Preference & Feature Breakdown** — mentions legacy app functionality, developer settings, and problematic app behavior.

These thematic groupings offer meaningful insight into user sentiment and app stability concerns. Next, we’ll map these themes to reviews using LDA output.

In [27]:
# Define BOA topic keywords
boa_topic_keywords = {
    "Topic 1": ['app crush frequently', 'great financial company', 'lemn embi yilal', 'doesnt workso frustrating',
                'professional banking app', 'ብዙዬ ሺዋየ በለጠ', 'አቢስኒያ የሁሉም ምርጫ', 'guy getting worst',
                'faster bank abissinya', 'opening really frustrating', 'corrupted poor app', 'worst mobile banking app',
                'mobile banking app', 'worst mobile banking', 'worst app human created', 'app human created',
                'open display error message', 'display error message', 'open display error', 'better compared cbe'],

    "Topic 2": ['poor application turned', 'work correctly update', 'gooood app dear', 'boa mobile backing',
                'aadan axmed barkhadle', 'ahmed mohammed husen', 'liking application good', 'app doesnt work',
                'dirtiest application seen', 'useless app downgraded', 'app doesnt start', 'ቆንጆ ነው በርቱ',
                'worst app seen like', 'worst app seen', 'app seen like', 'want international mobile',
                'international mobile banking', 'want international mobile banking', 'boa greqt ethiopian bank',
                'boa greqt ethiopian'],

    "Topic 3": ['worst banking app', 'poorest mobile banking', 'update doesnt work', 'ok stop sundenly',
                'far good lug', 'good app fore', 'fast suitable customer', 'want download dont',
                'awasome app head', 'bast bank ethiopia', 'yegema app tish', 'yes active user', 'bad app slow',
                'አሪፍ ነው በርቱልን', 'harun tamam galanaa', 'possible gove star', 'app android phone',
                'best financial app', 'banking app work', 'worst banking app work'],

    "Topic 4": ['app isnt working', 'worest app loading', 'help like ittttt', 'good app helpful', 'apps crash error',
                'open open service', 'አይሰራም ሼም ነው', 'trust bank service', 'dont trust bank service',
                'dont trust bank', 'የእርስዎን ተሞክሮ ይግለጹ አማራጭ', 'ተሞክሮ ይግለጹ አማራጭ', 'የእርስዎን ተሞክሮ ይግለጹ',
                'bad app vety bad', 'app vety bad', 'bad app vety', 'attention important difficult',
                'attention important difficult time', 'important difficult time', 'uselss app dont download'],

    "Topic 5": ['worest app recommende', 'problematic hardly work', 'nise mobile bankig', 'poorly functioning app',
                'amazing bank app', 'previous application better', 'worest mb app', 'doesnt work device',
                'new app good', 'working day check', 'disable developer option', 'demand disable developer option',
                'demand disable developer', 'best banking app wworld', 'banking app wworld', 'occured worst app',
                'error occured worst', 'error occured worst app', 'use applicationplease help', 'open use applicationplease help']
}

# Define theme labels for BOA topics
boa_theme_labels = {
    0: "App Crashes & Frustration Signals",
    1: "Inconsistency & Usability Complaints",
    2: "Mixed Feedback & Customer Experience",
    3: "Trust Issues & Service Downtime",
    4: "Prior App Preference & Feature Breakdown"
}

print("✅ BOA topic keywords and theme labels are ready.")

✅ BOA topic keywords and theme labels are ready.


In [28]:
# save theme labels for cbe for later analysis
save_topics_with_labels("boa", boa_topic_keywords, boa_theme_labels, "../data/processed/bank_themes")

✅ Saved to: ../data/processed/bank_themes\boa_theme_map.json


In [31]:
# Apply the thematic analysis function to CBE bank reviews
boa_df['identified_theme'] = map_topics_to_labels(lda_model, boa_tfidf_matrix, boa_theme_labels)

✅ Topics mapped to labels


In [32]:
boa_df[["identified_theme","review_clean"]]

Unnamed: 0,identified_theme,review_clean
500,"[Mixed Feedback & Customer Experience, Inconsi...",nothing when I need to install the Apk it say ...
501,"[Prior App Preference & Feature Breakdown, Tru...",I can log in from any where
502,"[Prior App Preference & Feature Breakdown, Tru...",😇
503,"[Prior App Preference & Feature Breakdown, Tru...",no proplem
504,"[Prior App Preference & Feature Breakdown, Tru...",👍👍👍
...,...,...
995,"[Prior App Preference & Feature Breakdown, Mix...",Always error occured. The worst app ever
996,"[Mixed Feedback & Customer Experience, Inconsi...",ከዚህ ትልቅ ባንክ የማይጠበቅ ድንዝዝዝዝ ያለ App.... ዛግግግ ነው ያ...
997,"[Prior App Preference & Feature Breakdown, Tru...",Excellent service
998,"[Prior App Preference & Feature Breakdown, Tru...",It's not convenient


# combine all the dataframes into one

In [42]:
# Combine all three bank DataFrames
combined_df = pd.concat([dashen_df, cbe_df, boa_df], ignore_index=True)
# check the shape of the combined dataframe
print(combined_df.shape)
# check the columsn to confirm the merge
combined_df.columns

(1492, 11)


Index(['review_id', 'review', 'rating', 'date', 'bank', 'source',
       'review_clean', 'sentiment_label', 'sentiment_score', 'keyword_ready',
       'identified_theme'],
      dtype='object')

In [43]:
# save the final combined dataframe
combined_df.to_csv("../data/processed/reviews with sentiments and themes.csv")
print("saved the combined dataframe succesfuly✅")

saved the combined dataframe succesfuly✅


# Sentiment and Thematic Analysis — Full Notebook Summary

This notebook provides a complete workflow for extracting sentiment and thematic insights from Google Play user reviews of three major Ethiopian banks: **Dashen Bank**, **Commercial Bank of Ethiopia (CBE)**, and **Bank of Abyssinia (BOA)**. The analysis uses modern NLP techniques and is designed for modularity and reproducibility.

---

## 1. Setup and Data Loading

- **Imports**: Essential libraries for:
  - Data: `pandas`, `numpy`
  - NLP: `transformers`, `nltk`, `emoji`, `sklearn`
  - Local utilities from `src` directory

- **Data Loading**:
  - Reads cleaned review data from `../data/processed/cleaned_reviews.csv`
  - Uses a custom `load_data` function

- **Preview**:
  - Confirms key columns: `review_clean`, `bank`, `rating`

---

## 2. Sentiment Analysis

### Model Initialization
- Loads pretrained model:  
  `distilbert-base-uncased-finetuned-sst-2-english` from HuggingFace Transformers

### Inference and Annotation
- Defines `get_sentiment_outputs` function
- Applies model to reviews
- Stores outputs in:
  - `sentiment_label` (positive/negative)
  - `sentiment_score` (confidence)

### Summary Statistics
- Aggregates sentiment scores:
  - By **bank**
  - By **rating**, focusing on 1-star reviews
- Shows high confidence in negative predictions for low-rated reviews

### Interpretation
- Negative reviews show **strong confidence**, confirming model reliability
- Tight clustering of confidence scores → Low ambiguity

---

## 3. Thematic Analysis Pipeline

### Text Preprocessing
- Cleaning steps:
  - Lemmatization
  - Stopword removal
  - Emoji stripping
  - Punctuation/digit normalization
- Saves cleaned text in a `keyword_ready` column

### TF-IDF Vectorization
- Uses `TfidfVectorizer` to extract **trigrams and four-grams**
- Filters overly frequent terms using `max_df=0.1`
- Produces TF-IDF matrices and features per bank

### Topic Modeling (LDA)
- Configures `LatentDirichletAllocation`:
  - `n_components=5` topics
- Fits LDA model on each bank's TF-IDF matrix
- Uses `extract_topic_keywords()` to get top words per topic

### Manual Topic Labeling
- Reviews top keywords and assigns **manual labels** to topics
- Creates dictionaries:
  - Maps topic indices → Human-readable themes
  - Separate mapping for Dashen, CBE, BOA

### Theme Mapping
- For each review:
  - Computes top 3 most likely themes from topic distribution
- Stores mapped labels in `identified_theme` column

### Saving Results
- Topic keywords and labels saved to disk using `save_topics_with_labels`

---

## 4. Data Integration and Export

- Merges processed DataFrames from all three banks into `combined_df`
- Saves annotated DataFrame to:  
  `../data/processed/reviews with sentiments and themes.csv`

---

## 5. Utility and Modularization

- **Reusable Functions**: All major steps implemented as standalone functions
- **Modules** (stored in `src/`):
  - `utils`
  - `keyword_text_processor`
  - `sentiments`
  - `keyword_and_topicmodeling`

---

## 6. Workflow Coverage

- **Banks Analyzed**: Dashen, CBE, BOA
- **Sentiment**: Model-based inference + confidence
- **Themes**:
  - N-gram extraction (TF-IDF)
  - LDA topic modeling
  - Manual label mapping
  - Per-review theme assignment
- **Output**: Final labeled dataset ready for downstream use

---

## Conclusion

This notebook delivers a robust, end-to-end pipeline that extracts both sentiment and thematic insights from real user feedback. With clear modular structure, it supports:
- Reusability and extension to other banks or datasets
- Reproducibility through function-driven design
- Export-ready annotated data for business or research analysis
