# Finding Labels

## Problem Statement

We've thousands of app reviews for any given client. We do not know what categories or tags would apply. The intent behind this notebook is to explore one approach, which can combine data with your understanding of the data/business context. 

## Approach

Step 1: We use 

In [23]:
import json
import random
from collections import Counter

import pandas as pd
import spacy
import textacy
from pydantic import BaseModel
from spacy.lang.en import English
from textacy import extract
from textacy.representations.vectorizers import Vectorizer
from textacy.similarity.tokens import jaccard
from textacy.tm import TopicModel
from tqdm.notebook import tqdm

from pathlib import Path
from typing import Dict, List, Optional, Tuple

%load_ext autoreload
%autoreload 2
Path.ls = lambda x: list(x.iterdir())

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The main contribution here is a programmatic way to find labels for topic models, and then classify documents into them -- but while still retaining some degree of human intervention.

In [2]:
data_dir = Path("../data/raw").resolve()
assert data_dir.exists()
data_dir.ls()

[WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/Clubhouse_us_app_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/com.ubercab_us_play_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/frequency_count.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/IndiaGold_in_app_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/Moj_in_app_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/Moj_us_app_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/Netflix_us_app_store_reviews.json'),
 WindowsPath('C:/Users/nirantk/Documents/GitHub/AppReview/data/raw/Uber_us_app_store_reviews.json')]

Note:

Here we will explore App Reviews for just one app: Uber (Passenger/Cab, not the Driver). The additional data to reproduce this for other clients is left as an exercise for you.

But to get an overview of all of them, we combine them into a larger single text string and explore them. 

In [3]:
file_path = data_dir / "Uber_us_app_store_reviews.json"; assert file_path.exists()
with file_path.open("r") as f:
    raw_data = pd.read_json(f)
    reviews = raw_data["review"].to_list()

In [4]:
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load("en_core_web_trf")

In [5]:
%time reviews = [rev.strip() for rev in reviews]
len(reviews)

Wall time: 987 µs


5000

In [6]:
%time corpus = textacy.Corpus("en_core_web_sm", data=reviews)

Wall time: 2min 14s


In [7]:
corpus.n_docs, corpus.n_sents, corpus.n_tokens

(5000, 43269, 729300)

In [8]:
word_counts = corpus.word_counts(by="lemma_", filter_stops= True, filter_nums=True, filter_punct=True)
sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:25]

[('driver', 8032),
 ('Uber', 7978),
 ('ride', 4797),
 ('time', 4369),
 ('app', 4232),
 ('charge', 2756),
 ('uber', 2527),
 ('minute', 2192),
 ('cancel', 2157),
 ('service', 2135),
 ('use', 2080),
 ('pick', 1940),
 ('wait', 1928),
 ('customer', 1916),
 ('$', 1834),
 ('car', 1791),
 ('try', 1774),
 ('go', 1639),
 ('get', 1607),
 ('way', 1586),
 ('work', 1516),
 ('trip', 1514),
 ('say', 1513),
 ('take', 1460),
 ('need', 1438)]

# TODO
- [ ] NMF Topic Modeling
- [ ] Noun Chunks
- [ ] N-Grams

## Topic Modeling with Textacy via Scikit-Learn

In [9]:
tokenized_docs = (
    (term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True))
    for doc in corpus
)

In [10]:
vectorizer = Vectorizer(
    tf_type="linear", idf_type="smooth", norm="l2", min_df=3, max_df=0.95
)

In [11]:
doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
doc_term_matrix

<5000x4544 sparse matrix of type '<class 'numpy.float64'>'
	with 203279 stored elements in Compressed Sparse Row format>

In [12]:
n_topics = 10

In [13]:
%%time
model = TopicModel("nmf", n_topics=n_topics)
model.fit(doc_term_matrix)



Wall time: 2.35 s


In [14]:
doc_topic_matrix = model.transform(doc_term_matrix)

In [38]:
class Topic(BaseModel):
    title: Optional[str]
    terms: Optional[List[str]]
    top_docs_idx: Optional[List[int]]
    keyterms: Optional[Dict] = {}
    linguistic_terms: Optional[Dict] = {}

lst_topics = []

In [39]:
for topic_idx, top_terms in model.top_topic_terms(
    vectorizer.id_to_term, topics=range(n_topics)
):
    print(f"Topic #{topic_idx}:", "\t".join(top_terms))
    lst_topics.append(Topic(terms=[str(term) for term in top_terms]))

Topic #0: app	Uber	driver	ride	time	work	use	need	like	great
Topic #1: customer	service	issue	support	contact	response	Uber	company	resolve	refund
Topic #2: cancel	driver	charge	fee	wait	trip	cancellation	pick	call	request
Topic #3: walk	drop	pick	location	pool	block	destination	address	point	street
Topic #4: $	charge	price	ride	Uber	pay	cost	money	dollar	take
Topic #5: card	credit	payment	gift	method	cash	use	Uber	account	debit
Topic #6: minute	wait	away	time	10	20	10 minute	5	15	late
Topic #7: uber	lyft	money	time	use	nt	work	need	like	bad
Topic #8: car	drive	seat	get	airport	ask	say	clean	smell	driver
Topic #9: account	number	phone	email	try	help	log	sign	app	Uber


In [89]:
# Top documents for each topic
for topic_idx, top_docs_idx in model.top_topic_docs(
    doc_topic_matrix, weights=True, top_n=20
):
    #     print(f"{topic_idx}: {top_docs_idx}")
    lst_topics[topic_idx].top_docs_idx = top_docs_idx

# Exploring Each Topic

In [111]:
# Top Ngrams, Entities and Noun Chunks for each document in a topic
for topic in lst_topics:
    l_terms, k_terms = [], []
    for doc_idx, weight in topic.top_docs_idx:
        doc = corpus[doc_idx]
        l_terms.extend(
            [
                str(term).lower()  # ignore case
                for term in extract.terms(doc, ngs=[2, 3], ncs=True, ents=True)
            ]
        )
        k_terms.extend([pair[0] for pair in extract.keyterms.yake(doc, ngrams=[2, 3])])
    #     print(len(topic.linguistic_terms))
    topic.linguistic_terms = Counter(l_terms).most_common(200)
    topic.linguistic_terms, topic.keyterms = get_terms(
        topic.linguistic_terms
    ), get_terms(topic.keyterms)
    topic.keyterms = Counter(k_terms).most_common(200)
    # remove duplicates across docs
#     print(len(topic.linguistic_terms))

In [112]:
def get_terms(terms: List[Tuple[str, int]]) -> List[str]:
    return [ele[0] for ele in terms]

In [115]:
# Finding a Title for Each Topic
for topic in lst_topics:
    combined_list = topic.terms + topic.keyterms + topic.linguistic_terms
    combined_list = Counter(combined_list).most_common(20)
    print(get_terms(combined_list))

['app', 'driver', 'ride', 'time', 'work', 'need', 'Uber', 'use', 'like', 'great', ('Lyft app', 2), ('couple time', 1), ('helpful driver', 1), ('great rating', 1), ('driver time', 1), ('couple thing', 1), ('cancellation tab', 1), ('star rating', 1), ('bad review', 1), ('phone battery', 1)]
['customer', 'service', 'issue', 'support', 'response', 'company', 'refund', 'contact', 'Uber', 'resolve', ('customer service', 20), ('customer support', 6), ('service number', 4), ('customer service number', 4), ('phone number', 3), ('terrible customer', 2), ('poor customer', 2), ('poor customer service', 2), ('horrible customer', 2), ('horrible customer service', 2)]
['driver', 'fee', 'trip', 'request', 'cancel', 'charge', 'wait', 'cancellation', 'pick', 'call', ('cancellation fee', 12), ('Uber driver', 4), ('new driver', 2), ('phone number', 2), ('second driver', 2), ('long time', 2), ('Ciudad Juarez', 1), ('El Paso', 1), ('cancel fee', 1), ('bad thing', 1)]
['drop', 'pick', 'location', 'pool', 'bl

In [None]:
for topic in lst_topics:
    print(topic.title, topic.terms)

# Labels

Happy with the app [Skipped]

|Index|Label Title| Label Description|
|---|:---|:---|
|1|payment|Payment Methods|
|2|cancel_fees|Cancellation Fee|
|3|price|Price|
|4|pickup|Pickup|
|5|pool|Pool|
|6|support|Customer Support|
|7|advance_ride|Advance Ride Booking|