# Simple KeyPhrases Workflow

This workflows shows users how to find keyphrases with Relevance AI.

In [1]:
!pip install -q RelevanceAI
!pip install -q rake-nltk

[K     |████████████████████████████████| 224 kB 12.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 61.1 MB/s 
[K     |████████████████████████████████| 58 kB 4.3 MB/s 
[K     |████████████████████████████████| 271 kB 52.0 MB/s 
[K     |████████████████████████████████| 94 kB 2.3 MB/s 
[K     |████████████████████████████████| 144 kB 33.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.0 requires tf-estimator-nightly==2.8.0.dev2021122109, which is not installed.
arviz 0.11.4 requires typing-extensions<4,>=3.7.4.3, but you have typing-extensions 4.0.1 which is incompatible.[0m
[?25h

In [2]:
from relevanceai import Client 
client = Client()

In [17]:
from relevanceai.datasets import get_dummy_ecommerce_dataset
docs = get_dummy_ecommerce_dataset()

In [20]:
dataset_id = "sample_dataset"
text_fields = ["product_title"]


In [18]:
ds = client.Dataset("ecommerce-example")
ds.upsert_documents(docs)

✅ All documents inserted/edited successfully.


# RAKE 

Relevance AI supports keyword extraction algorithm called `rake`. While this is the default, users can also specify additional stopwords to exclude on top of the `rake` algorithm and use normal keyword extraction. For example: 
- `nltk` with `unigram` (single word) extraction
- `nltk` with `bigram` (multiple word) extraction

In [22]:
ds.keyphrases(text_fields=text_fields, algorithm="rake")



Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving docume

[('hamilton beach 49970 personal cup single serve one cup pod disc coffee brewer',
  332.5),
 ('sandisk ultra mobile microsd hc 8gb 16gb 32gb 64gb class10 memory card wholesale',
  331.0),
 ('ounce eau de parfum spray', 273.75),
 ('aidan mattox gold strapless fairy tale empire waist bead evening dress',
  225.5),
 ('turboion croc skin titanium ceramic digital flat hair iron gift set',
  209.0),
 ('nike ladies pink lunar duet sport golf shoes', 204.71666666666667),
 ('micro sd 4gb 8gb 16gb tf flash memory card sdhc microsd', 203.5),
 ('secret vanilla lace scented massage oil sparkling caress rare htf', 200.0),
 ('mens stainless steel silver rope twist chain necklace jewelry 2', 197.5),
 ('36mens silver stainless steel braided wheat chain necklace jewelry 3',
  195.33333333333331)]

# Specifying the amount of words in the keyphrases

In [24]:
ds.keyphrases(text_fields=text_fields, algorithm="nltk", n=3)



Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving docume

[("'s Gold Toe", 46),
 ("'s DC Shoes", 44),
 ("Men 's Gold", 40),
 ("Men 's DC", 38),
 ("Nike Women 's", 30),
 ('White Mark Women', 24),
 ("Mark Women 's", 24),
 ("Levi 's Women", 24),
 ("'s Women 's", 24),
 ("Nike Men 's", 22)]

# Infinitely Hackable With Preprocessing Hooks

In [29]:
def remove_apostrophe(string):
    return string.replace("'s", "")

In [30]:
ds.keyphrases(text_fields=text_fields, algorithm="nltk", n=3, preprocess_hooks=[remove_apostrophe])



Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving docume

[('Men Gold Toe', 40),
 ('Men DC Shoes', 38),
 ('White Mark Women', 24),
 ('Eau de Parfum', 22),
 ('External Hard Drive', 22),
 ('Lunar Duet Sport', 20),
 ('Logitech Wireless Mouse', 20),
 ('de Parfum Spray', 18),
 ('Lane Boots Women', 18),
 ('Nike Womens Lunar', 16)]

# Adding Additional Stopwords

Users can also add stopwords on top of normal stopwords to improve insight!

In [33]:
ds.keyphrases(text_fields=text_fields, algorithm="nltk", n=3, additional_stopwords=["Men", "Women"], preprocess_hooks=[remove_apostrophe])



Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving docume

[('Eau de Parfum', 22),
 ('External Hard Drive', 22),
 ('Lunar Duet Sport', 20),
 ('Logitech Wireless Mouse', 20),
 ('de Parfum Spray', 18),
 ('Nike Womens Lunar', 16),
 ('Foam Mattress Topper', 16),
 ('Womens Lunar Duet', 14),
 ('Synthetic Athletic Shoe', 14),
 ('SonicFuel Hybrid Earbud', 14)]

# Cluster Keyphrases

Users can also get the key phrases across each cluster. This can be helpful if users want some automated way to label their clusters!

In [42]:
# First we run some clustering
clusterer = ds.auto_cluster("kmeans-5", vector_fields=["product_title_clip_vector_"])

Retrieving all documents
Fitting and predicting on all documents


  0%|          | 0/1 [00:00<?, ?it/s]

---------------------------
Grade: D
Mean Silhouette Score: 0.08688070325190632
---------------------------
Updating the database...
Inserting centroid documents...
Build your clustering app here: https://cloud.relevance.ai/dataset/ecommerce-example/deploy/recent/cluster


In [61]:
from tqdm.auto import tqdm
cluster_field = "_cluster_"
preprocess_hooks = []
vector_fields=["product_title_clip_vector_"]
cluster_alias="kmeans-5"
text_fields=["product_title"]

vector_fields_str = ".".join(sorted(vector_fields))
field = f"{cluster_field}.{vector_fields_str}.{cluster_alias}"
all_clusters = ds.facets([field], page_size=2)
most_common = 10
cluster_counters = {}
for c in tqdm(all_clusters['results'][field]):
    cluster_value = c[field]
    top_words = ds.keyphrases(
        text_fields=text_fields,
        n=3,
        filters=[
            {
                "field": field,
                "filter_type": "contains",
                "condition": "==",
                "condition_value": cluster_value,
            }
        ],
        most_common=most_common,
        preprocess_hooks=preprocess_hooks,
        algorithm="rake",
    )
    cluster_counters[cluster_value] = top_words
cluster_counters

  0%|          | 0/2 [00:00<?, ?it/s]



Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Updating word count...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...
Retrieving documents...


{'cluster-1': [('hamilton beach 49970 personal cup single serve one cup pod disc coffee brewer',
   169.0),
  ('turboion croc skin titanium ceramic digital flat hair iron gift set',
   105.0),
  ('secret vanilla lace scented massage oil sparkling caress rare htf', 100.0),
  ('seagate backup plus portable stdr2000102 2 tb external hard drive', 100.0),
  ('seagate backup plus portable stdr2000101 2 tb external hard drive', 100.0),
  ('turboion croc skin titanium ceramic digital flat iron gift set', 96.0),
  ('ounce eau de parfum spray', 94.95238095238095),
  ('zaq galaxy ceramic litemist aromatherapy 200 ml essential oil diffuser',
   91.0),
  ('rayovac aa 12 pack platinum rechargeable low discharge nimh batteries',
   86.83333333333334),
  ('nemo digital white crystal pave twisted heart earbud headphones', 80.0)],
 'cluster-3': [('nike ladies pink lunar duet sport golf shoes',
   104.43333333333334),
  ('nike ladies lunar duet sport golf shoes', 74.1462148962149),
  ('western chief hell