# Simple Keyword Clustering - SEO

Check out the repository for more code https://github.com/PabloRosales/code-examples

You can also follow me on twitter at [@_PabloDev](https://twitter.com/_PabloDev)

You can also help by supporting on Patreon: https://www.patreon.com/_pablodev

We'll just use the example from SBert with some tweaks: https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/clustering/agglomerative.py

# Install dependencies

We will use Sentence Transformers 

Check out the documentation at https://www.sbert.net/

In [None]:
%%capture
!pip install sentence_transformers

# Download the model

We'll use the biggest model `all-MiniLM-L6-v2`, you can try different models, check out their page: https://www.sbert.net/docs/pretrained_models.html

In [None]:
%%capture
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2', cache_folder='/tmp/.cache')

# Other depenencies

We will do Agglomerative Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

We will also be using Pandas and Numpy

In [None]:
import numpy as np
import pandas as pd

Seaborn will be used to graph the length of our keywords

In [None]:
import seaborn as sns

Lets import IO to load our file

In [None]:
import io

# Add your keywords from a CSV file

Your CSV file must include a **keyword** column, otherwise you'll need to adapt the code.

In [None]:
from google.colab import files

uploaded = files.upload()

# Load and check our keywords

The fun part, pretty simple actually with sentence transformers.

But first lets grab the keywords from the CSV

In [None]:
filename = list(uploaded.keys())[0]
df = pd.read_csv(io.BytesIO(uploaded[filename]))
corpus_of_keywords = df['keyword'].values

Lets check how many keywords we have:

In [None]:
len(corpus_of_keywords)

Lets check the length of the keywords

In [None]:
df["keyword_word_len"] = df["keyword"].apply(lambda x : len(x.split()))
df["keyword_len"] = df["keyword"].apply(lambda x : len(x))

First by word count

In [None]:
sns.displot(df.keyword_word_len, kde=False)

Now by character length

In [None]:
sns.displot(df.keyword_len, kde=False)

You can do do some cleaning up now...

# Now lets cluster them

Lets first create the embeddings:

In [None]:
corpus_embeddings = model.encode(corpus_of_keywords)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

You can adjust some values depending on how close you want your clusters to be.

A distance_threshold of 0.10 works great if you want the clusters to be very similar semantically, you can use 0.30 to get them a bit looser.

In [None]:
distance_threshold = 0.10

In [None]:
clustering_model = AgglomerativeClustering(n_clusters=None, affinity='cosine', linkage='average', distance_threshold=distance_threshold)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

Now lets cluster them:

In [None]:
clustered_sentences = {}
for sentence_id, cluster_id in enumerate(cluster_assignment):
    if cluster_id not in clustered_sentences:
        clustered_sentences[cluster_id] = []
    clustered_sentences[cluster_id].append(corpus_of_keywords[sentence_id])

And finally group them by first keyword in cluster:

In [None]:
_cluster_items = clustered_sentences.items()

clusters = {'others': []}
for cluster in _cluster_items:
    if len(cluster[1]) == 1:
        clusters['others'].append(cluster[1][0])
        continue
    clusters[cluster[1][0]] = cluster[1]

Check how many clusters we got

In [None]:
len(clusters.keys())

In [None]:
keywords_with_cluster = []

for cluster in clusters.keys():
    for kw in clusters[cluster]:
        keywords_with_cluster.append(dict(
            keyword=kw,
            cluster=cluster,
        ))

df_clusters = pd.DataFrame(keywords_with_cluster)

In [None]:
df_clusters

# Download CSV with cluster data

In [None]:
df_clusters.to_csv('keywords_clustered.csv', encoding = 'utf-8-sig') 
files.download('keywords_clustered.csv')

# Final notes

Semantic clustering is a great first step, but just because two keywords are semantically different, does not mean that Google treats them as separated intents. **I'll share another notebook for SERP clustering if asked.**