# Cluster bulk list of keywords with valueserp


Quickly cluster a large quantity of keywords using SERP results

A summary and conclusion at the bottom including next steps and alternate methods


### About Alton

Follow me for more data and tutorials

- twitter: https://twitter.com/alton_lex @alton_lex

- linkedin: https://www.linkedin.com/in/altonalexander/


### About Data Winners

Join the conversation:

- private Discord community

- Video tutorials

- Feedback and support on this and other scripts

Join now: https://datawinners.gumroad.com/l/data-analytics-for-seo

In [638]:
# install non-standard libraries

!pip install zipfile36


Defaulting to user installation because normal site-packages is not writeable
Collecting zipfile36
  Downloading zipfile36-0.1.3-py3-none-any.whl (20 kB)
Installing collected packages: zipfile36
Successfully installed zipfile36-0.1.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [639]:
import pandas as pd
import requests
from io import BytesIO
from zipfile import ZipFile
import requests
import json

In [388]:
# input variables

api_key = "yourkeyhere"

common_urls = 2

gl = "us"
hl = "en"
engine = "google"
location = "United States"
domain ="google.com"
device = "mobile"

In [634]:
# load the keywords

filename = "./input-keywords.csv"
df_keywords_input = pd.read_csv(filename)
df_keywords_input.head()

df_keywords_input

Unnamed: 0,Keyword
0,league of american bicyclists
1,rules of the road bicycles
2,bike classes near me
3,bike to work day 2022
4,rules for bicycles on the road
...,...
8042,bike stores near uc davis
8043,5 + -2
8044,10725 international drive rancho cordova ca 95...
8045,9287 n waverly dr. bayside wi


In [379]:
# create a new empty batch
# see example in docs https://www.valueserp.com/docs/batches-api/overview

value_serp_batch_name = "My First Auto Batch"

body = {
  "name": value_serp_batch_name,
  "enabled": True,
  "schedule_type": "manual",
  "priority": "normal"
}

# upload batch file to valueserp
api_result = requests.post('https://api.valueserp.com/batches?api_key='+api_key, json=body)

api_response = api_result.json()

print("Batch Created: ", json.dumps(api_response))


Batch Created:  {"request_info": {"success": true}, "batch": {"id": "D5A4837D", "created_at": "2023-02-14T14:51:12.718Z", "last_run": null, "name": "My First Auto Batch", "schedule_type": "manual", "enabled": true, "status": "idle", "searches_total_count": 0, "searches_page_count": 0, "credits_required": 0, "next_result_set_id": 1, "results_count": 0, "priority": "normal", "destination_ids": [], "searches_type": "mixed", "searches_type_locked": false}}


In [381]:
# get the resulting batch id

batch_id = api_response['batch']['id']
batch_id

'D5A4837D'

In [394]:
# upload all keywords as 'searches' in the batch
# see tutorial https://www.valueserp.com/docs/batches-api/searches/create


# upload in pages where each page as at most 1000 searches
n_max = 1000

list_of_searches = []
list_of_search_to_batch_api_responses = []

for i, row in df_keywords_input.iterrows():

    search_for = {
        "q": row['Keyword'],
        "gl":gl,
        "hl":hl,
        "engine":engine,
        "location": location,
        "google_domain":domain,
        "device":device,
        # "search_type": "web", # defaults to web
        # "custom_id": "MyCustomID_001"
    }
    
    list_of_searches.append(search_for)
    
    if len(list_of_searches) == 1000:
        print("sending batch", len(list_of_search_to_batch_api_responses))
        
        # send the max at a time
        body = {"searches": list_of_searches}

        api_result = requests.put('https://api.valueserp.com/batches/'+batch_id+'?api_key='+api_key, json=body)
        search_to_batch_api_response = api_result.json()
        list_of_search_to_batch_api_responses.append(search_to_batch_api_response)
        
        list_of_searches = []
    

# send final searches to batch
print("sending final batch", len(list_of_search_to_batch_api_responses))
body = {"searches": list_of_searches}

api_result = requests.put(
    'https://api.valueserp.com/batches/'+batch_id+'?api_key='+api_key, json=body)
search_to_batch_api_response = api_result.json()
list_of_search_to_batch_api_responses.append(search_to_batch_api_response)


sending batch 0
sending batch 1
sending batch 2
sending batch 3
sending batch 4
sending batch 5
sending batch 6
sending batch 7
sending final batch 8


In [463]:
# check that the searches got attached to the batch correctly

for status in list_of_search_to_batch_api_responses:
    print("page #", status['batch']['searches_page_count'], status['batch']['credits_required'], status['request_info']['success'])

page # 1 1000 True
page # 2 2000 True
page # 3 3000 True
page # 4 4000 True
page # 5 5000 True
page # 6 6000 True
page # 7 7000 True
page # 8 8000 True
page # 9 8047 True


In [398]:
# start the batch job
# WARNING this will cost credits every time you run this cell!

# see docs https://www.valueserp.com/docs/batches-api/batches/start

params = {
  'api_key': api_key
}

batch_api_result = requests.get('https://api.valueserp.com/batches/'+batch_id+'/start', params)

batch_api_response = batch_api_result.json()

print("Batch started.")
batch_api_response

Batch started.


{'request_info': {'success': True},
 'batch': {'id': 'D5A4837D',
  'created_at': '2023-02-14T14:51:12.718Z',
  'last_run': None,
  'name': 'My First Auto Batch',
  'schedule_type': 'manual',
  'enabled': True,
  'status': 'queued',
  'api_requests_required': 8047,
  'searches_total_count': 8047,
  'searches_page_count': 9,
  'credits_required': 8047,
  'next_result_set_id': 1,
  'results_count': 0,
  'priority': 'normal',
  'destination_ids': [],
  'searches_type': 'web',
  'searches_per_type_count': {'web': 8047},
  'searches_type_locked': False}}

In [404]:
# check for job completed
# see docs https://www.valueserp.com/docs/batches-api/batches/get

params = {
  'api_key': api_key
}

api_result = requests.get('https://api.valueserp.com/batches/'+batch_id, params)

api_response = api_result.json()

print("Batch Name: ", api_response['batch']['name'])

api_response

Batch Name:  My First Auto Batch


{'request_info': {'success': True},
 'batch': {'id': 'D5A4837D',
  'created_at': '2023-02-14T14:51:12.718Z',
  'last_run': '2023-02-14T15:16:00.123Z',
  'name': 'My First Auto Batch',
  'schedule_type': 'manual',
  'enabled': True,
  'status': 'idle',
  'api_requests_required': 8047,
  'searches_total_count': 8047,
  'searches_page_count': 9,
  'credits_required': 8047,
  'next_result_set_id': 2,
  'results_count': 1,
  'priority': 'normal',
  'destination_ids': [],
  'searches_type': 'web',
  'searches_per_type_count': {'web': 8047},
  'searches_type_locked': False}}

In [405]:
# get the link for the download as a CSV! (optional json download)

# see docs https://www.valueserp.com/docs/batches-api/results/get

params = {
  'api_key': api_key
}

batch_resultset_api_result = requests.get('https://api.valueserp.com/batches/'+batch_id+'/results/1/csv', params)

batch_resultset_api_response = batch_resultset_api_result.json()
batch_resultset_api_response

{'request_info': {'success': True},
 'batch_id': 'D5A4837D',
 'result': {'id': 1,
  'started_at': '2023-02-14T15:16:00.123Z',
  'ended_at': '2023-02-14T15:19:25.799Z',
  'expires_at': '2023-02-28T15:16:00.123Z',
  'results_page_count': 9,
  'searches_completed': 8047,
  'searches_failed': 0,
  'searches_total': 8047,
  'webhook_status': {'status': 'not_applicable', 'log': []},
  'destination_status': {},
  'download_links': {'pages': ['https://data.valueserp.com/results/14_FEBRUARY_2023/1516/Batch_Results_D5A4837D_1_Page_1_002c0139d00c9dd8e4d1b13c504517942bd20489.csv',
    'https://data.valueserp.com/results/14_FEBRUARY_2023/1516/Batch_Results_D5A4837D_1_Page_2_002c0139d00c9dd8e4d1b13c504517942bd20489.csv',
    'https://data.valueserp.com/results/14_FEBRUARY_2023/1516/Batch_Results_D5A4837D_1_Page_3_002c0139d00c9dd8e4d1b13c504517942bd20489.csv',
    'https://data.valueserp.com/results/14_FEBRUARY_2023/1516/Batch_Results_D5A4837D_1_Page_4_002c0139d00c9dd8e4d1b13c504517942bd20489.csv',
 

In [412]:
# get the Zip of all files

zip_url = batch_resultset_api_response['result']['download_links']['all_pages']
zip_url

'https://data.valueserp.com/results/14_FEBRUARY_2023/1516/Batch_Results_D5A4837D_1_All_Pages_002c0139d00c9dd8e4d1b13c504517942bd20489_csv.zip'

In [413]:
# download zip file and open all the csvs inside

content = requests.get(zip_url)
zf = ZipFile(BytesIO(content.content))

File in zip: Batch_Results_D5A4837D_1_Page_1_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_2_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_3_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_4_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_5_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_6_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_7_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_8_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
File in zip: Batch_Results_D5A4837D_1_Page_9_002c0139d00c9dd8e4d1b13c504517942bd20489.csv


In [618]:
# find the csv files in the zip:

list_of_df_batch_pages = []

for csv_filename in [s for s in zf.namelist() if ".csv" in s]:
    print(csv_filename)

    df_batch_page = pd.read_csv(zf.open(csv_filename), low_memory=False)
    list_of_df_batch_pages.append(df_batch_page)
    
# combine all results page into one data frame
serp_input = pd.concat(list_of_df_batch_pages)

Batch_Results_D5A4837D_1_Page_1_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_2_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_3_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_4_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_5_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_6_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_7_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_8_002c0139d00c9dd8e4d1b13c504517942bd20489.csv
Batch_Results_D5A4837D_1_Page_9_002c0139d00c9dd8e4d1b13c504517942bd20489.csv


## Cluster Clusters


Additional details on a similar method here:

https://www.searchenginejournal.com/automate-search-intent-clustering/413760/


In [619]:
# make dataframe from the serp lists
serp_input = serp_input[['result.organic_results.link', 'search.q', 'result.organic_results.position']]
serp_input = serp_input.rename(columns={'result.organic_results.link':'url', 'search.q':'keyword', 'result.organic_results.position':'rank'})

serp_input

Unnamed: 0,url,keyword,rank
0,,-2+(-5),
1,https://www.dmv.ca.gov/portal/handbook/califor...,street laws,1.0
2,https://m.driving-tests.org/beginner-drivers/r...,street laws,2.0
3,https://www.findlaw.com/traffic/traffic-ticket...,street laws,3.0
4,https://www.progressive.com/lifelanes/on-the-r...,street laws,4.0
...,...,...,...
331,https://officialbikeweek.com/Bike-Week-Calendar/,bike week 2021,1.0
332,https://officialbikeweek.com/,bike week 2021,2.0
333,https://azbikeweek.com/,bike week 2021,3.0
334,https://motorcycleshippers.com/2020/05/the-lat...,bike week 2021,4.0


In [620]:
# filter to at most the k organic results 
# - should aready be filtered from valueserp
# - useful for other serp apis

max_organic_urls = 10

serp_input = serp_input.dropna()
serps_grpby_keyword = serp_input.groupby("keyword").head(max_organic_urls).reset_index(drop=True)
serps_grpby_keyword

Unnamed: 0,url,keyword,rank
0,https://www.dmv.ca.gov/portal/handbook/califor...,street laws,1.0
1,https://m.driving-tests.org/beginner-drivers/r...,street laws,2.0
2,https://www.findlaw.com/traffic/traffic-ticket...,street laws,3.0
3,https://www.progressive.com/lifelanes/on-the-r...,street laws,4.0
4,https://streetlaw.org/,street laws,5.0
...,...,...,...
68617,https://officialbikeweek.com/Bike-Week-Calendar/,bike week 2021,1.0
68618,https://officialbikeweek.com/,bike week 2021,2.0
68619,https://azbikeweek.com/,bike week 2021,3.0
68620,https://motorcycleshippers.com/2020/05/the-lat...,bike week 2021,4.0


In [621]:
# Convert Ranking URLs To A String

serp_results_by_keyword = serps_grpby_keyword.groupby('keyword')['url'].apply(list).reset_index(name="organic_urls")
serp_results_by_keyword

Unnamed: 0,keyword,organic_urls
0,"""the league""","[https://www.theleague.com/, https://en.m.wiki..."
1,"""washington examiner""","[https://www.washingtonexaminer.com/, https://..."
2,$lci,"[https://finance.yahoo.com/quote/LCI/, https:/..."
3,%?,[https://stackoverflow.com/questions/53979979/...
4,(5,"[https://en.m.wikipedia.org/wiki/5, https://en..."
...,...,...
7964,نصائح,[https://www.instagram.com/explore/tags/%D9%86...
7965,نصايح,[https://www.instagram.com/explore/tags/%D9%86...
7966,وود,"[https://www.woodplc.com/, https://almajed4oud..."
7967,联盟,[https://zh.wikipedia.org/zh/%E8%81%AF%E7%9B%9...


In [622]:
# all eligible pairs

serp_results_by_url = serps_grpby_keyword.groupby('url')['keyword'].apply(list).reset_index(name="organic_keywords")
serp_results_by_url

Unnamed: 0,url,organic_keywords
0,http://101.pjstar.com/76-bicycle-safety-town/,[bicycle safety town peoria il]
1,http://2018.thevinechristianacademy.com/,[tvca]
2,http://209.204.164.86/calistoga.html,[rainbow ag calistoga]
3,http://35.196.144.73/hampden/agawam/home-appra...,[53 harvey johnson drive agawam ma]
4,http://352arts.org/directory/latina-womens-lea...,[latina women's league]
...,...,...
45942,https://zugobike.com/blogs/articles/how-to-pul...,[bike commute los angeles]
45943,https://zutobi.com/us/driver-guides/rules-of-t...,"[what are the traffic laws, what are traffic l..."
45944,https://zutobi.com/us/driver-guides/the-worst-...,[road rankings by state]
45945,https://zwiftinsider.com/rider-categorization-...,[cycling category chart]


In [623]:
serp_results_by_url = serp_results_by_url[ serp_results_by_url['organic_keywords'].apply(len) > 1 ]
serp_results_by_url

Unnamed: 0,url,organic_keywords
10,http://aba4u.org/,"[american bikers association, bikers for ameri..."
13,http://adamriethlaw.com/,"[adam rieth attorney, adam reith law, adam rieth]"
23,http://amp.soapcentral.com/young-and-restless/...,"[the young and the restless 4/11/22, young and..."
29,http://azmag.gov/LinkClick.aspx?fileticket=YkR...,"[pro walk pro bike, pro bike pro walk]"
39,http://bikecommutetips.blogspot.com/?m=1,"[bike commuting blogs, bike commute blog, bike..."
...,...,...
45886,https://yen.com.gh/facts-lifehacks/biographies...,"[whitaker family west virginia, west virginia ..."
45889,https://yoloboard.com/collections/bikes,"[bike boards, board bike]"
45893,https://youqueen.com/love/in-bed/master-the-wo...,"[how to ride better, tips for riding on top]"
45909,https://za.pinterest.com/RvdM555/bike-shop-ideas/,"[bike shop ideas, bike shop design]"


In [624]:
# how many pairs will we compare?
def myfunc( x ):
    return len(x)*len(x)-1

sum(serp_results_by_url['organic_keywords'].apply(lambda x: myfunc(x)))

322027

In [625]:

possible_pairs = []
def cross( x ):
    res = [{"keyword_x":a, "keyword_y":b} for idx, a in enumerate(x) for b in x[idx + 1:]]
    
    
    possible_pairs.extend(res)
    return res

serp_results_by_url.apply(lambda x: cross(x.organic_keywords), axis=1)

serp_results_crossed = pd.DataFrame(possible_pairs)

# drop duplicates
serp_results_crossed = serp_results_crossed.drop_duplicates()

serp_results_crossed = serp_results_crossed.merge(serp_results_by_keyword, left_on="keyword_x", right_on="keyword")
serp_results_crossed['organic_urls_x'] = serp_results_crossed['organic_urls']
serp_results_crossed = serp_results_crossed.drop(['keyword', 'organic_urls'], axis=1)
serp_results_crossed = serp_results_crossed.merge(serp_results_by_keyword, left_on="keyword_y", right_on="keyword")
serp_results_crossed['organic_urls_y'] = serp_results_crossed['organic_urls']
serp_results_crossed = serp_results_crossed.drop(['keyword', 'organic_urls'], axis=1)
serp_results_crossed

Unnamed: 0,keyword_x,keyword_y,organic_urls_x,organic_urls_y
0,american bikers association,bikers for america,"[http://aba4u.org/, https://m.facebook.com/aba...","[https://m.facebook.com/Bikers4America/, https..."
1,america bike,bikers for america,"[https://www.bikekc.com/, https://www.americas...","[https://m.facebook.com/Bikers4America/, https..."
2,bike america,bikers for america,"[https://www.bikekc.com/, https://www.concordt...","[https://m.facebook.com/Bikers4America/, https..."
3,bike american,bikers for america,"[https://www.bikekc.com/, https://www.bikekc.c...","[https://m.facebook.com/Bikers4America/, https..."
4,america bikes,bikers for america,"[https://www.americasbikecompany.com/, https:/...","[https://m.facebook.com/Bikers4America/, https..."
...,...,...,...,...
65056,walk score com,walkscore dc,"[https://www.walkscore.com/, https://www.redfi...",[https://www.walkscore.com/score/washington_d....
65057,walking score portland,walkscore portland,"[https://www.walkscore.com/OR/Portland, https:...","[https://www.walkscore.com/OR/Portland, https:..."
65058,fat tire bike shop tulsa,bike shops tulsa ok,[https://www.phattirebikeshop.com/about/the-st...,"[https://www.tomsbicycles.com/, https://www.t-..."
65059,golden apple bike ride,westchester cycle club,"[https://m.facebook.com/wccgoldenapple/, https...","[https://www.westchestercycleclub.org/, https:..."


In [626]:
# Only compare the top k_urls results 
def serps_similarity(a, b):
    
    distance = len(set(a) & set(b))
    
    return distance

# Apply the function
serp_results_crossed['similarity'] = serp_results_crossed.apply(lambda x: serps_similarity(x.organic_urls_x, x.organic_urls_y), axis=1)
#serps_compared = matched_serps[['keyword', 'keyword_b', 'si_simi']]
#serps_compared

In [627]:
# similarty greater than n common_urls

serp_results_crossed['len_kw_x'] = serp_results_crossed['keyword_x'].apply(len)
serp_results_crossed = serp_results_crossed[ serp_results_crossed['similarity'] >= common_urls ].sort_values(['similarity','len_kw_x'],ascending=[False,True])
serp_results_crossed

Unnamed: 0,keyword_x,keyword_y,organic_urls_x,organic_urls_y,similarity,len_kw_x
63334,نصايح,نصائح,[https://www.instagram.com/explore/tags/%D9%86...,[https://www.instagram.com/explore/tags/%D9%86...,10,5
62273,e leage,e legue,"[https://www.eleague.com/, https://www.eleague...","[https://www.eleague.com/, https://www.eleague...",10,7
26807,bucycles,bicyles,"[https://www.trekbikes.com/us/en_US/, https://...","[https://www.trekbikes.com/us/en_US/, https://...",10,8
62112,la ligue,la leaque,"[https://www.ligue1.com/, https://www.espn.com...","[https://www.ligue1.com/, https://www.espn.com...",10,8
62268,e league,e leauge,"[https://www.eleague.com/, https://www.eleague...","[https://www.eleague.com/, https://www.eleague...",10,8
...,...,...,...,...,...,...
40029,what are the rights of a person riding a bicyc...,bicycle laws by state,"[https://bikeleague.org/StateBikeLaws, https:/...","[https://bikeleague.org/StateBikeLaws, https:/...",2,65
40076,what are the rights of a person riding a bicyc...,rules of the road for bicycles,"[https://bikeleague.org/StateBikeLaws, https:/...","[https://bikeleague.org/StateBikeLaws, https:/...",2,65
61250,private exclusive clubs are nonprofit 501(c)(7...,how to start a social club for profit,[https://quizlet.com/11669815/sports-managemen...,[https://smallbusiness.chron.com/start-small-b...,2,74
61423,private exclusive clubs are nonprofit 501(c)(7...,can a 501c7 donate to a 501c3,[https://quizlet.com/11669815/sports-managemen...,[https://www.raise-funds.com/can-one-non-profi...,2,74


In [628]:
simi_lim = common_urls

queries_in_df = list(set(serp_results_crossed.keyword_x.to_list()))
topic_groups_numbered = {}
topics_added = []

def find_topics(keyword_x, keyword_y, similarity):
    i = 0
    if (not keyword_y in topics_added) and (not keyword_x in topics_added):
        # new so add both
        i += 1
        topics_added.append(keyword_y)
        topics_added.append(keyword_x)
        topic_groups_numbered[keyword_x] = [keyword_y, keyword_x]

    elif (not keyword_y in topics_added) and (keyword_x in topics_added):
        # add the y when x exists
        j = [key for key, value in topic_groups_numbered.items() if keyword_x in value]        
        topics_added.append(keyword_y)
        topic_groups_numbered[j[0]].append(keyword_y)
        
    elif (keyword_y in topics_added) and (not keyword_x in topics_added):
        # add the x when y exists
        j = [key for key, value in topic_groups_numbered.items() if keyword_y in value]
        topics_added.append(keyword_x)
        topic_groups_numbered[j[0]].append(keyword_x)

def apply_impl_ft(df):
    return df.apply( lambda row: find_topics(row.keyword_x, row.keyword_y, row.similarity), axis=1)

apply_impl_ft(serp_results_crossed)

topic_groups_numbered = {k:list(set(v)) for k, v in topic_groups_numbered.items()}


In [629]:
len(topic_groups_numbered)

1319

In [630]:
topic_groups_lst = []

for k, l in topic_groups_numbered.items():
    cluster_size = len(l)
    for v in l:
        topic_groups_lst.append([k, v, cluster_size])

topic_groups_dictdf = pd.DataFrame(topic_groups_lst, columns=['cluster', 'keyword','cluster_size'])

# sort by size
topic_groups_dictdf = topic_groups_dictdf.sort_values("cluster_size",ascending=False)
topic_groups_dictdf

Unnamed: 0,cluster,keyword,cluster_size
252,bikes for college campuses,best campus bikes,21
245,bikes for college campuses,best bicycles for college students,21
254,bikes for college campuses,best bike for campus,21
253,bikes for college campuses,best type of bike for college,21
251,bikes for college campuses,bicycle for college,21
...,...,...,...
2837,honorable mentions,honorable mention,2
2836,honorable mentions,honorable mentions,2
2826,nj bike helmet law,bike helmet law nj,2
2825,nj bike helmet law,nj bike helmet law,2


In [631]:
# add back anything that isn't in a cluster

# add in no cluster results
no_cluster = pd.DataFrame(None)
queries_de_duped = list(set(df_keywords_input['Keyword']).difference(set(topics_added)))
no_cluster['keyword'] = queries_de_duped
no_cluster['cluster'] = queries_de_duped
no_cluster['cluster_size'] = 1

no_cluster

Unnamed: 0,keyword,cluster,cluster_size
0,raven riley ohio state,raven riley ohio state,1
1,278 tripp hollow rd brooklyn,278 tripp hollow rd brooklyn,1
2,go two miles,go two miles,1
3,pag tucson,pag tucson,1
4,tax id number for youth sports team,tax id number for youth sports team,1
...,...,...,...
3598,freedom to ride bike shops,freedom to ride bike shops,1
3599,transportation advocacy group,transportation advocacy group,1
3600,san jose bicycle map,san jose bicycle map,1
3601,illinois biking,illinois biking,1


In [632]:
# add these on to the master list
df_cluster = pd.concat([topic_groups_dictdf, no_cluster])
df_cluster

Unnamed: 0,cluster,keyword,cluster_size
252,bikes for college campuses,best campus bikes,21
245,bikes for college campuses,best bicycles for college students,21
254,bikes for college campuses,best bike for campus,21
253,bikes for college campuses,best type of bike for college,21
251,bikes for college campuses,bicycle for college,21
...,...,...,...
3598,freedom to ride bike shops,freedom to ride bike shops,1
3599,transportation advocacy group,transportation advocacy group,1
3600,san jose bicycle map,san jose bicycle map,1
3601,illinois biking,illinois biking,1


In [636]:
# save the clusters and queries

topic_groups_dictdf.to_csv("./output-clustered-queries.csv", index=False)

# Conclusions

In this sample we input over 8000 keywords and clustered them into 1333 clusters

3603 were not clustered and live in their own (cluster size of 1 keyphrase each)

### Next steps:

- SERP clustering is one of many methodologies for clustering keywords

- SERP clustering can be adjusted beyond just URLs that match to include intent, related searches, etc.

- More sophisticated clustering methods include NLP

- Less sophisticated clustering methods include n-grams




# Questions & support

Feel free to reach out to me on https://twitter.com/alton_lex


### About Alton

Follow me for more data and tutorials

- twitter: https://twitter.com/alton_lex @alton_lex

- linkedin: https://www.linkedin.com/in/altonalexander/


### About Data Winners

Join the conversation:

- private Discord community

- Video tutorials

- Feedback and support on this and other scripts

Join now: https://datawinners.gumroad.com/l/data-analytics-for-seo