Copyright &copy; 2024 Praneeth Vadlapati

## Setup:

Example of .env: 
```bash
# Groq API to use LLMs - https://console.groq.com/keys
# Groq is preferred for fast responses
LM_PROVIDER_BASE_URL=https://api.groq.com/openai/v1
LM_API_KEY=
LM_MODEL=llama-3.1-70b-versatile

# For Google Trends API - https://serpapi.com/manage-api-key
SERP_API_KEY=
```

Installing packages:
```bash
pip install openai python-dotenv google-search-results markdown2
```

## 1. Loading a Language Model, adding the page content

In [1]:
import os
import json

import markdown2
import pandas as pd
from serpapi import SerpApiClient

from common_functions import get_lm_response, extract_data, \
    print_progress, print_error, model, data_folder, display_md, load_env

trends_engine = 'google_trends'

# User inputs
topic = 'Language Model Learning a Dataset for Data-Augmented Prediction'
topic_for_url = topic.replace(' ', '-')

content_file = os.path.join(data_folder, f'{topic_for_url}.md')
metadata_file = os.path.join(data_folder, f'{topic_for_url}.json')

print(f'Content file: {content_file}')
print(f'Model: {model}')

# try to read from file
page_content = None
if os.path.exists(content_file):
	with open(content_file, 'r') as f:
		page_content = f.read().strip() or None

if not page_content:
	print('-'*7)
	print('No content found for this topic. Please add content in markdown to the file.')
	print(f'File: {content_file}')
	raise SystemExit

page_title = page_content.split('\n')[0].strip()
page_title = page_title.strip('#').strip().strip('*').strip()
page_content = page_content.split('\n', 1)[1].strip()  # remove first line

display_md(f'Page title: **{page_title}**')
display_md(page_content[:500])

Content file: data/Language-Model-Learning-a-Dataset-for-Data-Augmented-Prediction.md
Model: llama-3.1-70b-versatile


Page title: **Introducing LML-DAP: A New Way to Combine Language Models with Data for Better Predictions**

Machine learning (ML) models are powerful, but they often lack transparency. They make predictions, but the reasoning behind them can be a mystery. ML models need a lot of pre-processing before they can work with data. But what if there was a faster, more explainable method for predictions?

**LML-DAP** (*Language Model Learning a Dataset for Data-Augmented Prediction*), offers an exciting alternative. Instead of relying on standard ML techniques, this method uses **Large Language Models (LLMs)*

## 2. Generating keywords

Generate keywords related to the page and content

In [2]:
keyword_gen_prompt_template = '''
Page content:
{page_content}
---
Page title: {page_title}

Generate 10 SEO keywords for the above page based on the page content.

All keywords must be in the same language as the page title and content.
Respond with a list of keywords separated by commas, inside tags like this:
Sample response: 
<keywords>
	keyword1, keyword2, keyword3
</keywords>
'''.strip()

keyword_gen_prompt = keyword_gen_prompt_template.format(
    page_title=page_title, page_content=page_content)
keyword_gen_response = get_lm_response(keyword_gen_prompt)
keywords = extract_data(keyword_gen_response, tag='keywords')
# load keywords to array
keywords = keywords.split(',')
keywords = [keyword.strip() for keyword in keywords]

print(f'Generated {len(keywords)} keywords:')
keywords[:5]

Generated 10 keywords:


['machine learning',
 'language models',
 'explainable predictions',
 'data-augmented prediction',
 'transparent AI']

Generate keywords that are not directly similar, but related

In [3]:
similar_keys_prompt_template = '''
Page content:
{page_content}
---
Page title: {page_title}
Find 5 related topics to the above page.

Respond with a list of SEO keywords separated by commas, inside tags like this:
Sample response: 
<keywords>
	keyword1, keyword2, keyword3
</keywords>

All keywords must be in the same language as the page title and content.
Keywords should not be directly related to the content, but
should be related to the topic.
'''.strip()

similar_keys_prompt = similar_keys_prompt_template.format(
    page_title=page_title, page_content=page_content)
similar_keys_response = get_lm_response(similar_keys_prompt)
similar_keys_keywords = extract_data(similar_keys_response, tag='keywords')
similar_keys_keywords = similar_keys_keywords.split(',')
similar_keys_keywords = [keyword.strip() for keyword in similar_keys_keywords]
print(f'Generated {len(similar_keys_keywords)} similar keywords:')
similar_keys_keywords = similar_keys_keywords[:5]
similar_keys_keywords

Generated 5 similar keywords:


['Explainable AI',
 'Natural Language Processing Applications',
 'Transparent Machine Learning',
 'Interpretable Predictive Models',
 'Data-Driven Decision Making']

In [4]:
keywords.extend(similar_keys_keywords)  # add similar keywords
if page_title not in keywords:
	keywords.insert(0, page_title)
keywords = [keyword.lower() for keyword in keywords]

for keyword in keywords:
	# remove duplicates without disturbing the order
	while keywords.count(keyword) > 1:
		# remove from last
		last_index_of_word = (len(keywords) - 1) - keywords[::-1].index(keyword)
		keywords.pop(last_index_of_word)
	if len(keyword) > 60:  # remove long keywords
		keywords.remove(keyword)

print(f'Final {len(keywords)} keywords:')
keywords[:5]

Final 15 keywords:


['machine learning',
 'language models',
 'explainable predictions',
 'data-augmented prediction',
 'transparent ai']

## 3. Fetching trends

In [6]:
def fetch_trends_data(q, history_start='7-d', data_type=None):
	# https://serpapi.com/google-trends-api
	now = 'now' if history_start[-1] == 'd' else 'today'
	query = { 'q': [q], 'engine': trends_engine,
			 'date': f'{now} {history_start}'}
	if data_type:
		query['data_type'] = data_type
	search = SerpApiClient(query)
	search.SERP_API_KEY = os.getenv('SERP_API_KEY')
	search = search.get_dict()
	if 'error' in search and 'out of searches' in search['error']:
		print_error()
		raise Exception(search['error'])
	return search

get_value = lambda x: x['values'][0]['extracted_value']

data_list = []
for keyword in keywords:
	trend_data = fetch_trends_data(keyword)
	if 'error' in trend_data:
		print_error()
		if 'out of searches' in trend_data['error']:
			break
		if 'hasn\'t returned any results' in trend_data['error']:
			continue
		if not len(data_list):  # first case got failed, so stop now
			print(trend_data['error'])
			break
		continue
	data = trend_data['interest_over_time']['timeline_data']
	oldest_value = get_value(data[0]) or 1
	latest_value = get_value(data[-1]) or get_value(data[-2])
	growth = round(((latest_value - oldest_value) / oldest_value) * 100)
	data_list.append([keyword, oldest_value, latest_value, growth])
	print_progress()

df_7d = pd.DataFrame(data_list, columns=['Keyword', 'Oldest_Value',
										'Latest_Value', 'Growth'])
df_7d.sort_values(by=['Growth', 'Latest_Value', 'Oldest_Value'],
					ascending=False, inplace=True)
df_7d = df_7d[df_7d['Latest_Value'] > 0]  # remove if latest value is 0
df_7d.reset_index(drop=True, inplace=True)
df_7d

..!!..!.....!!.

Unnamed: 0,Keyword,Oldest_Value,Latest_Value,Growth
0,explainable ai,1,89,8800
1,transparent ai,46,84,83
2,machine learning,53,71,34
3,large language models,37,70,89
4,language models,43,67,56
5,dataset analysis,42,51,21


## 4. 	Generated long-tail keywords and their trend data

In [7]:
# take top trendy keywords from both dfs and generate long keywords
top_n = 10
top_short_keywords = set(df_7d['Keyword'].head(top_n).tolist())

# Generate long-tail keywords from top short-tail keywords
long_kw_prompt_template = '''
Top trending short-tail SEO keywords:
`{top_short_keywords}`
---
Topic: `{page_title}`

Generate 5 long-tail SEO keywords for each of the top trending short-tail keywords.
Generate those that might have more demand.
Respond with a list of long-tail SEO keywords separated by commas, inside tags like this:
Sample response:
<keywords>
	keyword1, keyword2, keyword3
</keywords>
'''.strip()

long_kw_prompt = long_kw_prompt_template.format(page_title=page_title,
					top_short_keywords=', '.join(top_short_keywords))
long_kw_response = get_lm_response(long_kw_prompt)
long_keywords = extract_data(long_kw_response, tag='keywords')
long_keywords = long_keywords.split(',')
long_keywords = [keyword.strip() for keyword in long_keywords]
long_keywords = [keyword.lower() for keyword in long_keywords]
long_keywords = long_keywords[:5]  # take only first few
print(f'Generated {len(long_keywords)} long-tail keywords:')
long_keywords

Generated 5 long-tail keywords:


['language models for data prediction',
 'language models for machine learning applications',
 'training language models on small datasets',
 'language models for data augmentation techniques',
 'using language models for predictive analytics']

In [9]:
data_list = []
for keyword in long_keywords:
	trend_data = fetch_trends_data(keyword)
	if 'error' in trend_data:
		if 'hasn\'t returned any results' in trend_data['error']:
			print_progress(',')
			continue
		print_error()
		if not len(data_list):
			break
		continue
	data = trend_data['interest_over_time']['timeline_data']
	oldest_value = get_value(data[0]) or 1
	latest_value = get_value(data[-1]) or get_value(data[-2]) or get_value(data[-3])
	growth = round(((latest_value - oldest_value) / oldest_value) * 100)
	data_list.append([keyword, oldest_value, latest_value, growth])
	print_progress()

df_long_7d = pd.DataFrame(data_list, columns=['Keyword', 'Oldest_Value',
										'Latest_Value', 'Growth'])
df_long_7d.sort_values(by=['Growth', 'Latest_Value', 'Oldest_Value'],
					ascending=False, inplace=True)
# df_long_7d = df_long_7d[df_long_7d['Latest_Value'] > 0]
df_long_7d.reset_index(drop=True, inplace=True)
df_long_7d

.,...

Unnamed: 0,Keyword,Oldest_Value,Latest_Value,Growth
0,language models for data prediction,1,0,-100
1,training language models on small datasets,1,0,-100
2,language models for data augmentation techniques,1,0,-100
3,using language models for predictive analytics,1,0,-100


In [10]:
print('After removing keywords with 0 latest value:')
df_long_7d = df_long_7d[df_long_7d['Latest_Value'] > 0]
df_long_7d.reset_index(drop=True, inplace=True)
df_long_7d

After removing keywords with 0 latest value:


Unnamed: 0,Keyword,Oldest_Value,Latest_Value,Growth


## 5. 	Generating a description for metadata

In [11]:
top_n_final = 5
all_keywords = set()
for df in [df_7d, df_long_7d]:  # df_1m, df_related, 
	if df is not None:
		all_keywords.update(df['Keyword'].head(top_n_final).tolist())

all_keywords = list(all_keywords)
all_keywords

['language models',
 'transparent ai',
 'large language models',
 'explainable ai',
 'machine learning']

In [12]:
combiner_prompt_template = '''
Act as an SEO expert.
Combine multiple keywords without altering their content.
For example, "A" + "A B" + "B" + "B C D" -> "A B C D".
For example, input: ['uses', 'uses of car',
	'cars', 'cars in the world', 'drive a car']
output: <keywords>uses of cars in the world drive a car</keywords>

Use the following SEO keywords as input:
<keywords>{top_keywords}</keywords>

**Important Rules**:
1. Combine only the keywords provided. **Do not add new words**.
2. **All input keywords** must be present in the output exactly as provided. Do not change, modify, or substitute them.
3. Handle duplicates intelligently by removing redundancy (e.g., "cars" + "cars in the world" -> "cars in the world").
4. The order of keywords can be flexible but should maintain logical flow and readability.
5. The final output should not exceed the total number of characters of the original keywords combined.
All the input keywords that must exist in the output keyword:
<keywords>{top_keywords}</keywords>

Respond with a single combined keyword inside tags like this:
Sample response:
<keywords>(keyword here)</keywords>
'''.strip()

combiner_prompt = combiner_prompt_template.format(
			top_keywords=' \n '.join(all_keywords))
for _ in range(10):
	try:
		combiner_response = get_lm_response(combiner_prompt)
		combined_keyword = extract_data(combiner_response, tag='keywords')
		combined_keyword = combined_keyword.strip()
		for keyword in all_keywords:
			if keyword not in combined_keyword:
				combined_keyword = None
				raise Exception(f'Error: {keyword} not found in combined keyword')
		if len(combined_keyword) > len(''.join(all_keywords)):
			print('Words not reduced')
			raise Exception('Extra words got added. Please try again.')
		all_combined_keywords_set = ' '.join(combined_keyword).split(' ')
		all_keywords_set = ' '.join(all_keywords)
		for word in all_combined_keywords_set:
			if word not in all_keywords_set:
				print('Extra words got added')
				raise Exception('Extra words got added. Please try again.')
		break
	except Exception as e:
		print_error()
		if 'out of searches' in str(e):
			break
		continue
combined_keyword

!

'large language models transparent ai explainable ai machine learning'

## 6. 	Generating tags

In [13]:
tags_prompt_template = '''
Page content:
{page_content}
---
Page title: {page_title}

Generate in regular case. Tags need not be in lower case.
Act as an SEO expert. Create a list of 10 most popular tags for the blog post.
Respond with a list of tags separated by commas, inside tags like this:
Sample response:
<tags>
	tag1, tag2, tag3
</tags>
'''.strip()
tags_prompt = tags_prompt_template.format(
	page_content=page_content, combined_keyword=combined_keyword,
	page_title=page_title)
tags_response = get_lm_response(tags_prompt)
tags = extract_data(tags_response, tag='tags')
tags = tags.split(',')
tags = [tag.strip() for tag in tags]
tags = tags[:10]  # take only first few
print(f'Generated {len(tags)} tags:')
for tag in tags:
	print(f'“{tag}”, ', end='')

Generated 10 tags:
“Transparent AI”, “Explainable AI”, “Machine Learning”, “Large Language Models”, “Natural Language Processing”, “AI for Healthcare”, “Data Augmented Prediction”, “Interpretable AI”, “Artificial Intelligence”, “Language Model Learning”, 

In [14]:
# find the most trending tags
tags_data = []
for tag in tags:
	trend_data = fetch_trends_data(tag)
	if 'error' in trend_data:
		if 'hasn\'t returned any results' in trend_data['error']:
			print_progress(',')
			continue
		print_error()
		if not len(tags_data):
			break
		continue
	data = trend_data['interest_over_time']['timeline_data']
	oldest_value = get_value(data[0]) or 1
	latest_value = get_value(data[-1]) or get_value(data[-2]) or get_value(data[-3])
	growth = round(((latest_value - oldest_value) / oldest_value) * 100)
	tags_data.append([tag, oldest_value, latest_value, growth])
	print_progress()

df_tags = pd.DataFrame(tags_data, columns=['Tag', 'Oldest_Value',
										'Latest_Value', 'Growth'])
df_tags.sort_values(by=['Growth', 'Latest_Value', 'Oldest_Value'],
					ascending=False, inplace=True)
df_tags = df_tags[df_tags['Latest_Value'] > 0]
df_tags.reset_index(drop=True, inplace=True)
df_tags

....,,....

Unnamed: 0,Tag,Oldest_Value,Latest_Value,Growth
0,AI Model Explainability,1,100,9900
1,Machine Learning,51,72,41
2,Language Model Learning,46,65,41
3,Large Language Models,44,58,32
4,Explainable AI,40,50,25
5,AI Predictions,42,48,14
6,Transparent AI,57,64,12


In [15]:
used_tags = df_tags.head(5)  # take only top 5 tags
trending_tags = used_tags['Tag'].tolist()
print(f'There are {len(trending_tags)} trending tags:')
for tag in trending_tags:
	print(f'“{tag}”, ', end='')

used_tags

There are 5 trending tags:
“AI Model Explainability”, “Machine Learning”, “Language Model Learning”, “Large Language Models”, “Explainable AI”, 

Unnamed: 0,Tag,Oldest_Value,Latest_Value,Growth
0,AI Model Explainability,1,100,9900
1,Machine Learning,51,72,41
2,Language Model Learning,46,65,41
3,Large Language Models,44,58,32
4,Explainable AI,40,50,25


In [16]:
# adding tags to combined_keyword until it reaches 160 characters
for tag in trending_tags:
	if tag not in combined_keyword:
		combined_keyword += f' {tag}'
combined_keyword

'large language models transparent ai explainable ai machine learning AI Model Explainability Machine Learning Language Model Learning Large Language Models Explainable AI'

## 7. Generating a title for SEO

In [16]:
seo_title_prompt_template = '''
Page content:
{page_content}
---
SEO keywords:
`{keywords}`
Page title: {page_title}

Generate an SEO title for the blog post based on the keywords.
Respond with the title inside tags like this:
Sample response:
<title>(title here)</title>
'''.strip()
seo_title_prompt = seo_title_prompt_template.format(
	page_content=page_content, keywords=', '.join(all_keywords),
	page_title=page_title)
seo_title_response = get_lm_response(seo_title_prompt)
seo_title = extract_data(seo_title_response, tag='title')
seo_title = seo_title.strip()
print(f'SEO title: {seo_title}')

SEO title: Introducing LML-DAP: A New Way to Combine Language Models with Data for Better, Explainable Predictions


## 8. Usage of the keywords in the tags of the HTML page

In [17]:
html_meta_tags = '''
<!DOCTYPE html>
<html>
<head>
	<title>{seo_title}</title>
	<meta charset="UTF-8">
	<meta property="og:title" content="{seo_title}">
	<meta name="description" content="{seo_description}">
	<meta property="og:description" content="{seo_description}">
	<meta name="keywords" content="{keywords}">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
<!-- hidden tags are not supported on Medium.com -->
<!-- <h1 style="display: none;">{combined_keyword}</h1> -->

<h1>{page_title}</h1>

{page_content}

Keywords: {combined_keyword}
</body>
</html>
'''.lstrip()
html_content = html_meta_tags.format(
	seo_title=seo_title, page_title=page_title, keywords=', '.join(all_keywords),
	seo_description=f'Content on {combined_keyword}'[:155],
	combined_keyword=combined_keyword,
	page_content=markdown2.markdown(page_content),  # md to html
)

html_file = os.path.join(data_folder, f'{topic_for_url}.html')
with open(html_file, 'w') as f:
	f.write(html_content)

In [18]:
# save in metadata_file as backup
metadata = {
	'topic': topic,
	'page_title': page_title,
	'all_keywords': all_keywords,
	'combined_keyword': combined_keyword,
	'trending_tags': trending_tags,
}
with open(metadata_file, 'w', encoding='utf-8') as f:
	json.dump(metadata, f, indent=4)
	print(f'Metadata saved in json file')

# to load from metadata_file anytime
with open(metadata_file, 'r', encoding='utf-8') as f:
	metadata = json.load(f)
	topics = metadata['topic']
	page_title = metadata['page_title']
	all_keywords = metadata['all_keywords']
	combined_keyword = metadata['combined_keyword']
	trending_tags = metadata['trending_tags']
	print('Metadata loaded from json file')

Metadata loaded from json file


## 9. Generating paths

In [19]:
alias_prompt_template = '''
Page content:
{page_content}
---
SEO keywords: `{keywords}`
Page title: {page_title}
Example path: /benefits-of-apples/

Generate meaningful paths for the blog post.
A path must have the least possible number of words (max 4).

Return a list of 2 paths line-by-line, inside tags like this:
Sample response:
<paths>
	/alias1 \n /alias2
</paths>
'''.strip()
alias_prompt = alias_prompt_template.format(
	topic=topic, page_title=page_title, page_content=page_content,
	keywords=', '.join(all_keywords))
alias_response = get_lm_response(alias_prompt)
url_paths = extract_data(alias_response, tag='paths')
url_paths = url_paths.split('\n')
url_paths = [path.strip() for path in url_paths]
url_paths

['/explainable-ai-predictions', '/lml-dap-alternative-ml']