## **Keywords extraction using semantic search**
***

In [None]:
from keyword_extractor import extract_and_save_keywords_with_semantic_search
dream_df = extract_and_save_keywords_with_semantic_search()
dream_df.head()

The dataframe is too large (too many unnecessary columns). Let's view only the interesting ones:

In [None]:
from yaml_parser import load_config
config = load_config()
columns_to_show = [config['data']['keywords_column'], config['data']['dream_text_column']]
dream_df[columns_to_show]

well, a quick observation at the results shows that some of the dreams gained a too diverged keywords, and some are not diverged enough. We wll try to change the diversity factor and/or the size of the semantic search output. We will do it on samples of 100 rows of the data just to show the change.

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 50
top_n_mmr = config['model']['num_keywords']         # 5
diversity = config['model']['diversity']            # 0.5


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

it seems like it only made worse.. let's try to lower the diversity even more.

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 50
top_n_mmr = config['model']['num_keywords']         # 5
diversity = config['model']['diversity']            # 0.3


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

A little better, but still not satisfying. Let's try to put the diversity factor back at 0.7 and increase the semantic search output to 20% of the keywords (which is 240).

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 240
top_n_mmr = config['model']['num_keywords']         # 5
diversity = config['model']['diversity']            # 0.7


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

still not satisfying, let's try to lower the number of extracted keywords to the top 3.

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 240
top_n_mmr = config['model']['num_keywords']         # 3
diversity = config['model']['diversity']            # 0.7


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

well, it seems like we lost some valuable keywords. let's keep it 5 keywords but narrow the semantic search to 50.

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 50
top_n_mmr = config['model']['num_keywords']         # 5
diversity = config['model']['diversity']            # 0.7


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

it seems like narrowing the semantic search helped. let's try to narrow it down even more:

In [None]:
from keyword_extractor import extract_keywords_with_semantic_search, read_datasets
import torch
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import pandas as pd

# Set display options for DataFrames
pd.set_option("display.max_columns", None)  # Show all columns
pd.set_option("display.max_rows", None)     # Show all rows
pd.set_option("display.max_colwidth", None) # Do not truncate column contents

dream_df, keywords_df = read_datasets(config)

config = load_config()
rseeds = [11, 22, 33, 44, 55, 66, 77, 88, 99]
candidate_keywords = keywords_df[config['data']['keywords_column']].dropna().unique().tolist()

# Load model
model_name = config['model']['name']
model = SentenceTransformer(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Encode keywords
keywords_embeddings = model.encode(candidate_keywords, convert_to_tensor=True, device=model.device.type)

# Extract keywords from each dream using semantic search
top_k_semantic = config['model']['num_semantic']    # 30
top_n_mmr = config['model']['num_keywords']         # 5
diversity = config['model']['diversity']            # 0.7


for rs in rseeds:
    sample = dream_df.sample(100, random_state=rs)
    results = []

    for dream in sample[config['data']['dream_text_column']]:
        keywords = extract_keywords_with_semantic_search(dream, keywords_embeddings, candidate_keywords, model, top_k_semantic, top_n_mmr, diversity)
        results.append(",".join(keywords))

    sample[config['data']['keywords_column']] = results
    display(Markdown(f"#### **{top_k_semantic=}, {diversity=}, {rs=}**\n***"))
    display(sample[columns_to_show])

It seems like it was the best with semantic search of 50. We'll keep it that way.

In [None]:
extract_and_save_keywords_with_semantic_search()