## **Step 1 - keywords Extraction**
***

We have two datasets, one with dream text descriptions:

In [None]:
from keyword_extractor import read_datasets, extract_and_save_keywords_from_dataframes
from yaml_parser import load_config
config = load_config()
dream_df, keywords_df = read_datasets(config)
dream_df.head()

And another one with interpretations of dreams according to keywords:

In [None]:
keywords_df

Then, we used a pretrained Sentence transformer to encode the dream embeddings and keyword embeddings and try to extract the most significant keywords from each dream.

### **all-MiniLM-L6-v2**
***

In [None]:
dream_df = extract_and_save_keywords_from_dataframes()
dream_df

To view the dataframe better, We will filter out the interesting columns:

In [None]:
columns_to_show = ['text_dream', 'Dream Symbol']
dream_df[columns_to_show]

## Step 2 - Summarize interpretations

After extracting the meaningful keywords, we tried to fetch the matching interpretation for each extracted keyword and use a pretrained LLM to summarize these interpretations into one interpretation.

### Load data and prepare (small) dataset for experimenting

In [None]:
import pandas as pd
from datetime import datetime
from transformers import pipeline
from utils import  release_all_gpu_memory, save_df_as_pretty_html


In [None]:
from summarizer import load_causal_model, batch_generate_interpretations
import torch

In [None]:
dream_df= pd.read_csv('datasets/rsos_dream_data.tsv', sep='\t')
dream_df

In [None]:
keywords_df = pd.read_csv("datasets/fixed_interpretations.csv")
keywords_df

In [None]:
exmpl = dream_df[dream_df["text_dream"].str.len()< 300]     # Limit the dream length

In [None]:
exmpl = exmpl[["text_dream","Dream Symbol"]].sample(5, random_state=45)     # Create sample

In [None]:
exmpl

In [None]:
exmpl["Dream Symbol"]   # keywords of the sample

Now, we will create a prompt for the LLM. The prompt will include a request for the LLM to summarize the interpretations. It will get the dream description, the keywords, and the interpretations.

In [None]:
dataset = []

prmt = """Given dream description, interpret the meaning of the dream. 
Provided also are the dream symbols that appear in the dream and their meanings. 
Use the dream symbols meanings to help you interpret the dream. """.replace("\n", " ")


for i, ex in exmpl.iterrows():
    #print(ex)
    keys = ex["Dream Symbol"].split(",")[:5]
    
    #print(keys)
    syms = keywords_df[keywords_df["Dream Symbol"].isin(keys)]

    descr = syms.apply(lambda r: f' - {r["Dream Symbol"]}:  {r["Interpretation"]}', axis = 1)
    item = {
        "prompt": prmt, 
        "dream": ex["text_dream"],
        "symbols": "\n".join(descr),
        }
    dataset.append(item)
    

dataset = pd.DataFrame(dataset)
dataset


### Summarize with flan-T5-large model

In [None]:
release_all_gpu_memory()

In [None]:
# Step 1: Load FLAN-T5 model and tokenizer
model_name = "google/flan-t5-large"
model_name_short = model_name.split("/")[-1]
device = 0 if torch.cuda.is_available() else -1
model, tokenizer = load_causal_model(model_name)

In [None]:
text2text_generator = pipeline(
        "text2text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=1024,           # âœ… allow longer input
        truncation=True,           # âœ… ensure truncation at tokenizer level
        device=device,
    )

Create interpretations in batches:

In [None]:
tstp = datetime.now().strftime(r"%y.%m.%d-%H")
result_df = batch_generate_interpretations(dataset, text2text_generator, batch_size=10, max_length=250)


In [None]:
postproc = lambda out: out["generated_text"].strip()
result_df["interpretation"] = result_df["interpretation"].apply(postproc)


In [None]:
result_df

In [None]:
result_df.columns

We saw that the interpretations are not quite good, and not that related to the dream description. We tried to save the dataframe for further research and saw that the problem applies to many cells and tried another model called Mistral.

In [None]:
# TODO: Should probably delete that cell
# save_df = result_df[['prompt', 'symbols','dream', 'interpretation']]
#
# path = f"output/{model_name_short}_{tstp}"
#
# save_df_as_pretty_html(save_df, path + ".html")
#
# save_df.to_csv(path + ".csv")

In [None]:
result_df.interpretation.str.len()  # TODO: Remember why we sorted interpretations by length...

### Summarize with Mistral model

In [None]:
from summarizer import load_mistral_4bit_model

In [None]:
dataset

In [None]:
release_all_gpu_memory(["model", "tokenizer", "text2text_generator"])


In [None]:
print("Loading Mistral-7B-Instruct in 4-bit...")
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
model_name_short = model_name.split("/")[-1]
  
max_new_tokens=256

model, tokenizer = load_mistral_4bit_model(model_name)


In [None]:
model_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=max_new_tokens,
        do_sample=False
    )

In [None]:

print("\nðŸ§  Running interpretations...")
tstp = datetime.now().strftime(r"%y.%m.%d-%H")

result_df = batch_generate_interpretations(dataset, model_pipeline, batch_size=10)
#print(result_df[["dream", "interpretation"]])


In [None]:
postproc = lambda out: out[0]["generated_text"].split("Interpretation:")[-1].strip()
result_df["interpretation"] = result_df["interpretation"].apply(postproc)


In [None]:
result_df

In [None]:

save_df = result_df[['prompt', 'symbols','dream', 'interpretation']]

path = f"output/{model_name_short}_{tstp}"
save_df_as_pretty_html(save_df, path + ".html")

save_df.to_csv(path + ".csv")

It didn't seem to help... So we then tried to improve our keyword extraction using:
1. First - semantic search to narrow down the search of the keywords to only the semantically close ones.
2. Second - MMR (Maximal Marginal Relevance) to increase the diversity of keywords extracted from the dream.

#TODO: Summarize better_keywords_extraction.ipynb

## **Evaluation**
***

We evaluated the performance of the dream interpretation using BLEU,perplexity,ROUGE, and BERT. **It's important to mention: evaluation was tested on a small sample of 5 rows, and also the dream interpretation was compared to the dream itself and that might be the reason for the small values of the metrics**

In [None]:
from evaluation import evaluate_dream_interpretations
import pandas as pd
dreams_interpretations_df = pd.read_csv('datasets/Mistral-7B-Instruct-v0.2_25.04.17-16.csv')
dreams_interpretations_df = evaluate_dream_interpretations(dreams_interpretations_df)
dreams_interpretations_df.to_csv('datasets/Mistral-7B-Instruct-v0.2_25.04.17-16_evaluated.csv', index=False)
dreams_interpretations_df

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Create figure with subplots for non-perplexity scores
fig1, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
plt.grid()

# Plot distributions without perplexity
scores_without_perplexity = ['BLEU', 'ROUGE', 'BERT']
for score in scores_without_perplexity:
    sns.kdeplot(data=dreams_interpretations_df[score], label=score, ax=ax1)
ax1.set_title('Score Distributions (BLEU, ROUGE, BERT)')
ax1.legend()

# Calculate statistics for heatmap without perplexity
stats_df = pd.DataFrame()
for score in scores_without_perplexity:
    stats_df[score] = [
        dreams_interpretations_df[score].min(),
        dreams_interpretations_df[score].max(),
        dreams_interpretations_df[score].mean(),
        dreams_interpretations_df[score].median(),
        stats.mode(dreams_interpretations_df[score])[0]
    ]
stats_df.index = ['Min', 'Max', 'Average', 'Median', 'Mode']

# Plot heatmap
sns.heatmap(stats_df, annot=True, fmt='.3f', cmap='YlOrRd', ax=ax2)
ax2.set_title('Score Statistics (BLEU, ROUGE, BERT)')
plt.tight_layout()

# Create separate figure for perplexity
fig2, (ax3, ax4) = plt.subplots(2, 1, figsize=(12, 10))

# Plot perplexity distribution
sns.kdeplot(data=dreams_interpretations_df['perplexity'], ax=ax3)
ax3.set_title('Perplexity Distribution')

# Calculate perplexity statistics
perplexity_stats = pd.DataFrame({
    'perplexity': [
        dreams_interpretations_df['perplexity'].min(),
        dreams_interpretations_df['perplexity'].max(),
        dreams_interpretations_df['perplexity'].mean(),
        dreams_interpretations_df['perplexity'].median(),
        stats.mode(dreams_interpretations_df['perplexity'])[0]
    ]
})
perplexity_stats.index = ['Min', 'Max', 'Average', 'Median', 'Mode']

# Plot perplexity heatmap
sns.heatmap(perplexity_stats, annot=True, fmt='.3f', cmap='YlOrRd', ax=ax4)
ax4.set_title('Perplexity Statistics')
plt.tight_layout()

plt.show()


We can draw the following conclusions:
1. The bleu score is incredibly low (a good result should be 20-40, we didn't even get 1...). This means that there is a weak overlap between the dream and its interpretation.
2. Same for the Rouge.
3. BERT averages at 0.6, which is not that bad considering that a good value is 0.85â€“0.9 that indicates some semantic similarity between the dream and its interpretation.
4. perplexity is terrible since a good value is under 20...
5. We interpret the results using this table:
| **Metric** | **High Score Meaning** | **Low Score Meaning** | **Preferred Score** | **Typical Values for Good Results** | **Why?** |
| --- | --- | --- | --- | --- | --- |
| **BLEU** | High n-gram overlap between reference and candidate text | Low n-gram overlap between reference and candidate text | **High** | 20â€“40 (moderate), 40+ (good) | High BLEU indicates the candidate text closely matches the reference text. |
| **Perplexity** | Candidate text is unpredictable and diverges from reference distribution | Candidate text is predictable, fluent, and aligned with reference distribution | **Low** | < 20 (for good results) | Low perplexity shows that the candidate text is fluent, consistent, and aligned with the reference. |
| **ROUGE** | More overlapping n-grams (e.g., unigrams, bigrams) and higher recall of key phrases | Fewer overlapping n-grams and poor recall of key phrases | **High** | 30â€“50 (good), 50+ (very good) | High ROUGE suggests greater similarity between the candidate and reference texts. |
| **BERTScore** | Strong semantic similarity between the candidate and reference text | Weak semantic similarity between the candidate and reference text | **High** | 0.85â€“0.98 (good) | Higher BERTScore reflects that the candidate preserves the meaning of the reference text. |

