# Story Generation
We remember things better as stories. The plan here is to pick a subset of our phrases, extract the vocabularly, and generate a story based off of them. We can then pull in more flashcards / phrases to ensure a more complete phrase coverage.

The story name will be story_some_title; when added as a 'tag' into Anki, this will add a hyperlink to a google cloud bucket of a specific format of bucket/language/story_name/story_name.html

This means it is easy to add new stories to an existing flashcard deck, and the links will update as soon as you add the tags

In [2]:
%load_ext autoreload
%autoreload 2
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

PAY_FOR_API = True #change to True to run cells that cost money via API calls

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
import os
import pickle
import random
import sys
from pathlib import Path
from pprint import pprint

from dotenv import load_dotenv

from src.config_loader import config
from src.nlp import (
    create_flashcard_index,
    get_vocab_dict_from_dialogue,
    get_vocab_dictionary_from_phrases,
)
from src.utils import load_json, load_text_file, save_json
from src.anki_tools import get_deck_contents, AnkiCollectionReader
load_dotenv()
# Add the parent directory of 'src' to the Python path


True

In [None]:
filepath = "../data/longman_1000_phrases.txt"
phrases = load_text_file(filepath)
pprint(f"First few phrases {phrases[:10]}")

#we already have flashcards generated for some phrases:
#a flashcard index allows us to select flashcards that cover a specific
#vocabulary range, it's quite computationally expensive, but is generated
#using create_flashcard_index



## create the flashcard index
This makes it very fast to find matching flashcards from a given vocab list

In [6]:
# long process, so only create if it doesn't exist
notebook_dir = Path().absolute()  # This gives src/notebooks
data_dir = notebook_dir.parent / "data" / "longman_1000_phrase_index.json"

if data_dir.exists():
    phrase_index = load_json(data_dir)
else:
    phrase_index = create_flashcard_index(phrases)
    save_json(phrase_index, data_dir)


## Sample some phrases to generate the story from
This will pin the story to the vocab found in some pre-existing phrases

In [7]:
#we can obtain phrases we know to create a story from:
with AnkiCollectionReader() as reader:
    pprint(reader.get_deck_names())

{1: 'Default',
 1731524665442: 'Swedish EAL',
 1732020971325: 'RapidRetention - Swedish - LM1000',
 1732309563077: 'RapidRetention - Dutch - LM1000',
 1732312948269: 'RapidRetention - German - LM1000',
 1732313960891: 'RapidRetention - Arabic - LM1000',
 1732314196963: 'RapidRetention - Spanish - LM1000',
 1732314413500: 'RapidRetention - Japanese - LM1000',
 1732316149591: 'RapidRetention - Russian - LM1000',
 1732316158895: 'RapidRetention - Basque - LM1000',
 1732316821915: 'RapidRetention - French - LM1000',
 1732316936163: 'RapidRetention - Italian - LM1000',
 1732460522330: 'RapidRetention - Persian - LM1000',
 1732465028917: 'RapidRetention - Mandarin Chinese - LM1000',
 1732637740663: 'RapidRetention - Welsh - LM1000',
 1732869083179: 'RapidRetention - Serbian - LM1000',
 1732980361514: 'RapidRetention - Russian - GCSE',
 1732993700879: 'Persian Alphabet',
 1733170456922: 'RapidRetention - Swedish - GCSE',
 1733171641992: 'RapidRetention - Mandarin Chinese - GCSE',
 17342602274

In [9]:
df = get_deck_contents(deck_name="RapidRetention - Swedish - LM1000")
df.head()

Unnamed: 0,note_id,model_name,tags,n_cards,avg_ease,total_reps,avg_reps,total_lapses,avg_lapses,avg_interval,TargetText,TargetAudio,TargetAudioSlow,EnglishText,WiktionaryLinks,Picture,TargetLanguageName,knowledge_score
0,1732020511348,Language Practice With Images,,3,0.0,0,0.0,0,0.0,0.0,Var mer uppmärksam på detaljer,[sound:a821f020-5a84-44fb-af42-c6c4133e4379.mp3],[sound:3d6997c2-92c7-43c9-b31e-a20cc4f0bf9e.mp3],Pay more attention to details,"<a href=""https://en.wiktionary.org/wiki/var#Sw...","<img src=""f7153993-cfee-40f4-841c-1bd6cfaeb5a9...",Swedish,0.0
1,1732020511352,Language Practice With Images,,3,265.0,12,4.0,0,0.0,39.0,Kommer kunden att känna igen mig?,[sound:fa15b936-ef4e-44d5-932d-e94a6b477c9d.mp3],[sound:79a26555-55ef-43d6-a9ca-6ee02ade7721.mp3],Will the customer recognize me?,"<a href=""https://en.wiktionary.org/wiki/kommer...","<img src=""cd5e83e4-813e-4097-962a-cf536f866e99...",Swedish,0.354
2,1732020511356,Language Practice With Images,,3,78.3,4,1.3,0,0.0,7.0,Vänligen svara ärligt på alla frågor,[sound:f54483fc-3303-427d-acef-6710ae244bc9.mp3],[sound:41c55da5-347a-4d0b-b6c8-40b27b46c1b9.mp3],Please answer all questions honestly,"<a href=""https://en.wiktionary.org/wiki/v%C3%A...","<img src=""1a6640ec-fe2d-4808-b638-33a38b694224...",Swedish,0.314
3,1732020511360,Language Practice With Images,,3,0.0,0,0.0,0,0.0,0.0,Sluta slösa tid på detta,[sound:b34a331b-6dd9-44a2-a588-f92be1b11d06.mp3],[sound:a1a6714c-9bfd-4cd7-b30e-1a209c0ab42b.mp3],Stop wasting time on this,"<a href=""https://en.wiktionary.org/wiki/sluta#...","<img src=""8de5a6d3-b10c-45ea-860c-512cc2673be7...",Swedish,0.0
4,1732020511364,Language Practice With Images,,3,255.0,8,2.7,0,0.0,20.7,Vi producerar en ny produkt snart,[sound:10281f18-fddc-4a69-ab7b-6098f63b948f.mp3],[sound:0e472f07-e35b-4ab0-9161-e0a1474c6e34.mp3],We're producing a new product soon,"<a href=""https://en.wiktionary.org/wiki/vi#Swe...","<img src=""329bfcb3-7cb4-4174-923f-567a1bfe7ec9...",Swedish,0.331


In [58]:
known_phrases = df.query("knowledge_score > 0.2").sort_values(by="knowledge_score", ascending=False)['EnglishText'].tolist()

known_phrase_indicies = set(map(lambda x: phrase_index['phrases'].index(x), known_phrases))

In [61]:
from src.nlp import remove_unknown_index_values
phrase_index['verb_index'] = remove_unknown_index_values(known_phrase_indicies, phrase_index['verb_index'])
phrase_index['vocab_index'] = remove_unknown_index_values(known_phrase_indicies, phrase_index['vocab_index'])

In [18]:
vocab_dict_flashcards = get_vocab_dictionary_from_phrases(known_phrases[:75]) #75 phrases should give a decent amount of vocab

Now generate the story

In [23]:
from src.dialogue_generation import generate_story

story_name = "story_brussels_trip" #MUST start with story_ -> you will need to fill this in after
story_path = notebook_dir.parent / "outputs" / "stories" / config.TARGET_LANGUAGE_NAME / f"{story_name}.json"

if story_path.exists():
    story_dict = load_json(story_path)
elif story_dict:
    save_json(story_dict, story_path)
    print(f"saved story to {story_path}")
elif PAY_FOR_API:
    story_dict = generate_story(vocab_dict_flashcards)
    #once this is generated, alter the story_name above to make it match the story


We find that the LLM goes a bit beyond the vocab found in the flashcards

In [24]:
from src.nlp import get_vocab_dict_from_dialogue

vocab_dict_story = get_vocab_dict_from_dialogue(story_dict, limit_story_parts=None)

In [64]:
from src.nlp import find_missing_vocabulary

vocab_overlap = find_missing_vocabulary(vocab_dict_flashcards, vocab_dict_story)

=== VOCABULARY COVERAGE ANALYSIS ===
Target verbs covered by flashcards: 60.6%
Target vocabulary covered by flashcards: 49.1%

Verbs needing new flashcards:
['expire', 'find', 'save', 'excite', 'beat'] ...

Vocabulary needing new flashcards:
['but', 'campus', 'sure', 'another', 'mind'] ...


In [28]:
missing_vocab_dict = vocab_overlap['missing_vocab']

In [65]:
from src.nlp import get_matching_flashcards_indexed

# Let's pull all the existing phrases we need to cover the vocab on our story
results = get_matching_flashcards_indexed(vocab_dict_story, phrase_index)

In [66]:
proposed_flashcard_phrases = [card.get('phrase') for card in results['selected_cards']]
vocab_from_new_flashcards = get_vocab_dictionary_from_phrases(proposed_flashcard_phrases)
new_overlap = find_missing_vocabulary(vocab_from_new_flashcards, vocab_dict_story)

=== VOCABULARY COVERAGE ANALYSIS ===
Target verbs covered by flashcards: 80.3%
Target vocabulary covered by flashcards: 71.7%

Verbs needing new flashcards:
['expire', 'got', 'stress', 'save', 'worry'] ...

Vocabulary needing new flashcards:
['rich', 'maybe', 'professor', 'agreed', 'proud'] ...


In [83]:
len(proposed_flashcard_phrases) #now use wider index to pick phrases for missing vocab

78

In [39]:
filtered_df = df.loc[df['EnglishText'].isin(proposed_flashcard_phrases)]

In [None]:
(filtered_df.knowledge_score > 0.2).mean() # we know 61% of the flashcards

0.6119402985074627

In [None]:
knowledge_scores = []
for phrase in proposed_flashcard_phrases:
    

In [None]:
#we can fill in the gap with some missing flashcards:

missing_vocab_dict = new_overlap['missing_vocab']
missing_vocab_dict

In [None]:
from src.phrase import generate_phrases_from_vocab_dict

missing_phrases = generate_phrases_from_vocab_dict(missing_vocab_dict)
missing_phrases

In [None]:
num_cards = len(results["selected_cards"])
print(f"We need {num_cards + len(missing_phrases)} flashcards to cover the story")

In [None]:
from src.utils import save_text_file

save_text_file(proposed_flashcard_phrases + missing_phrases, "../data/stories/test_story/test_phrases.txt")

We will need to generate images for the missing phrases, then we can create an anki deck for that particualr story

In [None]:
from src.images import generate_images_from_phrases

PAY_FOR_API = True

output_dir = notebook_dir.parent / "data" / "longman_phrase_images" / "longman1000"

if not output_dir.exists():
    print("wrong directory")
    PAY_FOR_API = False

if PAY_FOR_API:
    image_files_and_prompts = generate_images_from_phrases(phrases=missing_phrases, output_dir = output_dir)



## Linking stories to flash cards
We will use the Anki tag feature. Given a list of english phrases that are required to understand a story, we can tag each of those phrases within a specific Anki Deck.

In [6]:
#load the phrases
phrases = load_text_file( "../data/stories/test_story/test_phrases.txt")

{1: 'Default',
 1731524665442: 'Swedish EAL',
 1731700590019: 'Custom study session',
 1732020971325: 'RapidRetention - Swedish - LM1000',
 1732309563077: 'RapidRetention - Dutch - LM1000',
 1732312948269: 'RapidRetention - German - LM1000',
 1732313960891: 'RapidRetention - Arabic - LM1000',
 1732314196963: 'RapidRetention - Spanish - LM1000',
 1732314413500: 'RapidRetention - Japanese - LM1000',
 1732316149591: 'RapidRetention - Russian - LM1000',
 1732316158895: 'RapidRetention - Basque - LM1000',
 1732316821915: 'RapidRetention - French - LM1000',
 1732316936163: 'RapidRetention - Italian - LM1000',
 1732460522330: 'RapidRetention - Persian - LM1000',
 1732465028917: 'RapidRetention - Mandarin Chinese - LM1000',
 1732637740663: 'RapidRetention - Welsh - LM1000',
 1732869083179: 'RapidRetention - Serbian - LM1000',
 1732980361514: 'RapidRetention - Russian - GCSE',
 1732993700879: 'Persian Alphabet',
 1733170456922: 'RapidRetention - Swedish - GCSE',
 1733171641992: 'RapidRetention 

In [10]:
from src.anki_tools import add_tag_to_matching_notes
deck_name = "RapidRetention - Swedish - LM1000"
updates, errors = add_tag_to_matching_notes(
    deck_name=deck_name,
    phrases=phrases,
    tag="story_community_park"
)

print(f"Updated {updates} notes")
if errors:
    print("Errors encountered:")
    for error in errors:
        print(f"- {error}")

audio-language-trainer\src\anki_tools.py:223:save() is deprecated: saving is automatic
Updated 88 notes


In [34]:



df_deck = get_deck_contents(deck_name)

In [36]:
df_deck.sort_values(by="knowledge_score", ascending=False)

Unnamed: 0,note_id,model_name,tags,n_cards,avg_ease,total_reps,avg_reps,total_lapses,avg_lapses,avg_interval,TargetText,TargetAudio,TargetAudioSlow,EnglishText,WiktionaryLinks,Picture,TargetLanguageName,knowledge_score
46,1732020511532,Language Practice With Images++,,3,280.0,9,3.0,0,0.0,53.3,Kommer de att lära oss hur man applicerar smink?,[sound:4442968e-cc2a-4c2d-ae1d-651ef4b60172.mp3],[sound:69bee3d7-f0ee-47b1-94e6-dc1ffc9ae57b.mp3],Are they going to teach us how to apply makeup?,"<a href=""https://en.wiktionary.org/wiki/kommer...","<img src=""5d11ecf0-dbff-4043-88b1-0902abc965a1...",Swedish,0.378
327,1732023524630,Language Practice With Images++,,3,280.0,10,3.3,0,0.0,54.3,Är du redo för festen ikväll?,[sound:06049362-4504-4f34-bdd9-d5213a75ea4c.mp3],[sound:0a402d7a-6360-48ee-953b-565b42469f37.mp3],Are you ready for the party tonight?,"<a href=""https://en.wiktionary.org/wiki/%C3%A4...","<img src=""f32df7ae-328e-4c07-b9a7-1f0d40f7e433...",Swedish,0.378
446,1732024523785,Language Practice With Images++,,3,280.0,10,3.3,0,0.0,53.0,Tänk på gapet mellan tåget och perrongen,[sound:2042fb3d-f849-48dd-be78-70d66190082b.mp3],[sound:08093457-1c8e-4935-90c3-3f7966a55e35.mp3],Mind the gap between the train and the platform,"<a href=""https://en.wiktionary.org/wiki/t%C3%A...","<img src=""4bc37b79-0848-4d89-95b4-1b99c740429f...",Swedish,0.376
36,1732020511492,Language Practice With Images++,story_community_park,3,280.0,10,3.3,0,0.0,52.3,Går du genom parken varje kväll?,[sound:074e3e18-114e-4fe9-9e46-785344942ee4.mp3],[sound:08dcb0de-4fbe-43c7-ab56-d56e4510c235.mp3],Do you walk through the park every evening?,"<a href=""https://en.wiktionary.org/wiki/g%C3%A...","<img src=""70d988fc-0496-4408-848d-e210e739dd47...",Swedish,0.375
280,1732023030318,Language Practice With Images++,,3,280.0,9,3.0,0,0.0,50.0,Kan du visa mig hur man organiserar detta?,[sound:d7cb2207-ec3a-46b9-848c-5fefcf21dfb3.mp3],[sound:d3822dc9-3adc-4599-9c35-a3ab95edebe3.mp3],Can you show me how to organize this?,"<a href=""https://en.wiktionary.org/wiki/kan#Sw...","<img src=""15a3928a-a2bf-4911-9f96-a04ffd04b070...",Swedish,0.373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
350,1732024003622,Language Practice With Images++,,3,0.0,0,0.0,0,0.0,0.0,Jag är så glad för din skull,[sound:cb00e745-d723-4cdd-b63c-831b07b65c14.mp3],[sound:5a336bc0-e720-44b2-ba7e-ee1ec74bc42f.mp3],I'm so happy for you,"<a href=""https://en.wiktionary.org/wiki/jag#Sw...","<img src=""c1fbaea6-694f-4c71-abf1-e7115c2e7e0e...",Swedish,0.000
351,1732024003626,Language Practice With Images++,,3,0.0,0,0.0,0,0.0,0.0,Jag svarar snart på dina frågor,[sound:6f755ebe-31d2-4546-870d-4cde9b1ea7b8.mp3],[sound:c433d488-45ea-4185-ab30-bcda291bf880.mp3],I'll answer your questions soon,"<a href=""https://en.wiktionary.org/wiki/jag#Sw...","<img src=""0f11b583-c3af-4cd9-b2db-14add4f8ea10...",Swedish,0.000
353,1732024003634,Language Practice With Images++,,3,0.0,0,0.0,0,0.0,0.0,Har du övervägt alla dina alternativ?,[sound:8fd911c3-a7c1-4740-a47c-f7f74b2f7ded.mp3],[sound:a5005cff-614e-4d8a-888c-18bc6fbe249c.mp3],Have you considered all your options?,"<a href=""https://en.wiktionary.org/wiki/har#Sw...","<img src=""6a27e1e7-749e-4874-91ac-c4292ee63f17...",Swedish,0.000
354,1732024003638,Language Practice With Images++,,3,0.0,0,0.0,0,0.0,0.0,Är du orolig för ekonomin?,[sound:6a777f36-f9a5-471b-9ead-4395131666f6.mp3],[sound:996ecdbb-c233-4ecd-a784-0757ba94d88b.mp3],Are you worried about the economy?,"<a href=""https://en.wiktionary.org/wiki/%C3%A4...","<img src=""0cbc4ae9-ce87-40f7-adc3-f6a65d56d6bf...",Swedish,0.000
