Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Bottish Play #27

Open
greg-kennedy opened this issue Nov 19, 2023 · 3 comments
Open

The Bottish Play #27

greg-kennedy opened this issue Nov 19, 2023 · 3 comments

Comments

@greg-kennedy
Copy link

greg-kennedy commented Nov 19, 2023

The Bottish Play

A computer speech audio production of Shakespeare's "Macbeth"

Listen to the Final Production

Listen to the Encore

View the Repository

The Write-Up

While trying to generate a lot of junk audio to consume bandwidth against someone's NFT project, I once again fell into the rabbit hole of computer text-to-speech synthesis. War and Peace can be easily turned into a 47 hour read-a-thon, but upon reviewing it, I found a lot of quality issues that I thought I could fix up. "I need to take off the Gutenberg header. And the Unicode characters aren't being parsed correctly. How about trying to find quotes, and maybe change the voice there? In fact, I could start with a play, which already has speakers clearly defined..." Classic mistake: now I've tricked myself into a project.

This year, I'm working on an audio book. NaNoGenMo has had audiobook entries before (see Scam Likely and Gaimidian Graveyard), as well as NaOpGenMo, but this one is my take on William Shakespeare's play Macbeth. The goal is to use TTS software to turn Macbeth into a listenable audio version.

The first thing to do is to get a copy of Macbeth. I did begin looking at Gutenberg and other online libraries, but the idea of parsing out the text format to assign speeches to speakers seemed annoying. That's when I ran across the Folger Library's XML version. "Aha!" I thought. "They have hopefully already done the work for me".

TEI Format

It turns out their version follows the Text Encoding Initiative guidelines. This is an XML-based markup system for capturing text as it's printed, including page breaks, metadata, annotations, and much more. The idea is noble, in that it should be able to cover pretty much any kind of printed text, and indeed TEI formatted files are available in a lot of places. I even ran across TEI annotated newspaper clippings of 19th century London ghost stories!

Unfortunately, TEI does seem to be very application-specific in practice. The Folger texts have XML elements for every word, space, punctuation character, line break, sound effect, stage direction, speech, etc. They can be nested within one another as well. Quotation marks are a block, as are song titles, names, foreign words... I'm sure this is great for scholars or something, but I just ended up hacking together a parser that works on Macbeth and I will be extremely lucky if it works on any other Folger play, let alone a generic document.

<sp xml:id="sp-0001" who="#WITCHES.1_Mac">
   <speaker xml:id="spk-0001">
       <w xml:id="w0000200">FIRST</w>
       <c xml:id="c0000210"></c>
       <w xml:id="w0000220">WITCH</w>
   </speaker>
   <ab xml:id="ab-0001">
       <lb xml:id="lb-00005"/>
       <milestone unit="ftln" xml:id="ftln-0001" n="1.1.1" ana="#verse" corresp="#w0000230 #c0000240 #w0000250 #c0000260 #w0000270 #c0000280 #w0000290 #c0000300 #w0000310 #c0000320 #w0000330 #p0000340"/>
       <w xml:id="w0000230" n="1.1.1">When</w>
       <c xml:id="c0000240" n="1.1.1"> </c>
       <w xml:id="w0000250" n="1.1.1">shall</w>
       <c xml:id="c0000260" n="1.1.1"> </c>
       <w xml:id="w0000270" n="1.1.1">we</w>
       <c xml:id="c0000280" n="1.1.1"> </c>
       <w xml:id="w0000290" n="1.1.1">three</w>
       <c xml:id="c0000300" n="1.1.1"> </c>
       <w xml:id="w0000310" n="1.1.1">meet</w>
       <c xml:id="c0000320" n="1.1.1"> </c>
       <w xml:id="w0000330" n="1.1.1">again</w>
       <pc xml:id="p0000340" n="1.1.1">?</pc>
       <lb xml:id="lb-00010"/>
       <milestone unit="ftln" xml:id="ftln-0002" n="1.1.2" ana="#verse" corresp="#w0000350 #c0000360 #w0000370 #p0000380 #c0000390 #w0000400 #p0000410 #c0000420 #w0000430 #c0000440 #w0000450 #c0000460 #w0000470 #p0000480"/>
       <w xml:id="w0000350" n="1.1.2">In</w>
       <c xml:id="c0000360" n="1.1.2"> </c>
       <w xml:id="w0000370" n="1.1.2">thunder</w>
       <pc xml:id="p0000380" n="1.1.2">,</pc>
       <c xml:id="c0000390" n="1.1.2"> </c>
       <w xml:id="w0000400" n="1.1.2">lightning</w>
       <pc xml:id="p0000410" n="1.1.2">,</pc>
       <c xml:id="c0000420" n="1.1.2"> </c>
       <w xml:id="w0000430" n="1.1.2">or</w>
       <c xml:id="c0000440" n="1.1.2"> </c>
       <w xml:id="w0000450" n="1.1.2">in</w>
       <c xml:id="c0000460" n="1.1.2"> </c>
       <w xml:id="w0000470" n="1.1.2">rain</w>
       <pc xml:id="p0000480" n="1.1.2">?</pc>
   </ab>
</sp>

That said I did think the XML files were pretty neat, mainly for their metadata section: at the top are editor's notes, a detailed description of the format, information about printed editions, corrections, and even a detailed character list - which includes character names, relationships to each other, gender, groupings (the three murderers are in a Murderers group), even the point in the text where the character dies! There's also "milestone" indicators which classify speeches as "verse" or "prose" - if I wanted to, I could use this to adjust the speech emphasis.

Anyway, with that done I need an output format.

SSML Format

The solution comes in the form of Speech Synthesis Markup Language, another XML flavor, but this one designed to feed into text-to-speech systems. It is a W3C standard, implemented now by a number of cloud-based TTS systems like Polly, Azure, Google Speech, etc. It also works with the Windows' on-device Speech API, both for desktops and mobile devices.

Again, despite being a "standard", SSML has a lot of vendor-specific support and/or extension. The broad outline is the same: a <speak> element, containing <voice> definitions, a <p> and <s> for paragraph / sentence, <break> to add pauses, as well as inline hints such as <say-as> (to tell the synthesizer to read digits of a number instead of a whole numeral), <emphasis> to mark inflection and volume changes, and <prosody> for general speech effects. You can even add <audio> to introduce a prerecorded sound file, or speak IPA phonemes directly! It's really quite flexible.

Still, the vendor may add attributes or features that don't work on other platforms. Azure TTS, for example, lets you use add "speaking style" (angry, newscast, whispering, sports_commentary) which does not work anywhere else. The SAPI 5.3 on Windows is much more limited. It does, at least, support voice changes.

In fact, I found a Microsoft blog post where they go through carefully tagging the introductory scene of Macbeth for better replay.

<speak
version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
   <p>
       <s>When shall we three meet again
           <break/>
           <prosody rate="slow">in
               <emphasis level="moderate">thunder,</emphasis>
               <break time="200ms"/>lightning,
               <emphasis level="reduced">
                   <break time="200ms"/>or in rain?
               </emphasis>
           </prosody>
       </s>
       <s>When the
           <emphasis level="strong">hurlyburly’s</emphasis> done
       </s>
       <s>
           <break time="500ms"/>When the battle’s 
           <emphasis level="moderate">lost</emphasis> and won
       </s>
       <s>
           <break time="500ms"/>That will be ere 
           <break time="200ms"/>set of sun
       </s>
   </p>
   <p>
       <s>
           <break time="500ms"/>Where the place?
           <break time="250ms"/>
       </s>
       <s>
           <emphasis level="reduced">Upon the heath
               <break time="1s"/>
           </emphasis>
       </s>
       <s>There to meet
           <break time="500ms"/>with 
           <emphasis level="strong">Macbeth</emphasis>
       </s>
   </p>
</speak>

Pretty cool! Also, way more time involving than I want to put into it. We'll stick with the defaults.

The Program

The core of this entry's first draft, then, is a tool to translate the Macbeth TEI into Microsoft-compatible SSML. You can find it in the repository (https://github.com/greg-kennedy/The-Bottish-Play), filenamed tei2ssml.pl.

There is one additional .xml file needed to complete the translation: a voices.xml file, which maps the play speakers to attributes of the SSML <voice> tag - in effect, casting the characters. I have provided one that uses all the English available voice packs for Windows 10, as well as narration by Microsoft Eva - a "hidden" TTS voice which is an early version of Cortana, but can be enabled again using some registry tweaks. It's a quick mapping and doesn't quite work (you can't make a "male" voice speak with "female" affect, so the genders get swapped, though this works on other systems) - I also think the "age" tag doesn't function, though it should accept values of 10, 15, 30, or 65 according to the documentation. But it's OK for a first draft.

Performing the SSML

Once I have an output XML file, then a short PowerShell sequence will invoke the TTS engine and read into an output .wav file.

Add-Type -AssemblyName System.Speech
$Speech = New-Object System.Speech.Synthesis.SpeechSynthesizer
$Text = Get-Content -Path "out.xml" -Raw
$Speech.SetOutputToWaveFile("output.wav")
$Speech.SpeakSsml($Text)

Et voilà! We have a fully spoken play, with separately voiced characters! And you can listen to it here:

https://youtu.be/NyBwvhex4dk

(PowerShell can also speak plain text instead of SSML, with $Speech.Speak("Hello, world!") - useful in a pinch if you need some audio and have no other tools available.)

Epilogue

I like Macbeth most of Shakespeare's plays, not least because it is a lot shorter than Hamlet and the rest. That's a major drawback for NaNoGenMo, though, where 50,000 words are needed to clear the bar. I consider the SSML file to be the "novel", in that it is essentially a "script" as one would use for a play, except formatted for computers instead of humans. Even so, the online word counter clocks only 26,199 words.

There is, of course, only one solution: at the end of the play, the cast goes on for an encore performance of "Cats" :P

That said, the month is barely half finished - and I have ideas of how to continue this further! Stay tuned...

@hugovk hugovk added the preview label Nov 19, 2023
@greg-kennedy
Copy link
Author

greg-kennedy commented Dec 1, 2023

Click to watch: The Robot Community Theatre Presents: "Macbeth"

What?

Despite proving that Macbeth can be turned into an audiobook, I find the result lacking: Microsoft's built-in voices don't cover most of the characters, and they're a little monotonous. Really, there's nothing to the final construction other than concatenating spoken phrases together. Why should they all come from this one particular speech synth?

The process is "simple": take the cast members, identify a speech-synthesizer (modern, retro, weird, whatever), record the lines using the synth, and assemble them together.

Welcome to my personal hell!

The Cast

  • Three Witches: These are some of the few (and important!) female characters of the novel. Due to their weïrd nature, I felt it appropriate to cast some similarly "out there" voices for the roles.
    • First Witch: Jessie (TikTok Text-to-Speech). It's pretty incredible how this particular voice has worked its way into the public consciousness though its ubiquitous application to viral videos seen by millions of people. There's just something about her bright, cheery attitude that is in stark contrast to the usual dispassionate readouts by other systems. To get these phrases, I used this unofficial Tiktok TTS interface built by weilbyte
    • Second Witch: Alexa (Amazon AWS). Alexa and co. are the modern incarnation of the sci-fi "home assistant robot voice", and her soothing voice is always ready to assist your toilet paper ordering needs. The Alexa Skills Kit, normally used to enhance Alexa with third-party skills, also includes a test feature for creating arbitrary speech. Alexa's success highlights the importance of "brand recognition" in TTS, in a way we didn't necessarily see before. They're big business! In fact, Amazon won't let Alexa swear
    • Third Witch: Heather (CereProc TTS). CereProc is a commercial offering that exemplifies the state-of-the-art in TTS, from around the early 2010s... prior to deep language models which upended other programmatic techniques. I used their Live Demo to collect spoken sentences for this project. Compared to the other two witches, this is a mundane choice, but I wanted to include it for a few reasons:
      • Heather is CereProc's flagship voice - unusual, as most other contemporaries focused on male voices first instead.
      • "Deep fakes" make headlines as a very recent worry, but CereProc demonstrates "voice cloning" going back for many years. Their site has an Obama example, and with Internet Archive I was able to locate a fake George W. Bush clip they created in 2006 - that's 17 years ago.
      • We think about TTS as an assistive technology for the vision impaired; however, our unique voice is a part of us, and when that is lost TTS can step in and help here too! CereProc was involved in efforts to synthesize Roger Ebert's voice from past recordings after he lost his voice due to surgery, as well as NFL player Steve Gleason who was diagnosed with MND. Those at risk of losing speech now sometimes record themselves in advance, anticipating eventually to feed those recordings into a system that will talk like they did before.
  • Hecate: Queen Elizabeth II (deep neural network model). Training some RNN on public clips and having it speak new phrases has drastically reduced both the expertise and time needed to produce a voice clone from scratch. In 2020, 15.ai launched to the public, providing free access to a voice synthesizer trained on popular characters like Twilight Sparkle and Spongebob Squarepants. The site was a tremendous hit, and though it has since been dormant for over a year, it proved the concept - while a lot of other sites were happy to follow its footsteps. I was able to use one, powered by Tacotron 2, which had been trained on clips of the late Queen Elizabeth II. A curious thing about these models is that they are not stable: running the same text snippet multiple times may produce different enunciations on output. This method of text-to-speech has proven so powerful and effective that legacy TTS companies have scrambled to add similar offerings, while other companies have launched a "make your own voice" service - allowing users to construct a brand new speaking TTS voice from scratch in minutes, and then using it for Youtube voiceovers or the like.
  • Spirits: Vocoder (Librivox recordings). The Spirits, otherworldly apparitions, are voiced here by actual humans. I retrieved a Librivox production of Macbeth and trimmed out the Spirit speeches, then used a Vocodor effect (Audacity) on each to produce a different ghostly sound. Vocoders are not strictly TTS systems; however, they function more as a combination of a filter bank (identify frequencies in input sound), then transforming that information back in a "voder" half (frequencies played by a carrier waveform), and the "Voder" is loosely considered the first electronic speech synthesizer on its own. Vocoders are useful not just for making robot voices (see Kraftwerk, Afrika Bambaataa) but also as a tool for speech compression: "encoded" speech can be sent over lower bandwidth links, then vod-ed back into listenable simulacra of the original speaker's voice.
  • King Duncan, and his sons Malcolm and Donalbain, share a TTS family here.
    • Duncan: S.A.M. (Don't Ask Software, Commodore 64). SAM, the "Software Automated Mouth", was one of the very first all-software voice synthesizers. Released in 1982 for various 6502-based home computers, S.A.M. produced barely-comprehensible speech from a floppy disk and nothing else - where other systems needed a dedicated TTS IC. Once loaded to memory, you could have S.A.M. recite words or phonemes, or call him from BASIC programs to add speech to your games. Don't Ask later rebranded as SoftVoice, providing a version of the engine to the Amiga Workbench built-in TTS program. I think SAM is great: he has a definite energy and a tendency to wail on vowels. Instead of digging out my Commodore 64, I found that there is a C reverse-engineered version for *nix which works on command line and can output .wave files.
    • Malcolm: Fred (Apple Macintalk 3). It's true! SoftVoice provided their TTS engine to Apple, who used an Apple-ified version for their infamous 1984 Macintosh unveiling in which Bruce says "It sure is great to get out of that bag!" Later versions of Macintalk, released along with Mac OS versions, continued to improve on the speech. Macintalk 3 for OS 8 contains a number of voices, of which "Fred" was the default. You might recognize him from "Fitter, Happier" by Radiohead. I used Basilisk II to emulate Mac OS 8 and Simpletext to speak the lines, which I then recorded with Audacity.
    • Donalbain: Junior (Apple Macintalk 3). Junior was another voice included with Macintalk 3. Typical of Apple at the time, the software shipped with a number of novelty voices, like a hoarse man, singing robots, non-stop laughter during speech, and underwater bubble noises. Junior demonstrates using TTS software to make a "child" voice instead.
  • Macbeth, Lady Macbeth: This pairing uses one synthesizer with different voice selections.
    • Macbeth: Perfect Paul (DECTalk DTC-01). DECTalk launched their all-in-one speech synthesizer in 1983 and it immediately found success: controllable by serial port, and with built-in phone jacks, it jumpstarted all manner of automated systems without a human involved. DECTalk (used to) read weather reports on NOAA radio, serve phone menus, announce stock quotes, etc. Later, DECTalk's IP was made available through other systems as a software synthesizer. Its most famous user, the one and only Stephen Hawking, found it invaluable as an assistive technology - allowing him to "speak" using his wheelchair-mounted computer. In contrast to voice cloning (above), Hawking instead found the robotic voice of the DECTalk to be "his" new voice, and identified so strongly with it that after testing a new TTS voice in 2004 from NeoSpeech (a much more natural-sounding voice), he quickly reverted to his former DECTalk which he used for the rest of his life.
    • Lady Macbeth: Beautiful Betty (DECTalk DTC-01). Another stock voice on the DTC-01. Dennis Klatt, speech synthesis pioneer and creator of the DTC-01, based its original voices on his own family, with his wife providing the inspiration for Beautiful Betty. As DECTalk is a softsynth now, someone has graciously provided a web-based version, which I used to get the samples for this. The DECTalk legacy lives on. John Madden!
  • Macduff, Lady Macduff, and Macduff's Son used various iterations of Microsoft's TTS prior to Windows 10.
    • Macduff: Microsoft SAM (Windows XP). Microsoft introduced its own native TTS system for Windows 2000, though most users first experienced it with XP at home. It isn't very good, but it was everywhere, and its creaky sound is memorable all the same. In addition to the TTS voice itself, Microsoft Windows 2000 introduced a "Speech API" component, which developers could use to extend their apps with the OS-level speech facilities, or extend with any SAPI-compatible voice.
    • Lady Macduff: Microsoft Anna (Windows 7). Anna was the default voice for Windows 7, and also heralded the introduction of Speech API 5.0. It provided a more realistic voice than SAM, though still clearly stilted and choppy.
    • Macduff's Son: Bonzi Buddy (Windows XP). Everyone's favorite purple monkey desktop assistant and/or spyware app, Bonzi Buddy - technically a "Microsoft Agent", like Clippy - used a slightly tweaked Microsoft SAPI 4 voice. Here, he plays the role of Macduff's Son, cracking wise and getting stabbed. All these voices were recorded using an emulated Windows XP VM and a TTS reader that could dump wave files for later.
  • Assorted Thanes and Lords used a modern Microsoft voice instead.
    • Lennox: Microsoft Sean (Windows 10).
    • Ross: Microsoft George (Windows 10).
    • Angus: Microsoft David Desktop (Windows 10).
    • Menteith: Microsoft James (Windows 10).
    • Caithness: Microsoft Mark (Windows 10).
    • Lord: Microsoft Ravi (Windows 10). Details of these recordings are in the first post above.
  • Siward and Young Siward share the same synthesizer.
    • Siward: MBROLA Roger (eSpeak, Linux).
    • Young Siward: MBROLA us2 (eSpeak, Linux). eSpeak is a Linux TTS engine with its own built-in voices and text-to-phoneme rule sets. It represents a FOSS approach to TTS, and often is called by the UI-based screen readers, so users may customize the defaults to speak differently. eSpeak's built-in voices are very bad (see below), but eSpeak can also act as a front-end to other systems. Here, I use the freely available MBROLA voices with eSpeak as an interface to provide voice to Siward and his son.
  • Banquo and Fleance share the same synthesizer.
    • Banquo: CMU Arctic Alan (Festival, Linux).
    • Fleance: CMU James (Festival, Linux). Festival is not so much a speech synthesizer as it is a "speech development toolkit". Written in Scheme (or its C++ variant "flite"), it supports a lot of different methods of generation, including the MBROLA voices (above). It's also notoriously hard to work with: though I thought I had everything set up correctly, the output includes noticeable artifacts, which the web demo does not suffer from. The voices used here are from CMU Arctic data set, a list of phrases which a reader is supposed to pronounce / record, and then those can be turned into a voice using tagged information. The two examples here are "awb" and "jmk" from Festivox: Alan (a Scottish accent), and James (Canadian English), rendered with the built-in text2wave tool.
  • Seyton: K.K. Slider (Animal Crossing, Nintendo Gamecube). Nintendo's Animal Crossing features tremendous amounts of in-game text and conversation, and to prevent these from being just boring speech bubbles it includes an extremely rudimentary text-to-speech system. There are only 26 phonemes, which are just the letters being read directly, and then concatenated to make a sound. It's unintelligible in the US version, although over time you may get used to certain words sounding a certain way, or the pauses in speech that become familiar. Later versions improve on the technique, while the Japanese version has always sounded different / more understandable (as the individual Japanese characters, when concatenated, do sound like actual words much of the time). Getting this audio required me to use Dolphin emulator, edit the script for K.K. Slider's introduction speech, and record it to disk. Because it included background music, I also recorded a version where K.K. Slider just pauses and waits, then inverted this recording and added it to isolate the speech alone.
    K.K. Slider says "The Queen, my lord, is dead."
  • Murderers: Votrax SC-01. The Votrax SC-01 is an all-in-one speech synthesis IC released in 1980. Votrax, formerly Federal Screw Works, did important early work in computer speech. Their chip made its way into arcade machines like Gorf, Wizard of War, and Q*BERT - especially funny, as the !@#$ speech bubbles are voiced by simply playing random phonemes off the SC-01! - but its only real use in home computers was as an addition to the Apple II "Mockingboard" sound card alongside the AY-3-8910 for music and sfx. It has a kind of unique and nasally tone for the time. Unfortunately, I could not get MAME to emulate the mockingboard, and so resorted to Plogue Chipspeech for these audio clips. Each murderer uses different pitch and speed. Incidentally, Votrax collaborated with the Naval Research Lab in 1976 to produce an influential paper on using rules to turn English text into phonemes - a necessity when a pronunciation dictionary simply would not fit on contemporary machines - and, in turn, some form of this paper's rules made their way into a number of other speech synthesizer software of the 80s.
  • Doctors: Dr. Sbaitso (Creative Labs, MS-DOS). Dr. Sbaitso was a demonstration program released for the Sound Blaster IBM-PC sound card. It bolted a primitive Eliza-like chatbot to a speech synthesizer, so that the doctor would "talk" back to you. A simple idea that proved immensely popular, as people still hold a lot of nostalgia for booting up this old program and hearing its deep voice intone "PARITY ERROR." Creative Labs did not actually build the TTS engine for this: it was produced by First Byte Software and named "Monologue" or, later (earlier?), SmoothTalker. Sbaitso comes with only the male voice, but it is possible to adjust its parameters, and so I have recorded DOSBox output to voice both doctors in the play. I think Sbaitso has a lot of personality: Monologue has a tendency to really ramp the pitch up or down based on punctuation, leading to a very melodic recitation, but by default the voice is deep and bassy.
  • Porter and Gentlewoman, two servants who attend the house of Macbeth
    • Porter: Mike (AT&T Natural Voices).
    • Gentlewoman: Crystal (AT&T Natural Voices). For a time, in the early 00s, when the best you could get on a computer was Microsoft SAM or Macintalk Fred, commercial offerings of TTS were far and away superior to the quality of your home computer's speech. AT&T Natural Voices, designed for automated phone speech, was a leading product and a great example of what speech synthesis could attain at its particular time period - before better techniques became available. As time moved on AT&T lost interest in the product and sold it off, so it now lives on as a product of various resellers - who offer live demos and I could use them to capture the few sentences I needed without paying :P
  • Soldiers: ST-SPEECH 2.0 (Atari 520ST). The Amiga shipped with SoftVoice, and not to be left out, Atari provided their own program for TTS: "ST-SPEECH". More a proof of concept than a serious attempt at a TTS engine, this small program churns out harsh and trashy syllables on the Atari 16-bit line. There were other systems available later (e.g. SmoothTalker was ported here), but, ST-SPEECH lived on in its own way: as Atari was already a popular music making machine due to its inclusion of MIDI ports (Cubase began life here), naturally electronic musicians looking for some computer vocals would pull up the readily available ST-SPEECH.TOS program. U-96 "Das Boot" from Germany provides a fantastic example with the ST shouting "TECHNO!" and "MAXIMUM VELOCITY!". I used the Hatari emulator to boot the program and capture wave files of the samples.
  • Servants: Texas Instruments LPC (TI-99/4a). Texas Instruments produced a series of "LPC" (linear predictive coding) speech chips in the 1980s, perhaps best known as the chip behind the original Speak-and-Spell toy. One of their many models was built into the optional Voice Synthesizer add-on for TI-99/4a, enabling the computer to speak to users. The original design called for a ROM containing word-to-phoneme mappings, and so there is an empty space in the module for swapping ROM chips. Later, the release of the "Terminal Emulator II" cartridge made this unnecessary, as it included text-to-phoneme rules in software instead. TI chips have found their way into some very odd places, like the Chrysler "Electronic Voice Alert", where it would say things to the driver like "Please close your passenger door." and "Your engine oil pressure is critical." The Classic99 emulator can emulate the speech synthesizer; I used short BASIC programs to set the voice parameters and then speak each line.
  • Messengers: eSpeak (Linux). All Messengers use the built-in eSpeak voices, with various pronunciations. As mentioned above, eSpeak can serve as a front-end to other speech synths, but in this case I let the (bad) defaults do the talking instead.
  • Old Man: General Instruments SP0256-AL2. This choppy old chip for last. The SP0256 was a dedicated speech synthesis chip line, of which many variants came mask-programmed with ROMs containing words for speaking. The Intellivoice add-on had a particular collection of video-game phrases that could be triggered, though only a few games used it. The AL2 version contains only 59 allophones, and can be triggered by a microcomputer (or similar) to string them together to make words and phrases. Radio Shack sold this device under the name "The Orator", and computer magazines of the time included articles on how to interface with it - like this "CheepTalk" mod from A.N.A.L.O.G. Magazine for Atari 8-bit computers. I hold a particular fondness for this device, awful as it is: my dad built a CheepTalk for Atari when I was very young, and used it to add speech to some of his BASIC programs - a dice game that would call you a "wimp" if you didn't bet much, etc. He also coded a program to help me learn the alphabet, with some art done by my mom. Pressing a key would trigger speech and an animation, like "The T is for Train" and so on. Ah, memories.

The last speaker in the text... is me! I recorded myself saying ~200 words of stage directions, then used sox to put the words back together into composite phrases for all direction and act / scene announcements. This is a common technique also for phone menus, etc. A "talking clock" can be made by just recording a handful of words and triggering them correctly.

@greg-kennedy
Copy link
Author

greg-kennedy commented Dec 1, 2023

More details

A big part of the last couple days' effort was listening to the whole play end-to-end and trying to clean up common mispronunciations. Most of the speakers stumbled over Shakespeare's (mis)use of apostrophes, as in 't for "it" (which readers will say "TEE") or i' for "in", 's for "his", etc. Some regex and re-runs got most of the issues shaken out, but a few snuck in... Witch Three says "Listen but not speak to tee" about the spirits, and towards the end Macduff reads "That way the noise is. Tyrant, show thy face!" as "That way the noise Island Tyrant, ..." which is a very funny choice. I find the older pronciation quirks charming, especially Duncan talking about "nobleness" (knob LEE ness). Pronunciation dictionaries in newer software mostly sort this out, but homographs still throw them off sometimes ("he lives here" vs "their lives are cut short"). But that's true of me too...! I read the word list off and pronounced "sewer" as in "the underground pipe system", when the text actually means "a person who sews (clothing)", whoops.

Sound effects are provided by a selection from Windows 95 "Plus!" themes (Musica, Leonardo da Vinci, etc) as well as some music from Interplay's "Castles" and Maxis "SimCity 2000". This is the weakest part of the project, I think: I wish that I had more time to come up with something more fitting here. It would have been great to have e.g. Hatsune Miku sing for the musical cues, but I did run out of time, and the sound portions are not the most important ones anyway.

For the video, I took screenshots of the Folger Library "Macbeth" and then wrote Processing to scroll through it. Each act/scene is synchronized, such that the start and end match the complete scene. What happens in between could be anything - the scene where the Witches summon the Spirits runs off the page for a bit, until some of the longer-winded speakers take long enough to bring it back on the screen.

Encore! Encore!!

Placing every spoken word in the play into a file (including introduction, credits, and stage directions) and running wc script.txt gives a paltry 17,827 words. As before, the cast is happy to take up the slack by putting on a special production of Cats. Rather than spend 2 hours on individual meows, they've worked on improving their efficiency. All 43 cast members will individually say "meow" as fast as they can, simultaneously, which clears the remaining 32,173 words in a mere eight minutes.

Epilogue

This project was fun, and a huge success! I learned a lot about text-to-speech software! If I never hear a robot speak again, it will be too soon! Augh!!

@bibliotechy
Copy link

I love this so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants