<a href="https://colab.research.google.com/github/Kiron-Ang/DSC/blob/main/machine_learning_project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Project 1
### Kiron Ang, DSC 3344, October 2024
---
Through the use of the Python library *elevenlabs*, this file serves as an initial remark on the use of machine learning tools to solve real-world problems. In this case, an AI text-to-speech generation tool is applied to produce audio files with life-like speech for my elderly friend Jerry, someone unable to read his emails without my assistance. Typically, he tasks me with printing his emails, but as an 84-year-old, it is not financially wise to continuously expend toner and paper. One other person helps him consistently with printing emails, but she charges Jerry excessively, while I help Jerry for free. Unfortunately, I will graduate within the next two years and move away from Waco. Although he remains far-removed from being feeble and blind, other avenues for resolving his lack of information technology skills must be explored. A cost-free solution that expends minimal resources is desired, and the ideal solution involves voice recordings played through Jerry's landline phone.

I created this IPYNB file with Google Colab, an online web application for executing Python code on hosted runtime environments. Additionally, I had to create an account at elevenlabs.io and purchase the "Starter" plan for $5.33 to access voice cloning capabilities. In the present work, I provide my API key ("api_key.txt") and a recording of my voice ("kiron.mp3") to create a voice template on my account. This allows the *elevenlabs* package to generate speech audio that resembles my own. I then input different text samples into the *generate()* function to produce audio files. After presenting them to Jerry and receiving his feedback, I include a final evaluation at the end of this file.

In [None]:
# Output this hosted runtime's Python version
!python -V

# Install ElevenLabs Python Library: github.com/elevenlabs/elevenlabs-python
!pip install -U elevenlabs > output.txt
import elevenlabs
print("elevenlabs", elevenlabs.__version__)

# For documentation purposes, play the voice sample I provided
import IPython.display
print("IPython", IPython.__version__)
IPython.display.Audio("kiron.mp3")

# The only other input file that was generated externally was the
# TXT file containing my API key, but I cannot share its contents here
# for security reasons, since this IPYNB file is available
# publicly on my GitHub profile. It contains one line of plain text.

Python 3.10.12
elevenlabs 1.9.0
IPython 7.34.0


Once the *elevenlabs* library was installed and imported into the environment, I instantiated a client with my API key. I then used the
client to request voice cloning with an audio file containing my voice. On their website, Eleven Labs claims that only a small amount of audio is needed to clone a voice:

> Clone a voice from a clean sample recording. Samples should contain 1 speaker and be over 1 minute long and not contain background noise...Sample quality is more important than quantity. Noisy samples may give bad results. Providing more than 5 minutes of audio in total brings little improvement.

The audio recording I made contains 1 minute and 45 seconds of my speech. The voice template created from this can be used to generate speech from English text. Ideally, the generated audio should sound as if I had recorded it myself with my own voice.

In [None]:
# Create a client to use various voice generation methods
client = elevenlabs.client.ElevenLabs(
    # Use API key obtained after signing up at elevenlabs.io
    api_key = next(open("api_key.txt")))

# Clone my voice using the audio file I provided
client.clone(name = "Kiron Ang",
             files = ["kiron.mp3"],
             description = "This is my voice!")

Voice(voice_id='a4WQAfgl4PoXH3o6KEDY', name='Kiron Ang', samples=[VoiceSample(sample_id='7W20eZwuSRPwhSAl3C0Y', file_name='kiron.mp3', mime_type='audio/mpeg', size_bytes=1271211, hash='0771ee0139b4a4e01e7120be63f7c58b')], category='cloned', fine_tuning=FineTuningResponse(is_allowed_to_fine_tune=False, state={}, verification_failures=[], verification_attempts_count=0, manual_verification_requested=False, language=None, progress={}, message={}, dataset_duration_seconds=None, verification_attempts=None, slice_ids=None, manual_verification=None, finetuning_state=None), labels={}, description='This is my voice!', preview_url=None, available_for_tiers=[], settings=None, sharing=None, high_quality_base_model_ids=[], safety_control='NONE', voice_verification=VoiceVerificationResponse(requires_verification=False, is_verified=False, verification_failures=[], verification_attempts_count=0, language=None, verification_attempts=None), owner_id=None, permission_on_resource=None, is_legacy=False, is_

Then, I generate audio for several different emails that Jerry would normally receive. Jerry owns a Texas healthcare staffing company, where Jerry is the only employee. The real-world problem here is therefore limited to generating speech for emails related to the business. I am the only one that uses the email since Jerry cannot use a computer, so I have permission to reveal non-sensitive email content here.

Below, I adapted some text from an email I sent to one of our clients. I modified it to reflect the kind of voice message I would want to send to Jerry. After generating audio with this text, I save it to a separate file. This is important because generating audio consumes "credits" on my account, meaning that my speech generation is limited. As a result, I make sure to save everything to have a copy on my local machine for evaluation later on.

In [None]:
audio = client.generate(
    text = "Hi Jerry! I sent out an email to Kelli from Roswell, New Mexico. Hi Kelli, attached are three candidates for your review. Please let me know if you would like to contact them. Thank you for your time. Sincerely, Kiron Ang, Associate, The Lewis Group.",
    voice = "Kiron Ang"
)

elevenlabs.save(
    audio = audio,
    filename = "audio_1.wav"
)

Below is an audio player to hear the result.

In [None]:
IPython.display.Audio("audio_1.wav")

I repeat the above process with an email that a candidate sent to me after slight modification of the text.

In [None]:
audio = client.generate(
    text = "Hi Jerry! I received an email from Rose, a family nurse practitioner from Dallas, Texas. Good Afternoon, thank you for taking the time to speak with me today regarding FNP positions in Texas. Please find a copy of my resume attached. I look forward to working with your company. Best Regards, Rose C. Osuji.",
    voice = "Kiron Ang"
)

elevenlabs.save(
    audio = audio,
    filename = "audio_2.wav"
)

Below is an audio player to hear the result.

In [None]:
IPython.display.Audio("audio_2.wav")

I repeat the process three more times with various emails.

In [None]:
audio = client.generate(
    text = "Hi Jerry! I sent an email out to Okorie Vivian. Here it is: Hi Okorie Vivian! I work with Jerry Lewis at the Lewis Group, and one of our clients is looking for a nurse practitioner to work in Fort Worth, TX. Can you send me a PDF of your most updated resume so I can send it over to our client for review? Thank you for your time. Sincerely, Kiron Ang, Associate, The Lewis Group",
    voice = "Kiron Ang"
)

elevenlabs.save(
    audio = audio,
    filename = "audio_3.wav"
)

audio = client.generate(
    text = "Hi Jerry! I sent an email to your friend, Leonard Gruppo. This was the message I sent: Hi Mr. Leonard Gruppo! I work with Jerry Lewis at the Lewis Group and he really wants to help you find a new job. I wanted to reach out and ask if you want any help with your resume. There are certain strategies you can use to make it past the automated systems that most companies use these days. Thanks for being Jerry's friend. Sincerely, Kiron Ang, Associate, The Lewis Group",
    voice = "Kiron Ang"
)

elevenlabs.save(
    audio = audio,
    filename = "audio_4.wav"
)

audio = client.generate(
    text = "Hi Jerry! I got a reply from Leonard Gruppo. He said: Hello Mr. Ang! I’m sorry for the late reply but somehow your email went to my junk folder. Good thing I noticed it! Thank you so much for offering to help with my CV. I have not posted it on any of the job sites and did not plan to. I don't want to do that particularly. However, if Jerry wants me to, I’ll consider it. What does he want/need me to do? Peace, Len",
    voice = "Kiron Ang"
)

elevenlabs.save(
    audio = audio,
    filename = "audio_5.wav"
)

In [None]:
IPython.display.Audio("audio_3.wav")

In [None]:
IPython.display.Audio("audio_4.wav")

In [None]:
IPython.display.Audio("audio_5.wav")

As you can see, the audio matches the pitch and timbre of my voice for the most part. Although the generated speech mispronounces my first and last name, it is easy to understand the messages in each audio file. At this point, I decided to play these audio files for Jerry and gather his opinion.

Jerry was quite amazed! He thought that the audio samples sounded realistic and similar to my actual voice. I asked him whether he would feel annoyed or uncomfortable hearing these recordings if he already knew that they were machine-generated. He said that he would have no problem, and he thought that my idea of processing email text and playing audio files through his landline phone in his office was a great idea. Of course, implementing this solution would require using the Gmail API, which requires a Google Workspace account and a Google Cloud project; furthermore, I am not sure how I would send these files to his landline phone. Nevertheless, this  prototype demonstration suggests that even if someone is unfamiliar or uncomfortable with information technology systems at large, they may still be willing to adopt technologies if familiar people help implement them.

It should be noted that Jerry and I are close friends; there may be a strong positive bias in his willingness to accept this solution. However, I do not believe that this weakens the potential of the *elevenlabs* technology, nor does it reduce the value of text-to-speech solutions for the elderly. It merely underscores the importance of understanding that context is crucial for implementing new solutions that usually alienate those outside of the ideal user persona.

Additionally, younger listeners may recognize that the pacing of the speech generated here is unrealistic. However, the elderly population, at least in my limited experience in the state of Texas, are quite prepared for semi-robotic voices for two reasons: Firstly, their hearing has often deteriorated, making it harder for them to distinguish such small differences, especially when they may be so focused on obtaining the meaning of the speech. Secondly, elderly people are more likely to make phone calls for any kind of task, meaning that they interact with robotic answering machines much more than the average person under the age of 30. Although those robotic voices pose a major pain point in customer service operations ("Sorry, I did not understand you. Can you repeat your identification number?"), they may be the cause of a golden opportunity for bridging the technology gap for some open-minded octogenarians.