<a href="https://colab.research.google.com/github/sidhusmart/CoRise_Prompt_Design_Course/blob/cohort2/Week_0/CoRise_Week0_StudentVersion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, I'll be developing an LLM app that efficiently summarizes podcast episodes, identifies guests, and highlights key points. Special thanks to Sidharth Ramachandran from Uplimit for providing valuable guidance throughout this endeavor.

notes: this notebook is designed to be run on Google Colab

# The Problem
Sidharth, a devoted podcast enthusiast, values the format for its in-depth insights into various industries and technologies, learning from global experiences. However, due to time constraints, he can only listen to a select few. Subscribed to several engaging podcasts, which release 1-2 episodes weekly, he struggles to pinpoint episodes of personal interest. Although many offer show notes, links, and timestamps, they fall short in truly capturing the episode's essence and sparking his curiosity. How can he make finding and enjoying podcasts easier and more enjoyable?

# Solution
I want to create a custom weekly newsletter summarizing new podcast episodes. It'll feature guest info, key topics, and highlights. Users provide a list of RSS feeds, and regularly, it processes the latest episodes to create the newsletter. This serves as a week-in-review, offering enough detail for users to choose which episodes to listen to.

# Approach
The steps to build this product can be divided into three parts:

    Part 1: use a Large Language Model (LLM) from OpenAI to develop the information extraction functionality, coupled with a Speech to Text model for transcribing the podcast
    Part 2: utilize a straightforward cloud deployment provider to seamlessly convert the information extraction function for on-demand use - this will serve as the app backend
    Part 3: develop and deploy a front-end that enables users to experience the end-to-end functionality

# Part 1: Podcast transcription and information extraction
Step 1 - Retrieve the audio file using the RSS feed of the podcast. 

In [1]:
!pip install feedparser

Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6047 sha256=c74dfffac96b0b3d7c1db4e10a02e83da9301711b50f41776b23b6ccb12c014c
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0


In [5]:
import feedparser
podcast_feed_url = "http://feeds.feedburner.com/TEDTalks_audio"
podcast_feed = feedparser.parse(podcast_feed_url)

In [6]:
print ("The number of podcast entries is ", len(podcast_feed.entries))

The number of podcast entries is  182


In [7]:
# download the mp3 file and save it on Google Colab
for item in podcast_feed.entries[0].links:
  if (item['type'] == 'audio/mpeg'):
    episode_url = item.href
!wget -O 'podcast_episode.mp3' {episode_url}

--2023-08-26 17:08:31--  https://dts.podtrac.com/redirect.mp3/download.ted.com/talks/StuartKauffman_2023.mp3?apikey=172BB350-0207
Resolving dts.podtrac.com (dts.podtrac.com)... 3.211.155.0, 44.207.102.56, 54.84.2.247
Connecting to dts.podtrac.com (dts.podtrac.com)|3.211.155.0|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://download.ted.com/talks/StuartKauffman_2023.mp3?apikey=172BB350-0207 [following]
--2023-08-26 17:08:31--  https://download.ted.com/talks/StuartKauffman_2023.mp3?apikey=172BB350-0207
Resolving download.ted.com (download.ted.com)... 54.172.44.16, 52.206.157.182
Connecting to download.ted.com (download.ted.com)|54.172.44.16|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://chtbl.com/track/48D18/https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/70/9fef2426-7491-4929-8a59-e48ff4a882c2/StuartKauffman_2023_VO_Intro.mp3 [following]
--2023-08-26 17:08:31--  https://chtbl.com/track/48D18/https://dt

Step 2 - transcribing the audio file

Here I will use Whisper as the speech-to-text model. The model can be freely downloaded and used directly. I will use the medium model to transcribe the downloaded podcast.

In [8]:
!pip install git+https://github.com/openai/whisper.git  -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [9]:
%%time

import pathlib
import whisper
# Perform download only once and save to Network storage
model_path = pathlib.Path("/content/podcast/medium.pt")
if model_path.exists():
  print ("Model has been downloaded, no re-download necessary")
else:
  print ("Starting download of Whisper Model")
  whisper._download(whisper._MODELS["medium"], '/content/podcast/', False)

Starting download of Whisper Model


100%|██████████████████████████████████████| 1.42G/1.42G [00:12<00:00, 119MiB/s]


CPU times: user 8.9 s, sys: 4.15 s, total: 13 s
Wall time: 25.6 s


In [10]:
# Load model from saved location
model = whisper.load_model('medium', device='cuda', download_root='/content/podcast/')

In [11]:
# transcribing
%%time
result = model.transcribe("/content/podcast_episode.mp3")

CPU times: user 1min 49s, sys: 550 ms, total: 1min 49s
Wall time: 2min 2s


In [12]:
# Check the transcription happened correctly by peeking into the first 500 characters
podcast_transcript = result['text']
result['text'][:500]

" TED Audio Collective You're listening to TED Talks Daily. I'm Elise Hulme. Stuart Kaufman founded an idea called the adjacent possible. It's a mathematical theory that helps us understand, well, what's possible. In his talk from TED 2023, he explains the science behind deducing what happens next after the break. Support for TED Talks Daily comes from better help. Gosh, there are so many forks in the road, so many times in my life where I have felt uncertain whether it was just entering adulthoo"

In [13]:
podcast_transcript_1 = podcast_transcript

" TED Audio Collective You're listening to TED Talks Daily. I'm Elise Hulme. Stuart Kaufman founded an idea called the adjacent possible. It's a mathematical theory that helps us understand, well, what's possible. In his talk from TED 2023, he explains the science behind deducing what happens next after the break. Support for TED Talks Daily comes from better help. Gosh, there are so many forks in the road, so many times in my life where I have felt uncertain whether it was just entering adulthood, coming out of college and not knowing what to do with my life, or at midlife. You know, after I had a family and a husband and children and really wasn't sure what I wanted to do with the rest of my life. So whether you're dealing with decisions around your career or relationships or anything else, something I've always turned to as an adult is therapy. Therapy helps you stay connected to what you really want while you navigate life. Trusting yourself to make decisions that align with your v

Step 3 - Creating a summary of the podcast

I will ask the LLM (`gpt-3.5-turbo`) from OpenAI to generate the summary.

In [15]:
!pip install openai
!pip install tiktoken

Collecting openai
  Downloading openai-0.27.9-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.9


In [16]:
import openai
from getpass import getpass

openai.api_key = getpass('Enter the OpenAI API Key in the cell  ')

Enter the OpenAI API Key in the cell  ··········


In [17]:
# confirming that the API key works by listing all the OpenAI models
models = openai.Model.list()
for model in models["data"]:
  print (model["root"])

davinci
text-davinci-001
text-search-curie-query-001
gpt-3.5-turbo
babbage
text-babbage-001
curie-instruct-beta
davinci-similarity
code-davinci-edit-001
text-similarity-curie-001
ada-code-search-text
gpt-3.5-turbo-0613
text-search-ada-query-001
gpt-3.5-turbo-16k-0613
babbage-search-query
ada-similarity
text-curie-001
gpt-3.5-turbo-16k
text-search-ada-doc-001
text-search-babbage-query-001
code-search-ada-code-001
curie-search-document
davinci-002
text-search-davinci-query-001
text-search-curie-doc-001
babbage-search-document
babbage-002
babbage-code-search-text
text-embedding-ada-002
davinci-instruct-beta
davinci-search-query
text-similarity-babbage-001
text-davinci-002
code-search-babbage-text-001
text-davinci-003
text-search-davinci-doc-001
code-search-ada-text-001
ada-search-query
text-similarity-ada-001
ada-code-search-code
whisper-1
text-davinci-edit-001
davinci-search-document
curie-search-query
babbage-similarity
ada
ada-search-document
text-ada-001
text-similarity-davinci-001
cu

**Context Window**

This is the maximum of the combined text that can be used in one API call to the gpt-3.5-turbo model. It is not only a combination of the input text sent to the model but also takes into consideration the output response as well. Also keep in mind that this is measured in terms of tokens and not words. While we could treat them as analogous, it's technically not the same as one word may actually be broken down into multiple tokens.

In [18]:
# check the number of tokens in the text
import tiktoken
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
print ("Number of tokens in input prompt ", len(enc.encode(podcast_transcript)))

Number of tokens in input prompt  2363


The podcast turns out to have only 2363 tokens, lower than the 4096 tokens that is accepted by the default gpt-3.5-turbo model. However, for future proof, I will use the larger model that has a context size of 16,384 tokens.

In [19]:
instructPrompt = """
You are an expert copywriter who is responsible for publishing a newsletter with hundreds of thousands of subscribers. You recently listened to a great podcast and want to share a summary of it with your readers. Please write the summary of this podcast in a concise and engaging way, use bullet points or numbered lists if necessary..
The transcript of the podcast is provided below
"""

request = instructPrompt + podcast_transcript

In [20]:
chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                            messages=[{"role": "system", "content": "You are a helpful assistant."},
                                                      {"role": "user", "content": request}
                                                      ]
                                            )

In [21]:
podcastSummary = chatOutput.choices[0].message.content
podcastSummary

'Summary of the podcast:\n\n- Stuart Kaufman founded the concept of the adjacent possible, a mathematical theory that explains what is possible in the future.\n- The biosphere has been evolving for billions of years, creating new possibilities through jury rigging and recombination.\n- The theory of the adjacent possible (TAP) suggests that things can be combined to create new things, leading to an exponential growth of possibilities.\n- This pattern of slow progress followed by a burst of innovation can be seen in the Cambrian explosion, as well as in human evolution and economic growth.\n- As technology and innovation continue to accelerate, the waiting time for new discoveries and inventions is being cut in half.\n- The podcast highlights the importance of finding better adjacent possibilities, especially in solving environmental challenges like climate change and soil degradation.\n- Utilizing fungal bacterial communities and implementing sustainable practices, such as composting a

step 4 - extract additional information to provide additional context on the episode

I will use the function calling capability of the OpenAI API to ensure the output from the API is as structured as possible. It is needed since I will pass the extracted name, org, and title of the podcast guest to a function, to find his/her information in Wikipedia.

In [29]:
request = podcast_transcript[:10000]
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
print ("Number of tokens in input prompt ", len(enc.encode(request)))

Number of tokens in input prompt  2235


In [30]:
completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",                              # using the non 16k model to save cost
    messages=[{"role": "user", "content": request}],
    functions=[
    {
        "name": "get_podcast_guest_information",
        "description": "Get information on the podcast guest using their full name and the name of the organization they are part of to search for them on Wikipedia or Google",
        "parameters": {
            "type": "object",
            "properties": {
                "guest_name": {
                    "type": "string",
                    "description": "The full name of the guest who is speaking in the podcast",
                },
                "guest_organization": {
                    "type": "string",
                    "description": "The full name of the organization that the podcast guest belongs to or runs",
                },
                "guest_title": {
                    "type": "string",
                    "description": "The title, designation or role of the podcast guest in their organization",
                },
            },
            "required": ["guest_name"],
        },
    }
],
function_call={"name": "get_podcast_guest_information"}
)

In [31]:
import json

podcast_guest = ""
podcast_guest_org = ""
podcast_guest_title = ""
response_message = completion["choices"][0]["message"]
if response_message.get("function_call"):
  function_name = response_message["function_call"]["name"]
  function_args = json.loads(response_message["function_call"]["arguments"])
  podcast_guest=function_args.get("guest_name")
  podcast_guest_org=function_args.get("guest_organization")
  podcast_guest_title=function_args.get("guest_title")

In [32]:
print (podcast_guest)
print (podcast_guest_org)
print (podcast_guest_title)

Stuart Kaufman
None
None


In [33]:
if podcast_guest_org is None:
  podcast_guest_org = ""
if podcast_guest_title is None:
  podcast_guest_title = ""

In [26]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=57c04dcd44fecd8c35efe58b2a2f3b790013cbf6bb9b45c0d7536c4f3eeb6f7b
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [27]:
import wikipedia
input = wikipedia.page(podcast_guest, auto_suggest=False)

In [34]:
input = wikipedia.page(podcast_guest + " " + podcast_guest_org + " " + podcast_guest_title, auto_suggest=True)

In [35]:
input.summary

'Charles Stuart Kaufman (; born November 19, 1958) is an American filmmaker and novelist. He wrote the films Being John Malkovich (1999), Adaptation (2002), and Eternal Sunshine of the Spotless Mind (2004). He both wrote and directed the films Synecdoche, New York (2008), Anomalisa (2015), and I\'m Thinking of Ending Things (2020). In 2020, Kaufman made his literary debut with the release of his first novel, Antkind.\nOne of the most celebrated screenwriters of his era, Kaufman has received an Academy Award, three BAFTA Awards, two Independent Spirit Awards, and a Writers Guild of America Award. Film critic Roger Ebert called Synecdoche, New York "the best movie of the decade" in 2009. Three of Kaufman\'s scripts appear in the Writers Guild of America\'s list of the 101 greatest movie screenplays ever written.\n\n'

Step 5: Extract the highlights of the podcast

In this step, I want to extract some key moments in the podcast. These are typically interesting insights from the guest or critical questions that the host might have put forward. It could also be a discussion on a hot topic or controversial opinion.

In [36]:
instructPrompt = """
You are a podcast editor and producer. You are provided with the transcript of a podcast episode and have to identify the 5 most significant moments in the podast as highlights.
- Each highlight needs to be a statement by one of the podcast guests
- Each highlight has to be impactful and an important takeaway from this podcast episode
- Each highlight must be concise and make listeners want to hear more about why the podcast guest said that
- The highlights that you pick must be spread out throughout the episode

Provide only the highlights and nothing else. Prodive the full sentence of the highlight and format it as follows:

- Highlight 1 of the podcast
- Highlight 2 of the podcast
- Highlight 3 of the podcast
"""

request = instructPrompt + podcast_transcript

In [37]:
chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                            messages=[{"role": "system", "content": "You are a helpful assistant."},
                                                      {"role": "user", "content": request}
                                                      ]
                                            )

In [38]:
chatOutput.choices[0].message.content

'- Highlight 1: "We cannot deduce what is in the adjacent possible that the evolving biosphere will create then become. We do not even know what can happen."\n- Highlight 2: "Therefore, this process, the TAP process has the property that for a long time, the number of things increases very, very, very slowly. Then something stunning happens. There\'s a hockey stick explosion and the number of things reaches infinity in a finite time."\n- Highlight 3: "Once you\'ve made a bow, a crossbow is in the adjacent possible. The pattern that we saw in the Cambrian of a long period, nothing happening and then a burst, is here right now."\n- Highlight 4: "The TAP process has the following property. Every time you make something new, the waiting time for the next new thing is cut in half."\n- Highlight 5: "We need to find a better adjacent possible. We\'re rampaging over the planet. And the hope is in soils."'

In [39]:
podcastHighlights = chatOutput.choices[0].message.content

# Part 2: On-demand information extraction

Here I will build the back-end service. I will package the information extraction steps from previous part into an on-demand cloud function. The goal is to have this as the back-end service that can process an RSS feed provided by the user, perform the necessary steps and return the final output with all the extracted information.

first, encapsulate the podcast retrieval and transcription step into a function.

In [40]:
!pip install feedparser
!pip install git+https://github.com/openai/whisper.git  -q
!pip install requests

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [41]:
def get_transcribe_podcast(rss_url, local_path):
  print ("Starting Podcast Transcription Function")
  print ("Feed URL: ", rss_url)
  print ("Local Path:", local_path)

  # Read from the RSS Feed URL
  import feedparser
  intelligence_feed = feedparser.parse(rss_url)
  for item in intelligence_feed.entries[0].links:
    if (item['type'] == 'audio/mpeg'):
      episode_url = item.href
  episode_name = "podcast_episode.mp3"
  print ("RSS URL read and episode URL: ", episode_url)

  # Download the podcast episode by parsing the RSS feed
  from pathlib import Path
  p = Path(local_path)
  p.mkdir(exist_ok=True)

  print ("Downloading the podcast episode")
  import requests
  with requests.get(episode_url, stream=True) as r:
    r.raise_for_status()
    episode_path = p.joinpath(episode_name)
    with open(episode_path, 'wb') as f:
      for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

  print ("Podcast Episode downloaded")

  # Load the Whisper model
  import os
  import whisper
  print ("Download and Load the Whisper model")
  model = whisper.load_model("medium")
  print (model.device)

  # Perform the transcription
  print ("Starting podcast transcription")
  result = model.transcribe(local_path + episode_name)

  # Return the transcribed text
  print ("Podcast transcription completed, returning results...")
  return result

In [42]:
output = get_transcribe_podcast("http://feeds.feedburner.com/TEDTalks_audio", "/content/podcast/")

Starting Podcast Transcription Function
Feed URL:  http://feeds.feedburner.com/TEDTalks_audio
Local Path: /content/podcast/
RSS URL read and episode URL:  https://dts.podtrac.com/redirect.mp3/download.ted.com/talks/StuartKauffman_2023.mp3?apikey=172BB350-0207&prx_url=https://chtbl.com/track/48D18/https://dovetail.prxu.org/70/9fef2426-7491-4929-8a59-e48ff4a882c2/StuartKauffman_2023_VO_Intro.mp3
Downloading the podcast episode
Podcast Episode downloaded
Download and Load the Whisper model


100%|██████████████████████████████████████| 1.42G/1.42G [00:14<00:00, 105MiB/s]


cuda:0
Starting podcast transcription
Podcast transcription completed, returning results...


In [43]:
# checking the transcription to make sure that the function worked
output['text'][:500]

" TED Audio Collective You're listening to TED Talks Daily. I'm Elise Hulme. Stuart Kaufman founded an idea called the adjacent possible. It's a mathematical theory that helps us understand, well, what's possible. In his talk from TED 2023, he explains the science behind deducing what happens next after the break. Support for TED Talks Daily comes from better help. Gosh, there are so many forks in the road, so many times in my life where I have felt uncertain whether it was just entering adulthoo"

Step 1 - Create a cloud transcription function

I will make use of [Modal Labs](https://modal.com/), a service that allows conversion of any python function to run on-demand in the cloud. It supports the use of GPUs which is important in the transcription step.

In [44]:
!pip install modal

Collecting modal
  Downloading modal-0.51.3192-py3-none-any.whl (1.2 kB)
Collecting modal-client==0.51.3192 (from modal)
  Downloading modal_client-0.51.3192-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream (from modal-client==0.51.3192->modal)
  Downloading aiostream-0.4.5-py3-none-any.whl (35 kB)
Collecting asgiref (from modal-client==0.51.3192->modal)
  Downloading asgiref-3.7.2-py3-none-any.whl (24 kB)
Collecting fastapi (from modal-client==0.51.3192->modal)
  Downloading fastapi-0.102.0-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.0/66.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting grpclib==0.4.3 (from modal-client==0.51.3192->modal)
  Downloading grpclib-0.4.3.tar.gz (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [45]:
# granting access to Modal labs
!modal token new --source corise > authenticationURL.txt

In [46]:
import getpass
import subprocess

def set_modal_token():
  token_id = getpass.getpass('Please enter your Modal token ID in the cell: ')
  token_secret = getpass.getpass('Please enter your Modal token secret in the cell:  ')

  # Using subprocess to execute the command
  subprocess.run(f"!modal token set --token-id (token_id) --token-secret (token_secret)", shell=True)

In [47]:
set_modal_token()

Please enter your Modal token ID in the cell: ··········
Please enter your Modal token secret in the cell:  ··········


changing the existing transcription function to adapt it to run in the cloud using modal.

In [48]:
%%writefile /content/podcast/podcast_backend.py
import modal

def download_whisper():
  # Load the Whisper model
  import os
  import whisper
  print ("Download the Whisper model")

  # Perform download only once and save to Container storage
  whisper._download(whisper._MODELS["medium"], '/content/podcast/', False)


stub = modal.Stub("corise-podcast-project")
corise_image = modal.Image.debian_slim().pip_install("feedparser",
                                                     "https://github.com/openai/whisper/archive/9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d.tar.gz",
                                                     "requests",
                                                     "ffmpeg").apt_install("ffmpeg").run_function(download_whisper)

@stub.function(image=corise_image, gpu="any")
def get_transcribe_podcast(rss_url, local_path):
  print ("Starting Podcast Transcription Function")
  print ("Feed URL: ", rss_url)
  print ("Local Path:", local_path)

  # Read from the RSS Feed URL
  import feedparser
  intelligence_feed = feedparser.parse(rss_url)
  for item in intelligence_feed.entries[0].links:
    if (item['type'] == 'audio/mpeg'):
      episode_url = item.href
  episode_name = "podcast_episode.mp3"
  print ("RSS URL read and episode URL: ", episode_url)

  # Download the podcast episode by parsing the RSS feed
  from pathlib import Path
  p = Path(local_path)
  p.mkdir(exist_ok=True)

  print ("Downloading the podcast episode")
  import requests
  with requests.get(episode_url, stream=True) as r:
    r.raise_for_status()
    episode_path = p.joinpath(episode_name)
    with open(episode_path, 'wb') as f:
      for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

  print ("Podcast Episode downloaded")

  # Load the Whisper model
  import os
  import whisper

  # Load model from saved location
  print ("Load the Whisper model")
  model = whisper.load_model('medium', device='cuda', download_root='/content/podcast/')

  # Perform the transcription
  print ("Starting podcast transcription")
  result = model.transcribe(local_path + episode_name)

  # Return the transcribed text
  print ("Podcast transcription completed, returning results...")
  return result

@stub.local_entrypoint()
def main(url, path):
  output = get_transcribe_podcast.call(url, path)
  print (output['text'])

Writing /content/podcast/podcast_backend.py


In [49]:
# invoke the function from the command line
!modal run /content/podcast/podcast_backend.py --url http://feeds.feedburner.com/TEDTalks_audio --path /content/podcast/

[?25l[34m⠋[0m Initializing...[2K[32m✓[0m Initialized. [37mView app at [0m[4;37mhttps://modal.com/apps/ap-aX7R5Ix1hq9FFKbdfYdh9I[0m
[2K[34m⠋[0m Initializing...
[2K[34m⠸[0m Creating objects...
[37m├── [0m[34m⠋[0m Creating get_transcribe_podcast...
[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[34m⠸[0m Creating get_transcribe_podcast...
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[32m🔨[0m Created download_whisper.
[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_transcribe_podcast.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[32m🔨[0m Created download_whisper.
[37m└── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[32m✓[0m Created objects.
[37m├── [0m[32m🔨[0m Created get_transcribe_podcast.
[37m├── [0m[32m🔨[0m Created mount /content/podcast

Step 2 - Create a cloud information extraction function

updating the backend

In [52]:
%%writefile /content/podcast/podcast_backend.py
import modal

def download_whisper():
  # Load the Whisper model
  import os
  import whisper
  print ("Download the Whisper model")

  # Perform download only once and save to Container storage
  whisper._download(whisper._MODELS["medium"], '/content/podcast/', False)


stub = modal.Stub("corise-podcast-project")
corise_image = modal.Image.debian_slim().pip_install("feedparser",
                                                     "https://github.com/openai/whisper/archive/9f70a352f9f8630ab3aa0d06af5cb9532bd8c21d.tar.gz",
                                                     "requests",
                                                     "ffmpeg",
                                                     "openai",
                                                     "tiktoken",
                                                     "wikipedia",
                                                     "ffmpeg-python").apt_install("ffmpeg").run_function(download_whisper)

@stub.function(image=corise_image, gpu="any", timeout=600)
def get_transcribe_podcast(rss_url, local_path):
  print ("Starting Podcast Transcription Function")
  print ("Feed URL: ", rss_url)
  print ("Local Path:", local_path)

  # Read from the RSS Feed URL
  import feedparser
  intelligence_feed = feedparser.parse(rss_url)
  podcast_title = intelligence_feed['feed']['title']
  episode_title = intelligence_feed.entries[0]['title']
  episode_image = intelligence_feed['feed']['image'].href
  for item in intelligence_feed.entries[0].links:
    if (item['type'] == 'audio/mpeg'):
      episode_url = item.href
  episode_name = "podcast_episode.mp3"
  print ("RSS URL read and episode URL: ", episode_url)

  # Download the podcast episode by parsing the RSS feed
  from pathlib import Path
  p = Path(local_path)
  p.mkdir(exist_ok=True)

  print ("Downloading the podcast episode")
  import requests
  with requests.get(episode_url, stream=True) as r:
    r.raise_for_status()
    episode_path = p.joinpath(episode_name)
    with open(episode_path, 'wb') as f:
      for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

  print ("Podcast Episode downloaded")

  # Load the Whisper model
  import os
  import whisper

  # Load model from saved location
  print ("Load the Whisper model")
  model = whisper.load_model('medium', device='cuda', download_root='/content/podcast/')

  # Perform the transcription
  print ("Starting podcast transcription")
  result = model.transcribe(local_path + episode_name)

  # Return the transcribed text
  print ("Podcast transcription completed, returning results...")
  output = {}
  output['podcast_title'] = podcast_title
  output['episode_title'] = episode_title
  output['episode_image'] = episode_image
  output['episode_transcript'] = result['text']
  return output

@stub.function(image=corise_image, secret=modal.Secret.from_name("my-openai-secret"))
def get_podcast_summary(podcast_transcript):
  import openai
  instructPrompt = """
  You are an expert copywriter who is responsible for publishing a newsletter with hundreds of thousands of subscribers. You recently listened to a great podcast and want to share a summary of it with your readers. Please write the summary of this podcast in a concise and engaging way, use bullet points or numbered lists if necessary..
  The transcript of the podcast is provided below
  """

  request = instructPrompt + podcast_transcript
  chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                            messages=[{"role": "system", "content": "You are a helpful assistant."},
                                                      {"role": "user", "content": request}
                                                      ]
                                            )

  podcastSummary = chatOutput.choices[0].message.content
  return podcastSummary

@stub.function(image=corise_image, secret=modal.Secret.from_name("my-openai-secret"))
def get_podcast_guest(podcast_transcript):
  import openai
  import wikipedia
  import json
  request = podcast_transcript[:10000]
  completion = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": request}],
    functions=[
    {
        "name": "get_podcast_guest_information",
        "description": "Get information on the podcast guest using their full name and the name of the organization they are part of to search for them on Wikipedia or Google",
        "parameters": {
            "type": "object",
            "properties": {
                "guest_name": {
                    "type": "string",
                    "description": "The full name of the guest who is speaking in the podcast",
                },
                "guest_organization": {
                    "type": "string",
                    "description": "The full name of the organization that the podcast guest belongs to or runs",
                },
                "guest_title": {
                    "type": "string",
                    "description": "The title, designation or role of the podcast guest in their organization",
                },
            },
            "required": ["guest_name"],
        },
    }
  ],
  function_call={"name": "get_podcast_guest_information"}
    )
  podcast_guest = ""
  podcast_guest_org = ""
  podcast_guest_title = ""
  response_message = completion["choices"][0]["message"]
  if response_message.get("function_call"):
    function_name = response_message["function_call"]["name"]
    function_args = json.loads(response_message["function_call"]["arguments"])
    podcast_guest=function_args.get("guest_name")
    podcast_guest_org=function_args.get("guest_organization")
    podcast_guest_title=function_args.get("guest_title")

  return podcast_guest, podcast_guest_org, podcast_guest_title

@stub.function(image=corise_image, secret=modal.Secret.from_name("my-openai-secret"))
def get_podcast_highlights(podcast_transcript):
  import openai
  instructPrompt = """
  You are a podcast editor and producer. You are provided with the transcript of a podcast episode and have to identify the 5 most significant moments in the podcast as highlights.
  - Each highlight needs to be a statement by one of the podcast guests
  - Each highlight has to be impactful and an important takeaway from this podcast episode
  - Each highlight must be concise and make listeners want to hear more about why the podcast guest said that
  - The highlights that you pick must be spread out throughout the episode

  Provide only the highlights and nothing else. Prodive the full sentence of the highlight and format it as follows:

  - Highlight 1 of the podcast
  - Highlight 2 of the podcast
  - Highlight 3 of the podcast


  """
  request = instructPrompt + podcast_transcript
  chatOutput = openai.ChatCompletion.create(model="gpt-3.5-turbo-16k",
                                            messages=[{"role": "system", "content": "You are a helpful assistant."},
                                                      {"role": "user", "content": request}
                                                      ]
                                            )
  chatOutput.choices[0].message.content
  podcastHighlights = chatOutput.choices[0].message.content
  return podcastHighlights

@stub.function(image=corise_image, secret=modal.Secret.from_name("my-openai-secret"), timeout=1200)
def process_podcast(url, path):
  output = {}
  podcast_details = get_transcribe_podcast.call(url, path)
  podcast_summary = get_podcast_summary.call(podcast_details['episode_transcript'])
  podcast_guest = get_podcast_guest.call(podcast_details['episode_transcript'])
  podcast_highlights = get_podcast_highlights.call(podcast_details['episode_transcript'])
  output['podcast_details'] = podcast_details
  output['podcast_summary'] = podcast_summary
  output['podcast_guest'] = podcast_guest
  output['podcast_highlights'] = podcast_highlights
  return output

@stub.local_entrypoint()
def test_method(url, path):
  output = {}
  podcast_details = get_transcribe_podcast.call(url, path)
  print ("Podcast Summary: ", get_podcast_summary.call(podcast_details['episode_transcript']))
  print ("Podcast Guest Information: ", get_podcast_guest.call(podcast_details['episode_transcript']))
  print ("Podcast Highlights: ", get_podcast_highlights.call(podcast_details['episode_transcript']))

Overwriting /content/podcast/podcast_backend.py


Running the function with the local_entrypoint to check that the entire information extraction works.

In [53]:
!modal run /content/podcast/podcast_backend.py --url http://feeds.feedburner.com/TEDTalks_audio --path /content/podcast/

[?25l[34m⠋[0m Initializing...[2K[32m✓[0m Initialized. [37mView app at [0m[4;37mhttps://modal.com/apps/ap-r86UHymz2Nh6GzJdk68Lv1[0m
[2K[34m⠋[0m Initializing...
[2K[34m⠸[0m Creating objects...
[37m├── [0m[34m⠋[0m Creating get_transcribe_podcast...
[37m└── [0m[34m⠋[0m Creating mount /content/podcast/podcast_backend.py: Uploaded 0/0 inspected
[2K[1A[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[34m⠸[0m Creating get_transcribe_podcast...
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[34m⠋[0m Creating download_whisper...
[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠏[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_transcribe_podcast.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[32m🔨[0m Created download_whisper.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠹[0m Creating obje

In [54]:
# deploying the function
!modal deploy /content/podcast/podcast_backend.py

[2K[34m⠸[0m Creating objects...
[37m├── [0m[34m⠋[0m Creating get_transcribe_podcast...
[37m└── [0m[34m⠋[0m Creating mount /content/podcast/podcast_backend.py: Uploaded 0/0 inspected
[2K[1A[2K[1A[2K[1A[2K[34m⠦[0m Creating objects...
[37m├── [0m[34m⠸[0m Creating get_transcribe_podcast...
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[34m⠋[0m Creating download_whisper...
[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠏[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_transcribe_podcast.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[32m🔨[0m Created download_whisper.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[34m⠹[0m Creating objects...
[37m├── [0m[32m🔨[0m Created get_transcribe_podcast.
[37m├── [0m[32m🔨[0m Created mount /content/podcast/podcast_backend.py
[37m├── [0m[32m🔨[0m Created down

In [63]:
# Trying to call the deployed function from another python session
import modal
f = modal.Function.lookup("corise-podcast-project", "process_podcast")
output = f.call('https://5minute.libsyn.com/rss', '/content/podcast/')

<ipython-input-63-ba568c7c10d7>:4: DeprecationError: 2023-08-16: `f.call(...)` is deprecated. It has been renamed to `f.remote(...)`
  output = f.call('https://5minute.libsyn.com/rss', '/content/podcast/')


In [64]:
import json
with open("/content/podcast/podcast-3.json", "w") as outfile:
  json.dump(output, outfile)

# Part 3: Deploying the front-end application

I choose to go with a streamlit application for the front-end for the ease of deployment using the Streamlit Share.

In [57]:
%%writefile /content/podcast/podcast_frontend.py
import streamlit as st
import modal
import json
import os

def main():
    st.title("Newsletter Dashboard")

    available_podcast_info = create_dict_from_json_files('.')

    # Left section - Input fields
    st.sidebar.header("Podcast RSS Feeds")

    # Dropdown box
    st.sidebar.subheader("Available Podcasts Feeds")
    selected_podcast = st.sidebar.selectbox("Select Podcast", options=available_podcast_info.keys())

    if selected_podcast:

        podcast_info = available_podcast_info[selected_podcast]

        # Right section - Newsletter content
        st.header("Newsletter Content")

        # Display the podcast title
        st.subheader("Episode Title")
        st.write(podcast_info['podcast_details']['episode_title'])

        # Display the podcast summary and the cover image in a side-by-side layout
        col1, col2 = st.columns([7, 3])

        with col1:
            # Display the podcast episode summary
            st.subheader("Podcast Episode Summary")
            st.write(podcast_info['podcast_summary'])

        with col2:
            st.image(podcast_info['podcast_details']['episode_image'], caption="Podcast Cover", width=300, use_column_width=True)

        # Display the podcast guest and their details in a side-by-side layout
        col3, col4 = st.columns([3, 7])

        with col3:
            st.subheader("Podcast Guest")
            st.write(podcast_info['podcast_guest']['name'])

        with col4:
            st.subheader("Podcast Guest Details")
            st.write(podcast_info["podcast_guest"]['summary'])

        # Display the five key moments
        st.subheader("Key Moments")
        key_moments = podcast_info['podcast_highlights']
        for moment in key_moments.split('\n'):
            st.markdown(
                f"<p style='margin-bottom: 5px;'>{moment}</p>", unsafe_allow_html=True)

    # User Input box
    st.sidebar.subheader("Add and Process New Podcast Feed")
    url = st.sidebar.text_input("Link to RSS Feed")

    process_button = st.sidebar.button("Process Podcast Feed")
    st.sidebar.markdown("**Note**: Podcast processing can take upto 5 mins, please be patient.")

    if process_button:

        # Call the function to process the URLs and retrieve podcast guest information
        podcast_info = process_podcast_info(url)

        # Right section - Newsletter content
        st.header("Newsletter Content")

        # Display the podcast title
        st.subheader("Episode Title")
        st.write(podcast_info['podcast_details']['episode_title'])

        # Display the podcast summary and the cover image in a side-by-side layout
        col1, col2 = st.columns([7, 3])

        with col1:
            # Display the podcast episode summary
            st.subheader("Podcast Episode Summary")
            st.write(podcast_info['podcast_summary'])

        with col2:
            st.image(podcast_info['podcast_details']['episode_image'], caption="Podcast Cover", width=300, use_column_width=True)

        # Display the podcast guest and their details in a side-by-side layout
        col3, col4 = st.columns([3, 7])

        with col3:
            st.subheader("Podcast Guest")
            st.write(podcast_info['podcast_guest']['name'])

        with col4:
            st.subheader("Podcast Guest Details")
            st.write(podcast_info["podcast_guest"]['summary'])

        # Display the five key moments
        st.subheader("Key Moments")
        key_moments = podcast_info['podcast_highlights']
        for moment in key_moments.split('\n'):
            st.markdown(
                f"<p style='margin-bottom: 5px;'>{moment}</p>", unsafe_allow_html=True)

def create_dict_from_json_files(folder_path):
    json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]
    data_dict = {}

    for file_name in json_files:
        file_path = os.path.join(folder_path, file_name)
        with open(file_path, 'r') as file:
            podcast_info = json.load(file)
            podcast_name = podcast_info['podcast_details']['podcast_title']
            # Process the file data as needed
            data_dict[podcast_name] = podcast_info

    return data_dict

def process_podcast_info(url):
    f = modal.Function.lookup("corise-podcast-project", "process_podcast")
    output = f.call(url, '/content/podcast/')
    return output

if __name__ == '__main__':
    main()

Writing /content/podcast/podcast_frontend.py


In [58]:
from google.colab import files

# Download the file locally
files.download('/content/podcast/podcast_frontend.py')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [59]:
# create requirements file
%%writefile /content/podcast/requirements.txt
streamlit
modal

Writing /content/podcast/requirements.txt


In [60]:
from google.colab import files

# Download the file locally
files.download('/content/podcast/requirements.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [65]:
# pre-populate the app with some pre-processed podcasts
from google.colab import files

# Download the file locally
files.download('/content/podcast/podcast-1.json')
files.download('/content/podcast/podcast-2.json')
files.download('/content/podcast/podcast-3.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>