This notebook is the submission for the course: [Building AI products with OpenAI](https://corise.com/go/building-ai-products-with-openai-MWKY3)

# Intro

I decided to use local models because I didn't have any openAI credits lefts. Also this allowed me to have more control over which models are being used.

We used [Ollama](https://ollama.ai/) to download and manage models on the local machine.
We can then use [langchain](https://www.langchain.com/) to run the Ollama models

Bellow are some issues I had:
- Whisper need ffmpeg so to install it run: `brew install ffmpeg`


# We install everything we need


In [13]:
!pip install gradio
!pip install feedparser
!pip install git+https://github.com/stlukey/whispercpp.py
!pip install langchain
!pip install langchain-community
!pip install fastapi
!pip install nest_asyncio
!pip install uvicorn

Collecting git+https://github.com/stlukey/whispercpp.py
  Cloning https://github.com/stlukey/whispercpp.py to /private/var/folders/_r/gx7qddks347_dfxnml9wdkcr0000gn/T/pip-req-build-nlot6ay0
  Running command git clone --filter=blob:none --quiet https://github.com/stlukey/whispercpp.py /private/var/folders/_r/gx7qddks347_dfxnml9wdkcr0000gn/T/pip-req-build-nlot6ay0
  Resolved https://github.com/stlukey/whispercpp.py to commit 7af678159c29edb3bc2a51a72665073d58f2352f
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


# We define the functions that can download a podcast information and episode.

There are 2 functions:
  - get_podcast_information: get the podcast information in a dictionary
  - download_episode: download an episode locally

In [34]:
import feedparser
from pathlib import Path
import logging
import requests
import uuid
import os


logging.basicConfig(filename='app.log', filemode='w', format='%(name)s - %(levelname)s - %(message)s', level=logging.DEBUG)
logger = logging.getLogger()


def get_podcast_information(rss_feed):
    """
    Returns the podcast information based on a RSS feed Url

    Parameters:
        rss_feed (str): The url for the podcast RSS feed

    Returns:
        podcast(dict): A dictionary with the podcast info
        podcast["title"] (string): The name of the podcast
        podcast["image"] (string): Url of the podcast thumbnail
        podcast["episodes"] (array): an array of all the episodes
        episode (dict): an dictionary with the information about an episode
        episode["title"] (string): The episode name
        episode["url"] (string): The link of the episode
    """
    feed = feedparser.parse(rss_feed)

    logger.debug('Getting podcast information from {}'.format(rss_feed))
    
    podcast = {
        "title": feed['feed']['title'],
        "image": feed['feed']['image'].href,
        "link": feed['feed']['link'],
        "description": feed['feed']['description'],
        "episodes": []
    }

    logger.debug(podcast)
    
    for episode in feed.entries:
        print(episode)
        episode_dict = {
            "id":episode.id,
            "title": episode.title,
            "description": episode.summary,
            "number": episode["itunes_episode"],
            "date": episode.published
        }
        
        for link in episode.links:
            if link['type'] == 'audio/mpeg' or link['type'] == 'audio/mp3':
                episode_dict["url"] = link.href
        podcast["episodes"].append(episode_dict)
        logger.debug(episode_dict)
    return podcast
    



def download_episode(episode_url, episode_name=str(uuid.uuid4())+".mp3", output_folder="/tmp/podcasts/"):
    """
    Download the audio file

    Parameters:
       episode_url (str): The url for the episode to be downloaded
       episode_name (str, optional): The name of the episode. Will be used as the file name. If not provided a random uuid will be provided. 
       output_folder (str, optional): the folder to save the downloaded episode. Defaults to "/tmp/podcasts/"
    
    Returns:
        episode_path (str): the path of the downloaded episode
    """
    logger.debug('Downloading {url} to folder {folder}'.format(url=episode_url, folder=output_folder))
    folder_path = Path(output_folder)
    folder_path.mkdir(exist_ok=True)

    with requests.get(episode_url, stream=True) as request:
        request.raise_for_status()
        episode_path = folder_path.joinpath(episode_name)
        with open(episode_path, 'wb') as file:
            for chunk in request.iter_content(chunk_size=8192):
                file.write(chunk)

    logger.debug("Podcast Episode downloaded to {path}".format(path=episode_path))
    return str(episode_path)

# podcast = get_podcast_information("https://www.marketplace.org/feed/podcast/make-me-smart")
# if podcast["episodes"][0]:
#     download_episode(podcast["episodes"][0]["url"], output_folder=os.getcwd()+"/episodes", episode_name="make_me_smart.mp3")

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



# We define the function that transcribe the episode audio

We use Whisper locally to transcribe the episode and then save it locally next to the episode

In [10]:
from whispercpp import Whisper

def local_transcribe_audio(audio_path):
    """
    Transcribe the audio file and save the content locally next to the audio file
    
    Parameters:
        audio_path (str): the path of the local audio file
    
    Returns:
        transcript_path (str): the path of the transcript text file
        transcript (str): the transcript of the audio
    """
    transcript_path = audio_path[:-3]+"txt"
    logger.debug('Transcribing audio from {audio_path}'.format(audio_path=audio_path))
    print(transcript_path)
    whisper = Whisper('tiny')
    result = whisper.transcribe(audio_path)
    transcript = whisper.extract_text(result)
    print(transcript)
    with open(transcript_path, 'w') as file:
        file.write(str(transcript))
    logger.debug("Transcript saved to {transcript_path}".format(transcript_path=transcript_path))
    return transcript_path, transcript


# audio = os.getcwd()+"/episodes/make_me_smart.mp3"
# path,audio_transcript =local_transcribe_audio(audio)
    

/Users/loiclemerlus/DataspellProjects/building-ai-products-with-open-ai/episodes/make_me_smart.txt
Loading data..


whisper_init_from_file_no_state: loading model from '/Users/loiclemerlus/.ggml-models/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  129.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB


Transcribing..
Extracting text...
[' [MUSIC]', " >> Everybody in the car rose though, we'll come back to make this more professional.", ' >> It is Friday, it is second of February.', " >> I'm Nova Soffielding for Kimberly Adams.", " Thanks for joining us on the podcast and it's Friday, so it's the YouTube live stream.", " It means it's also time for our weekly happy hour episodes so excited about that.", " >> And then Willie goes out the door and leaves the door to the shed open and it's like 50 degrees in", ' Los Angeles.', " >> I'm freezing my head.", ' >> Oh no, 50 degrees.', " >> It's terrible.", ' >> How do you stand in it?', " >> I don't know.", " >> Well, I bundle up this, what I do when I put on the space you're so, it's Friday.", " We will do what we usually do, I'm amusing to know about today.", ' We will do what we usually do, which is a little bit of news.', " We'll do a little half a half empty, I think Drew's on.", ' And we will check to see what people are drinking.', ' 

whisper_full_with_state: progress =   5%
whisper_full_with_state: progress =  10%
whisper_full_with_state: progress =  15%
whisper_full_with_state: progress =  20%
whisper_full_with_state: progress =  25%
whisper_full_with_state: progress =  30%
whisper_full_with_state: progress =  35%
whisper_full_with_state: progress =  40%
whisper_full_with_state: progress =  45%
whisper_full_with_state: progress =  50%
whisper_full_with_state: progress =  55%
whisper_full_with_state: progress =  60%
whisper_full_with_state: progress =  65%
whisper_full_with_state: progress =  70%
whisper_full_with_state: progress =  75%
whisper_full_with_state: progress =  80%
whisper_full_with_state: progress =  85%
whisper_full_with_state: progress =  90%
whisper_full_with_state: progress =  95%


# We create a function that configure the model we want to use

We use Ollama for downloading the model locally, and then use LangChain to use it.

In [ ]:
from langchain_community.llms import Ollama

def setup_local_model(model):
    llm = Ollama(model=model)
    return llm

# We create the prompts to extract different information from the transcript

We want to get the following information:
- Podcast bullet points of the topics
- Podcast hosts and guest
- the picks from the guests if there is any
- keywords from the podcast

# Setting up the web server

We want to have a webserver that will serve an interactive website.
We use FastAPI as it is a lightweight python server.

Because for this assignment we need to only submit this notebook we need to create the HTML and JavaScript files from the notebook rather than just using static files. The files will still be available in the gitHub repository

In [35]:
from fastapi import FastAPI, File, UploadFile, Request
from fastapi.responses import HTMLResponse, RedirectResponse
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates
import json
import nest_asyncio
import uvicorn


app = FastAPI()
os.makedirs("static", exist_ok=True)
app.mount("/static", StaticFiles(directory="static"), name="static")
# templates = Jinja2Templates(directory="templates")

@app.get("/")
def index():
    return RedirectResponse(url="/static/index.html")

@app.get("/api/rss")
async def rss(request: Request):
    rss_feed = request.query_params.get('feed')
    print(rss_feed)
    podcast_from_rss = get_podcast_information(rss_feed)
    return podcast

@app.post("/api/download")
async def download_podcast(request:Request):
    podcast_request = request.body()
    audio_path = download_episode(podcast_request.url, podcast_request.name, output_folder=os.getcwd()+"/episodes")
    return audio_path

@app.post("/api/transcribe")
async def transcribe_audio(request: Request):
    audio_file = request.body()
    audio_path = audio_file["path"]
    transcript_path, transcript = local_transcribe_audio(audio_path)
    return {"transcript_path": transcript_path, "transcript": transcript}

if __name__ == "__main__":
    nest_asyncio.apply()
    uvicorn.run(app)



INFO:     Started server process [6615]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


INFO:     127.0.0.1:55963 - "GET /static/index.html HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:55963 - "GET /static/app.js HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:55963 - "GET /static/PodcastPage.js HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:55963 - "GET /static/HomePage.js HTTP/1.1" 304 Not Modified
https://www.marketplace.org/feed/podcast/make-me-smart


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



INFO:     127.0.0.1:56213 - "GET /static/index.html HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:56213 - "GET /static/PodcastPage.js HTTP/1.1" 200 OK
https://www.marketplace.org/feed/podcast/make-me-smart


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



INFO:     127.0.0.1:57151 - "GET /static/index.html HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:57151 - "GET /static/app.js HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:57151 - "GET /static/PodcastPage.js HTTP/1.1" 200 OK
https://www.marketplace.org/feed/podcast/make-me-smart


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



INFO:     127.0.0.1:57383 - "GET /static/index.html HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:57383 - "GET /static/HomePage.js HTTP/1.1" 304 Not Modified
INFO:     127.0.0.1:57384 - "GET /static/PodcastPage.js HTTP/1.1" 304 Not Modified
https://www.marketplace.org/feed/podcast/make-me-smart


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [6615]
