# Stuff You Should Know Podcast

### Notebook Covers
0. Import dependencies
1. Retrieving xml from sysk feed
2. Parsing xml and transforming into json
3. Fetching mp3 files for episodes
4. Transcribing mp3 using whisper tiny model
5. Chunking and embedding transcripts
6. Chat with the data using retrieval augmented generation (RAG)

### 0. Import dependencies

In this project, we will heavily rely on Whisper and Langchain to accomplish our objectives. These technologies will enable us to generate high-quality, coherent, and informative text by leveraging the vast knowledge and language understanding capabilities of the models. Remember to run this section for module import if starting from any other section

In [11]:
import whisper
import os
import requests
from dotenv import load_dotenv
import xmltodict
import json
import tiktoken
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter


load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')

### 1. Retrieving xml from sysk feed

First, we need to retrieve the latest sysk rss feed. This feed contains information from the podcast, such as episode descriptions and links to the mp3 files.

In [2]:
# Create directory for rss_feed if not exists
parent_directory = os.path.dirname(os.getcwd())
save_folder = os.path.join(parent_directory, 'rss_feed')
os.makedirs(save_folder, exist_ok=True)

# Request rss file
podcast_url = 'https://omnycontent.com/d/playlist/e73c998e-6e60-432f-8610-ae210140c5b1/A91018A4-EA4F-4130-BF55-AE270180C327/44710ECC-10BB-48D1-93C7-AE270180C33E/podcast.rss'
response = requests.get(podcast_url)

# Save rss file to new directory
save_file_path = os.path.join(save_folder, 'sysk_podcast.rss')

if response.status_code == 200:
    with open(save_file_path, 'wb') as file:
        file.write(response.content)

### 2. Parsing xml and transforming into json

Next, to clean things up, let's fetch the useful information from xml rss feed and transform that into JSON

In [3]:
# Retrieve the newly created file. I repeat myself here in case user starts fresh from section 2
parent_directory = os.path.dirname(os.getcwd())
load_file_path = os.path.join(parent_directory, 'rss_feed', 'sysk_podcast.rss')
with open(load_file_path, 'r', errors='ignore') as file:
    content = file.read()

# Create folder for episode information
save_folder = os.path.join(parent_directory, 'episode_info')
os.makedirs(save_folder, exist_ok=True)

# Parse xml into a dict
xml_dict = xmltodict.parse(content)

# Get list of items (aka podcast episodes)
items = xml_dict['rss']['channel']['item']

# Important step to only take 10 shorter episodes for practice, comment out to run full list
i = 0
items_temp = []
for item in items:
    if i == 10:
        break
    if 'Short Stuff' in item['title']:
        items_temp.append(item)
        i += 1
items = items_temp

# Loop through items in xml dict and dump into new folder as json
for item in items:
    episode_info = {
        'title': item['title'],
        'guid': item['guid']['#text'],
        'publish_date': item['pubDate'],
        'mp3_path': item['enclosure']['@url']
    }

    save_file_path = os.path.join(save_folder, item['guid']['#text'] + '.json')
    with open(save_file_path, 'w') as file:
        json.dump(episode_info, file)

### 3. Fetching mp3 files for episodes

Here, we will use the mp3_path parsed from the xml rss feed (then persisted into json) to download the associated mp3 into another folder

In [4]:
# Retrieve episode_info
parent_directory = os.path.dirname(os.getcwd())
load_folder = os.path.join(parent_directory, 'episode_info')
episode_info_dir = os.listdir(load_folder)

# Create folder for mp3 audio if it does not already exist
save_folder = os.path.join(parent_directory, 'audio')
os.makedirs(save_folder, exist_ok=True)

# Open up each file in episode_info folder, download mp3 files to new audio folder, name each mp3 with the associated episode guid
for episode in episode_info_dir:
    file_path = os.path.join(load_folder, episode)

    with open(file_path, 'r') as episode_info:
        data = json.load(episode_info)
        mp3_path = data.get('mp3_path')
        guid = data.get('guid')
        response = requests.get(mp3_path)

        if response.status_code == 200:
            save_file_path = os.path.join(save_folder, guid + '.mp3')
            with open(save_file_path, "wb") as file:
                file.write(response.content)

### 4. Transribe audio into transcripts

In [5]:
# Create whisper model "tiny"
model = whisper.load_model("tiny")

In [6]:
# Get necessary paths
parent_directory = os.path.dirname(os.getcwd())
audio_path = os.path.join(parent_directory, 'audio')
episode_info_path = os.path.join(parent_directory, 'episode_info')

# Loop through episode_info folder, lookup audio from guid, transcribe using whisper, write transcription back into episode_info json
for episode in episode_info_dir:
    file_path = os.path.join(episode_info_path, episode)

    with open(file_path, 'r') as episode_info:
        data = json.load(episode_info)
        guid = data.get('guid')
        audio_file_path = os.path.join(audio_path, guid + '.mp3')

        transcription = model.transcribe(audio_file_path)
        data['transcription'] = transcription['text']
    
    with open(file_path, 'w') as file:
        json.dump(data, file)



### 5. Chunking and Embedding Transcripts

In [20]:
parent_directory = os.path.dirname(os.getcwd())

docs = []

with open (os.path.join(parent_directory, 'episode_info', '3ce14c09-a1ae-4df1-8f51-b01200d74112.json'), 'r') as file:
    data = json.load(file)

    transcript = data['transcription']
    docs.append(Document(page_content=transcript))

In [21]:
print(tiktoken.encoding_for_model('gpt-3.5-turbo'))
tokenizer = tiktoken.get_encoding('cl100k_base')

def tiktoken_len(text):
    tokens = tokenizer.encode(text, disallowed_special=())
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20, length_function=tiktoken_len, separators=[' ', ''])

<Encoding 'cl100k_base'>


In [22]:
chunks = text_splitter.split_text(docs[0].page_content)
print(len(chunks))
print(chunks)

19
["You know, there are some things in life you just can't trust. Like a free couch on the side of the road. Or the sushi rolls from your local gas station. Or when your kid says they don't need the bathroom before the road trip. But there are some things in life you can't trust. Like the HP Smart Tank Printer. With up to two years of ink included and outstanding print quality, you can rely on the HP Smart Tank Printer from HP. America's most trusted printer brand. Hey, I'm welcome to the short stuff. I'm Josh, there's Chuck Jerry's here too. Dave's here in Spare Let's Go. Big thanks to Still Assignmenton, great name. One of the great house stuff works.com writers. And I am really thrilled with this one because I had always known about and heard the term in fully understood what Rub Goldberg meant. A Rub Goldberg machine. Something that is really complex and kind of awesome. And it's usually some", "A Rub Goldberg machine. Something that is really complex and kind of awesome. And it's