In [6]:
# %pip install pandas
# %pip install numpy
# %pip install tiktoken
# %pip install bs4

## Scrape Website

In [7]:
import requests
from bs4 import BeautifulSoup

In [5]:
roasted_reflections_url = 'https://benmcdougal.com/roasted-reflections/'
response = requests.get(roasted_reflections_url)
soup = BeautifulSoup(response.text)

soup

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<meta content="4.9.13" name="dlm-version"/>
<!-- This site is optimized with the Yoast SEO plugin v22.5 - https://yoast.com/wordpress/plugins/seo/ -->
<title>Roasted Reflections</title>
<meta content="Welcome to a library of weekly writings by Ben McDougal." name="description">
<link href="https://benmcdougal.com/roasted-reflections/" rel="canonical">
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type"/>
<meta content="Roasted Reflections" property="og:title"/>
<meta content="Welcome to a library of weekly writings by Ben McDougal." property="og:description"/>
<meta content="https://benmcdougal.com/roasted-reflections/" property="og:url"/>
<meta content="COMM

In [8]:
content_columns = soup.find_all('div', {'class': 'content-column one_third'})
# links = soup.find_all('a', {'target' : '_blank', 'rel': 'noopener'})

corpus = []

idx = 0

for div in content_columns:
  links = div.find_all('a', {'target' : '_blank', 'rel': 'noopener'})
  for link in links:
    href = link['href']

    # skip the nft link
    if href == 'https://benmcdougal.com/nft':
      continue

    response = requests.get(href)
    page = BeautifulSoup(response.text)
    print(f'parsing {page.title.text}')

    article = page.find('div', {'class': 'single-post-wrap entry-content'})

    if not article:
      continue

    paragraphs = article.find_all('p')

    for i, p in enumerate(paragraphs):
      text = p.text.strip()

      if not len(text):
        continue

      if text == 'Extra Shot':
        continue
      
      corpus.append({
        'index': idx,
        'url': href,
        'paragraph': i,
        'text': text
      })

      # increment for each paragraph
      idx = idx + 1

parsing The Headline Trap
parsing Personal Bandwidth
parsing Pure Wonder
parsing The Idea Machine
parsing Super Sentence
parsing Love Triangles
parsing Career Nirvana
parsing Champions of Change
parsing Co-Founders
parsing Lone Wolves
parsing Pain Relievers vs. Vitamins
parsing Life's a Pitch
parsing Content Creation
parsing Content Creation: Writing
parsing Content Creation: Photography
parsing Content Creation: Videography
parsing Content Creation: Graphic Design
parsing Content Creation: Creativity
parsing Content Creation: Organization
parsing Feedback is Data
parsing Pitch Competitions
parsing Slide Deck Design
parsing Early Moves
parsing Ship It
parsing First in Line
parsing Launch
parsing Dark Matter
parsing Linchpin - COMMUNITY BUILDER
parsing Stealth Mode - COMMUNITY BUILDER
parsing Anxiety
parsing Minerva
parsing Maverick
parsing Milestones
parsing Celebrate
parsing Ideaworks
parsing Shifting Gears
parsing Intrinsic
parsing Not to Lose
parsing Melting Momentum
parsing Wirefra

In [11]:
corpus[:5]

[{'index': 0,
  'url': 'https://benmcdougal.com/the-headline-trap/',
  'paragraph': 0,
  'text': 'Reducing barriers to entrepreneurship allows more people to feel inspired by their work. There are many common barriers to entrepreneurship. As I’ve worked with students, new entrepreneurs and intrapreneurs working inside existing companies, I’ve noticed a self-limiting ideology we can call The Headline Trap.'},
 {'index': 1,
  'url': 'https://benmcdougal.com/the-headline-trap/',
  'paragraph': 1,
  'text': 'The Headline Trap is an emotional barrier that can subconsciously make people think their own entrepreneurial abilities don’t warrant action. It festers from the deception that business ventures must “go big” or make a bunch of cash to positively impact one’s career portfolio.'},
 {'index': 2,
  'url': 'https://benmcdougal.com/the-headline-trap/',
  'paragraph': 2,
  'text': 'This is no surprise. Successful startup stories are celebrated loudly. These spotlights are well deserved and c

## Load Corpus

In [14]:
# %pip install chromadb

In [24]:
import chromadb
from dotenv import load_dotenv

load_dotenv()

chroma_client = chromadb.PersistentClient(path=f"_data/chroma")
docs = chroma_client.get_or_create_collection("benbot")

In [27]:
docs.add(
    documents=[doc["text"] for doc in corpus],
    metadatas=[{"url": doc["url"], "paragraph": doc["paragraph"]} for doc in corpus],
    ids=[f"corpus:{doc['index']}" for doc in corpus],
)

Insert of existing embedding ID: corpus:0
Insert of existing embedding ID: corpus:1
Add of existing embedding ID: corpus:0
Add of existing embedding ID: corpus:1


In [33]:
results = docs.query(query_texts=["what is benbot?"], n_results=5)

for doc in results["documents"][0]:
  print(doc)

Wow. I’ll stop talking. It’s time to give BEN BOT a try! Our new friend is always thirsty to chat and we can’t wait to hear what you think – BENBOT.ai
Five more robotic NFTs will be released into the Roasted Reflections NFT Collection every Wednesday in April. Active NFT ownership includes a variety of utilities and 24/7 access to BEN BOT. While NFT ownership is the best way to access BEN BOT, not everyone wants an NFT. Your web3 exploration is rewarded with lower prices and more value with NFTs, but to make BEN BOT accessible to all, a monthly membership (paid by credit card) is also available.
BEN BOT goes online April 1st.
BEN BOT infuses all 125+ weekly writings in Roasted Reflections, key takeaways from YDNTB, and other strategic embeddings into a conversational chatbot!
Installation complete. BEN BOT is now online!
