In [1]:
#reloads modules before executing user code
%load_ext autoreload
%autoreload 2

In [2]:
import sys
!{sys.executable} -m pip install -r ../requirements.txt

Collecting scrapy
  Using cached Scrapy-2.9.0-py2.py3-none-any.whl (277 kB)
Collecting matplotlib
  Using cached matplotlib-3.7.1-cp310-cp310-macosx_11_0_arm64.whl (7.3 MB)
Collecting plotly
  Using cached plotly-5.15.0-py2.py3-none-any.whl (15.5 MB)
Collecting scipy
  Using cached scipy-1.10.1-cp310-cp310-macosx_12_0_arm64.whl (28.8 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.2-cp310-cp310-macosx_12_0_arm64.whl (8.5 MB)
Collecting python-dotenv
  Using cached python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting httpx
  Using cached httpx-0.24.1-py3-none-any.whl (75 kB)
Collecting service-identity>=18.1.0
  Using cached service_identity-23.1.0-py3-none-any.whl (12 kB)
Collecting itemadapter>=0.1.0
  Using cached itemadapter-0.8.0-py3-none-any.whl (11 kB)
Collecting zope.interface>=5.1.0
  Using cached zope.interface-6.0-cp310-cp310-macosx_11_0_arm64.whl (202 kB)
Collecting Twisted>=18.9.0
  Using cached Twisted-22.10.0-py3-none-any.whl (3.1 MB)


In [2]:
# Ignore unclosed SSL socket warnings - optional in case you get these errors
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ImportWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Laying the foundations

### Storage

We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).

To set this up locally, you will need to install Docker and then run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [3]:
# Setup Redis and running?
from database import get_redis_connection

redis_client = get_redis_connection()

redis_client.ping()

True

In [8]:
# Optional step to drop the indexes if they already exists
from importer import NOTION_INDEX_NAME, WEB_SCRAPE_INDEX_NAME

redis_client.ft(NOTION_INDEX_NAME).dropindex()
redis_client.ft(WEB_SCRAPE_INDEX_NAME).dropindex()

ResponseError: Unknown Index name

### Ingestion

We'll load up our Notion pages into documents

In [1]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [2]:
from importer import import_notion_data

notion_index = import_notion_data()

INFO:numexpr.utils:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.
[Document(text='\n\t\tFocused Labs @ The Old Post Office Building\n\t\t\n433 W Van Buren St, \nSuite 1100-B, \nChicago, IL 60607\n\nEntering the Building\nThe Focused Labs office is located in the Old Post Office on the 11th floor, Suite B.  You can access the 11th floor using the elevators on the North side of the building, bank B.\nThe Old Post Office requires you to carry an access badge at all times. The badge will provide you access to the building as well as the Focused Labs office suite 24/7\nIf it is your first day, a team member will meet you in the lobby and guide you up to our office!\nTravel\nThe Old Post Office is accessible via CTA, Metra, Bike and Automobile\nIf you use the CTA 

In [3]:
# Optional
# Proves that the redis database contains data

from importer import number_of_stored_notion_docs
print(number_of_stored_notion_docs())

791


In [8]:
# set Logging to DEBUG for more detailed outputs
query_engine = notion_index.as_query_engine()
response = query_engine.query("Where is the Denver office?")
response.response

INFO:llama_index.vector_stores.redis:Querying index notion-fl-index
Querying index notion-fl-index
INFO:llama_index.vector_stores.redis:Found 2 results for query with id ['notionfocusedlabsdocs_d63f3b0a-e1c6-44d5-b616-a740c08e08c5', 'notionfocusedlabsdocs_0e7023fe-b2fc-4198-9dbc-7297b6794ec4']
Found 2 results for query with id ['notionfocusedlabsdocs_d63f3b0a-e1c6-44d5-b616-a740c08e08c5', 'notionfocusedlabsdocs_0e7023fe-b2fc-4198-9dbc-7297b6794ec4']
INFO:llama_index.token_counter.token_counter:> [retrieve] Total LLM token usage: 0 tokens
> [retrieve] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [retrieve] Total embedding token usage: 6 tokens
> [retrieve] Total embedding token usage: 6 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total LLM token usage: 1109 tokens
> [get_response] Total LLM token usage: 1109 tokens
INFO:llama_index.token_counter.token_counter:> [get_response] Total embedding token usage: 0 tokens
> [get_response

'The Denver office is located at 1800 Wazee St, 3rd floor, Denver, CO 80202.'

Adding web scraped data to index

In [4]:
from importer import import_web_scrape_data

web_scrape_index = import_web_scrape_data()

[Document(text='\n\nA digital transformation partner focused on software delivery\n\n\n\n      var show = localStorage.getItem(\'show\');\n      if(show === \'true\'){\n        document.documentElement.classList.add(\'dark\');\n      } \n    \n\nhsjQuery = window[\'jQuery\'];\n\n\n\n\n\na.cta_button{-moz-box-sizing:content-box !important;-webkit-box-sizing:content-box !important;box-sizing:content-box !important;vertical-align:middle}.hs-breadcrumb-menu{list-style-type:none;margin:0px 0px 0px 0px;padding:0px 0px 0px 0px}.hs-breadcrumb-menu-item{float:left;padding:10px 0px 10px 10px}.hs-breadcrumb-menu-divider:before{content:\'›\';padding-left:10px}.hs-featured-image-link{border:0}.hs-featured-image{float:right;margin:0 0 20px 20px;max-width:50%}@media (max-width: 568px){.hs-featured-image{float:none;margin:0;width:100%;max-width:100%}}.hs-screen-reader-text{clip:rect(1px, 1px, 1px, 1px);height:1px;overflow:hidden;position:absolute !important;width:1px}\n\n\n\n\n\n\n\n  \n  .cards_galle

In [5]:
# Optional
# Proves that the redis database contains data

from importer import number_of_stored_web_scrape_docs
print(number_of_stored_web_scrape_docs())

841


In [None]:
query_engine = web_scrape_index.as_query_engine()
response = query_engine.query("What are some of the solutions that Focused Labs has created?")
response.response

In [28]:
#optional if you haven't installed stop words
#go to the corpora tab, use the arrow key to scroll down to stop words and hit enter to install
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


KeyboardInterrupt: 

In [1]:
from importer import compose_graph

graph = compose_graph()

In [29]:
# Optional
# Proves that the graph is built

# query_engine = graph.as_query_engine()
response = graph.query("What are some of the solutions that Focused Labs has created?")

print(str(response))
# print(response.get_formatted_sources())

Some of the solutions that Focused Labs has created include streamlining onboarding with BTR Energy's Bridge platform, managing EV charging data, helping Hertz leverage technology to capture new markets, building highly productive software teams in a traditional IT environment, building a marketplace platform to enable a new business model, building a strong remote culture with the transparency leadership needs, designing and documenting a repeatable publication flow, and creating a remote first playbook to help organizations un-stuck on long standing problems.
