## Embed your Google Drive Docs in a DataStax Astra Vector Database with Unstructured!


Author: Nina Lopatina from Unstructured

Nina's X handle: [@NinaLopatina](https://x.com/ninalopatina)

Nina's LinkedIn: https://www.linkedin.com/in/ninalopatina

Last updated: 09.05.24

Do you have some files in Google Docs that you want to parse, embed, and import to your Astra DataBase for RAG? If so, this notebook will guide you through all the steps to do so!

Here are the initial non-code steps:

A. Sign up for your [Unstructured API key](https://app.unstructured.io/) with a 2 week free trial for up to 1000 documents. You can find your API credentials in your dashboard.

B. Create a [Google Drive service acount](https://support.google.com/a/answer/7378726?hl=en) or find your json with your login info. Make sure you share the google drive directory your data is stored in with the service account email address.

C. Sign up to get your [AstraDB](https://www.google.com/url?q=https%3A%2F%2Fwww.datastax.com%2Flp%2Fastra-registration) DB endpoint and token

D. Decide on which embeddings to use, and obtain the appropriate API Token as needed (in this notebook we are using OpenAI for embedding generation).

Set up the any private API keys in a .env file in your Google Drive
_______________




1. Now starting with the code below, we will install all the necessary libraries

In [None]:
!pip install -q -U "unstructured-ingest[google-drive, astradb, embed-huggingface]" langchain-community httpx python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.2/290.2 kB[0m [31m19.3 MB/s[0m eta [36

2. [Mount your Google drive locally](https://colab.research.google.com/notebooks/io.ipynb) -- there will be a pop up asking you to connect to your google drive -- to load your dotenv file, and to store your .json locally in case you want to reference them later.

  The files themselves will be pulled via a connector to a service account, which allows for processing of google doc files in addition to standard file formats that can be saved in your Drive.

  The secret parameters to set in your .env file are:
  
  UNSTRUCTURED_API_KEY

  UNSTRUCTURED_PARTITION_ENDPOINT
  
  ASTRA_DB_TOKEN

  ASTRA_DB_ENDPOINT

  

### Note that in this notebook, you are sharing your Google Drive with the colab notebook itself, not with Unstructured or DataStax.
  If you prefer not to share your notebook, you can access your .env and Drive .json files in another fashion, e.g. by downloading this notebook as a .ipynb and running it locally with local directory access.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import dotenv

dotenv.load_dotenv('/content/drive/MyDrive/.env')

True

3. We will set additional parameters here, that are not secret, that we can modify more easily in a notebook

In [None]:
from google.colab import userdata

os.environ['GCP_SERVICE_ACCOUNT_KEY'] = '/content/drive/MyDrive/secret/unstructured-podcast-efb838617cbd.json'  # The json you downloaded for your account key # #json path # userdata.get('Google-json-output')
os.environ['GOOGLE_DRIVE_FOLDER_ID'] = '1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP' # The folder where your unstructured data is contained #full '1lBSoQJAg1Tbaer5Z99XI8P4-F16BSWTM' #  #test # '1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP'
os.environ['LOCAL_FILE_DOWNLOAD_DIR'] = '/content/drive/MyDrive/output78/'
os.environ['ASTRA_DB_COLLECTION'] = 'ninatest'
os.environ['ASTRA_DB_EMBEDDING_DIMENSIONS'] = '384' #This value depends on the embedding model you choose. For our current default model per provider, the current values are 384 for HF and 1536 for OpenAI
os.environ["ASTRA_DB_NAMESPACE"] = 'nina_namespace'

#### Note that we temporarily have a bug in processing Docs, Sheets, and Slides in Google Docs (.doc, .xlsx, .ppt, etc., would work fine) -- as a temporary workaround, you can use the V1 SDK code, or download and upload your files)

4. Set up Unstructured API access and process the documents as per our [Google Drive source connector](https://docs.unstructured.io/open-source/ingest/source-connectors/google-drive) documentation and set up the [Astra destination connector](https://docs.unstructured.io/open-source/ingest/destination-connectors/astra). Note that these will shortly be updated to for our new Serverless API.

  At the end of this workflow, your unstructured documents have been extracted, chunked, summarized, embedded, and loaded in your Astra DB!

In [None]:
#All of the imports
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig


from unstructured_ingest.v2.processes.connectors.google_drive import (
    GoogleDriveConnectionConfig,
    GoogleDriveAccessConfig,
    GoogleDriveIndexerConfig,
    GoogleDriveDownloaderConfig
)
from unstructured_ingest.v2.processes.connectors.astradb import (
    AstraDBConnectionConfig,
    AstraDBAccessConfig,
    AstraDBUploadStagerConfig,
    AstraDBUploaderConfig
)
import os

In [None]:
Pipeline.from_configs(
    context=ProcessorConfig(),
    indexer_config=GoogleDriveIndexerConfig(),
    downloader_config=GoogleDriveDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
    source_connection_config=GoogleDriveConnectionConfig(
                access_config=GoogleDriveAccessConfig(
                    service_account_key_path=os.getenv("GCP_SERVICE_ACCOUNT_KEY")
                ),
                drive_id=os.getenv("GOOGLE_DRIVE_FOLDER_ID"),
            ),
    partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
    chunker_config=ChunkerConfig(chunking_strategy="by_title"),
    embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
    destination_connection_config=AstraDBConnectionConfig(
            access_config=AstraDBAccessConfig(
                api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
                token=os.getenv("ASTRA_DB_APPLICATION_TOKEN")
            )
        ),
    stager_config=AstraDBUploadStagerConfig(),
    uploader_config=AstraDBUploaderConfig(
        namespace=os.getenv("ASTRA_DB_NAMESPACE"),
        collection_name=os.getenv("ASTRA_DB_COLLECTION"),
        embedding_dimension=os.getenv("ASTRA_DB_EMBEDDING_DIMENSIONS")
    )
).run()

2024-09-06 00:09:28,254 MainProcess INFO     Created index with configs: {"extensions": null, "recursive": false}, connection configs: {"access_config": "**********", "drive_id": "1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP"}
2024-09-06 00:09:28,260 MainProcess INFO     Created download with configs: {"download_dir": "/content/drive/MyDrive/output78"}, connection configs: {"access_config": "**********", "drive_id": "1cF_wp5Mkuiyvrcee0KKBWWmLcyPj8BiP"}
2024-09-06 00:09:28,265 MainProcess INFO     Created partition with configs: {"strategy": "hi_res", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": true, "split_pdf_allow_failed": true, "split_pdf_concurrency_level": 15}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": null, "partition_by_api": true, "api_key": "*******", "hi_res_model_name": null}
2024-

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2024-09-06 00:09:53,739 MainProcess INFO     Created upload_stage with configs: {}
2024-09-06 00:09:53,740 MainProcess INFO     Created upload with configs: {"collection_name": "ninatest", "embedding_dimension": 384, "namespace": "nina_namespace", "requested_indexing_policy": null, "batch_size": 20}, connection configs: {"access_config": "**********", "connection_type": "astradb"}
2024-09-06 00:09:56,836 MainProcess INFO     Running local pipline: index (GoogleDriveIndexer) -> download (GoogleDriveDownloader) -> partition (hi_res) -> chunk (by_title) -> embed (langchain-huggingface) -> upload_stage (AstraDBUploadStager) -> upload (AstraDBUploader) with configs: {"reprocess": false, "verbose": false, "tqdm": false, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "num_processes": 2, "max_connections": null, "raise_on_error": false, "disable_parallelism": false, "preserve_downloads": false, "download_only": false, "max_docs": null, "re_download": false, "uncompress": false, "otel