## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Create an Azure Cognitive Search Index**](#create-index): This index will store the content from a document hosted on SharePoint Online.

2. [**Initialize the `client_extractor` client**](#init-client): This client manages the connection to a SharePoint site through the Microsoft Graph REST API and retrieves the Site ID for the site.

3. [**Download and Process Content and Metadata**](#download-process): The `client_extractor` client provides several methods for this:
    - Download and process all `.docx` and `.pdf` files from a SharePoint site.
    - Download and process only `.docx` files from a specific SharePoint site that were modified or uploaded in the last 60 minutes.
    - Download and process files from a specific folder within a SharePoint site.
    - Download and process a specific file within a SharePoint site.

4. [**Ingest into Azure AI Search Index**](#ingest-index): The extracted content and metadata are ingested into the Azure AI Search Index for easy retrieval and search.

For more details, refer to the following resources:
- [Quickstart: Register an app with the Azure AD v2.0 endpoint](https://learn.microsoft.com/en-us/azure/active-directory/develop/console-app-quickstart?pivots=devlang-python)
- [Create a Demo SharePoint Online Environment](https://cdx.transform.microsoft.com/) (Note: To use this, you need to either be a Microsoft Employee or part of the Microsoft Partner Program: [Microsoft Partner Program](https://partner.microsoft.com/dashboard/account/v3/enrollment/introduction/partnership))



## Getting Started

#### Configure Environment Variables 

Before running this notebook, you must configure certain environment variables. We will now use environment variables to store our configuration. This is a more secure practice as it prevents sensitive data from being accidentally committed and pushed to version control systems.

Create a `.env` file in your project root (use the provided `.env.sample` as a template) and add the following variables:

```env
# Azure AI Search Service Configuration
AZURE_AI_SEARCH_SERVICE_ENDPOINT="[Your Azure Search Service Endpoint]"
AZURE_SEARCH_ADMIN_KEY="[Your Azure Search Index Name]"

#Azure Open API Configuration
AZURE_OPENAI_API_KEY='[Your OpenAI API Key]'
AZURE_OPENAI_ENDPOINT='[Your OpenAI Endpoint]'
AZURE_OPENAI_API_VERSION='[Your Azure OpenAI API Version]'

#Azure Open API Configuration
AZURE_STORAGE_CONNECTION_STRING='[Your Azure Storage Connection String]'
```

Replace the placeholders (e.g., [Your Azure Search Service Endpoint]) with your actual values.

- `AZURE_AI_SEARCH_SERVICE_ENDPOINT` and `AZURE_SEARCH_ADMIN_KEY` are used to configure the Azure AI Search service.
- `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`, and `AZURE_OPENAI_API_VERSION` are used to configure the Azure OpenAI service.
- `AZURE_STORAGE_CONNECTION_STRING` is used to configure the Azure Storage service.
```

> 📌 **Note**
> Remember not to commit the .env file to your version control system. Add it to your .gitignore file to prevent it from being tracked.

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks (Optional)

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

> Instructions for Windows users: 

1. **Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory.
   - Execute the following command to create the Conda environment using the `environment.yml` file:
     ```bash
     conda env create -f environment.yml
     ```
   - This command creates a Conda environment as defined in `environment.yml`.

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate sharepoint-indexing
     ```

> Instructions for Linux users (or Windows users with WSL or other linux setup): 

1. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yml` file:
     ```bash
     make create_conda_env
     ```

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate sharepoint-indexing
     ```

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions for VSCode. These extensions provide support for running and editing Jupyter Notebooks within VSCode.

2. **Open the Notebook**:
   - Open the Jupyter Notebook file (`01-indexing-content.ipynb`) in VSCode.

3. **Attach Kernel to VSCode**:
   - After creating the Conda environment, it should be available in the kernel selection dropdown. This dropdown is located in the top-right corner of the VSCode interface.
   - Select your newly created environment (`sharepoint-indexing`) from the dropdown. This sets it as the kernel for running your Jupyter Notebooks.

4. **Run the Notebook**:
   - Once the kernel is attached, you can run the notebook by clicking on the "Run All" button in the top menu, or by running each cell individually.


By following these steps, you'll establish a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will include all the necessary dependencies specified in your `environment.yml` file. If you wish to add more packages or change versions, please use `pip install` in a notebook cell or in the terminal after activating the environment, and then restart the kernel. The changes should be automatically applied after the session restarts.

In [13]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-langchain-azureai-search"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-langchain-azureai-search


## Initialize `TextChunkingIndexing`

In [14]:
# Import the TextChunkingIndexing class from the langchain_integration module
from src.gbb_ai.langchain_integration_azureai import TextChunkingIndexing

# Create an instance of the TextChunkingIndexing class
gbb_ai_client = TextChunkingIndexing()

# load the environment variables from the .env file
gbb_ai_client.load_environment_variables_from_env_file()

# Specify the name of the deployment in Azure AI Services
DEPLOYMENT_NAME = "foundational-ada"

# Load the embedding model associated with the specified deployment
embedding_model = gbb_ai_client.load_embedding_model(azure_deployment=DEPLOYMENT_NAME)

2023-12-21 01:51:03,965 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (langchain_integration_azureai.py:load_embedding_model:113)


2023-12-21 01:51:05,605 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object created successfully. (langchain_integration_azureai.py:load_embedding_model:124)


## Create/Load the Azure AI Search Index


In [15]:
# Define the name of the Azure Search index
# This is the index where your data is stored in Azure Search
INDEX_NAME = "index-test"

### Setting Up Search Fields with Azure AI

In this section, we define the fields that will be used for indexing and searching in Azure AI. These fields represent the different pieces of data that Azure AI will use to understand and categorize the information, enabling more efficient and accurate search results.

In [16]:
from azure.search.documents.indexes.models import (
    SearchFieldDataType,
    SearchField,
    SimpleField,
    SearchableField,
    SemanticSettings,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField,
)

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, searchable=True),
    SearchField(
        name="content_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=len(embedding_model.embed_query("Text")),
        vector_search_configuration="default",
    ),
    SearchableField(name="metadata", type=SearchFieldDataType.String, searchable=True),
    SimpleField(name="source", type=SearchFieldDataType.String, filterable=True),
]

### Configuring Semantic Search Parameters

In this section, we set up the configuration for semantic search. Semantic search is a type of information retrieval that focuses on the meaning of queries, rather than just matching keywords. It uses natural language processing (NLP) and other advanced techniques to understand the context and intent behind a user's search query, providing more relevant and accurate results.

In [17]:
semantic_settings_config = [
    SemanticConfiguration(
        name="config",
        prioritized_fields=PrioritizedFields(
            title_field=SemanticField(field_name="content"),
            prioritized_content_fields=[SemanticField(field_name="content")],
            prioritized_keywords_fields=[SemanticField(field_name="metadata")],
        ),
    )
]

In [18]:
# Set up the Azure Search client with the specified index
# This prepares the client to interact with the Azure Search service
gbb_ai_client.setup_azure_search(
    index_name=INDEX_NAME,
    fields=fields,
    semantic_settings_config=semantic_settings_config,
)

2023-12-21 01:55:09,525 - micro - MainProcess - INFO     Azure Cognitive Search client configured successfully. (langchain_integration_azureai.py:setup_azure_search:222)


<langchain_community.vectorstores.azuresearch.AzureSearch at 0x20a9785dca0>

## Indexing PDFs

In [20]:
# Scrap web and chuck files intp sentences
# Define the URLs of the web pages to scrape
file_1 = "utils\\data\\ultraflex_user_manual.pdf"

# Set the chunk size and overlap size for splitting the text
CHUNK_SIZE = 512
OVERLAP_SIZE = 128
SEPARATOR = "(\n\w|\w\n)"

# Scrape the web pages, split the text into chunks, and store the chunks
# The text is split into chunks of size CHUNK_SIZE, with an overlap of OVERLAP_SIZE between consecutive chunks
text_chuncked = gbb_ai_client.load_and_split_text_by_character_from_pdf(
    source=file_1, chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP_SIZE
)

# Embed the chunks and index them in Azure Search
# This function converts the text chunks into vector embeddings and stores them in the Azure Search index
gbb_ai_client.embed_and_index(text_chuncked)

2023-12-21 01:59:26,727 - micro - MainProcess - INFO     Reading PDF files from C:\Users\pablosal\Desktop\gbbai-langchain-azureai-search\utils\data\ultraflex_user_manual.pdf. (langchain_integration_azureai.py:read_and_load_pdfs:320)
2023-12-21 01:59:51,660 - micro - MainProcess - INFO     Starting to embed and index 39 chuncks. (langchain_integration_azureai.py:embed_and_index:387)
2023-12-21 02:00:43,435 - micro - MainProcess - INFO     Successfully embedded and indexed 39 chuncks. (langchain_integration_azureai.py:embed_and_index:389)
