<a id="0"></a> <br>
# Table of Contents  
1. [Data preprocessing](#1)     
2. [Create embeddings](#2)
3. [Create the search index](#3)
4. [Insert text and embeddings into vector store](#4)
5. [Vector similarity search](#5)
6. [Vector similarity search with a filter](#6)
7. [Keyword search](#7)
8. [Hybrid search](#8)
9. [LLM model](#9)

In [1]:
import pandas as pd
import numpy as np

<a id="1"></a>
## Data preprocessing
[Back to the top](#0)

Load dataset.

In [2]:
data_google = pd.read_excel('search_results.xlsx')
data_google.shape

(8205, 8)

In [3]:
data_google.head(5)

Unnamed: 0,title,link,displayed_link,snippet,keywords,position,language,timestamp
0,Hairpin-Technologie,https://de.wikipedia.org/wiki/Hairpin-Technologie,https://de.wikipedia.org,... Kontaktieren von Statoren für elektrische ...,"stator wicklung ""kontaktieren"" -""kontaktieren ...",1,de,04/04/2024 17:14
1,EP2858212B1 - Wickelverfahren für eine Statorw...,https://patents.google.com/patent/EP2858212B1/de,https://patents.google.com,[0006]. Insbesondere das Verbinden der Leiterd...,"stator wicklung elektrisch draht spule phase ""...",1,de,04/04/2024 17:14
2,DE10321956A1 - Hairpin wound stator for electr...,https://patents.google.com/patent/DE10321956A1/en,https://patents.google.com,... verbinden. Ein Prototyp dieser Technologie...,"stator wicklung draht spule phase ""verbinden"" ...",1,de,04/04/2024 17:14
3,Stator und Verfahren zur Herstellung eines Sta...,https://patents.google.com/patent/DE1020191113...,https://patents.google.com,Zur Kontaktierung ist eine Kontakteinrichtung ...,stator wicklung draht spule wicklungsende phas...,1,de,04/04/2024 17:14
4,risomat - Prozesse,https://www.risomat.de/prozesse/,https://www.risomat.de,Für das Wickeln von Spulen für Stator-und Roto...,"stator wicklung ""kontaktieren"" -""kontaktieren ...",1,de,04/04/2024 17:14


Check if our dataset has missing values.

In [4]:
data_google.isnull().sum()

title              0
link               0
displayed_link     0
snippet           31
keywords           0
position           0
language           0
timestamp          0
dtype: int64

Remove rows with missing values.

In [5]:
data_google = data_google.dropna(how='any',axis=0)

In [6]:
data_google.isnull().sum()

title             0
link              0
displayed_link    0
snippet           0
keywords          0
position          0
language          0
timestamp         0
dtype: int64

Check the dataset for duplicates.

In [7]:
data_google.duplicated().sum()

0

Reset index after preprocessing step.

In [8]:
data_google.reset_index(drop=True, inplace=True)

<a id="2"></a>
## Create embeddings
[Back to the top](#0)

In [9]:
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential

In [10]:
endpoint = 
key = 
credential = AzureKeyCredential(key)
index_name = 
azure_openai_endpoint = 
azure_openai_key = 
azure_openai_embedding_deployment = 
embedding_model_name = 
azure_openai_api_version = 

In [11]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

In [12]:
openai_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(openai_credential, "https://cognitiveservices.azure.com/.default")

Initialize the Azure OpenAI client.

In [13]:
client = AzureOpenAI(
    azure_deployment=azure_openai_embedding_deployment,
    api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_key,
    azure_ad_token_provider=token_provider if not azure_openai_key else None
)

Convert `title` and `snippet` column to list.

In [14]:
# titles = data_google['title'].tolist()
# len(titles)

In [15]:
# snippets = data_google['snippet'].tolist()
# len(snippets)

Generate embeddings.

In [16]:
# def generate_embeddings(text, model=embedding_model_name):
#     title_response = client.embeddings.create(input = text, model=model)
#     return [item.embedding for item in title_response.data]

In [17]:
# int(np.ceil(len(titles) / 2048))

Maximum number of embeddings that can be generated by embedding model at one time is 2048, that's why we divide data into chunks and then create embeddings out of them.

In [18]:
# %%time
# c = int(np.ceil(len(titles) / 2048))
# k = 0
# embedds_titles = []
# for i in range(c):
#   n = len(titles) - 2048 * (c-i-1)
#   embedds_titles = embedds_titles + generate_embeddings(titles[k:n])
#   k = n

In [19]:
# %%time
# c = int(np.ceil(len(snippets) / 2048))
# k = 0
# embedds_snippets = []
# for i in range(c):
#   n = len(snippets) - 2048 * (c-i-1)
#   embedds_snippets = embedds_snippets + generate_embeddings(snippets[k:n])
#   k = n

In [20]:
# len(embedds_titles)

In [21]:
# len(embedds_snippets)

In [22]:
# len(embedds_titles[3])

In [23]:
# len(embedds_snippets[3])

Append new columns `title_embedding` and `snippet_embedding` to the dataframe.

In [24]:
# data_google['title_embedding'] = pd.Series((i for i in embedds_titles))

In [25]:
# data_google['snippet_embedding'] = pd.Series((i for i in embedds_snippets))

In [26]:
# data_google.head()

In [27]:
# data_google.tail()

Add new column `id` to the dataframe.

In [28]:
# new_col = range(1, len(data_google) + 1)
# new_col

In [29]:
# idx = 0
# data_google.insert(loc=idx, column='id', value=new_col)

In [30]:
# data_google['id'] = data_google['id'].apply(str)

Make sure that column names in the dataframe match the column names of the future vector store.

In [31]:
# data_google = data_google.rename(columns={"title_embedding": "titleVector", "snippet_embedding": "snippetVector"})

Select columns which will be stored in the vector store.

In [32]:
# data = data_google[['id', 'title', 'snippet', 'keywords', 'link', 'displayed_link', 'titleVector', 'snippetVector']]

In [33]:
# data.head()

In [34]:
# data.tail()

Create JSON file out of the dataframe which will be uploaded to the vector store.

In [35]:
#import json

In [36]:
# data.to_json("gdrive/My Drive/text_analysis/Vectors.json", orient="records")

In [37]:
# path = 'gdrive/My Drive/text_analysis/Vectors.json'
# with open(path, 'r', encoding='utf-8') as file:
#     documents = json.load(file)

In [38]:
#documents[:2]

<a id="3"></a>
## Create the search index
[Back to the top](#0)

In [39]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    #SemanticConfiguration,
    #SemanticPrioritizedFields,
    #SemanticField,
    #SemanticSearch,
    SearchIndex
)

Create a search index schema.

In [40]:
# index_client = SearchIndexClient(
#     endpoint=endpoint, credential=credential)
# fields = [
#     SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
#     SearchableField(name="title", type=SearchFieldDataType.String),
#     SearchableField(name="snippet", type=SearchFieldDataType.String),
#     SearchableField(name="keywords", type=SearchFieldDataType.String),
#     SearchableField(name="link", type=SearchFieldDataType.String),
#     SearchableField(name="displayed_link", type=SearchFieldDataType.String,
#                     filterable=True),
#     SearchField(name="titleVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
#                 searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
#     SearchField(name="snippetVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
#                 searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
# ]

Configure the vector search configuration.

In [41]:
# vector_search = VectorSearch(
#     algorithms=[
#         HnswAlgorithmConfiguration(
#             name="myHnsw"
#         )
#     ],
#     profiles=[
#         VectorSearchProfile(
#             name="myHnswProfile",
#             algorithm_configuration_name="myHnsw",
#         )
#     ]
# )

Create an instance of the search index.

In [42]:
# index = SearchIndex(name=index_name, fields=fields,
#                     vector_search=vector_search)
# result = index_client.create_or_update_index(index)
# print(f' {result.name} created')

<a id="4"></a>
## Insert text and embeddings into vector store
[Back to the top](#0)

In [43]:
from azure.search.documents import SearchClient

In [44]:
# len(documents)

Upload all documents to the vector store. Again, we divide our data into chunks because Searchclient can't upload 8192 document at one time.

In [45]:
# search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)
# c = int(np.ceil(len(documents) / 1000))
# k = 0
# for i in range(c):
#   n = len(documents) - 1000 * (c-i-1)
#   result = search_client.upload_documents(documents[k:n])
#   k = n

<a id="5"></a>
## Vector similarity search
[Back to the top](#0)

In [46]:
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)

Create an embedding for the query and find top-6 similar embeddings of `titleVector` and `snippetVector`. Print `title`, `snippet` and `link` of the corresponding documents.

In [47]:
from azure.search.documents.models import VectorizedQuery

query = "Welded wire mesh"

embedding = client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="titleVector, snippetVector")

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["title", "snippet", "link"],
    top=6
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['snippet']}")
    print(f"Link: {result['link']}\n")

Title: Magnet Wire
Score: 0.03229166567325592
Content: Enamelled wire is used for wound coil of motor or transformer. ... Small welding method (connecting wire, lead wire connection) ... Wire stretching shall be ...
Link: https://www.casa.co.nz/cables/wire/magnet/MagnetWire-Hitachi_40p.pdf

Title: SelfBonding Wire
Score: 0.03131881356239319
Content: Contact · Map · Country. Selfbonding Wire, Selfbonding ... Use of selfbonding enamelled wire offers advantages over conventional enamelled wire in certain winding ...
Link: https://www.elektrisola.com/en/Selfbonding-Wire/Info

Title: Spot welding of enamelled wires on stator
Score: 0.029236022382974625
Content: The enamelled copper wire needs to be reliably welded to the copper contact pin without removing the enamel in advance. Solution. The connection was created ...
Link: https://www.telsonic.com/en/application-finder/spot-welding-of-enamelled-wires-on-stator/

Title: Enamelled Copper Wires for Automotive
Score: 0.024610823020339012
Cont

Create an embedding for the query and find top-6 similar embeddings of `snippetVector`. Print `title`, `snippet` and `link` of the corresponding documents.

In [48]:
from azure.search.documents.models import VectorizedQuery

query = "Welded wire mesh"

embedding = client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="snippetVector")

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["title", "snippet", "link"],
    top=6
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['snippet']}")
    print(f"Link: {result['link']}\n")

Title: PROJECT 1 COMPLETE REPORT (docx)
Score: 0.8654317
Content: ... attach the device to a power source and ... Wire Mesh Wire mesh manufactured from ... winding. The field winding is a wire coil contained in ...
Link: https://www.cliffsnotes.com/study-notes/4862057

Title: Resource Center
Score: 0.85823053
Content: For a demo, we used a sewing machine to sew two DexMat CNT wires / yarn / rope into a piece of fabric and connect... GAMMA-RAY SHIELDING PERFORMANCE OF ...
Link: https://dexmat.com/resource-center/

Title: Direct-Extruded High-Conductivity Copper for Electric ...
Score: 0.85118
Content: Shape Wire process ... The high conductivity wire for stator winding could spill over into permanent magnet motor ... using non-contact infrared ...
Link: https://www.energy.gov/eere/vehicles/articles/directly-extruded-high-conductivity-copper-electric-machines

Title: 1/8" Form-I-Glas K1230P .030" 100% Polyester Shrinkable ...
Score: 0.8511506
Content: Magnet/Winding Wire · Fine Magnet/Wi

<a id="6"></a>
## Vector similarity search with a filter
[Back to the top](#0)

Add filter to the previous vector similarity search.

In [49]:
from azure.search.documents.models import VectorizedQuery
from azure.search.documents.models import VectorFilterMode

query = "Welded wire mesh"

embedding = client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="titleVector, snippetVector")
filter_str = 'https://patents.google.com'

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    vector_filter_mode=VectorFilterMode.PRE_FILTER,
    filter=f"displayed_link eq '{filter_str}'",
    select=["title", "snippet", "link"],
    top=6
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['snippet']}")
    print(f"Link: {result['link']}\n")

Title: Stator assembly made from a molded web of core ...
Score: 0.030365770682692528
Content: Wire is then wound around the poles to form stator windings. ... Of primary concern are glues used to attach ... step, as a continuous length of wire for each phase ...
Link: https://patents.google.com/patent/US20040034988A1/en

Title: Wire guide for winding dynamo-electric machine stators ...
Score: 0.0297619067132473
Content: While it is possible to attach a guide 100 to just one coil ... wire guiding head for winding stator coils ... Technologies, L.L.C. Method of winding an electric ...
Link: https://patents.google.com/patent/US6325318B1/en

Title: EP3402050A1 - Insultated wire of a coil for a random-wound stator
Score: 0.028665028512477875
Content: A winding wire (1) of a coil for a random-wound stator ... winding wire 1 endings, where high electric fields may occur. ... contacting method. GB2332559A * 1997-11 ...
Link: https://patents.google.com/patent/EP3402050A1/en

Title: Process for

<a id="7"></a>
## Keyword search
[Back to the top](#0)

Perform keyword search based on words given in the query.

In [50]:
from azure.search.documents.models import VectorizedQuery

query = "Welded wire mesh"

results = search_client.search(
    search_text=query,
    select=["title", "snippet", "link"],
    top=6
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['snippet']}")
    print(f"Link: {result['link']}\n")

Title: PEEK Wire | Magnet & Winding Wire Insulation - Zeus
Score: 13.158661
Content: As a single layer solution, PEEK wire HB enables motor ... Our patented extrusion technology enables ... Connect With Zeus. LinkedIn · YouTube. Copyright 2024 ...
Link: https://www.zeusinc.com/products/insulated-wire/peek-wire/

Title: PROJECT 1 COMPLETE REPORT (docx)
Score: 12.687489
Content: ... attach the device to a power source and ... Wire Mesh Wire mesh manufactured from ... winding. The field winding is a wire coil contained in ...
Link: https://www.cliffsnotes.com/study-notes/4862057

Title: SelfBonding Wire
Score: 12.551285
Content: Contact · Map · Country. Selfbonding Wire, Selfbonding ... Use of selfbonding enamelled wire offers advantages over conventional enamelled wire in certain winding ...
Link: https://www.elektrisola.com/en/Selfbonding-Wire/Info

Title: Magnet Wire
Score: 12.205912
Content: Enamelled wire is used for wound coil of motor or transformer. ... Small welding method (conne

<a id="8"></a>
## Hybrid search
[Back to the top](#0)

Perform hybrid search which combines vector similarity search and keyword search. According to Microsoft documentation, in most cases this approach gives the best results.

In [51]:
from azure.search.documents.models import VectorizedQuery

query = "Welded wire mesh"

embedding = client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="titleVector, snippetVector")

results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    select=["title", "snippet", "link"],
    top=10
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['snippet']}")
    print(f"Link: {result['link']}\n")

Title: Magnet Wire
Score: 0.04816468432545662
Content: Enamelled wire is used for wound coil of motor or transformer. ... Small welding method (connecting wire, lead wire connection) ... Wire stretching shall be ...
Link: https://www.casa.co.nz/cables/wire/magnet/MagnetWire-Hitachi_40p.pdf

Title: SelfBonding Wire
Score: 0.047447845339775085
Content: Contact · Map · Country. Selfbonding Wire, Selfbonding ... Use of selfbonding enamelled wire offers advantages over conventional enamelled wire in certain winding ...
Link: https://www.elektrisola.com/en/Selfbonding-Wire/Info

Title: Spot welding of enamelled wires on stator
Score: 0.037569355219602585
Content: The enamelled copper wire needs to be reliably welded to the copper contact pin without removing the enamel in advance. Solution. The connection was created ...
Link: https://www.telsonic.com/en/application-finder/spot-welding-of-enamelled-wires-on-stator/

Title: PROJECT 1 COMPLETE REPORT (docx)
Score: 0.03306011110544205
Content: 

<a id="9"></a>
## LLM model
[Back to the top](#0)

In [52]:
pip install langchain_openai




You should consider upgrading via the 'c:\Users\Sereb\OneDrive\Рабочий стол\folder\.venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [53]:
from langchain_openai import AzureOpenAI

Define Azure ML properties.

In [54]:
import os
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2024-02-01"
os.environ["OPENAI_API_KEY"] = "77773196027543a394597fc8e0c6c576"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://azure-openai-service-iwis-1.openai.azure.com/"

Initialize the Azure OpenAI model.

In [55]:
llm = AzureOpenAI(deployment_name = "azure-llm-model",
                  model = "gpt-35-turbo-instruct",
                  temperature=1)

Create a function that generates a response using hybrid search and LLM which was defined above.

In [56]:
def generate_response(user_question):

    # Fetch the appropriate chunk from the database
    context = """"""
    embedding = client.embeddings.create(input=user_question, model=embedding_model_name).data[0].embedding
    vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="titleVector, snippetVector")

    results = search_client.search(
        search_text=user_question,
        vector_queries= [vector_query],
        select=["title", "snippet", "link"],
        top=20
    )

    for result in results:
      context += f"Title: {result['title']}" + f"\nContent: {result['snippet']}" + f"\nLink: {result['link']}\n\n"


    # Append the chunk and the question into prompt
    qna_prompt_template = f"""You will be provided with the question and a related context, you need to answer the question using the context.

Context:
{context}

Question:
{user_question}

Make sure to answer the question only using the context provided, if the context doesn't contain the answer then return "I don't have enough information to answer the question".

Answer:"""

    def splitter(n, s):
      pieces = s.split()
      return (" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n))

    # Call LLM model to generate response
    response = llm(qna_prompt_template)
    print("Answer:")

    for piece in splitter(25, response):
      print(piece)

    print("\nRelated documents:\n")
    print(context)

Test the function.

In [59]:
user_question = "Find all papers written by T. Gläßel."
generate_response(user_question)

Answer:
1) "Kontaktierung von Antrieben für die Elektromobilität: Innovative Vorgehensweisen, Prozessketten und Technologien" by T. Gläßel and Jörg Franke in Annals 2017 2) "Development of a
Production Process for Formed Litz Wire..." by T. Gläßel and J. Franke in chapter/10.1007/978-3-030-78424-9_44 (accessed: 2022/07/17) 3) "Skinning of Insulated Copper Wires within the Production
Chain" by Tobias Gläßel and Karsten Seefried in Kontaktierung von Antrieben für die Elektromobilität.

Related documents:

Title: Annals 2017
Content: Gläßel, T.; Franke, Jörg: Kontaktierung von Antrieben für die Elektromobilität: Innovative Vorgehensweisen,. Prozessketten und Technologien ...
Link: https://www.faps.fau.eu/wp-content/uploads/2018/03/FAPS_Annals_2017.pdf

Title: Development of a Production Process for Formed Litz Wire ...
Content: chapter/10.1007/978-3-030-78424-9_44 (accessed: 2022/07/17). [9] T. Gläßel and J. Franke, “Kontaktierung von Antrieben für die.
Link: https://www.researchgate.net/

Answer the same question as above without providing LLM with our data.

In [58]:
user_question = "Find all papers written by T. Gläßel."
def splitter(n, s):
      pieces = s.split()
      return (" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n))

# Call LLM model to generate response
response_llm = llm(user_question)
print("Answer:")

for piece in splitter(25, response_llm):
  print(piece)

Answer:
There is not enough information available to answer this question accurately as it depends on which field or subject the person is researching on or
which specific T. Gläßel is being referred to. Without additional information, it is not possible to provide an accurate list of papers written by T.
Gläßel.
