Муртазалиев Матвей, 466797, J3110


## 0. Загрузка данных

In [2]:
import os, gc, random, warnings
from pathlib import Path
import numpy as np, pandas as pd
from tqdm.notebook import tqdm
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings

warnings.filterwarnings('ignore')

  from tqdm.autonotebook import tqdm, trange


In [3]:
# Чтение файлов
DATA_DIR = '../data'
questions = pd.read_csv(os.path.join(DATA_DIR, 'Questions_cleared.csv'), encoding='latin1', parse_dates=['CreationDate', 'ClosedDate'])
answers = pd.read_csv(os.path.join(DATA_DIR, 'Answers_cleared.csv'), encoding='latin1', parse_dates=['CreationDate'])
tags = pd.read_csv(os.path.join(DATA_DIR, 'Tags.csv'), encoding='latin1')

print('Questions:', questions.shape)
print('Answers  :', answers.shape)
print('Tags     :', tags.shape)


Questions: (1264216, 13)
Answers  : (2014516, 7)
Tags     : (3750994, 2)


In [4]:
questions.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body,AnswerCount,TagList,CleanBody,CleanTitle,AcceptedAnswer,doc
0,80,26.0,2008-08-01 13:57:07+00:00,NaT,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,3,"['flex', 'actionscript-3', 'air']",write database generation script sql want exec...,sqlstatement execute multiple query one statement,wound use kind hack actually work pretty well ...,write database generation script sql want exec...
1,90,58.0,2008-08-01 14:41:24+00:00,2012-12-26 03:45:49+00:00,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,3,"['svn', 'tortoisesvn', 'branch', 'branching-an...",really good tutorial explain branching merge a...,good branching merge tutorial tortoisesvn,easy click click instruction specific tortoise...,really good tutorial explain branching merge a...
2,120,83.0,2008-08-01 15:50:08+00:00,NaT,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,1,"['sql', 'asp.net', 'sitemap']",anyone get experience create sql base asp net ...,asp net site map,jeff prosise version msdn magazine work pretty...,anyone get experience create sql base asp net ...
3,180,2089740.0,2008-08-01 18:42:19+00:00,NaT,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,9,"['algorithm', 'language-agnostic', 'colors', '...",something pseudo solve many time never quite f...,function create color wheel,first thought generate vector space maximize d...,something pseudo solve many time never quite f...
4,260,91.0,2008-08-01 23:22:08+00:00,NaT,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,9,"['c#', '.net', 'scripting', 'compiler-construc...",little game write us database back end trading...,add script functionality net application,oleg shilo script solution code project really...,little game write us database back end trading...


In [13]:
answers.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,CleanBody
0,92,61.0,2008-08-01 14:45:37+00:00,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Vers...",version control subversion good resource sourc...
1,124,26.0,2008-08-01 16:09:47+00:00,80,12,<p>I wound up using this. It is a kind of a ha...,wound use kind hack actually work pretty well ...
2,199,50.0,2008-08-01 19:36:46+00:00,180,1,<p>I've read somewhere the human eye can't dis...,read somewhere human eye distinguish less valu...
3,269,91.0,2008-08-01 23:49:57+00:00,260,4,"<p>Yes, I thought about that, but I soon figur...",yes thought soon figure another domain specifi...
4,307,49.0,2008-08-02 01:49:46+00:00,260,28,"<p><a href=""http://www.codeproject.com/Article...",oleg shilo script solution code project really...


## 1. Генерация эммбеддингов


### Формируем текст (вопрос + лучший ответ)

In [6]:
questions['AcceptedAnswer'].fillna('', inplace=True)

In [7]:
tqdm.pandas(desc='concat')
questions['doc'] = (
        questions['CleanBody'] + ' ' + questions['AcceptedAnswer']
)
documents = questions['doc'].tolist()

### Создаем эмбеддинги

In [8]:
MODEL = 'sentence-transformers/all-MiniLM-L6-v2'  # 384-dim
model = SentenceTransformer(MODEL)
BATCH = 1024
embeddings = []
for i in tqdm(range(0, len(documents), BATCH), desc='embed'):
    batch_emb = model.encode(documents[i:i + BATCH], convert_to_numpy=True, normalize_embeddings=True)
    embeddings.append(batch_emb)
embeddings = np.vstack(embeddings)
print('Embeddings shape:', embeddings.shape)

embed:   0%|          | 0/1235 [00:00<?, ?it/s]

Embeddings shape: (1264216, 384)


## 2&3. Загрузка эмбеддингов в Chroma

In [9]:
CHROMA_DIR = 'chroma_db'
client = chromadb.PersistentClient(path=CHROMA_DIR, settings=Settings(anonymized_telemetry=False))  # Создаем клиент
COLL = 'StackOverflowQnA'
try:
    col = client.get_collection(COLL)  # Получаем коллекцию
except:
    col = client.create_collection(COLL, metadata={'hnsw:space': 'cosine'})  # Создаем если не существует

# Если коллекция пустая, загружаем эмбеддинги
if col.count() == 0:
    STEP = 1000
    for i in tqdm(range(0, len(questions), STEP), desc='upload'):
        ids = questions.index[i:i + STEP].astype(str).tolist()
        documents = questions['Body'][i:i + STEP].tolist()
        metas = [{
            'title': questions.loc[int(id_)]['Title'][:200],
            'answer': questions.loc[int(id_)]['AcceptedAnswer'][:500]
        } for id_ in ids]
        col.add(ids=ids, embeddings=embeddings[i:i + STEP].tolist(), documents=documents, metadatas=metas)
print('Vectors in Chroma:', col.count())

upload:   0%|          | 0/1265 [00:00<?, ?it/s]

Vectors in Chroma: 1264216


## 4. Семантический поиск

In [10]:
def semantic_search(query: str, k: int = 5):
    q_emb = model.encode([query], normalize_embeddings=True)[0]
    res = col.query(query_embeddings=[q_emb.tolist()], n_results=k, include=['distances', 'metadatas'])
    hits = []
    for dist, meta in zip(res['distances'][0], res['metadatas'][0]):
        hits.append({'answer': meta.get('answer', ''), 'score': 1 - dist, 'title': meta.get('title', '')})
    return hits

In [11]:
# Тестируем
for h in semantic_search('How to convert a string to a list in Python?', k=1):
    print(f"Title:  {h['title']}\nAnswer: {h['answer']}\nScore:  {h['score']:.3f}\n")

Title:  Assign part of a string to a variable [Python]
Answer: easy python fun object actually string list safely convert list
Score:  0.770



In [12]:
for h in semantic_search('How to convert string to int in Python?', k=1):
    print(f"Title:  {h['title']}\nAnswer: {h['answer']}\nScore:  {h['score']:.3f}\n")

Title:  Is there other ways to convert a string to int in python2 without int and string.atoi?
Answer: would recommend use try catch also use module
Score:  0.730



# 5&6. Хранение и сравнение и выводы

Хранение данных в векторном виде в Chroma: 
- Эмбеддинги занимают $N \times d \times 4$ байт памяти, где $N$ — количество объектов, $d$ — размерность эмбеддингов, а 4 — размер одного числа в float. Для текущего проекта это примерно 2 GB
- Поиск в Chroma: Chroma использует индексы ANN (Approximate Nearest Neighbors), такие как HNSW, для быстрого поиска ближайших соседей в векторном пространстве. Это позволяет эффективно находить похожие объекты даже в больших наборах данных
- Для обновления данных необходимо удалить старые записи и добавить новые с теми же идентификаторами. Удаление записей выполняется через метод delete, а добавление — через метод add

Сравнение классического и семантического поиска:
- Классический поиск основывается на точном совпадении слов и не учитывает синонимы или контекст
- Семантический поиск, используя эмбеддинги, лучше справляется с синонимами и контекстом, но требует больше ресурсов для хранения и вычислений
