# faiss 是使用示例

参考 [faiss-useage.ipynb](https://colab.research.google.com/drive/1MSrwFndb62j87-00Rk4s9TQEXjgAWHLw?usp=sharing#scrollTo=C4FiOXnEtl1f)

## 准备

In [5]:
%%time
%%capture
!pip install sentence-transformers
!apt-get install libomp-dev -y
!pip install faiss-cpu

CPU times: user 25.5 ms, sys: 54.8 ms, total: 80.3 ms
Wall time: 5.98 s


## 文档编码

In [1]:
%%time

from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("/models/bge-large-zh-v1.5")

CPU times: user 3.03 s, sys: 880 ms, total: 3.91 s
Wall time: 3.55 s


In [2]:
%%time

data = [
    'What is your name?',
    'What is your age?',
]
encoded_data = encoder.encode(data)

CPU times: user 372 ms, sys: 16.5 ms, total: 388 ms
Wall time: 389 ms


## 增加索引

In [3]:
%%time

import faiss
import numpy as np

# IndexFlatIP: Flat inner product (for small datasets)
# IndexIDMap: store document ids in the index as well
index = faiss.IndexIDMap(faiss.IndexFlatIP(1024))
index.add_with_ids(encoded_data, np.arange(len(data)))

CPU times: user 18.4 ms, sys: 632 µs, total: 19.1 ms
Wall time: 19.4 ms


## 搜索

In [4]:
%%time

def search(query, k=1):
    query_vector = encoder.encode([query])
    top_k = index.search(query_vector, k)
    print(top_k)
    return [
        data[_id] for _id in top_k[1][0]
    ]
    
search("你是张三么？")

(array([[0.5025584]], dtype=float32), array([[0]]))
CPU times: user 24 ms, sys: 22.7 ms, total: 46.7 ms
Wall time: 47.6 ms


['What is your name?']

## 保存索引

In [5]:
%%time

path = './faiss-only.index'

# Save index
faiss.write_index(index, path)

CPU times: user 196 µs, sys: 0 ns, total: 196 µs
Wall time: 201 µs


## 加载和使用索引

In [6]:
%%time

index = faiss.read_index(path)
# search("How old are you?")

CPU times: user 888 µs, sys: 225 µs, total: 1.11 ms
Wall time: 830 µs


## 加入新的数据

In [7]:
new_data='西游记是吴承恩的著作'
new_encoded_data = encoder.encode([new_data])

index.add_with_ids(new_encoded_data, 2)
data.append(new_data)

In [9]:
%%time

search("西游记的作者是谁？")

(array([[0.6554023]], dtype=float32), array([[2]]))
CPU times: user 25.5 ms, sys: 23.3 ms, total: 48.8 ms
Wall time: 51.5 ms


['西游记是吴承恩的著作']