xb-for the database, that contains all the vectors that must be indexed, and that we are going to search in. Its size is nb-by-d

xq-for the query vectors, for which we need to find the nearest neighbors. Its size is nq-by-d. If we have a single query vector, nq=1.

The matrices are always represented as numpy arrays. The data type dtype must be float32

In [1]:
import numpy as np
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

##### Building an index and adding the vectors to it
When the index is built and trained, two operations can be performed on the index: <code>add</code> & <code>search</code>



<!-- <span style="background-color: red; padding: 3px 8px; border-radius: 3px; font-family: monospace;">add and search</span> -->



In [2]:
import faiss                   # make faiss available
index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)

True
100000


In [38]:
import faiss
import numpy as np


np.random.seed(0)


# 1. 索引创建(数据库)
def test01():

    data = np.random.rand(10000, 256)
    dim = 256
    """
    dim: 参数用来指定存储的向量维度
    IndexFlat: 线性搜索
    两种计算方法:
        L2: 使用欧式距离计算相似度
        IP: 点积计算相似度 (越大越相似)

    index_factory: 工厂方法

    Some other methods of index: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
    """
    index = faiss.IndexFlatL2(dim)
    index = faiss.IndexFlatIP(dim) 
    index = faiss.index_factory(dim, "Flat", faiss.METRIC_L2)
    index = faiss.index_factory(dim, "Flat", faiss.METRIC_INNER_PRODUCT)

    # 2,添加向量
    index.add(data)

    # 3,搜索向量
    query_vectors = np.random.rand(2, 256)
    """
    search(query, topk_queries) -> 最相似的两个向量的值, 两个向量的index
    """
    D, I = index.search(query_vectors, k=2)
    # print(D)
    # print(I)

    # 查询最近似向量的ID
    I = index.assign(query_vectors, k=2)
    # print(I)

    # 重建指定位置向量，并不是所有索引都支持该函数
    # print(index.reconstruct(0))

    # 4,删除指定 ID 数据 ([1,2,3]代表实际的编号，而不是index位置)
    index.remove_ids(np.array([1, 2, 3]))
    # print(index.ntotal)

    # 删除所有向量数据
    index.reset()
    # print(index.ntotal)

    # 5,存储索引
    faiss.write_index(index, 'vectors1.faiss')

    # 6,加载索引
    index = faiss.read_index('vectors1.faiss')
    print(index)


# 2. 向量 ID 映射
def test02():
    # 默认情况下每一个向量都会分配一个连续的编号
    # 现在希望能够给每一个向量指定一个 ID
    query_vectors = np.random.rand(1,256)
    """
    IndexIDMap: 
    """
    index = faiss.IndexFlatIP(256)
    # 包装索引，实现自定义向量编号
    """
    faiss.IndexIDMap(index对象) -> index对象
    注意：
        1, 有些faiss.Index 类 没有 add_with_ids()方法, 所以需要IndexIDMap()再包装一下
        2, 也有些faiss.Index 类 已经有了 add_with_ids()方法.
    """
    index = faiss.IndexIDMap(index)

    #  参数1：添加的向量
    # 参数2：向量的编号
    index.add_with_ids(np.random.rand(10000, 256), np.arange(10000, 20000))
    print(index.ntotal)

    I = index.search(query_vectors, k=1)
    print(I)

    index.remove_ids(np.array([10000, 10002, 10003]))
    print(index.ntotal)
if __name__ == '__main__':
    # test01()
    test02()

10000
(array([[73.85191]], dtype=float32), array([[12387]], dtype=int64))
9997
