Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:support python milvus & python lint #24

Merged
merged 5 commits into from
Oct 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/workflows/pythonci-lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Linter

on:
- push
- pull_request

jobs:
lint-python:
name: ruff
runs-on: ubuntu-latest
if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name
steps:
- name: Checkout Code
uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.11
- name: Install Ruff
run: pip install ruff==0.0.272
- name: Run Ruff
run: ruff .
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,6 @@ mr-*
wukong_100m*
*.trie_tree
data/
*.inverted
*.inverted
weights/
index/
267 changes: 11 additions & 256 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,34 @@
# Tangseng 基于Go语言的搜索引擎

# 项目大体框架
**[项目详细内容地址点击这里](https://cocainecong.github.io/tangseng/#/)**

## 项目大体框架

1、gin作为http框架,grpc作为rpc框架,etcd作为服务发现。\
2、总体服务分成`用户模块`、`收藏夹模块`、`索引平台`、`搜索引擎(文字模块)`、`搜索引擎(图片模块)`。\
3、分布式爬虫爬取数据,并发送到kafka集群中,再落库消费。 \
3、分布式爬虫爬取数据,并发送到kafka集群中,再落库消费。 (虽然爬虫还没写,但不妨碍我画饼...) \
4、搜索引擎模块的文本搜索单独设立使用boltdb存储index。\
5、图片搜索待定...
5、使用trie tree实现词条联想。 \
6、图片搜索使用ResNet50来进行向量化查询 + Milvus or Faiss 向量数据库的查询 (开始做了...)。


![项目大体框架](docs/images/tangseng.png)

# 🧑🏻‍💻 前端地址
## 🧑🏻‍💻 前端地址

前端用的是 react, but still coding

[react-tangseng](https://github.com/CocaineCong/react-tangseng)

# 🌈 项目主要功能
## 1. 用户模块
- 登录注册

## 2. 收藏夹模块
- 创建/更新/删除/展示 收藏夹
- 将搜索结果的url进行收藏夹的创建/删除/展示

## 3. 搜索模块

### 3.1 文本检索

> * x.inverted 存储倒排索引文件
> * x.trie_tree 存储词典trie树

#### 正排库

* 目前存放在mysql中,但后续会放到starrocks

#### 倒排库

* term文件 bolt存储,key为token,value 为对应token的postingslist,但由于文件太大了,后续改成倒排索引文件的offset和size,压缩存储容量

**后续看实现难度,能不能用mmap来读取倒排索引**

#### index platform 索引平台

构建对象与召回对象分开, 索引构建,存储都放在索引平台,召回独自放在search_engine模块

### 未来规划
#### 1.架构相关
## 未来规划
### 架构相关

- [ ] 引入降级熔断
- [ ] 引入jaeger进行链路追踪
- [ ] 引入skywalking or prometheus进行监控
- [ ] 抽离dao的init,用key来获取相关数据库实例

#### 2.功能相关
### 功能相关

- [x] 构建索引的时候太慢了.后面加上并发,建立索引的地方加上并发
- [ ] 索引压缩,inverted index,也就是倒排索引表,后续改成存offset,用mmap
Expand All @@ -78,223 +53,3 @@

![文本搜索](docs/images/text2text.jpg)


### 3.2 图片搜索

deving

# 项目主要依赖
- gin
- gorm
- etcd
- grpc
- jwt-go
- logrus
- viper
- protobuf

# ✨ 项目结构

## 1.tangseng 项目总体
```
tangseng/
├── app // 各个微服务
│ ├── favorite // 收藏夹
│ ├── gateway // 网关
│ ├── index_platform // 索引平台
│ ├── mapreduce // mapreduce 服务(已弃用)
│ ├── gateway // 网关
│ ├── search_engine // 搜索微服务(文本)
│ ├── search_img // 搜索微服务(图片)
│ └── user // 用户模块微服务
├── bin // 编译后的二进制文件模块
├── config // 配置文件
├── consts // 定义的常量
├── doc // 接口文档
├── idl // protoc文件
│ └── pb // 放置生成的pb文件
├── loading // 全局的loading,各个微服务都可以使用的工具
├── logs // 放置打印日志模块
├── pkg // 各种包
│ ├── bloom_filter // 布隆过滤器
│ ├── ctl // 用户信息相关
│ ├── discovery // etcd服务注册、keep-alive、获取服务信息等等
│ ├── es // es 模块
│ ├── jwt // jwt鉴权
│ ├── kfk // kafka 生产与消费
│ ├── logger // 日志
│ ├── mapreduce // mapreduce服务
│ ├── res // 统一response接口返回
│ ├── retry // 重试函数
│ ├── trie // 前缀树
│ ├── util // 各种工具、处理时间、处理字符串等等..
│ └── wrappers // 熔断
└── types // 定义各种结构体
```

## 2.gateway 网关部分
```
gateway/
├── cmd // 启动入口
├── internal // 业务逻辑(不对外暴露)
│ ├── handler // 视图层
│ └── service // 服务层
│ └── pb // 放置生成的pb文件
├── logs // 放置打印日志模块
├── middleware // 中间件
├── routes // http 路由模块
└── rpc // rpc 调用
```

## 3.user && favorite 用户与收藏夹模块
```
user/
├── cmd // 启动入口
└── internal // 业务逻辑(不对外暴露)
├── service // 业务服务
└── repository // 持久层
└── db // db模块
├── dao // 对数据库进行操作
└── model // 定义数据库的模型
```

## 4.search-engine 搜索引擎模块

```
seach-engine/
├── analyzer // 分词器
├── cmd // 启动入口
├── data // 数据层
├── ranking // 排序器
├── respository // 存储信息
│ ├── spark // spark 存储,后续支持...
│ └── storage // boltdb 存储(后续迁到spark)
├── service // 服务
├── test // 测试文件
└── types // 定义的结构体
```

## 5. index platform索引平台

```
seach-engine/
├── analyzer // 分词器
├── cmd // 启动入口
├── consts // 放置常量
├── crawl // 分布式爬虫
├── input_data // csv文件(爬虫未实现)
├── respository // 存储信息
│ ├── spark // spark 存储,后续支持...
│ └── storage // boltdb 存储(后续迁到spark)
├── service // 服务
└── trie // 存放trie树
```

# 项目文件配置

将config文件夹下的`config.yml.example`文件重命名成`config.yml`即可。

```yaml
server:
port: :4000
version: 1.0
jwtSecret: 38324-search-engine

mysql:
driverName: mysql
host: 127.0.0.1
port: 3306
database: search_engine
username: search_engine
password: search_engine
charset: utf8mb4

redis:
user_name: default
address: 127.0.0.1:6379
password:

etcd:
address: 127.0.0.1:2379

services:
gateway:
name: gateway
loadBalance: true
addr:
- 127.0.0.1:10001

user:
name: user
loadBalance: false
addr:
- 127.0.0.1:10002 # 监听地址

favorite:
name: favorite
loadBalance: false
addr:
- 127.0.0.1:10003 # 监听地址

searchEngine:
name: favorite
loadBalance: false
addr:
- 127.0.0.1:10004 # 监听地址

domain:
user:
name: user
favorite:
name: favorite
searchEngine:
name: searchEngine
```


# 项目启动
## makefile启动

启动命令

```shell
make env-up # 启动容器环境
make user # 启动用户摸块
make task # 启动任务模块
make gateway # 启动网关
make env-down # 关闭并删除容器环境
```

其他命令
```shell
make run # 启动所有模块
make proto # 生成proto文件,如果proto有改变的话,则需要重新生成文件
```
生成.pb文件所需要的工具有`protoc-gen-go`,`protoc-gen-go-grpc`,`protoc-go-inject-tag`


## 手动启动

1. 利用compose快速构建环境
```shell
docker-compose up -d
```
2. 保证mysql,etcd活跃, 在 app 文件夹下的各个模块的 cmd 下执行
```go
go run main.go
```

# 导入接口文档

打开postman,点击导入

![postman导入](docs/images/1.点击import导入.png)

选择导入文件
![选择导入接口文件](docs/images/2.选择文件.png)

![导入](docs/images/3.导入.png)

效果

![postman](docs/images/4.效果.png)
2 changes: 1 addition & 1 deletion app/search_img/cirtorch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
from .datasets import datahelpers, genericdataset, testdataset, traindataset
from .layers import functional, loss, normalization, pooling
from .networks import imageretrievalnet
from .utils import general, download, evaluate, whiten
from .utils import general, download, evaluate, whiten
12 changes: 10 additions & 2 deletions app/search_img/cirtorch/datasets/datahelpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import torch


def cid2filename(cid, prefix):
"""
Creates a training image path out of its CID name
Expand All @@ -18,12 +19,14 @@ def cid2filename(cid, prefix):
"""
return os.path.join(prefix, cid[-2:], cid[-4:-2], cid[-6:-4], cid)


def pil_loader(path):
# open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
with open(path, 'rb') as f:
img = Image.open(f)
return img.convert('RGB')


def accimage_loader(path):
import accimage
try:
Expand All @@ -32,25 +35,30 @@ def accimage_loader(path):
# Potentially a decoding problem, fall back to PIL.Image
return pil_loader(path)


def default_loader(path):
from torchvision import get_image_backend
if get_image_backend() == 'accimage':
return accimage_loader(path)
else:
return pil_loader(path)


def imresize(img, imsize):
img.thumbnail((imsize, imsize), Image.ANTIALIAS)
return img


def flip(x, dim):
xsize = x.size()
dim = x.dim() + dim if dim < 0 else dim
x = x.view(-1, *xsize[dim:])
x = x.view(x.size(0), x.size(1), -1)[:, getattr(torch.arange(x.size(1)-1, -1, -1), ('cpu','cuda')[x.is_cuda])().long(), :]
x = x.view(x.size(0), x.size(1), -1)[:,
getattr(torch.arange(x.size(1) - 1, -1, -1), ('cpu', 'cuda')[x.is_cuda])().long(), :]
return x.view(xsize)


def collate_tuples(batch):
if len(batch) == 1:
return [batch[0][0]], [batch[0][1]]
return [batch[i][0] for i in range(len(batch))], [batch[i][1] for i in range(len(batch))]
return [batch[i][0] for i in range(len(batch))], [batch[i][1] for i in range(len(batch))]
Loading