Distributed Document Search Engine

This is an open-source project for a paper search engine, which includes a Scrapy-Redis distributed crawler, an Elasticsearch search engine, and a Django frontend. The project was designed to provide a platform for users to easily search and access research papers.

Features

Scrapy-Redis distributed crawler using CSS Selectors.
Centralized deduplication with Redis for distribution.
Text search engine implemented with ElasticSearch.
Full-stack web application built using Django.

Technology Stack

The main technology stack used in this project includes:

Scrapy-Redis
Elasticsearch
Django

👉👉👉 More technical details that help to understand my project as follows. 中文版本

Technical selection scrapy vs requests+beautifulsoup

Both requests and beautifulsoup are libraries, and scrapy is the framework;
Requests and beautifulsoup can be added to the scrapy framework;
Scrapy is based on twisted, performance is the biggest advantage;
Scrapy is convenient for expansion and provides many built-in functions;
The built-in css and xpath selector of scrapy is very convenient, and the biggest disadvantage of beautifulsoup is slow.

Depth first and breadth first

Depth first (recursive implementation)

def depth_tree(tree_node):
    if tree_node is not None:
        print (tree_node._data)
        if tree_node._left is not None:
            return depth_tree(tree_node._left)
        if tree_node._right is not None:
            return depth_tree(tree_node._right)

Breadth first (queue implementation)

def level_queue(root):
    if root is None:
        return
    my_queue = []
    node = root
    my_queue.append(node)
    while my_queue:
        node = my_queue.pop(0)
        print (node.elem)
        if node.lchild is not None:
            my_queue.append(node.lchild)
        if node.rchild is not None:
            my_queue.append(node.rchild)

URL deduplication strategy

Save the visited URL in the database;
Save the visited URL in the set, and query the URL only at the cost of O(1);
The URL is saved in the set after being hashed by md5 and other methods;
Use the bitmap method to map the visited URL to a certain bit through the hash function;
The bloomfilter method improves bitmap, and multiple hash functions reduce conflicts.

String encoding encode decode

Computers can only process numbers, and text can only be processed by converting text to numbers. 8 bits in the computer are regarded as a byte, so the largest number that a byte can represent is 255;
ASCII (one byte) encoding has become the standard encoding for Americans;
ASCII is not enough to handle Chinese. China has developed GB2312 encoding, which uses two bytes to represent a Chinese character;
The emergence of unicode unifies all languages into a set of codes;
The garbled problem is solved, but if the content is all in English, unicode encoding requires twice the storage space than ASCII, and at the same time, if the transmission requires twice the transmission;
The emergence of variable-length encoding utf-8 has changed the length of English to one byte and Chinese characters to three bytes. Especially uncommon ones become 4-6 bytes. If a large amount of English is transmitted, the effect of utf-8 will be obvious.

scrapy

scrapy is a fast and high-level screen scraping and web scraping framework developed by Python to scrape web sites and extract structured data from pages. Advantages: high concurrency (the bottom layer is asynchronous IO frame time loop + callback). Official document

download：pip install Scrapy
new：scrapy startproject namexxx

xpath syntax res.xpath('').extract_first('')

xpath uses path expressions to navigate in xml and html;
xpath contains standard function library;
xpath is a w3c standard.

Advantages of distributed crawlers

Make full use of the bandwidth of multiple machines to accelerate crawling;
Make full use of the IP of multiple machines to accelerate the crawling speed.

Stand-alone crawler => distributed crawlers problems that need to solve

Centralized management of request queue: The scheduler is stored in memory in the form of a queue, and other servers cannot get the contents of the current server's memory;
De-duplicate centralized management. Solution: Put the request queue and de-replay into third-party components, using Redis (memory database, faster reading speed).

Redis

Redis is a key-value storage system, and data is stored in memory.

Redis data type

String hash/hash list collection, sortable collection

needs to pay attention to writing crawlers using Scrapy-Redis

Inherit RedisSpider;
All requests are no longer completed by the local schedule, but the schedule of Scrapy-Redis;
Need to push the starting url.

The difference between session and cookie

Cookies are stored in the form of key-value

When downloading the package fails

pip install wheel
pip install -r requirements.txt

Integrate Redis

Integrate BloomFilter

Incremental crawling of crawlers

How to quickly discover new data
1. The full amount of crawlers is still going on
  1. Restart a crawler: one is responsible for full crawling, and the other is responsible for incremental crawling
  2. Use priority queue (conducive to maintenance)
2. Crawler is over
  1. Crawler is closed
    1. How to find that there is a new URL to be crawled, once there is a URL, a script is required to start the crawler
  2. Crawler waiting: continue to push URL
How to solve the data that has been crawled (scrapy comes with a deduplication mechanism)
1. After the list data has been crawled, continue crawling
2. Whether to continue crawling the items that have been crawled (involving update issues) Optimal solution: Modify the scrapy-redis source code to achieve the goal.

Complete incremental crawling by modifying scrapy-redis

Crawler data update

Fields that will be updated: cited amount

Search engine requirements

Efficient
Zero configuration is completely free
Able to interact with search engines simply through json and http
Search server is stable
Able to easily expand one server to hundreds

Introduction to elasticsearch

Lucene-based search server
Provides a full-text search engine with distributed multi-user capabilities
Based on RESTful web interface
Developed in Java and released as open source under the terms of the Apache license

Disadvantages of relational data search

Unable to score -> Unable to sort
No distributed
Unable to parse search request
low efficiency
Participle

elasticsearch installation

Install elasticsearch-rtf
Installation of head plugin and kibana

Cross-domain configuration

http.cors.enabled: true
http:cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers: "X-Requested-With, Content-Type, Content-Type, Content-Length, X-User"

elasticsearch concept

Cluster: One or more nodes are organized together
Node: A node is a server in the cluster, identified by a name, the default is the name of a random comic character
Fragmentation: The ability to divide the index into multiple parts, allowing horizontal partitioning and capacity expansion, multiple shards responding to requests, improving performance and throughput
Replica: The ability to create one or more copies of a shard, and the rest of the nodes can be on top when one node fails

elasticsearch vs mysql

index => database
type => table
document => line
fields => columns

Inverted index

The inverted index comes from the need to find records based on the value of attributes in practical applications. Each item in this index table includes an attribute value and the address of each record with the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A text with an inverted index is referred to as an inverted file.

TF-IDF

Inverted index pending issues

Case conversion issues, such as python and PYTHON should be a word
Stemming, looking and look should be treated as one word
Participle
The inverted index file is too large, compression encoding Elasticsearch can complete all of the above problems.

elasticsearch basic index

Mapping

Mapping: When creating an index, you can predefine the field type and related attributes.

ES will guess the field mapping you want based on the basic type of the JSON source data. Turn the entered data into searchable index items. Mapping is the data type of the field defined by my mother. It also tells es how to index the data and whether it can be searched.

Role: It will make the index creation more detailed and perfect.

es query

Basic query: use es built-in query conditions to query
Combined query: Combine multiple queries together for compound query
Filtering: the query passes the filter condition to filter the data without affecting the scoring

Edit distance

Edit distance is a calculation method of similarity between strings. That is, the edit distance between two character strings is equal to the minimum number of operations for insert/delete/replace/swap positions of adjacent character strings to make one character string become another character string.

Regarding the calculation of edit distance, dynamic programming is commonly used.

Environment migration

pip freeze > requirements.text
pip install -r requirement.txt

References

Elasticsearch中ik_max_word和 ik_smart的区别

Several problems encountered in Elasticsearch Chinese search

Search for the glucose keyword, hope that the result contains only glucose, not grapes; search for grapes, hope that the result contains glucose.
Searching for "RMB" will only match the content that contains the keyword "RMB". In fact, "RMB" and "RMB" are synonyms. We hope that users can search for "RMB" and "RMB" to match each other. How to configure ES synonyms ?
User search pinyin: such as "baidu", or the first letter of pinyin "bd", how to match the keyword "百度", and if the user enters the word "摆渡", it can also match the keyword "Baidu", how does the Chinese pinyin match? Do it?
How to ensure that the search keywords are correctly segmented, usually we will use a custom dictionary to do it, so how to get a custom dictionary?

ik tokenizer

ik_max_word: Split the text at the finest granularity, such as splitting the "Great Hall of the People of the People's Republic of China" into "People's Republic of China, Chinese People, Chinese, Chinese, People's Republic, People, Republic, Great Hall, Assembly, Words such as hall.
ik_smart: Will do the most coarse-grained split, such as splitting the "Great Hall of the People of the People's Republic of China" into the People's Republic of China and the Great Hall of the People.

Best Practices

The best practice for the use of the two tokenizers is: use ik_max_word for indexing, and ik_smart for search.

That is: the content of the article is segmented to the maximum when indexing, and the desired result is more precise when searching. When indexing, in order to provide the coverage of the index, the ik_max_word analyzer is usually used, which will index with the most fine-grained word segmentation. In order to improve the search accuracy, the ik_smart analyzer will be used for coarse-grained word segmentation.

ES word segmentation process analysis and analyzer

character filter: process the string before word segmentation and remove HTML tags;
tokenizer: English word segmentation can separate words according to spaces, Chinese word segmentation is more complicated, and machine learning algorithms can be used to segment words;
token filters characterize filters: modify capitalization, stop words, add synonyms, add words, etc.;
ES word segmentation process: character filter-->>tokenizer-->>token filters
Custom analyzer
Word segmentation mapping settings

"content": {
    "type": "string",
    "analyzer": "ik_max_word",
    "search_analyzer": "ik_smart"
}

Synonym

Suggest participle

Suggest words need to match the prefix of Pinyin, Quanpin, and Chinese. For example: "百度", type "baidu", "bd", "百" must be matched, so it needs to be divided into multiple words when indexing A word segmenter is used to index and save. Chinese uses single-character word segmentation. Pinyin first letter and Quanpin require a custom analyzer to index.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
django_search		django_search
paperSpider		paperSpider
.gitignore		.gitignore
README.md		README.md
README_zh.MD		README_zh.MD

Beking0912/distributed-paper-search-engine

Folders and files

Latest commit

History

Repository files navigation