Skip to content

Designed and implemented a web crawler using selenium, which crawled all the financial data of the 2016 ’resident life’ part of 34 provinces in China and automatically entered Mysql. And for a few provinces with special fiscal data, used BeautifulSoup4 for direct data crawling and storage.

Notifications You must be signed in to change notification settings

ScheWann/Government

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Government

Sorry for the irreparable loss of all the code due to computer replacement, but I have provided most of the application information of the project

Demo

Introduction

The project uses Vue.js + Flask + Echarts + DataV to implement all projects

Vue

Vue.js, as a progressive framework that has been developed day by day, can realize data binding and grouping of responses, and use simple APIs to build data-driven web interfaces. Single-page applications can refresh the page locally, without having to request every page jump. All data and dom.

The documents about Vue.js could be checked in https://vuejs.org/guide/introduction.html

Echarts

ECharts is a commercial-grade data chart, a pure Javascript chart library, which can run smoothly on PC and mobile devices, and is compatible with most current browsers (IE6/7/8/9/10/11, chrome, firefox, Safari, etc.), the bottom layer relies on the lightweight Canvas class library ZRender, providing intuitive, vivid, interactive, highly customizable data visualization charts

The documents about Flask could be checked in https://flask.palletsprojects.com/en/2.2.x/

Flask

Flask is a "micro" framework for web development based on Python. The form of micro-framework gives developers more choices. The purpose of using Flask in this project is to provide a JSON data interface for the front end. Since the server program uses Flask, the Flask project team provides the Flask-SqlAlchemy extension and is highly linked to SqlAlchemy, which greatly facilitates the development of the Flask server program.

The documents about Echarts.js could be checked in https://echarts.apache.org/handbook/en/get-started/

DataV

DataV is a Vue-based data visualization component library that provides SVG borders and decorations to enhance the visual effect of the page

The documents about DataV could be checked in https://github.com/DataV-Team/Datav

Backend/excel_crawler

This is the core part of this project. It is used to crawl the complex excel tables

Deficiencies in the system

  • The system still has flaws in data acquisition and cannot obtain all data. For example, some provinces do not provide downloads of statistical yearbooks. Because the older xlrd library is used when crawling the statistical yearbook, because the latest openpyxl supports the latest xlsx format, and with the gradual elimination of the xp system from various office scenarios, the old file format of Microsoft office suite is no longer Applicable to the current environment. If the format of the statistical yearbook is changed in the future, the specific program of the current crawler needs to be rewritten, but the availability of openpyxl is very high, and the use of the new library will greatly improve the accuracy and maintainability of the crawler, although due to the characteristics of this library, the file loading may be more xlrd takes longer.

    For example:

    This is part of the code of the current crawling algorithm:

    ifself.wb.xf_list[self.tb.cell_xf_index(rx,0)].backdround.pattern_colour_index==44
    

    The significance of this part is to verify whether the current row is the row where the column header is located. The principle is to determine whether the background color of the current row is blue. It can be seen that the positioning of the cell position in the first half is very complicated and unintuitive, and the '44' at the end of the line is even more unclear.

    But with the new library you can change to:

    if sele.tb[f’A{rx}’].fill.start_color==colors.Blue:
    

    Obviously, compared with the above, it is more concise: use the positioning method of excel (ie column-row combination, format string in the figure), and the color is preset by the openpyxl library, which can be called directly.

  • Without using OpenID when logging in, the login interface is actually not secure. Although the UUID is used as the user's primary key in the database and the interface authentication field, when the UUID is stolen, as long as the user is logged in in the database, the intruder can use this UUID to log in to the system. If OpenID is used, every time you log in to the system, the front-end interface authentication field given to a specific user will have an expiration date, so that it will expire when the session ends, or it will be valid for a certain period of time. The above problems can be avoided. Flask provides functional expansion, which can be achieved by using flask-OpenID. However, since the above-mentioned verification method has been used, the front-end and back-end are highly integrated, and it is not easy to carry out transformation, so this project does not use OpenID for login verification.

  • The system has a lot of sapce for improvement in the display part of data visualization. At present, we only use some basic display effects, and the animation part needs to be deepened.

About

Designed and implemented a web crawler using selenium, which crawled all the financial data of the 2016 ’resident life’ part of 34 provinces in China and automatically entered Mysql. And for a few provinces with special fiscal data, used BeautifulSoup4 for direct data crawling and storage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages