DeFiHap

A tool for automatically detecting and fixing HiveQL anti-patterns. Application developers can leverage DeFiHap to create efficient, maintainable and accurate queries in big data applications.

Architecture

The architecture of DeFiHap is shown in the figure. It consis ts of three components.

DeFiHap first implements a static analysis of AST, HiveQL data and metadata to search for the latent statement and configuration anti-patterns. Then, for statement anti-patterns, DeFiHap suggests the fix through a template based rewriting technique. And for configuration anti-patterns, DeFiHap employs a machine learning based approach to recommend the desired reducer settings for different join queries.

You can try a demo version of DeFiHap on http://202.120.40.28:50012

HiveQL Anti-Patterns

We conduct an empirical study on the posts about HiveQL programming issues in Stack Overflow. 38 HiveQL anti-patterns have been identified, which are categorized as statement and configuration anti-patterns. DeFiHap is able to detect 25 HiveQL anti-patterns and generates the fix suggestions for 17 of them, as shown in the list.

Project Structure

There are three main component in this project:

UploadAndDetect: A web interface for using DeFiHap.
StaticAnalysis: Check HiveQL statement and configuration anti-patterns with HiveQL AST and so on.
DynamicAnalysis: Recommend the desired reducer settings for different join queries and check if there exists data skew.

Install

Package two maven projects: staticCheck and dynamicCheck, then deploy them. Attention: copy the src folder into your deploying path.
Deploy python flask api pred_api. Attention: the model checkpoint is trained on our cluster for demostration, to support recommending number of reducers in your Hive cluster, you should replace dataset with your join logs and re-train the MLP model.
Run client web interface (a VUE frontend project) to use DeFiHap. Remember to set configuration in DeFiHap->Configurations->Set-Configuration first.

Evaluation

We evaluate DeFiHap on 110 real-world HiveQL statements collected non-duplicated HiveQL statements from Stack Overflow, to quantify its capabilities in processing HiveQL APs. Three widely used metrics are selected: precision, recall, and F1.

The dataset contains 96 HiveQL statements for evaluating the technique of detecting and fixing HiveQL APs and 14 different join queries for evaluating the technique of recommending reducer settings. (The recommended number is considered correct if the difference between its corresponding execution time and the optimal one is less than 2 seconds.)

We build Hive tables and load synthetic data for each statement in our cluster. The server of DeFiHap runs on Hive 2.3.4 with default configurations in a 3-node Hadoop cluster, and the Hive metadata is stored in MySQL.

Here are our evaluation results.

Function	Precision	Recall	F1
Detecting HiveQL APs	100%	96.88%	98.42%
Fixing HiveQL APs	92.11%	-	-
Recommending the reducer number	78.57%	-	-

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
DynamicAnalysis		DynamicAnalysis
StaticAnalysis		StaticAnalysis
UploadAndDetect		UploadAndDetect
pic		pic
src/main		src/main
.gitattributes		.gitattributes
.gitignore		.gitignore
AP.md		AP.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeFiHap

Architecture

HiveQL Anti-Patterns

Project Structure

Install

Evaluation

About

Contributors 3

Languages

ChrisCN97/DeFiHap

Folders and files

Latest commit

History

Repository files navigation

DeFiHap

Architecture

HiveQL Anti-Patterns

Project Structure

Install

Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages