A tool for automatically detecting and fixing HiveQL anti-patterns. Application developers can leverage DeFiHap to create efficient, maintainable and accurate queries in big data applications.
The architecture of DeFiHap is shown in the figure. It consis ts of three components.
DeFiHap first implements a static analysis of AST, HiveQL data and metadata to search for the latent statement and configuration anti-patterns. Then, for statement anti-patterns, DeFiHap suggests the fix through a template based rewriting technique. And for configuration anti-patterns, DeFiHap employs a machine learning based approach to recommend the desired reducer settings for different join queries.
You can try a demo version of DeFiHap on http://202.120.40.28:50012
We conduct an empirical study on the posts about HiveQL programming issues in Stack Overflow. 38 HiveQL anti-patterns have been identified, which are categorized as statement and configuration anti-patterns. DeFiHap is able to detect 25 HiveQL anti-patterns and generates the fix suggestions for 17 of them, as shown in the list.
There are three main component in this project:
- UploadAndDetect: A web interface for using DeFiHap.
- StaticAnalysis: Check HiveQL statement and configuration anti-patterns with HiveQL AST and so on.
- DynamicAnalysis: Recommend the desired reducer settings for different join queries and check if there exists data skew.
- Package two maven projects: staticCheck and dynamicCheck, then deploy them. Attention: copy the src folder into your deploying path.
- Deploy python flask api pred_api. Attention: the model checkpoint is trained on our cluster for demostration, to support recommending number of reducers in your Hive cluster, you should replace dataset with your join logs and re-train the MLP model.
- Run client web interface (a VUE frontend project) to use DeFiHap. Remember to set configuration in DeFiHap->Configurations->Set-Configuration first.
We evaluate DeFiHap on 110 real-world HiveQL statements collected non-duplicated HiveQL statements from Stack Overflow, to quantify its capabilities in processing HiveQL APs. Three widely used metrics are selected: precision, recall, and F1.
The dataset contains 96 HiveQL statements for evaluating the technique of detecting and fixing HiveQL APs and 14 different join queries for evaluating the technique of recommending reducer settings. (The recommended number is considered correct if the difference between its corresponding execution time and the optimal one is less than 2 seconds.)
We build Hive tables and load synthetic data for each statement in our cluster. The server of DeFiHap runs on Hive 2.3.4 with default configurations in a 3-node Hadoop cluster, and the Hive metadata is stored in MySQL.
Here are our evaluation results.
Function | Precision | Recall | F1 |
---|---|---|---|
Detecting HiveQL APs | 100% | 96.88% | 98.42% |
Fixing HiveQL APs | 92.11% | - | - |
Recommending the reducer number | 78.57% | - | - |