Data Cleaning is an important precursor to any form of data analysis. Data Cleaners is a Spark-based Application with the responsibility of cleaning any datasets provided, either from files or database tables. A user may define a set of rules to construct a request, which in return will produce a detailed report of each entry that violates any of the rules. Violated entries can then be seperated or removed from the original dataset, thus ensuring data quality.
-
Registration of Datasets from both CSV files and database tables, without relying on memory.
-
Construction and execution of robust requests, on registered datasets, by uttilizing a wide range of heavily customisable checks:
- Primary Key Check for column uniqueness.✅
- Foreign Key Check between two registered datasets. ✅
- Domain Type Check for column validity towards the data's type. ✅
- Domain Value Check for checking if a column contains only certain values.✅
- Format Check for column validity and consistency. ✅
- Not Null Check for column completeness.✅
- Numeric Constraint Check for making sure a column has numeric values within a defined range.✅
- User Defined Expression Check for complex user-defined mathematical expression tests regarding a single entry.✅
- User Defined Conditional Check for complex user-defined mathematical expression tests regarding single entries that follow a certain condition.✅
- User Defined Aggregation Checks for complex user-defined mathematical expression tests using, aggregation functions, regarding a single entry.✅
- User Defined Group Checks for complex user-defined mathematical expression tests regarding groups of entries.❌
-
Detailed log generation for each executed request in several formats:
- TXT File ✅
- HTML File ❌
- MARKDOWN File❌
-
Violating Row Policy; Different options for handling rejected (or invalid) entries
- WARN: Generate the log. ✅
- ISOLATE: Generate the log, and produce two TSV files; one with the rejected and one with the passed entries. ✅
- PURGE: Generate the log, and produce one TSV file with just the passed entries. ✅
Choose the IDE of preference and import the project as a Maven Project. Make sure to set-up the JAVA_HOME property to the correct location of a Java 8 or above installation.
Afterwards, check the Client
class for an example on the creation, execution and report production of a request.
All tests are stored within the test
folder. To execute all of them simply run:
./mvnw test
Since it's a Maven script, ensure that M2_HOME and MAVEN_HOME system variables have been set correctly.
Consider that you need to perform some form of analysis on a dataset that follows this schema:
ID | Name | Wage |
---|---|---|
1 | John | 120 |
2 | Mike | null |
3 | Samantha | 500 |
2 | Jane | -1 |
5 | Bob | 102213 |
This dataset however, not only is it large, but it also contains several quality issues. We first define some logical rules for this schema:
- The
ID
column contains unique, not null, numeric values. - The
Name
column contains non numeric strings. - The
Wage
column contains numeric values that should be between 0 and 1_000.
With the help of our application, we can now create a quality enforcing request and determine which entries are problematic. First, we need to register the dataset in quest.
FacadeFactory facadeFactory = new FacadeFactory();
IDataCleanerFacade facade = facadeFactory.createDataCleanerFacade();
boolean hasHeader = true;
String frameName = "dataset";
facade.registerDataset("path//of//file//dataset.csv", frameName, hasHeader);
With the dataset registered via our facade, we proceed to define our request:
ClientRequest req = ClientRequest.builder()
.onDataset("dataset") //The name used during registration
//For the ID column
.withPrimaryKeys("ID")
.withColumnType("ID", DomainType.INTEGER)
//For the Name column
.withColumnType("Name", DomainType.ALPHA)
//For the Wage column
.withColumnType("Wage", DomainType.NUMERIC)
.withNumericColumn("Wage", 0, 1_000)
.withViolationPolicy(ViolatingRowPolicy.PURGE)
.build()
facade.executeClientRequest(req);
We also choose the PURGE violating row policy in order to immediatly dispose of all problematic entries when generating a report.
facade.generateReport("dataset", "output//path//directory", ReportType.TEXT);
Finally, we call the generateReport
function to create a log.txt
file, as well as a TSV file with all the conforming entries.
Nikolaos Taflampas
Panos Vassiliadis