Data Cleaning is an important precursor to any form of data analysis. Data Cleaners is a Spark-based Application with the responsibility of cleaning any datasets provided, either from files or database tables. A user may define a set of rules to construct a request, which in return will produce a detailed report of each entry that violates any of the rules. Violated entries can then be seperated or removed from the original dataset, thus ensuring data quality.
Registration of Datasets from both CSV files and database tables, without relying on memory.
Construction and execution of robust requests, on registered datasets, by uttilizing a wide range of heavily customisable checks:
- Primary Key Check for column uniqueness.✅
- Foreign Key Check between two registered datasets. ✅
- Domain Type Check for column validity towards the data's type. ✅
- Domain Value Check for checking if a column contains only certain values.✅
- Format Check for column validity and consistency. ✅
- Not Null Check for column completeness.✅
- Numeric Constraint Check for making sure a column has numeric values within a defined range.✅
- User Defined Expression Check for complex user-defined mathematical expression tests regarding a single entry.✅
- User Defined Conditional Check for complex user-defined mathematical expression tests regarding single entries that follow a certain condition.✅
- User Defined Aggregation Checks for complex user-defined mathematical expression tests using, aggregation functions, regarding a single entry.✅
- User Defined Group Checks for complex user-defined mathematical expression tests regarding groups of entries.❌
Detailed log generation for each executed request in several formats:
- TXT File ✅
- HTML File ❌
Violating Row Policy; Different options for handling rejected (or invalid) entries
- WARN: Generate the log. ✅
- ISOLATE: Generate the log, and produce two TSV files; one with the rejected and one with the passed entries. ✅
- PURGE: Generate the log, and produce one TSV file with just the passed entries. ✅
Choose the IDE of preference and import the project as a Maven Project. Make sure to set-up the JAVA_HOME property to the correct location of a Java 8 or above installation.
Afterwards, check the Client
class for an example on the creation, execution and report production of a request.
All tests are stored within the test
folder. To execute all of them simply run:
./mvnw test
Since it's a Maven script, ensure that M2_HOME and MAVEN_HOME system variables have been set correctly.
Consider that you need to perform some form of analysis on a dataset that follows this schema:
ID | Name | Wage |
1 | John | 120 |
2 | Mike | null |
3 | Samantha | 500 |
2 | Jane | -1 |
5 | Bob | 102213 |
This dataset however, not only is it large, but it also contains several quality issues. We first define some logical rules for this schema:
- The
column contains unique, not null, numeric values. - The
column contains non numeric strings. - The
column contains numeric values that should be between 0 and 1_000.
With the help of our application, we can now create a quality enforcing request and determine which entries are problematic. First, we need to register the dataset in quest.
FacadeFactory facadeFactory = new FacadeFactory();
IDataCleanerFacade facade = facadeFactory.createDataCleanerFacade();
boolean hasHeader = true;
String frameName = "dataset";
facade.registerDataset("path//of//file//dataset.csv", frameName, hasHeader);
With the dataset registered via our facade, we proceed to define our request:
ClientRequest req = ClientRequest.builder()
.onDataset("dataset") //The name used during registration
//For the ID column
.withColumnType("ID", DomainType.INTEGER)
//For the Name column
.withColumnType("Name", DomainType.ALPHA)
//For the Wage column
.withColumnType("Wage", DomainType.NUMERIC)
.withNumericColumn("Wage", 0, 1_000)
We also choose the PURGE violating row policy in order to immediatly dispose of all problematic entries when generating a report.
facade.generateReport("dataset", "output//path//directory", ReportType.TEXT);
Finally, we call the generateReport
function to create a log.txt
file, as well as a TSV file with all the conforming entries.
Nikolaos Taflampas
Panos Vassiliadis