Skip to content

03. Assignment: Clean dataset

Chazzers edited this page Nov 18, 2019 · 1 revision

3. Assignment: Clean dataset

I got an assignment to clean a part of a 'dirty' dataset.

This dataset was came from a survey that was filled in by students.

Chosen part of dataset: Haircolor

The survey asked students what haircolor they had in a Hexcode format. Since not every student had answered the question in the right format the data had to be cleaned.

Cleaning the column Haircolor

To load the csv file into a JavaScript project i made use of the JavaScript library D3. In order to only use one part of D3 i used JavaScript's import statement like this:

import { csv } from 'd3';

Afterwards i wanted to make sure that the csv data of the survey was loaded before doing anything with it.

Doing this requires the use of a JavaScript Promise.

csv("../data/enquete.csv")
	.then(data => {
	});

In order to clean the data i wanted to do it step by step. I created variables and let every variable do something until i cleaned all the data for example like this:

csv("../data/enquete.csv")
	.then(data => {

		let items = [];

		for(let i = 0; i < data.length; i++) {
			items.push(data[i]["Kleur haar (HEX code)"].toUpperCase());
		}

		let filterUndefined = items.filter(item => item !== '' && item !== "0");
	});

And the complete code looked like this:

csv("../data/enquete.csv")
	.then(data => {
		let items = [];
		for(let i = 0; i < data.length; i++) {
			items.push(data[i]["Kleur haar (HEX code)"].toUpperCase());
		}
		//remove undefined
		let filterUndefined = items.filter(item => item !== '' && item !== "0");
		//regex for testing hashtags
		let regExHashtag = /^\#.*/;
		//regex for testing no hashtags
		let regExNoHashtag = /^(?!\#).*/;
		//filter no hashtag
		let filterNoHashtag = filterUndefined.filter(item => item.match(regExHashtag));
		//filter hashtag
		let filterHashtag = filterUndefined.filter(item => item.match(regExNoHashtag));
		//filter non-hexcode
		let filterKleurNaam = filterHashtag.filter(item => !item.includes("BLOND") && !item.includes("BRUIN"));
		let hashtag = "#";
		//add # to non hashtag items
		let listWithHashtag = filterKleurNaam.map(item => "#" + item);
		//make one array of the non-hashtags and hashtags
		let cleanList = filterNoHashtag.concat(listWithHashtag);
	});

After showing my assignment to the teacher, i received the following feedback: "You need to make the code functional".

So i started refactoring my code so it would adhere to the functional programming standards. The same piece of code refactored into functional programming standards looks like this:

import { csv } from 'd3';
csv("../data/enquete.csv")
	.then(data => makeArray(data))
	.then(data => filterUndefined(data))
	.then(data => filterKleurNaam(data))
	.then(data => console.log(addHashtag(data)));

	function makeArray(items) {
		return items.map(item => item["Kleur haar (HEX code)"].toUpperCase());
	}
	function filterUndefined(items) {
		return items.filter(item => item !== '' && item !== "0");
	}
	function filterKleurNaam(items) {
		return items.filter(item => !item.includes("BLOND") && !item.includes("BRUIN"))
	}
	function addHashtag(items) {
		return items.map(item => item[0]!== "#" ? "#" + item : item );
	}

In-depth function explanation

function makeArray();

This function uses the data received from the dataset and takes the property values of the chosen column "Kleur haar (HEX code)" and puts these in a new Array using the .map() method.

function makeArray(items) {
	return items.map(item => item["Kleur haar (HEX code)"].toUpperCase());
}

function filterUndefined();

This function filters the data on undefined values and values equal to 0.

function filterUndefined(items) {
	return items.filter(item => item !== '' && item !== "0");
}

function filterKleurNaam();

This function filters the values that are not hexcodes

function filterKleurNaam(items) {
	return items.filter(item => !item.includes("BLOND") && !item.includes("BRUIN"))
}

function addHashtag();

This function adds a hashtag to hexcodes that didn't have one.

function addHashtag(items) {
	return items.map(item => item[0]!== "#" ? "#" + item : item );
}

Calling the functions using .then

To make sure all the functions get executed in the right order, i used promise chaining like this:

csv("../data/enquete.csv")
	.then(data => makeArray(data))
	.then(data => filterUndefined(data))
	.then(data => filterKleurNaam(data))
	.then(data => addHashtag(data));