DATA SCRUBBING
Much like many categories of fruit, datasets nearly always require some form
of upfront cleaning and human manipulation before they are ready to digest.
For machine learning and data science more broadly, there are a vast number
of techniques to scrub data.
Scrubbing is the technical process of refining your dataset to make it more
workable. This can involve modifying and sometimes removing incomplete,
incorrectly formatted, irrelevant or duplicated data. It can also entail
converting text-based data to numerical values and the redesigning of
features. For data practitioners, data scrubbing usually demands the greatest
application of time and effort.

Feature Selection
To generate the best results from your data, it is important to first identify the
variables most relevant to your hypothesis. In practice, this means being
selective about the variables you select to design your model.
Rather than creating a four-dimensional scatterplot with four features in the
model, an opportunity may present to select two highly relevant features and
build a two-dimensional plot that is easier to interpret. Moreover, preserving
features that do not correlate strongly with the outcome value can, in fact,
manipulate and derail the model’s accuracy. Consider the following table
excerpt downloaded from kaggle.com documenting dying languages.

Database: https://www.kaggle.com/the-guardian/extinct-languages

Let’s say our goal is to identify variables that lead to a language becoming
endangered. Based on this goal, it’s unlikely that a language’s “Name in
Spanish” will lead to any relevant insight. We can therefore go ahead and
delete this vector (column) from the dataset. This will help to prevent over-
complication and potential inaccuracies, and will also improve the overall
processing speed of the model.
Secondly, the dataset holds duplicate information in the form of separate
vectors for “Countries” and “Country Code.” Including both of these vectors
doesn’t provide any additional insight; hence, we can choose to delete oneand retain the other.
Another method to reduce the number of features is to roll multiple features
into one. In the next table, we have a list of products sold on an e-commerce
platform. The dataset comprises four buyers and eight products. This is not a
large sample size of buyers and products—due in part to the spatial
limitations of the book format. A real-life e-commerce platform would have
many more columns to work with, but let’s go ahead with this example.

tabla protein shake nike sneakers

In order to analyze the data in a more efficient way, we can reduce the
number of columns by merging similar features into fewer columns. For
instance, we can remove individual product names and replace the eight
product items with a lower number of categories or subtypes. As all product
items fall under the single category of “fitness,” we will sort by product
subtype and compress the columns from eight to three. The three newly
created product subtype columns are “Health Food,” “Apparel,” and
“Digital.”

Tabla comida saluable

Rather than recommending products to users according to other individual
products, recommendations will instead be based on relationships between
product subtypes.
Nonetheless, this approach does uphold a high level of data relevancy.
Buyers will be recommended health food when they buy other health food or
when they buy apparel (depending on the level of correlation), and obviously
not machine learning textbooks—unless it turns out that there is a strong
correlation there! But alas, such a variable is outside the frame of this dataset.
Remember that data reduction is also a business decision, and business
owners in counsel with the data science team will need to consider the trade-
off between convenience and the overall precision of the model.

Row Compression
In addition to feature selection, there may also be an opportunity to reduce
the number of rows and thereby compress the total number of data points.
This can involve merging two or more rows into one. For example, in the
following dataset, “Tiger” and “Lion” can be merged and renamed
“Carnivore.”

Table comparativa compresion

However, by merging these two rows (Tiger & Lion), the feature values forboth rows must also be aggregated and recorded in a single row. In this case,
it is viable to merge the two rows because they both possess the same
categorical values for all features except y (Race Time)—which can be
aggregated. The race time of the Tiger and the Lion can be added and divided
by two.
Numerical values, such as time, are normally simple to aggregate unless they
are categorical. For instance, it would be impossible to aggregate an animal
with four legs and an animal with two legs! We obviously can’t merge these
two animals and set “three” as the aggregate number of legs.
Row compression can also be difficult to implement when numerical values
aren’t available. For example, the values “Japan” and “Argentina” are very
difficult to merge. The countries “Japan” and “South Korea” can be merged,
as they can be categorized as the same continent, “Asia” or “East Asia.”
However, if we add “Pakistan” and “Indonesia” to the same group, we may
begin to see skewed results, as there are significant cultural, religious,
economic, and other dissimilarities between these four countries.
In summary, non-numerical and categorical row values can be problematic to
merge while preserving the true value of the original data. Also, row
compression is normally less attainable than feature compression for most
datasets.

One-hot Encoding
After choosing variables and rows, you next want to look for text-based
features that can be converted into numbers. Aside from set text-based values
such as True/False (that automatically convert to “1” and “0” respectively),
many algorithms and also scatterplots are not compatible with non-numerical
data.
One means to convert text-based features into numerical values is through
one-hot encoding, which transforms features into binary form, represented as
“1” or “0”—“True” or “False.” A “0,” representing False, means that the
feature does not belong to a particular category, whereas a “1”—True or
“hot”—denotes that the feature does belong to a set category.
Below is another excerpt of the dataset on dying languages, which we can use
to practice one-hot encoding.

Tabla Degree of Endangerment

First, note that the values contained in the “No. of Speakers” column do not
contain commas or spaces, e.g. 7,500,000 and 7 500 000. Although such
formatting does make large numbers clearer for our eyes, programming
languages don’t require such niceties. In fact, formatting numbers can lead to
an invalid syntax or trigger an unwanted result, depending on the
programming language you use. So remember to keep numbers unformatted
for programming purposes. Feel free, though, to add spacing or commas at
the data visualization stage, as this will make it easier for your audience to
interpret!
On the right-hand-side of the table is a vector categorizing the degree of
endangerment of the nine different languages. This column we can convert to
numerical values by applying the one-hot encoding method, as demonstrated
in the subsequent table.

Tabla one hot encoding

Using one-hot encoding, the dataset has expanded to five columns and we
have created three new features from the original feature (Degree of
Endangerment). We have also set each column value to “1” or “0,”
depending on the original category value.
This now makes it possible for us to input the data into our model and choose
from a wider array of machine learning algorithms. The downside is that we
have more dataset features, which may lead to slightly longer processing
time. This is nonetheless manageable, but it can be problematic for datasets
where original features are split into a larger number of new features.
One hack to minimize the number of features is to restrict binary cases to a
single column. As an example, there is a speed dating dataset on kaggle.com
that lists “Gender” in a single column using one-hot encoding. Rather than
create discrete columns for both “Male” and “Female,” they merged these
two features into one. According to the dataset’s key, females are denoted as
“0” and males are denoted as “1.” The creator of the dataset also used this
technique for “Same Race” and “Match.”

Database: https://www.kaggle.com/annavictoria/speed-dating-experiment

Binning
Binning is another method of feature engineering that is used to convert
numerical values into a category.
Whoa, hold on! Didn’t you say that numerical values were a good thing? Yes,
numerical values tend to be preferred in most cases. Where numerical values
are less ideal, is in situations where they list variations irrelevant to the goals
of your analysis. Let’s take house price evaluation as an example. The exact
measurements of a tennis court might not matter greatly when evaluating
house prices. The relevant information is whether the house has a tennis
court. The same logic probably also applies to the garage and the swimming
pool, where the existence or non-existence of the variable is more influential
than their specific measurements.
The solution here is to replace the numeric measurements of the tennis court
with a True/False feature or a categorical value such as “small,” “medium,”
and “large.” Another alternative would be to apply one-hot encoding with “0”
for homes that do not have a tennis court and “1” for homes that do have atennis court.

Missing Data
Dealing with missing data is never a desired situation. Imagine unpacking a
jigsaw puzzle that you discover has five percent of its pieces missing.
Missing values in a dataset can be equally frustrating and will ultimately
interfere with your analysis and final predictions. There are, however,
strategies to minimize the negative impact of missing data.
One approach is to approximate missing values using the mode value. The
mode represents the single most common variable value available in the
dataset. This works best with categorical and binary variable types.

Figure 1: A visual example of the mode and median respectively

The second approach to manage missing data is to approximate missing
values using the median value, which adopts the value(s) located in the
middle of the dataset. This works best with integers (whole numbers) and
continuous variables (numbers with decimals).
As a last resort, rows with missing values can be removed altogether. The
obvious downside to this approach is having less data to analyze and
potentially less comprehensive results.