## Chapter 4
---
# Handling Numerical Data

### 4.0 Introduction
Quantitative data is the measurment of something--weather class size, monthly sales, or student scores. The natural way to represent these quantities is numerically (e.g., 20 students, $529,392 in sales). In this chapter we will cover numerous strategies for transforming raw numerical data into features purpose-built for machine learning algorithms

### 4.1 Rescaling a feature
Use scikit-learn's `MinMaxScaler` to rescale a feature array

In [2]:
import numpy as np
from sklearn import preprocessing

# create a feature
feature = np.array([
    [-500.5],
    [-100.1],
    [0],
    [100.1],
    [900.9]
])

# create scaler
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))

# scale feature
scaled_feature = minmax_scaler.fit_transform(feature)

scaled_feature

array([[0.        ],
       [0.28571429],
       [0.35714286],
       [0.42857143],
       [1.        ]])

#### Discussion
Rescaling is a common preprocessing task in machine learning. Many of the algorithms described later in this book will assume all features are on the same scale, typically 0 to 1 or -1 to 1. There are a number of rescaling techniques, but one of the simlest is called *min-max scaling*. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. Specfically, min-max calculates:
$$
x_i^` = \frac{x_i - min(x)}{max(x) - min(x)}
$$

where x is the feature vector, $x_i$ is an individual element of feature x, and $x_i^`$ is the rescaled element

#### See Also
* Feature scaling, wikipedia (https://en.wikipedia.org/wiki/Feature_scaling)

### 4.2 Standardizing a Feature
scikit-learn's `StandardScaler` transforms a feature to have a mean of 0 and a standard deviation of 1.

In [3]:
import numpy as np
from sklearn import preprocessing

# create a feature
feature = np.array([
    [-1000.1],
    [-200.2],
    [500.5],
    [600.6],
    [9000.9]
])

# create scaler
scaler = preprocessing.StandardScaler()

# transform the feature
standardized = scaler.fit_transform(feature)

standardized

array([[-0.76058269],
       [-0.54177196],
       [-0.35009716],
       [-0.32271504],
       [ 1.97516685]])

#### Discussion
A common alternative to min-max scaling is rescaling of features to be approximately standard normally distributed. To achieve this, we use standardization to tranform the data such that it has a mean, $\bar x$, or 0 and a standard deviation $\sigma$, of 1. Specifically, each element in the feature is transformed so that:
$$
x_i^` = \frac{x_i - \bar x}{\sigma}
$$

Where $x_I^`$ is our standardized form of $x_i$. The transformed feature represents the number of standard deviations in the original value is away from the feature's mean value (also called a *z-score* in statistics)

Standardization is a common go-to scaling method for machine learning preprocessing and in my experience is used more than min-max scaling. However it depends on the learning algorithm. For example, principal component analysis often works better using standardization, while min-max scaling is often recommended for neural netwroks. As a general rule, I'd recommend defauling to standardization unless you have a specific reason to use an alternative.

We can see the effect of standardization by looking at the mean and standard deviation of our solutions output:

In [4]:
print("Mean {}".format(round(standardized.mean())))
print("Standard Deviation: {}".format(standardized.std()))

Mean 0.0
Standard Deviation: 1.0


If our data has significant outliers, it can negatively impact our standardizatino by affecting the feature's mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range. In scikit-learn, we do this using the *RobustScaler* method:

In [5]:
# create scaler
robust_scaler = preprocessing.RobustScaler()

# transform feature
robust_scaler.fit_transform(feature)

array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

### 4.3 Normalizing Observations
Use scikit-learn's `Normalizer` to rescale the feature values to have unit norm (a total length of 1)

In [6]:
import numpy as np
from sklearn.preprocessing import Normalizer

# create feature matrix
features = np.array([
    [0.5, 0.5],
    [1.1, 3.4],
    [1.5, 20.2],
    [1.63, 34.4],
    [10.9, 3.3]
])

# create normalizer
normalizer = Normalizer(norm="l2")

# transofmr feature matrix
normalizer.transform(features)

array([[0.70710678, 0.70710678],
       [0.30782029, 0.95144452],
       [0.07405353, 0.99725427],
       [0.04733062, 0.99887928],
       [0.95709822, 0.28976368]])

#### Discussion
Many rescaling methods operate of features; however, we can also rescale across individual observations. `Normalizer` rescales the values on individual observations to have unit norm (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features (e.g. text-classification when every word is n-word group is a feature).

`Normalizer` provides three norm options with Euclidean norm (often called L2) being the default:
$$
||x||_2 = \sqrt{x_1^2 + x_2^2 + ... + x_n^2}
$$

where x is an individual observation and x_n is that observation's value for the nth feature.

Alternatively, we can specify Manhattan norm (L1):
$$
||x||_1 = \sum_{i=1}^n{x_i}
$$

Intuitively, L2 norm can be thought of as the distance between two poitns in New York for a bird (i.e. a straight line), while L1 can be thought of as the distance for a human wlaking on the street (walk north one block, east one block, north one block, east one block, etc), which is why it is called "Manhattan norm" or "Taxicab norm".

Practically, notice that `norm='l1'` rescales an observation's values so they sum to 1, which can sometimes be a desirable quality

In [8]:
# transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)
print("Sum of the first observation's values: {}".format(features_l1_norm[0,0] + features_l1_norm[0,1]))

Sum of the first observation's values: 1.0


### 4.9 Grouping Observations Using Clustering

In [9]:
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

features, _ = make_blobs(n_samples = 50,
                         n_features = 2,
                         centers = 3,
                         random_state = 1)

df = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# make k-means clusterer
clusterer = KMeans(3, random_state=0)

# fit clusterer
clusterer.fit(features)

# predict values
df['group'] = clusterer.predict(features)

df.head()

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2


# 4.10 Deleteing Observations with Missing Values

In [10]:
import numpy as np

features = np.array([
    [1.1, 11.1],
    [2.2, 22.2],
    [3.3, 33.3],
    [np.nan, 55]
])

# keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3]])

In [11]:
import pandas as pd
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
df.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3


#### Discussion
Most machine learnign algorithms cannot handling any missing values in the target and feature arrays. The simplest solution is the delete every observation that contains one or more missing values

There are three types of missing data:

*Missing Completely At Random (MCAR)*
* The probability that a value is missing is independent of everything.

*Missing At Random (MAR)*
* The probability that a value is missing is not completely random, but depends on information capture in other feature

*Missing Not At Random (MNAR)*
* The probability that a value is missing is not random and depends on information not captured in our features

#### See Also
* Identifying the Three Types of Missing Data (https://measuringu.com/missing-data/)
* Missing-Data Imputation (http://www.stat.columbia.edu/~gelman/arm/missing.pdf)

### 4.11 Imputing Missing Values

In [14]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
from sklearn.preprocessing import Imputer

# make fake data
features, _ = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)

# standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# replace the first feature's first value with a missing value
true_value = standardized_features[0, 0]
standardized_features[0,0] = np.nan

# create imputer
mean_imputer = Imputer(strategy="mean", axis=0)

# impute values
feautres_mean_imputed = mean_imputer.fit_transform(features)

# compare true and imputed values
print("True Value: {}".format(true_value))
print("Imputed Value: {}".format(feautres_mean_imputed[0,0]))

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996


#### See Also
* A Study of K-Nearest Neighbor as an Imputation Method (http://conteudo.icmc.usp.br/pessoas/gbatista/files/his2002.pdf)