# Handling Numerical Data
<!-- 2018, Albon, Machine Learning with Python Cookbook. Cap 4 -->

## Rescaling a Feature
Rescaling is a common preprocessing task in machine learning. Many of the algorithms
described later in this book will assume all features are on the same scale, typically
`0` to `1` or `–1` to `1`. There are a number of rescaling techniques, but one of the
simplest is called **min-max scaling**. Min-max scaling uses the minimum and maximum
values of a feature to rescale values to within a range. Specifically, min-max calculates:

$x'_i = \frac{x_i - min(x)}{max(x)-min(x)}$

where $x$ is the feature vector, $x_i$ is an individual element of feature $x$, and $x’i$ is the
rescaled element.

In [None]:
# You need to rescale the values of a numerical feature to be between two values.
# Load libraries
import numpy as np
from sklearn import preprocessing

# Create feature
feature = np.array([[-500.5], 
                    [-100.1], 
                    [0], 
                    [100.1], 
                    [900.9]])

# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)

# Show feature
print(scaled_feature)

[[0.        ]
 [0.28571429]
 [0.35714286]
 [0.42857143]
 [1.        ]]


## Standardizing a Feature
A common alternative to min-max scaling is rescaling of features
to be approximately **standard normally distributed**. To achieve this, we use
standardization to transform the data such that it has a mean, $\bar{x}$, of 0 and a standard
deviation, σ, of 1. Specifically, each element in the feature is transformed so that:

$x'_i = \frac{x_i - \bar{x}}{\sigma}$

where $x’_i$ is our standardized form of $x_i$. The transformed feature represents the number
of standard deviations the original value is away from the feature’s mean value
(also called a z-score in statistics).

Standardization is a common go-to scaling method for machine learning preprocessing
and in my experience is used more than min-max scaling. However, it depends on
the learning algorithm. For example, principal component analysis often works better
using standardization, while min-max scaling is often recommended for neural networks. As a general rule, I’d recommend
defaulting to standardization unless you have a specific reason to use an alternative.

In [None]:
# You want to transform a feature to have a mean of 0 and a standard deviation of 1.
# Load libraries
import numpy as np
from sklearn import preprocessing

# Create feature
x = np.array([[-1000.1], 
              [-200.2], 
              [500.5], 
              [600.6], 
              [9000.9]])

# Create scaler
scaler = preprocessing.StandardScaler()

# Transform the feature
standardized = scaler.fit_transform(x)

# Show feature
print(standardized)

[[-0.76058269]
 [-0.54177196]
 [-0.35009716]
 [-0.32271504]
 [ 1.97516685]]


- We can see the effect of standardization by looking at the mean and standard deviation of our solution’s output:

In [None]:
# Print mean and standard deviation
print("Mean:", round(x.mean()))
print("Standard deviation:", x.std())

Mean: 1780
Standard deviation: 3655.6709067420165


In [None]:
# Print mean and standard deviation
print("Mean:", round(standardized.mean()))
print("Standard deviation:", standardized.std())

Mean: 0
Standard deviation: 1.0


- If our data has significant outliers, it can negatively impact our standardization by affecting the feature’s mean and variance. In this scenario, it is often helpful to instead rescale the feature using the median and quartile range.

In [None]:
# Create scaler
robust_scaler = preprocessing.RobustScaler()

# Transform feature
robust_scaler.fit_transform(x)

array([[-1.87387612],
       [-0.875     ],
       [ 0.        ],
       [ 0.125     ],
       [10.61488511]])

## Normalizing Observations
Many rescaling methods (e.g., min-max scaling and standardization) operate on features;
however, we can also rescale across individual observations. `Normalizer`
rescales the values on individual observations to have **unit norm** (the sum of their lengths is 1). This type of rescaling is often used when we have many equivalent features
(e.g., text classification when every word or n-word group is a feature).
Normalizer provides three norm options with Euclidean norm (often called L2)
being the default argument:
    
$|| x ||_2 = \sqrt{x_1^2+x_2^2+\cdots x_n^2}$

where $x$ is an individual observation and $x_n$ is that observation’s value for the nth feature.

In [None]:
#You want to rescale the feature values of observations to have unit norm 
# (a total length of 1).
# Load libraries
import numpy as np
from sklearn.preprocessing import Normalizer

# Create feature matrix
features = np.array([[0.5, 0.5], 
                     [1.1, 3.4], 
                     [1.5, 20.2], 
                     [1.63, 34.4], 
                     [10.9, 3.3]])

# Create normalizer
normalizer = Normalizer(norm="l2")

# Transform feature matrix
features_l2_norm = normalizer.transform(features)

# Show feature matrix
print(features_l2_norm)

[[0.70710678 0.70710678]
 [0.30782029 0.95144452]
 [0.07405353 0.99725427]
 [0.04733062 0.99887928]
 [0.95709822 0.28976368]]


- Alternatively, we can specify Manhattan norm (L1):
$|| x ||_1 = \sum _{i=1}^{n} |x_i|$

In [None]:
# Transform feature matrix
features_l1_norm = Normalizer(norm="l1").transform(features)

# Show feature matrix
print(features_l1_norm)

[[0.5        0.5       ]
 [0.24444444 0.75555556]
 [0.06912442 0.93087558]
 [0.04524008 0.95475992]
 [0.76760563 0.23239437]]


- Practically, notice that norm='l1' rescales an observation’s values so they sum to 1, which can sometimes be a desirable quality:

In [None]:
# Print sum
print("Sum of the first observation\'s values:", features_l1_norm[0, 0] + features_l1_norm[0, 1])

Sum of the first observation's values: 1.0


## Generating Polynomial and Interaction Features

Polynomial features are often created when we want to include the notion that there
exists a nonlinear relationship between the features and the target. For example, we
might suspect that the effect of age on the probability of having a major medical condition
is not constant over time but increases as age increases. We can encode that
nonconstant effect in a feature, $x$, by generating that feature’s higher-order forms ($x_2$, $x_3$, etc.).

In [None]:
# You want to create polynominal and interaction features.
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Create feature matrix
features = np.array([[2, 3], 
                     [2, 3], 
                     [2, 3]])

# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False)

# Create polynomial features
polynomial_interaction.fit_transform(features)

array([[2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.],
       [2., 3., 4., 6., 9.]])

The degree parameter determines the maximum degree of the polynomial. For example, degree=2 will create new features raised to the second power:

$x_1, x_2, x_1^2 , x_2^2$
    
while degree=3 will create new features raised to the second and third power:

$x_1, x_2, x_1^2 , x_2^2, x_1^3 , x_2^3$

Furthermore, by default PolynomialFeatures includes interaction features: 

$x_1x_2$

In [None]:
# We can restrict the features created to only interaction features by setting interaction_only to True:
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction.fit_transform(features)

array([[2., 3., 6.],
       [2., 3., 6.],
       [2., 3., 6.]])

## Transforming Features

It is common to want to make some custom transformations to one or more features. 

- For example, we might want to create a feature that is the natural log of the values of the different feature. We can do this by creating a function and then mapping it to features using either scikit-learn’s `FunctionTransformer` or pandas’ `apply`. 

- In the solution we created a very simple function, `add_ten`, which added `10` to each input, but there is no reason we could not define a much more complex function.

In [None]:
# You want to make a custom transformation to one or more features.
# Load libraries
import numpy as np
from sklearn.preprocessing import FunctionTransformer

# Create feature matrix
features = np.array([[2, 3],
                     [2, 3],
                     [2, 3]])

# Define a simple function
def add_ten(x):
    return x + 10

# Create transformer
ten_transformer = FunctionTransformer(add_ten)

# Transform feature matrix
ten_transformer.transform(features)

array([[12, 13],
       [12, 13],
       [12, 13]])

- We can create the same transformation in pandas using apply:

In [None]:
# Load library
import pandas as pd

# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# Apply function
df.apply(add_ten)

df.shape

(3, 2)

## Detecting Outliers

There is no single best technique for detecting outliers. Instead, we have a collection
of techniques all with their own advantages and disadvantages. Our best strategy is
often trying multiple techniques (e.g., both EllipticEnvelope and IQR-based detection)
and looking at the results as a whole.

In [None]:
# You want to identify extreme observations.
# Load libraries
import numpy as np
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs

# Create simulated data
features, _ = make_blobs(n_samples = 10, 
                         n_features = 2, 
                         centers = 1, 
                         random_state = 1)
print(features)

# Replace the first observation's values with extreme values
features[0,0] = 10000
features[0,1] = 10000
print(features)

# Create detector
outlier_detector = EllipticEnvelope(contamination=.1)

# Fit detector
outlier_detector.fit(features)

# Predict outliers
outlier_detector.predict(features)

[[-1.83198811  3.52863145]
 [-2.76017908  5.55121358]
 [-1.61734616  4.98930508]
 [-0.52579046  3.3065986 ]
 [ 0.08525186  3.64528297]
 [-0.79415228  2.10495117]
 [-1.34052081  4.15711949]
 [-1.98197711  4.02243551]
 [-2.18773166  3.33352125]
 [-0.19745197  2.34634916]]
[[ 1.00000000e+04  1.00000000e+04]
 [-2.76017908e+00  5.55121358e+00]
 [-1.61734616e+00  4.98930508e+00]
 [-5.25790464e-01  3.30659860e+00]
 [ 8.52518583e-02  3.64528297e+00]
 [-7.94152277e-01  2.10495117e+00]
 [-1.34052081e+00  4.15711949e+00]
 [-1.98197711e+00  4.02243551e+00]
 [-2.18773166e+00  3.33352125e+00]
 [-1.97451969e-01  2.34634916e+00]]


array([-1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

A major limitation of this approach is the need to specify a `contamination` parameter,
which is the proportion of observations that are outliers—a value that we don’t
know. Think of `contamination` as our estimate of the cleanliness of our data. If we
expect our data to have few outliers, we can set `contamination` to something small.
However, if we believe that the data is very likely to have outliers, we can set it to a
higher value.

- Instead of looking at observations as a whole, we can instead look at individual features and identify extreme values in those features using interquartile range (IQR):

In [None]:
# Create one feature
feature = features[:,0]

# Create a function to return index of outliers
def indicies_of_outliers(x):
    q1, q3 = np.percentile(x, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((x > upper_bound) | (x < lower_bound))

# Run function
indicies_of_outliers(feature)

(array([0]),)

IQR is the difference between the first and third quartile of a set of data. You can
think of IQR as the spread of the bulk of the data, with outliers being observations far
from the main concentration of data. Outliers are commonly defined as any value 1.5
IQRs less than the first quartile or 1.5 IQRs greater than the third quartile.

## Handling Outliers

Similar to detecting outliers, there is no hard-and-fast rule for handling them. How
we handle them should be based on two aspects.

- First, we should consider what makes them an outlier. If we believe they are errors in the data such as from a broken sensor or a miscoded value, then we might drop the observation or replace outlier values with NaN since we can’t believe those values. However, if we believe the outliers are genuine extreme values (e.g., a house [mansion] with 200 bathrooms), then marking them as outliers or transforming their values is more appropriate.

- Second, how we handle outliers should be based on our goal for machine learning. For example, if we want to predict house prices based on features of the house, we might reasonably assume the price for mansions with over 100 bathrooms is driven by a different dynamic than regular family homes. Furthermore, if we are training a model to use as part of an online home loan web application, we might assume that our potential users will not include billionaires looking to buy a mansion.

So what should we do if we have outliers? Think about why they are outliers, have an
end goal in mind for the data, and, most importantly, remember that not making a
decision to address outliers is itself a decision with implications.

One additional point: ***if you do have outliers standardization might not be appropriate***
because the mean and variance might be highly influenced by the outliers. In this
case, use a rescaling method more robust against outliers like `RobustScaler`.

In [None]:
# You have outliers. 1. we can drop them:
#
# Load library
import pandas as pd

# Create DataFrame
houses = pd.DataFrame()
houses['Price'] = [534433, 392333, 293222, 4322032]
houses['Bathrooms'] = [2, 3.5, 2, 116]
houses['Square_Feet'] = [1500, 2500, 1500, 48000]

# Filter observations
houses[houses['Bathrooms'] < 20]

Unnamed: 0,Price,Bathrooms,Square_Feet
0,534433,2.0,1500
1,392333,3.5,2500
2,293222,2.0,1500


In [None]:
# You have outliers. 2. we can mark them as outliers and include it as a feature

# Load library
import numpy as np

# Create feature based on boolean condition
houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier
0,534433,2.0,1500,0
1,392333,3.5,2500,0
2,293222,2.0,1500,0
3,4322032,116.0,48000,1


In [None]:
# You have outliers. 3. we can transform the feature to dampen the effect of the outlier:

# Log feature
houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]

# Show data
houses

Unnamed: 0,Price,Bathrooms,Square_Feet,Outlier,Log_Of_Square_Feet
0,534433,2.0,1500,0,7.31322
1,392333,3.5,2500,0,7.824046
2,293222,2.0,1500,0,7.31322
3,4322032,116.0,48000,1,10.778956


## Discretizating Features

Discretization can be a fruitful strategy when we have reason to believe that a numerical
feature should behave more like a categorical feature. For example, we might
believe there is very little difference in the spending habits of 19- and 20-year-olds,
but a significant difference between 20- and 21-year-olds (the age in the United States
when young adults can consume alcohol). In that example, it could be useful to break
up individuals in our data into those who can drink alcohol and those who cannot.
Similarly, in other cases it might be useful to discretize our data into three or more
bins.

1. First, we can binarize the feature according to some threshold:

In [None]:
# You have a numerical feature and want to break it up into discrete bins.

# Load libraries
import numpy as np
from sklearn.preprocessing import Binarizer

# Create feature
age = np.array([[6], 
                [12], 
                [20], 
                [36], 
                [65]])

# Create binarizer
binarizer = Binarizer(threshold=18)

# Transform feature
binarizer.fit_transform(age)

array([[0],
       [0],
       [1],
       [1],
       [1]])

In [None]:
# Bin feature
np.digitize(age, bins=[18])

array([[0],
       [0],
       [1],
       [1],
       [1]])

2. we can break up numerical features according to multiple thresholds:

In [None]:
# Bin feature
print(age)
print(np.digitize(age, bins=[20,30,64]))

[[ 6]
 [12]
 [20]
 [36]
 [65]]
[[0]
 [0]
 [1]
 [2]
 [3]]


## Grouping Observations Using Clustering
I wanted to point out that we can use clustering as a preprocessing step. Specifically, we use unsupervised learning algorithms
like k-means to cluster observations into groups. The end result is a categorical feature with similar observations being members of the same group.

In [None]:
# You want to cluster observations so that similar observations are grouped together.

# Load libraries
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Make simulated feature matrix
features, _ = make_blobs(n_samples = 50, 
                         n_features = 2, 
                         centers = 3, 
                         random_state = 1)

# Create DataFrame
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# Make k-means clusterer
clusterer = KMeans(3, random_state=0)

# Fit clusterer
clusterer.fit(features)

# Predict values
dataframe["group"] = clusterer.predict(features)

# View first few observations
dataframe.head(10)

Unnamed: 0,feature_1,feature_2,group
0,-9.877554,-3.336145,0
1,-7.28721,-8.353986,2
2,-6.943061,-7.023744,2
3,-7.440167,-8.791959,2
4,-6.641388,-8.075888,2
5,-0.794152,2.104951,1
6,-2.760179,5.551214,1
7,-9.946905,-4.590344,0
8,-0.52579,3.306599,1
9,-1.981977,4.022436,1


## Deleting Observations with Missing Values

Most machine learning algorithms cannot handle any missing values in the target and
feature arrays. For this reason, we cannot ignore missing values in our data and must
address the issue during preprocessing.

The simplest solution is to delete every observation that contains one or more missing
values, a task quickly and easily accomplished using NumPy or pandas.

In [None]:
# You need to delete observations containing missing values.
# Load library
import numpy as np

# Create feature matrix
features = np.array([[1.1, 11.1], 
                     [2.2, 22.2],
                     [3.3, 33.3],
                     [4.4, 44.4],
                     [np.nan, 55]])

# Keep only observations that are not (denoted by ~) missing
features[~np.isnan(features).any(axis=1)]

array([[ 1.1, 11.1],
       [ 2.2, 22.2],
       [ 3.3, 33.3],
       [ 4.4, 44.4]])

- Alternatively, we can drop missing observations using pandas:

In [None]:
# Load library
import pandas as pd

# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])

# Remove observations with missing values
dataframe.dropna()

Unnamed: 0,feature_1,feature_2
0,1.1,11.1
1,2.2,22.2
2,3.3,33.3
3,4.4,44.4


## Imputing Missing Values

There are two main strategies for replacing missing data with substitute values, each
of which has strengths and weaknesses.

- First, we can use machine learning to predict the values of the missing data. To do this we treat the feature with missing values as a target vector and use the remaining subset of features to predict missing values.

- An alternative and more scalable strategy is to fill in all missing values with some average value.

In [None]:
!pip install fancyimpute

Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
Collecting knnimpute>=0.1.0
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 6.9 MB/s 
Building wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29899 sha256=44f03542e435f85433c0722dd2185781507846128ee1d4392e7e556ce3e96afc
  Stored in directory: /root/.cache/pip/wheels/e3/04/06/a1a7d89ef4e631ce6268ea2d8cde04f7290651c1ff1025ce68
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11353 sha256=ecffbe9c2120eb415db979c76c9402922076a2c18b4ff7814562be9f283c0d01
  Stored in directory: /root/.cache/pip/wheels/72/21/a8/a045cacd9838abd5643f6bfa852c0796a99d6b1494760494e0
Successf

In [None]:
# You have missing values in your data and want to fill in or predict their values.

# Load libraries
import numpy as np
from fancyimpute import KNN
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Make a simulated feature matrix
features, _ = make_blobs(n_samples = 1000,
                        n_features = 2,
                        random_state = 1)

# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(features)

# Replace the first feature's first value with a missing value
true_value = standardized_features[0,0]
standardized_features[0,0] = np.nan

# Predict the missing values in the feature matrix
features_knn_imputed = KNN(k=5, verbose=0).fit_transform(standardized_features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_knn_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: 1.0955332713113226


- Alternatively, we can use scikit-learn’s Imputer module to fill in missing values with the feature’s mean, median, or most frequent value.

In [None]:
# Load library
from sklearn.impute import SimpleImputer

# Create imputer
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Impute values
features_mean_imputed = mean_imputer.fit(features)
features_mean_imputed = mean_imputer.transform(features)

# Compare true and imputed values
print("True Value:", true_value)
print("Imputed Value:", features_mean_imputed[0,0])

True Value: 0.8730186113995938
Imputed Value: -3.058372724614996
