# Missing values

Handling missing values is an essential preprocessing task that can drastically deteriorate your model when not done with sufficient care. A few questions should come up when handling missing values:

Do I have missing values? How are they expressed in the data? Should I withhold samples with missing values? Or should I replace them? If so, which values should they be replaced with?

Before starting handling missing values it is important to identify the missing values and know with which value they are replaced. You should be able to find this out by combining the metadata information with exploratory analysis.

Once you know a bit more about the missing data you have to decide whether or not you want to keep entries with missing data. According to Chris Albon (Machine Learning with Python Cookbook), this decision should partially depend on how random missing values are.

If they are completely at random, they don’t give any extra information and can be omitted. On the other hand, if they’re not at random, the fact that a value is missing is itself information and can be expressed as an extra binary feature.

Also keep in mind that deleting a whole observation because it has one missing value, might be a poor decision and lead to information loss. Just like keeping a whole row of missing values because it has a meaningful missing value might not be your best move.

Let’s materialize this theory with some coding examples using sklearn’s MissingIndicator. To give our code some meaning, we’ll create a very small data set with three features and five samples. The data contains obvious missing values expressed as not-a-number or 999

In [1]:
import numpy as np
import pandas as pd

X = pd.DataFrame(np.array([5,7,8, np.NaN, np.NaN, np.NaN, -5,
                           0,25,999,1,-1, np.NaN, 0, np.NaN])
                 .reshape((5,3)))
X.columns = ['f1', 'f2', 'f3'] #feature 1, feature 2, feature 3
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


Take a quick look at the data so you know where the missing values are situated. Rows or columns with to many non-meaningful missing values can be deleted from you data with pandas’ dropna function. Let take a look at the most important parameters:

    axis: 0 for rows, 1 for columns
    tresh: the number of non-NaN’s not to drop a row or column
    inplace: update the frame

We update our dataset by deleting all the rows (axis=0) with only missing values. Note that in this case instead of setting tresh to 1, you can also set the how parameter to ‘all’. As a result our second sample is dropped, since it only consist of missing values.

In [2]:
X.dropna(axis='rows', thresh=1, inplace=True)
X.reset_index(inplace=True)

# сброс старого индекса, если это не сделать, будет 2 столбца индексов
X.drop(['index'], axis='columns', inplace=True)

# не обращай внимания на это строчку, она нужна для примера который будет дальше
Y = X

X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,999.0,1.0,-1.0
3,,0.0,


Создадим булевы features говорящие о том есть ли в sample пустые значения, используя MissingIndicator.

Т.к. он не поддерживает множественные типы пустых значений мы преобразовываем значение 999 в NaN вручную. Затем мы create, fit & transform a MissingIndicator object который покажет где у нас NaNы

In [3]:
from sklearn.impute import MissingIndicator

X.replace({999.0 : np.NaN}, inplace=True)

# указываем что считать пустым значением (оно может быть только одно)
indicator = MissingIndicator(missing_values=np.NaN)
indicator = indicator.fit_transform(X)
indicator = pd.DataFrame(indicator, columns=['m1', 'm3'])
indicator

# обрати внимание что второго столбца (в котором нет пустых значений)
# в индикаторе нет, он такие столбцы игнорирует, выводя только те в которых
# есть хоть одно пустое значение

Unnamed: 0,m1,m3
0,False,False
1,False,False
2,True,False
3,True,True


# Imputing values

After deciding to keep (some of) your missing values and creating missing value indicators, the next question is if you should replace the missing values. Most learning algorithms perform poorly when missing values are expressed as not a number (np.NaN) and need some form of missing value imputation. Be aware that some libraries and algorithms, such as XGBoost, can handle missing values and impute these values automatically by learning.

For filling up missing values with common strategies, sklearn provides a SimpleImputer. The four main strategies are mean, most_frequent, median and constant (don’t forget to set the fill_value parameter). In the example below we impute missing values for our dataframe X with the feature’s mean.

In [4]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit_transform(X)

# пустые значения заполнили, однако этот метод возвращает массив NumPy
# что для нас не очень полезно

array([[ 5.        ,  7.        ,  8.        ],
       [-5.        ,  0.        , 25.        ],
       [ 0.        ,  1.        , -1.        ],
       [ 0.        ,  0.        , 10.66666667]])

Note that the values returned are put into an Numpy array and we lose all the meta-information. Since all these strategies can be mimicked in pandas, we are going to use pandas fillna method to impute missing values. For ‘mean’ we can use the following code. This pandas implementation also provides options to fill forward (ffill) or fill backward (bfill), which are convenient when working with time series.

In [5]:
# в пандасе тоже самое можно сделать вот так:
X.fillna(X.mean(), inplace=True)
X

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,-5.0,0.0,25.0
2,0.0,1.0,-1.0
3,0.0,0.0,10.666667


Other popular ways to impute missing data are clustering the data with the k-nearest neighbor (KNN) algorithm or interpolating the values using a wide range of interpolation methods. Both techniques are not implemented in sklearn’s preprocessing library and won’t be discussed here.

# Polynomial features

Creating polynomial features is a simple and common way of feature engineering that adds complexity to numeric input data by combining features.

Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target.They are mostly used to add complexity to linear models with little features, or when we suspect the effect of one feature is dependent on another feature.

Before handling missing values, you need to decide if you want to use polynomial features or not. If you for example replace all the missing values by 0, all the cross-products using this feature will be 0. Moreover, if you don’t replace missing values (NaN), creating polynomial features will raise a value error in the fit_transform phase, since the input should be finite.

In this respect, replacing missing values by the median or the mean seems to be a reasonable choice. Since I’m not completely sure about this, and can’t find any consistent information, I asked this question on the data science StackExchange.

Sklearn provides a PolynomialFeatures class to create polynomial features from scratch. The degree parameter determines the maximum degree of the polynomial. For example, when degree is set to two and X=x1, x2, the features created will be 1, x1, x2, x1², x1x2 and x2². The interaction_only parameter let the function know we only want the interaction features, i.e. 1, x1, x2 and x1x2.

Here, we create polynomial features to the third degree and only interaction feature. As a result we get four new features: f1.f2, f1.f3, f2.f3 and f1.f2.f3. Note that our original features are also included in the output and we slice off the new features to add to our data later.

In [6]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, interaction_only=True)

# создали фрейм, запихнули в него polynomial features, сделали срез
# чтоб в конечном фрейме были только полиномные features
polynomials = pd.DataFrame(poly.fit_transform(X),
                           columns=['0','1','2','3',
                                    'p1', 'p2', 'p3',
                                    'p4'])[['p1', 'p2', 'p3', 'p4']]
polynomials

Unnamed: 0,p1,p2,p3,p4
0,35.0,40.0,56.0,280.0
1,-0.0,-125.0,0.0,-0.0
2,0.0,-0.0,-1.0,-0.0
3,0.0,0.0,0.0,0.0


Just as with any other form of feature engineering, it is important to create polynomial features before doing any feature scaling.

Now, let’s concatenate our new missing indicator features and polynomial features to our data with pandas concat method.

In [7]:
X = pd.concat([X, indicator, polynomials], axis='columns')
X

Unnamed: 0,f1,f2,f3,m1,m3,p1,p2,p3,p4
0,5.0,7.0,8.0,False,False,35.0,40.0,56.0,280.0
1,-5.0,0.0,25.0,False,False,-0.0,-125.0,0.0,-0.0
2,0.0,1.0,-1.0,True,False,0.0,-0.0,-1.0,-0.0
3,0.0,0.0,10.666667,True,True,0.0,0.0,0.0,0.0


# Categorical features

Munging categorical data is another essential process during data preprocessing. Unfortunately, sklearn’s machine learning library does not support handling categorical data. Even for tree-based models, it is necessary to convert categorical features to a numerical representation.

Before you start transforming your data, it is important to figure out if the feature you’re working on is ordinal (as opposed to nominal). An ordinal feature is best described as a feature with natural, ordered categories and the distances between the categories is not known.

Once you know what type of categorical data you’re working on, you can pick a suiting transformation tool. In sklearn that will be a OrdinalEncoder for ordinal data, and a OneHotEncoder for nominal data.

Let’s consider a simple example to demonstrate how both classes are working. Create a dataframe with five entries and three features: sex, blood type and education level.

In [8]:
# создали тестовый фрейм
X = pd.DataFrame([['M', 'O-', 'medium'],
                 ['M', 'O-', 'high'],
                 ['F', 'O+', 'high'],
                 ['F', 'AB', 'low'],
                 ['F', 'B+', np.nan]])            
X.columns = ['sex', 'blood_type', 'edu_level']
print(X)

  sex blood_type edu_level
0   M         O-    medium
1   M         O-      high
2   F         O+      high
3   F         AB       low
4   F         B+       NaN


Теперь нам нужно обработать ordinal data, которым является последний столбец. Для этого создаём объект класса Categorical, передав в него столбец который нужно "категоризировать", список категорий и установив ordered=True.

In [9]:
cat = pd.Categorical(X['edu_level'], 
                     categories=['missing', 'low', 
                                 'medium', 'high'], 
                     ordered=True)

# вот так наш объект выглядит сейчас
print(cat)

# заменим НАНы на 'missing'
cat = cat.fillna('missing')
print(cat)

# через pd.factorize(), which encode the object as an enumerated type or categorical variable,
# передав ему наш объект cat (с категориями) и установив sort=True (т.к. у нас ordinal data)
# получаем labels и unique - два np.ndarray, один с набором наших категорий для каждого sample
# нашего столбца, второй с  набором уникальных категорий
labels, unique = pd.factorize(cat, sort=True)

# и наконец присваиваем столбцу edu_level эти лейблы, заменяя строки на упорядоченные числовые
# значения, делая этот столбец пригодным к использованию в алгоритмах МЛ
X['edu_level'] = labels
X

[medium, high, high, low, NaN]
Categories (4, object): [missing < low < medium < high]
[medium, high, high, low, missing]
Categories (4, object): [missing < low < medium < high]


Unnamed: 0,sex,blood_type,edu_level
0,M,O-,2
1,M,O-,3
2,F,O+,3
3,F,AB,1
4,F,B+,0


Теперь надо перевести в числа наши nominal данные. Remember that we can’t replace these features by a number since this would imply the features have an order, which is untrue in case of sex or blood type.

The most popular way to encode nominal features is one-hot-encoding. Essentially, each categorical feature with n categories is transformed into n binary features.

In [10]:
from sklearn.preprocessing import OneHotEncoder

# создадим объект класса OneHotEncoder переведя все данные в np.int и выставив sparce=False,
# чтоб при onehot.fit_transform() на выходе был обычный массив, а не sparse array,
# к
onehot = OneHotEncoder(dtype=np.int, sparse=False)

# fit & transform our two nominal categoricals.
nominals = pd.DataFrame(onehot.fit_transform(X[['sex', 'blood_type']]),
                        columns=['F', 'M', 'AB', 'B+','O+', 'O-'])
print(nominals) # смотрим что получилось

# добавляем столбец edu_level подготовленный ранее
nominals['edu_level'] = X['edu_level']
nominals

   F  M  AB  B+  O+  O-
0  0  1   0   0   0   1
1  0  1   0   0   0   1
2  1  0   0   0   1   0
3  1  0   1   0   0   0
4  1  0   0   1   0   0


Unnamed: 0,F,M,AB,B+,O+,O-,edu_level
0,0,1,0,0,0,1,2
1,0,1,0,0,0,1,3
2,1,0,0,0,1,0,3
3,1,0,1,0,0,0,1
4,1,0,0,1,0,0,0


Since there were no missing values in our data, it is important to have a word on how to handle missing values with the OneHotEncoder. A missing value can easily be handled as an extra feature. Note that to do this, you need to replace the missing value by an arbitrary value first (e.g. ‘missing’) If you, on the other hand, want to ignore the missing value and create an instance with all zeros (False), you can just set the handle_unkown parameter of the OneHotEncoder to ignore.

# Numerical features

Just like categorical data can be encoded, numerical features can be ‘decoded’ into categorical features. The two most common ways to do this are discretization and binarization.

$$Discretization$$

Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.

One of the main goals of a discretization is to significantly reduce the number of discrete intervals of a continuous attribute. Hence, why this transformation can increase the performance of tree based models.

sklearn.preprocessing provides a KBinsDiscretizer class that can take care of this. Important 'strategy' parameter can be set to three values:
    
    1) uniform, where all bins in each feature have identical widths.
    2) quantile (default), where all bins in each feature have the same number of points.
    3) kmeans, where all values in each bin have the same nearest center of a 1D k-means cluster.

It is important to pick the strategy parameter with care. Using the uniform strategy for example, is very sensitive for outliers and can make you end up with bins with just a few data points, i.e. the outliners.

Остальное читай в документации.


$$Binarization$$

Feature binarization is the process of tresholding numerical features to get boolean values. Or in other words, assign a boolean value (True or False) to each sample based on a threshold. Note that binarization is an extreme form of two-bin discretization.

Binarization is useful as a feature engineering technique for creating new features that indicate something meaningful.

импортируется он вот так - from sklearn.preprocessing import Binarizer


# Custom transformers

Если хотим провернуть какую-то кастомную трансформацию данных то можно использовать FunctionTransformer. Или просто применить лямбда-функцию через df['столбец'].aplly(lamda x: функция(х)).


# Feature scaling

Следующий логический этап подготовки данных это их скалирование (scaling). Однако есть нюанс, если мы делим наш датасет на тренировочный и тестовый сеты, то их сначала нужно разделить и лишь затем скалировать. Иначе трен-й и тестовый сеты могут оказаться заскейлеными около среднего значения которое не является их средним значением, а является средним значением всего сета. И разделив уже заскейленный сет мы получим трен-й и тестовый сеты которые уже будут незаскейленными. Вот основные виды feature scaling:

## Standardization

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one 

В sklearn есть куча scalers: StandardScaler, MinMaxScaler, MaxAbsScaler and RobustScaler.

Применяются они примерно одинаково:

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(.f3.values.reshape(-1, 1))

array([[-0.28562322],
       [ 1.53522482],
       [-1.24960159],
       [ 0.        ]])

## Normalization

Normalization is the process of scaling individual samples to have unit norm. In basic terms you need to normalize data when the algorithm predicts based on the weighted relationships formed between data points. Scaling inputs to unit norms is a common operation for text classification or clustering.

One of the key differences between scaling (e.g. standardizing) and normalizing, is that normalizing is a row-wise operation, while scaling is a column-wise operation.

Although there are many other ways to normalize data, sklearn provides three norms (the value to which the individual values are compared): l1, l2 and max. When creating a new instance of the Normalizer class you can specify the desired norm under the norm parameter.

Применяется так же как и стандартизатор.

# По применению стандартизаторов и нормализаторов читай документацию