We will cover:
 - Removing and imputing missing values from the dataset;
 - Getting categorical data into shape for machine learning algorithms;
 - Selecting relevant features for the model construction.

# Dealing with missing data
Firstly we have to identify missing values (in this case in tabula data):

In [30]:
import pandas as pd
from io import StringIO
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,
'''
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [31]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

## Eliminating training examples or features with missing values
Rows with missing values can easily be dropped this way:

In [32]:
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Similarly for columns:

In [33]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


- `df.dropna(how='all')` drop rows where all columns are NaN;
- `df.dropna(thresh=4)` drop rows that have fewer than 4 real values;
- `df.dropna(subset=['C'])` drow rows wheren NaN appear in specific columns (here 'C' for example).

## Imputing missing values
When a value is missing we can impute it with different techniques of interpolation. The most famous is the _mean imputation_, where we simply replace the missing value with the mean value of the entire feature column.
Scikit-learn offers different imputers:

In [34]:
from sklearn.impute import SimpleImputer
import numpy as np
imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

- fit is used to learn the parameters from the training data;
- transform uses those parameters to transform the data.

Other options for the strategy parameter are `median` and `most_frequent` where we simply replace missing values with the most frequent.
A more handy approach is using the `df.fillna(<imputator>)` method, where everything is done directly from Pandas.

# Handling categorical data

In [35]:
df = pd.DataFrame([
    ['green','M',10.1,'class2'],
    ['red','L',13.5,'class1'],
    ['blue','XL',15.3,'class2'],
])
df.columns = ['color','size','price','classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


We need to map categorical information into ordinal information.

In [36]:
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1,
}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class2
1,red,2,13.5,class1
2,blue,3,15.3,class2


In [37]:
# To invert the mapping procedure:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

## Encoding class labels

In [38]:
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))} # finding the uniques elements of that column and then enumerating it
class_mapping

{'class1': 0, 'class2': 1}

In [39]:
df['classlabel'] = df['classlabel'].map(class_mapping)
inv_class_mapping = {v: k for k,v in class_mapping.items()}
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,1
1,red,2,13.5,0
2,blue,3,15.3,1


In [40]:
# A more convenient way of doing this with scikit learn is:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values) # To make it work, remove the class labeling done manually before

#Let's do it for the color
X = df[['color','size','price']].values
color_le = LabelEncoder()
X[:,0] = color_le.fit_transform(X[:,0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

Using this version of X is problematic though, because the model will understand that a color is "larger" or "smaller" than another, and this would be incorrect.

To solve this problem we will use the *one-hot encoding* method. Simply put, we create a dummy new feature for each unique value in the nominal feature column:

In [41]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
X = df[['color','size','price']].values
c_transf = ColumnTransformer([
    ('onehot',OneHotEncoder(),[0]),
    ('nothing','passthrough',[1,2]), # Here we are saying to not do any transformation on the other two columns
])
c_transf.fit_transform(X).astype(float)

array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

In [42]:
# A more convenient way of doing this in pandas is simply to call the get_dummies method
pd.get_dummies(df[['price','color','size']]).astype(float)

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1.0,0.0,1.0,0.0
1,13.5,2.0,0.0,0.0,1.0
2,15.3,3.0,1.0,0.0,0.0


> Note that we are creating highly correlated features, and that could be a problem for certain machine learning algorithms.
To solve the problem we can remove in this case a feature column (since it can be derived by the other two)

In [43]:
pd.get_dummies(df[['price','color','size']], drop_first=True).astype(float)

Unnamed: 0,price,size,color_green,color_red
0,10.1,1.0,1.0,0.0
1,13.5,2.0,0.0,1.0
2,15.3,3.0,0.0,0.0


There are plenty more of encoding schemes for nominal data, especially encodings well suited for nominal features with high cardinality, for example:
- Binary encoding;
- Count or Frequency encoding.

In [44]:
# We could even encode ordinal features if we wanted to:
df['x > M'] = df['size'].apply(
    lambda x: 1 if x in [2, 3] else 0
)
df['x > L'] = df['size'].apply(
    lambda x: 1 if x == 3 else 0
)
del df['size']
df

Unnamed: 0,color,price,classlabel,x > M,x > L
0,green,10.1,1,0,0
1,red,13.5,0,1,0
2,blue,15.3,1,1,1


# Partitioning a dataset