# 05.04 - Feature Engineering

Unfortunately for us, for real-world data is not in a tidy and neat format. Often we will need to manipulate our initial data to turn it into numbers that we can use to build our feature matrix.

In this section, we will cover features for representing:

* **Categorical data**
* **Text**
* **Images**

Additionally, we will discuss:
* **Derived features** for increasing model complexity 
* **Missing data** handling and imputation

This whole process is also known as vectorization, as it involves converting arbitrary data into well-behaved vectors.

### Categorical Features

One common type of non-numerical data is **categorical data**. For example, the name of the neighborhood in a housing price dataset:

In [1]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

Encoding this data with a straighforward numerical mapping does not make a lot of sense (as Scikit-learn would interpret them as quantities, and we don't want that). 

In this case, one proven technique is to use <code>one-hot encoding</code>, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.

If your data comes as a list of dicts, Scikit <code>DictVectorizer</code> is the best tool:

In [2]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int32)

In [3]:
# inspecting the meaning of each col
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

However, this approach can greatly increase the size of your dataset. In our case, being our values mostly zeros, we can use a sparse output approach:

In [4]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>