In [1]:
%matplotlib inline

## [Feature Extraction - Dicts](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer)
- Converts feature arryas (as lists of Python dicts) into NumPy or SciPy representations
- Uses **one-hot** (one of K) encoding for categorical data.

In [2]:
measurements = [
    {'city': 'Dubai',         'temperature': 33.},
    {'city': 'London',        'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

- also useful for extracting NLP features around a specific word.
- below: vectorizing description to a sparse 2D matrix for classification.

In [3]:
pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    },
]
vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized)

pos_vectorized.toarray()
vec.get_feature_names()

  (0, 0)	1.0
  (0, 1)	1.0
  (0, 2)	1.0
  (0, 3)	1.0
  (0, 4)	1.0
  (0, 5)	1.0


['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']