You have a dictionary and want to convert it into a feature matrix
Use **DictVectorizer**

In [1]:
from sklearn.feature_extraction import DictVectorizer

In [2]:
# Create dictionary
data_dict = [{"Red": 2, "Blue": 4},
{"Red": 4, "Blue": 3},
{"Red": 1, "Yellow": 2},
{"Red": 2, "Yellow": 2}]

In [6]:
dictVectorizer=DictVectorizer(sparse=False)
features=dictVectorizer.fit_transform(data_dict)
features

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

By default DictVectorizer outputs a sparse matrix that only stores elements
with a value other than 0. This can be very helpful when we have massive
matrices (often encountered in natural language processing) and want to
minimize the memory requirements. We can force DictVectorizer to output a
dense matrix using sparse=False

We can get the names of each generated feature using the get_feature_names
method

In [8]:
featurenames=dictVectorizer.get_feature_names()
featurenames

['Blue', 'Red', 'Yellow']

While not necessary, for the sake of illustration we can create a pandas
DataFrame to view the output better

In [9]:
import pandas as pd
pd.DataFrame(features,columns=featurenames)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


A dictionary is a popular data structure used by many programming languages;
however, machine learning algorithms expect the data to be in the form of a
matrix. We can accomplish this using scikit-learn’s dictvectorizer.

This is a common situation when working with natural language processing. For
example, we might have a collection of documents and for each document we
have a dictionary containing the number of times every word appears in the
document. Using dictvectorizer, we can easily create a feature matrix where
every feature is the number of times a word appears in each document



In [10]:
# Create word counts dictionaries for four documents
doc_1_word_count = {"Red": 2, "Blue": 4}
doc_2_word_count = {"Red": 4, "Blue": 3}
doc_3_word_count = {"Red": 1, "Yellow": 2}
doc_4_word_count = {"Red": 2, "Yellow": 2}

In [11]:
# Create list
doc_word_counts = [doc_1_word_count,
doc_2_word_count,
doc_3_word_count,
doc_4_word_count]

In [13]:
# Convert list of word count dictionaries into feature matrix
dictVectorizer.fit_transform(doc_word_counts)


array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

In [14]:
dictVectorizer.feature_names_

['Blue', 'Red', 'Yellow']

In [17]:
pd.DataFrame(dictVectorizer.fit_transform(doc_word_counts), columns=dictVectorizer.feature_names_)

Unnamed: 0,Blue,Red,Yellow
0,4.0,2.0,0.0
1,3.0,4.0,0.0
2,0.0,1.0,2.0
3,0.0,2.0,2.0


In our toy example there are only three unique words (Red, Yellow, Blue) so
there are only three features in our matrix; however, you can imagine that if each
document was actually a book in a university library our feature matrix would be
very large (and then we would want to set spare to True)