<h1 align="center"><b>AI Lab: Computer Vision and NLP</b></h1>
<h3 align="center">Lesson 14: Categorical Features</h3>

---

Let's create some fake categorical data, such a set of houses with their `price`, the number of `rooms` and the `neighbourhood` where they are:

In [1]:
data = [
    {
        "price": 834648,
        "rooms": 4,
        "neighbourhood": "Prati",
    },
    {
        "price": 293487,
        "rooms": 3,
        "neighbourhood": "Tor Bella Monaca",
    },
    {
        "price": 598403,
        "rooms": 4,
        "neighbourhood": "Cinecittà",
    }
]

How could we represent in numbers the `neighbourhood` category? We could try to do a simple thing like the following:

In [2]:
neighbourhoods = {
    "Prati": 1,
    "Tor Bella Monaca": 2,
    "Cinecittà": 3
}

Now, this approach would be good with an Excel file, but for Machine Learning it's not good. This because the model might infere that for instance `Prati` would be larger than `Tor Bella Monaca`. A better way to encode such neighbourhoods is to use the **one-hot encoding**. In order to do that, we must add a column for each possible value:

In [4]:
data_one_hot = [
    {
        "price": 834648,
        "rooms": 4,
        "Prati": 1,
        "Tor Bella Monaca": 0,
        "Cinecittà": 0
    },
    {
        "price": 293487,
        "rooms": 3,
        "Prati": 0,
        "Tor Bella Monaca": 1,
        "Cinecittà": 0
    },
    {
        "price": 598403,
        "rooms": 4,
        "Prati": 0,
        "Tor Bella Monaca": 0,
        "Cinecittà": 1
    }
]

Notice how the dataset becomes larger and larger as the number of values that the category might have increases. We can use though the `DictVectorizer` function from `scikit-learn` in order to convert data into a one-hot encoding representation

In [5]:
from sklearn.feature_extraction import DictVectorizer

We can now use the function, which has the following syntax:
> ```python
> DictVectorizer(sparse=bool, dtype=datatype)
> ```
> where:
>  - `sparse` denotes if most of the data should be 0;
>  - `dtype` is the datatype of the vector.

In [7]:
vec = DictVectorizer(sparse=False, dtype=int)
result = vec.fit_transform(data)

# Let's print the result
print(result)

[[     0      1      0 834648      4]
 [     0      0      1 293487      3]
 [     1      0      0 598403      4]]


The result of the data is in the following form:

|Cinecittà|Prati|Tor Bella Monaca|Price|Rooms|
|---|---|---|---|---|
| ... | ... | ... | ... | ... |

We can also print the name of the features with `features_names_`:

In [8]:
names = vec.feature_names_
print(names)

['neighbourhood=Cinecittà', 'neighbourhood=Prati', 'neighbourhood=Tor Bella Monaca', 'price', 'rooms']
