<h1 align="center"><b>AI Lab: Computer Vision and NLP</b></h1>
<h3 align="center">Lessons 14 - 15: Categorical Features</h3>

---

Let's create some fake categorical data, such a set of houses with their `price`, the number of `rooms` and the `neighbourhood` where they are:

In [1]:
data = [
    {
        "price": 834648,
        "rooms": 4,
        "neighbourhood": "Prati",
    },
    {
        "price": 293487,
        "rooms": 3,
        "neighbourhood": "Tor Bella Monaca",
    },
    {
        "price": 598403,
        "rooms": 4,
        "neighbourhood": "Cinecittà",
    }
]

How could we represent in numbers the `neighbourhood` category? We could try to do a simple thing like the following:

In [2]:
neighbourhoods = {
    "Prati": 1,
    "Tor Bella Monaca": 2,
    "Cinecittà": 3
}

Now, this approach would be good with an Excel file, but for Machine Learning it's not good. This because the model might infere that for instance `Prati` would be larger than `Tor Bella Monaca`. A better way to encode such neighbourhoods is to use the **one-hot encoding**. In order to do that, we must add a column for each possible value:

In [3]:
data_one_hot = [
    {
        "price": 834648,
        "rooms": 4,
        "Prati": 1,
        "Tor Bella Monaca": 0,
        "Cinecittà": 0
    },
    {
        "price": 293487,
        "rooms": 3,
        "Prati": 0,
        "Tor Bella Monaca": 1,
        "Cinecittà": 0
    },
    {
        "price": 598403,
        "rooms": 4,
        "Prati": 0,
        "Tor Bella Monaca": 0,
        "Cinecittà": 1
    }
]

Notice how the dataset becomes larger and larger as the number of values that the category might have increases. We can use though the `DictVectorizer` function from `scikit-learn` in order to convert data into a one-hot encoding representation

In [4]:
from sklearn.feature_extraction import DictVectorizer

We can now use the function, which has the following syntax:
> ```python
> DictVectorizer(sparse=bool, dtype=datatype)
> ```
> where:
>  - `sparse` denotes if most of the data should be 0;
>  - `dtype` is the datatype of the vector.

In [5]:
vec = DictVectorizer(sparse=False, dtype=int)
result = vec.fit_transform(data)

# Let's print the result
print(result)

[[     0      1      0 834648      4]
 [     0      0      1 293487      3]
 [     1      0      0 598403      4]]


The result of the data is in the following form:

|Cinecittà|Prati|Tor Bella Monaca|Price|Rooms|
|---|---|---|---|---|
| ... | ... | ... | ... | ... |

We can also print the name of the features with `features_names_`:

In [9]:
names = vec.feature_names_
print(names)

['neighbourhood=Cinecittà', 'neighbourhood=Prati', 'neighbourhood=Tor Bella Monaca', 'price', 'rooms']


Let's now create a sample dataset that we can use. We'll also use the `pandas` package, which is very useful when we have to deal with tables:

In [14]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

data = [
    "trying to catch",
    "catch the devil",
    "the devil is trying"
]

vec = CountVectorizer()

Now, we can transform our data into a vector. We'll then transform the result into a `pandas` table:

In [15]:
X = vec.fit_transform(data)
tdata = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

print(tdata)

   catch  devil  is  the  to  trying
0      1      0   0    0   1       1
1      1      1   0    1   0       0
2      0      1   1    1   0       1


The table can be interpreted in the following way: each word appears as many times as there are `1`s in the corresponding column. So in this case, it means that we have the word `catch` appears two times in the whole dataset, the word `is` appears only once, etc... The less a word appears in the table, the more **important** it is.

There is another way to represent data, and such approach is called `TF-IDF` (**T**erm **F**requency - **I**nverse **D**ocument **F**requency). We can try to do the same steps but with the new vectorization approach:

In [17]:
vec = TfidfVectorizer()
X = vec.fit_transform(data)
tdata = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())

print(tdata)

      catch     devil        is       the        to    trying
0  0.517856  0.000000  0.000000  0.000000  0.680919  0.517856
1  0.577350  0.577350  0.000000  0.577350  0.000000  0.000000
2  0.000000  0.459854  0.604652  0.459854  0.000000  0.459854


We now have values expressed in weights: if a word appears **less times**, it is considered as **more important**