# Feature Engineering



In this Assignment, we will cover a two common examples of feature engineering tasks: 

1.   features for representing *categorical data*
2.   features for representing *text*


This process is also known as *vectorization*, as it involves converting arbitrary data into vectors. Vectors are the language that our machine learning models understand, so we need to convert our data into the language they will understand.





## Categorical Features

### This section is only given for illustration purposes and for possible future assignments.

One common type of non-numerical data is *categorical* data.
For example, imagine you are exploring some data on housing prices, and along with numerical features like "price" and "rooms", you also have "neighborhood" information.
For example, your data might look something like this:

In [1]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Charlotte'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'New York'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Miami'},
]

You might be tempted to encode this data with a straightforward numerical mapping:

In [2]:
{'Charlotte': 1, 'New York': 2, 'Miami': 3};

If we do something like this, the machine learning models make the fundamental assumption that numerical features reflect algebraic quantities.
Thus such a mapping would imply, for example, that *Charlotte < New York < Miami*, or even that *Charlotte - New York = Miami*, which (niche demographic jokes aside) does not make much sense.

In this case, one proven technique is to use *one-hot encoding*, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.
When your data comes as a list of dictionaries, Scikit-Learn's ``DictVectorizer`` will do this for you:

In [None]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     1,      0,      0, 850000,      4],
       [     0,      0,      1, 700000,      3],
       [     0,      1,      0, 650000,      3]])

Notice that the 'neighborhood' column has been expanded into three separate columns, representing the three neighborhood labels, and that each row has a 1 in the column associated with its neighborhood.
With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.

To see the meaning of each column, you can inspect the feature names:

In [None]:
vec.get_feature_names()

['neighborhood=Charlotte',
 'neighborhood=Miami',
 'neighborhood=New York',
 'price',
 'rooms']

There is one clear disadvantage of this approach: if your category has many possible values, this can *greatly* increase the size of your dataset.
However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:

In [None]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding.

## Text Features

### This section is related to the assigment.
Another common need in feature engineering is to convert text to a set of representative numerical values.

One of the simplest methods of encoding data is by *word counts*: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

For example, consider the following set of three phrases:

In [2]:
sample = ['problem of evil',
          'evil joker',
          'regression problem']

For a vectorization of this data based on word count, we could construct a column representing the word "problem," the word "evil," the word "horizon," and so on.
While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [None]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,joker,of,problem,regression
0,1,0,1,1,0
1,1,1,0,0,0
2,0,0,0,1,1


There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms.
One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*) which weights the word counts by a measure of how often they appear in the documents.
The syntax for computing these features is similar to the previous example:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,joker,of,problem,regression
0,0.517856,0.0,0.680919,0.517856,0.0
1,0.605349,0.795961,0.0,0.0,0.0
2,0.0,0.0,0.0,0.605349,0.795961


Assignment Question: Use *CountVectorizer* and *TfidfVectorizer* for the TEXT column in the data files we are using for the class (English and Italian).



Step 1: You will need to load the data files into a dataframe. Feel free to use your solution from the previous assignment here. 

Step 2: And next, use the sample code given above to create vectors  (count vectors and tf-idf vectors) for the text in the entire TEXT column of the English language data file. 


Step 3: Repeat the process for the Italian language data file.




In [11]:
#your solution goes here
import pandas as pd
eng_data=pd.read_csv('CONcreTEXT_trial_EN.tsv',sep='\t')
IT_data=pd.read_csv('CONcreTEXT_trial_IT.tsv',sep='\t')

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(eng_data['TEXT'])
X

<100x637 sparse matrix of type '<class 'numpy.int64'>'
	with 1194 stored elements in Compressed Sparse Row format>

In [13]:
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,30,about,academic,achievements,across,action,activated,activates,active,actually,...,woman,women,work,world,worth,wrist,years,you,your,yourself
0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,1,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(eng_data['TEXT'])
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,30,about,academic,achievements,across,action,activated,activates,active,actually,...,woman,women,work,world,worth,wrist,years,you,your,yourself
0,0.0,0.0,0.365892,0.335751,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.167100,0.0
1,0.0,0.0,0.000000,0.267340,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.247274,0.133053,0.0
2,0.0,0.0,0.000000,0.000000,0.0,0.0,0.440947,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.201377,0.0
3,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.31925,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.145799,0.0
4,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.139450,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.171157,0.0
96,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
97,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
98,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.00000,0.0,0.0,...,0.000000,0.340887,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0


In [15]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
y = vec.fit_transform(IT_data['TEXT'])
y

<100x702 sparse matrix of type '<class 'numpy.int64'>'
	with 1161 stored elements in Compressed Sparse Row format>

In [16]:
pd.DataFrame(y.toarray(), columns=vec.get_feature_names())

Unnamed: 0,125,250,30,abilità,abito,abitudinario,accendi,accesso,accettare,accondiscendente,...,vista,vita,vivi,vodka,volpe,volta,volte,vuoi,vuole,zentangle
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
y = vec.fit_transform(IT_data['TEXT'])
pd.DataFrame(y.toarray(), columns=vec.get_feature_names())

Unnamed: 0,125,250,30,abilità,abito,abitudinario,accendi,accesso,accettare,accondiscendente,...,vista,vita,vivi,vodka,volpe,volta,volte,vuoi,vuole,zentangle
0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.32087,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,...,0.366112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
