# Intro: from Categorical Features to Text Data

- Computers are good with numbers, but not that much with textual data.  



- Text Analysis is a major application field for machine learning algorithms. 



- However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves 


- Most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable


- Most automatic mining of social media data relies on some form of encoding the text as numbers.

In [2]:
import sklearn as sk
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

- How do we approach this problem?


- One common type of non-numerical data is *categorical* data.


- For example, imagine you are exploring some data on housing prices, 
 and along with numerical features like "price" and "rooms", you also have "neighborhood" information.
    
    
- For example, your data might look something like this:

In [3]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

You might be tempted to encode this data with a straightforward numerical mapping:

In [4]:
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};

- One inconvenience:   models make the fundamental assumption that numerical features reflect algebraic quantities.

    
-Thus such a mapping would imply, for example, **order** i.e. that *Queen Anne < Fremont < Wallingford*, or 
even that *Wallingford - Queen Anne = Fremont*, which  does not make much sense.

- Use *one-hot encoding*, 

- Which effectively creates extra columns
indicating the presence or absence of a category with a value of 1 or 0, respectively.

<img src="figures/hotencoding.png" width="30%">



When your data comes as a list of dictionaries, Scikit-Learn's ``DictVectorizer`` will do this for you:

In [5]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int64)

- The 'neighborhood' column is expanded into three separate columns representing the three neighborhood labels


- Each row has a 1 in the column associated with its neighborhood.


- With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.


To see the meaning of each column, you can inspect the feature names:

In [6]:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

- There is one clear disadvantage of this approach: if your category has many possible values, this can *greatly* increase the size of your dataset.

    
- However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:

In [7]:
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

- Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. 


- ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding.

## Text Feature Extraction

- Then, How do we  convert text to a set of representative numerical values?.



- One of the simplest methods of encoding data is by *word counts*: 

    
- You take each snippet of text...


- Then count the  occurrences of each word within it, and put the results in a table.



For example, consider the following set of three phrases:

In [10]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem']

- For a vectorization of this data based on word count, we could construct a column representing the word "problem," the word "evil," the word "horizon," and so on.


- While doing this by hand would be possible, the **tedium** can be avoided by using Scikit-Learn's ``CountVectorizer``:

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [15]:
print(X)  #Shows only the ONEs in each doce.g. in doc 0 only word 0,2 and 3 appear as 1

  (0, 0)	1
  (0, 2)	1
  (0, 3)	1
  (1, 4)	1
  (1, 0)	1
  (2, 1)	1
  (2, 3)	1


The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:

In [18]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0


- Main ISSUE  with this approach, however: **the raw word counts lead to features which put 
    too much weight on words that appear very frequently** 

    
- and this can be sub-optimal in some classification algorithms.



- One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*) 



- It weights the word counts by a measure of how often they appear in the documents.