## Feature Extraction

This notebook discusses the various methods of feature extraction. It is also known as Text Vectorization or text representation. This technique is used to convert the words into vectors so that they can be used for modeling. The various techniques are as following of semantic meaning.


### 1. One hot encoding: 

Most basic technique to convert words into vectors. Consider the following example.
- Doc1 = "It is raining"
- Doc2 = "Boston is raining"
- Doc3 = "Rain raining everywhere."

Vocabulary of this corpus = ["It","is","raining","Boston","rain","everywhere"].
In the vector form these docs can be converted as:
    
```
Doc1 = [[1,0,0,0,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]] -- size of this vector is 3X6.
Doc2 = [[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]] -- size of this vector is 3X6.
```

In general, each word of each document is converted to a vector.

**Pros:** 
- Very simple, intuitive, easy to implement.
    
**Cons:** 
- Sparsity, different array sizes won't work, out of vocabulary words will throw an error, no capturing of semantic meaning.

### 2. Bag of Words:
It deals with occurence of words in a document.

- Doc1 = "The more the merrier."
- Doc2 = "More people enrolled"
- Doc3 = "Enrolled last week"

Vocabulary of this corpus = ["The","more","merrier","peope","enrolled","last","week"].
In the vector form these docs can be converted as:
    
```
Doc1 = [2,1,1,0,0,0,0]
Doc2 = [0,1,0,1,1,0,0]
Doc3 = [0,0,0,0,1,1,1]
```

**Pros:**
- Simple and easy to understand, No fixed size required, doesn't throw an error for out of vocabulary words, captures semantic meaning (less)

**Cons:**
- Ignores new words in the text dataset.
- Sparsity
- Consider: "This is a good book" and "This is not a good book". Because the docs contain same words at same frequency, this approach treats the two documents as similar, when in reality they convey opposite meanings.

In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame({'documents':["The more the merrier.","More people enrolled","Enrolled last week"]})

In [4]:
df

Unnamed: 0,documents
0,The more the merrier.
1,More people enrolled
2,Enrolled last week


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [12]:
bow = cv.fit_transform(df['documents'])

In [15]:
#Vocabulary
cv.vocabulary_

{'the': 5,
 'more': 3,
 'merrier': 2,
 'people': 4,
 'enrolled': 0,
 'last': 1,
 'week': 6}

The numbers indicate their ordering index. These words are alphabetically ordered.

In [18]:
print("Vector form of doc1:",bow[0].toarray())
print("Vector form of doc2:",bow[1].toarray())
print("Vector form of doc3:",bow[2].toarray())

Vector form of doc1: [[0 0 1 1 0 2 0]]
Vector form of doc2: [[1 0 0 1 1 0 0]]
Vector form of doc3: [[1 1 0 0 0 0 1]]


In [20]:
cv.transform(['This sentence contains more words']).toarray()

array([[0, 0, 0, 1, 0, 0, 0]], dtype=int64)

As expected the new words got ignored in the vector form; one of the drawback of BoW.