## Bag of word Vectorisation
* Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

### 1. CountVectorizer
* Convert a collection of text documents to a matrix of token counts
```
sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
```
> 

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
vectorizer = CountVectorizer()

In [4]:
corpus = [   'This is the first document.',
 'This is the second second document.',
   'And the third one.',
    'Is this the first document?',
 ]

In [7]:
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [8]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

> The default configuration tokenizes the string by extracting words of at least 2 letters.

In [17]:
analyze = vectorizer.build_analyzer()
analyze("This is a boy.") == ("This is boy".lower().split(" "))

True

> Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. 

In [20]:
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [21]:
corpus

['This is the first document.',
 'This is the second second document.',
 'And the third one.',
 'Is this the first document?']

In [23]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

> the word "document" appears on the first, second and last sentence, and the word "and" appears on the third sentence only.

In [24]:
vectorizer.vocabulary_.get("document")

1

In [25]:
vectorizer.vocabulary_.get("and")

0

> Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method

In [27]:
vectorizer.transform(["he authenticates again."]).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

> **Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):**

In [30]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2), 
                                    token_pattern=r'\b\w+\b',
                                    min_df=1)

In [33]:
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!') == ( ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])

True

In [35]:
X = bigram_vectorizer.fit(corpus).transform(corpus)
print(X.toarray())

[[0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0]
 [0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 0 1 1 0]
 [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0]
 [0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]]


### Stop words

In [64]:
v = CountVectorizer()
X = v.fit_transform(corpus)
v.get_stop_words()

In [65]:
v.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

> Removing **stop words**

In [67]:
v = CountVectorizer(stop_words="english")
x = v.fit_transform(corpus)
x.todense()

matrix([[1, 0],
        [1, 2],
        [0, 0],
        [1, 0]], dtype=int64)

In [70]:
x.toarray().reshape(-1)

array([1, 0, 1, 2, 0, 0, 1, 0], dtype=int64)

In [76]:
v.get_feature_names()

['document', 'second']

In [75]:
v.get_stop_words()

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

### 2. TfidfVectorizer
>As tf–idf is very often used for text features, there is also another class called **TfidfVectorizer** that combines all the options of **CountVectorizer** and **TfidfTransformer** in a single model:

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [49]:
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [50]:
print(X.toarray())

[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]


### Ref
* [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)