# Feature Extraction Using Scikit-Learn

## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems

Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging

When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

In [1]:
text = ["This is first line", "This is second Line", "Another LINE", "not the FIrst linE"]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

#### 1) Count Vectorizer:

In [3]:
cv = CountVectorizer()

In [4]:
cv.fit_transform(text)

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [5]:
sparse_mat = cv.fit_transform(text)

In [6]:
sparse_mat

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [7]:
sparse_mat.todense()

matrix([[0, 1, 1, 1, 0, 0, 0, 1],
        [0, 0, 1, 1, 0, 1, 0, 1],
        [1, 0, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 1, 1, 0, 1, 0]], dtype=int64)

In [8]:
cv = CountVectorizer(stop_words= "english")

In [10]:
sparse_mat= cv.fit_transform(text)

In [11]:
sparse_mat

<4x2 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [12]:
sparse_mat.todense()

matrix([[1, 0],
        [1, 1],
        [1, 0],
        [1, 0]], dtype=int64)

In [13]:
cv.vocabulary_

{'line': 0, 'second': 1}

#### 2) Tfidf Transformer:

In [14]:
tfidf_transformer = TfidfTransformer()

In [15]:
cv = CountVectorizer()

In [16]:
sparse_mat = cv.fit_transform(text)

In [17]:
sparse_mat

<4x8 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [18]:
tfidf_mat = tfidf_transformer.fit_transform(sparse_mat)

In [19]:
tfidf_mat

<4x8 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [20]:
tfidf_mat.todense()

matrix([[0.        , 0.53931298, 0.53931298, 0.35696573, 0.        ,
         0.        , 0.        , 0.53931298],
        [0.        , 0.        , 0.4970962 , 0.32902288, 0.        ,
         0.6305035 , 0.        , 0.4970962 ],
        [0.88654763, 0.        , 0.        , 0.46263733, 0.        ,
         0.        , 0.        , 0.        ],
        [0.        , 0.46345796, 0.        , 0.30675807, 0.58783765,
         0.        , 0.58783765, 0.        ]])

#### 3) Tfidf Vectorizer:

In [21]:
tfidf_vectorizer = TfidfVectorizer()

In [22]:
sparse_mat = tfidf_vectorizer.fit_transform(text)

In [23]:
sparse_mat

<4x8 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

In [24]:
sparse_mat.todense()

matrix([[0.        , 0.53931298, 0.53931298, 0.35696573, 0.        ,
         0.        , 0.        , 0.53931298],
        [0.        , 0.        , 0.4970962 , 0.32902288, 0.        ,
         0.6305035 , 0.        , 0.4970962 ],
        [0.88654763, 0.        , 0.        , 0.46263733, 0.        ,
         0.        , 0.        , 0.        ],
        [0.        , 0.46345796, 0.        , 0.30675807, 0.58783765,
         0.        , 0.58783765, 0.        ]])