# Part Two:  Feature Extraction with Scikit-Learn

Let's explore the more realistic process of using sklearn to complete the tasks mentioned above!

# Scikit-Learn's Text Feature Extraction Options

In [1]:
text = ['This is a line',
           "This is another line",
       "Completely different line"]

## CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
cv = CountVectorizer()

In [14]:
sparse_matrix = cv.fit_transform(text)

In [15]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]])

In [16]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

## TfidfTransformer

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfTransformer

In [17]:
tfidf = TfidfTransformer()

In [18]:
results = tfidf.fit_transform(sparse_matrix) # Bag Of Words --> TF-IDF

In [19]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

## TfIdfVectorizer

Does both above in a single step!

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
tv = TfidfVectorizer()

In [23]:
tv_results = tv.fit_transform(text)

In [24]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])