Here we'll see how to perform feature extraction using Scikit learn.

We'll be using 3 main Api's of scikit-learn:-
1. Count Vectorizer
2. TF-IDF transformer
3. TF-IDF vectorizer

In [1]:
# Creating some artificial text

text = ['This is a line','This is another line','Completely different line']

So we can see that some of these lines have words that are common to all 3 and some that are unique to all 3.

# 1. Count Vectorizer

So we're going to first do is explore the Count Victimiser, which essentially creates a bag of words model, which is what we did last time, in the manual process.

But here we will use Scikit learn for that.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
#help(CountVectorizer)

So CountVectorizer is pretty much straight forward, It counts up the frequency of the words and then create the numeric vectors, essentially what we did last time manually.

In [5]:
cv = CountVectorizer()

In [6]:
cv.fit_transform(text)

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

So it's stores the result in a sparse matrix to save memory for the computer.

3x6 because, the input has 3 spereate documents/dentences and 6 unique words in total.

So fit and transform to the text, which essentially says get the unique vocabulary on fit and then transform it by actually performing the frequency counts on these three documents inside the list called "text".

In [7]:
sparse_matrix = cv.fit_transform(text)

In [8]:
sparse_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [9]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]], dtype=int64)

#### Note:- whille dealing with really large datasets, we should avoid calling todense() to avoid memort issse.

So the sparse matrix only has 6 columns because it has only 6 unique features, we can see that using that as a vocabulary set as a dict.

In [10]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

So for example to check this vocabulary we see that it reports back that "another" is at index 0 and also in the sparse matrix we can see that only second sentence has the word "another" in it one time.

And the other two lines have zero for the word "another".

### Removing the Stop words 

Removing the most common words in English language called Stop words

In [12]:
#Again telling it to create the vectorizer but this time exclude the stop words
cv = CountVectorizer(stop_words='english')

In [14]:
sparse_matrix = cv.fit_transform(text)

In [15]:
sparse_matrix.todense()

matrix([[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1]], dtype=int64)

In [16]:
cv.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

So this time the CV has removed all the words it considered as stopwords and only kept these 3 words which are unique.

Which is also probably correct as the words - 'this','a','the', are most common stopwords and don't really add much value.

# 2.TF-IDF Transformer

We saw the CountVectorizer, but we want to be able to take this bag of words model and transform it into a "Term Frequency Inverse Document Frequency" vector.

We use Scikit-learn's TFIDF trasformer. It is to be able to take in a "vector of word frequency counts" and transform it into a "vector of term frequency - inverse document frequency values".

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

In [18]:
tfidf = TfidfTransformer()

In [19]:
sparse_matrix

<3x3 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

So we have the sparse matrix which has the frequency counts and we'll be able to get the term frequency - inverse document frequency for it.

So first we can try this out without removing the stopwords to get a better idea.

In [21]:
cv = CountVectorizer()
sparse_mat = cv.fit_transform(text)

In [22]:
sparse_mat

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [24]:
# tfidf = TfidfTransformer()
results = tfidf.fit_transform(sparse_mat) # Pass in BOW - bag of words and convert it into tf-idf frequency count

In [25]:
results

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [26]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

In [28]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

In [29]:
import pandas as pd



In [30]:
pd.DataFrame(data=results.todense(), columns=cv.vocabulary_)

Unnamed: 0,this,is,line,another,completely,different
0,0.0,0.0,0.0,0.619805,0.481334,0.619805
1,0.631745,0.0,0.0,0.480458,0.373119,0.480458
2,0.0,0.652491,0.652491,0.0,0.385372,0.0


### So here we can see the tf-idf values for each of those words for the given text. So this tfidf trasformer allows us to go from Bag of words to term frequency - inverse document frequency.

# 3. TF-IDF Vectorizer

Now we have seen the Count Vectorizer and TF-IDF transformer, but seperately.

Scikit-learn has somethings called "TF-IDF Vectorizer" that does both of them in one step.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
tv = TfidfVectorizer()

In [33]:
tv_results = tv.fit_transform(text)

In [34]:
tv_results

<3x6 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [36]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

In [37]:
tv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

In [38]:
pd.DataFrame(data = tv_results.todense(), columns=tv.vocabulary_)

Unnamed: 0,this,is,line,another,completely,different
0,0.0,0.0,0.0,0.619805,0.481334,0.619805
1,0.631745,0.0,0.0,0.480458,0.373119,0.480458
2,0.0,0.652491,0.652491,0.0,0.385372,0.0


#### So above we can see the combined result of Count Vectorizer and Tf-idf transformer in a single step.

#### Again point to note is  that here we had really small dataset with very few documents, so we could see the reults.

#### But in case of really big datasets this might not be possible if e try to display the results as time taken for rendering is really large and program might crash.