# Feature Extraction from Text

### Feature Extraction from Text with Scikit Learn

In this notebook we are going to see how to find out the numeric feature vectors from text.

We will discuss the following:

##### Bag-of-Words (BoW)
##### N-Gram
**https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer**

##### TF-IDF
**https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html**

In [3]:

from sklearn.feature_extraction.text import CountVectorizer

In [4]:
corpus = ['This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?']

## Bag-of-Words(BoW) based on Word-Count

In [16]:
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

# Feature names - words
print(vectorizer.get_feature_names())

print()

# Word-indicex in the vocabulary
print(vectorizer.vocabulary_)

print()

# Numeric Vectors
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


## N-Gram Feature Vectors

In [18]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))

X2 = vectorizer2.fit_transform(corpus)

# Feature names
#print(vectorizer2.get_feature_names())

#print()

# Word-indicex in the vocabulary
print(vectorizer2.vocabulary_)

print()
print('Dimension of the N-gram feature vector: ',len(vectorizer2.get_feature_names()))

print()
# Numeric Vectors
print(X2.toarray())

{'this is': 11, 'is the': 3, 'the first': 6, 'first document': 2, 'this document': 10, 'document is': 1, 'the second': 7, 'second document': 5, 'and this': 0, 'the third': 8, 'third one': 9, 'is this': 4, 'this the': 12}

Dimension of the N-gram feature vector:  13

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


## TF-IDF Feature Vector

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

#print(vectorizer.get_feature_names())
print()

# Word-indicex in the vocabulary
print(vectorizer.vocabulary_)

print()


{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}



In [14]:
print(X.shape)

(4, 9)


In [12]:
# Numeric Vectors
print(X.toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
