# Extracting features from text

Many machine learning problems use text as an explanatory variable. Text must
be transformed to a different representation that encodes as much of its meaning
as possible in a feature vector. In the following sections we will review variations
of the most common representation of text that is used in machine learning: the
bag-of-words model.

## The bag-of-words representation

 - It creates one feature for each word of interest in the text
 - used effectively for document classification and retrieval

A collection of documents is called a corpus. Let's use a corpus with the following
two documents to examine the bag-of-words model:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
                'UNC played Duke in basketball',
                'Duke lost the basketball game'
         ]


vectorizer = CountVectorizer()


In [2]:
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'played': 5, 'duke': 1, 'basketball': 0, 'in': 3, 'lost': 4, 'game': 2, 'the': 6, 'unc': 7}


In [None]:
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]

In [None]:
vectorizer = CountVectorizer()

In [None]:
counts=(vectorizer.fit_transform(corpus).todense())
counts

In [None]:
print(vectorizer.vocabulary_)

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[2], counts[1]))

## Stop-word filtering

In [None]:
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]

In [None]:
vectorizer = CountVectorizer(stop_words='english')

In [None]:
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)


In [None]:
corpus = [
            'He ate the sandwiches',
            'Every sandwich was eaten by him'
        ]

In [None]:
vectorizer = CountVectorizer(stop_words='english')

In [None]:
counts=vectorizer.fit_transform(corpus).todense()
print (counts)
print(vectorizer.vocabulary_)


In [None]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))

## Stemming and lemmatization

Natural Language Tool Kit (NTLK) lib is used for this purpose. But not in the scope of this tutorial.

## Extending bag-of-words with TF-IDF weights

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
counts=vectorizer.fit_transform(corpus).todense()
print(counts)

In [None]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))