# Feature extraction from text

## Part 1: Manual feature extraction (without using libraries)

The files will contain very simple textx without any punctuation to keep things simple.

In [1]:
import pandas as pd

In [2]:
words_one = "This is a story about dogs our canine pets Dogs are furry animals".lower().split()
unique_words_one = set(words_one)
    
unique_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [3]:
words_two = "This story is about surfing Catching waves is fun Surfing is a popular water sport".lower().split()
unique_words_two = set(words_two)
    
unique_words_two

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [4]:
all_unique_words = set()
all_unique_words.update(unique_words_one)
all_unique_words.update(unique_words_two)

all_unique_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [5]:
full_vocab = dict()
i = 0

for word in all_unique_words:
    full_vocab[word] = i
    i = i + 1
    
full_vocab

{'furry': 0,
 'water': 1,
 'is': 2,
 'animals': 3,
 'story': 4,
 'about': 5,
 'are': 6,
 'waves': 7,
 'surfing': 8,
 'our': 9,
 'catching': 10,
 'dogs': 11,
 'pets': 12,
 'this': 13,
 'fun': 14,
 'sport': 15,
 'canine': 16,
 'a': 17,
 'popular': 18}

In [6]:
one_freq = [0] * len(full_vocab)
two_freq = [0] * len(full_vocab)
all_words = [''] * len(full_vocab)

print(one_freq)
print(two_freq)
print(all_words)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [7]:
one_text = "This is a story about dogs our canine pets Dogs are furry animals".lower().split()
    
for word in one_text:
    word_ind = full_vocab[word]
    one_freq[word_ind] += 1
    
one_freq

[1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0, 1, 1, 0]

In [8]:
two_text = "This story is about surfing Catching waves is fun Surfing is a popular water sport".lower().split()
    
for word in two_text:
    word_ind = full_vocab[word]
    two_freq[word_ind] += 1

two_freq    

[0, 1, 3, 0, 1, 1, 0, 1, 2, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1]

In [9]:
for word in full_vocab:
    word_ind = full_vocab[word]
    all_words[word_ind] = word

Now we have our **Bag of words** below (our dataframe).

In [10]:
pd.DataFrame(data=[one_freq, two_freq], columns=all_words)

Unnamed: 0,furry,water,is,animals,story,about,are,waves,surfing,our,catching,dogs,pets,this,fun,sport,canine,a,popular
0,1,0,1,1,1,1,1,0,0,1,0,2,1,1,0,0,1,1,0
1,0,1,3,0,1,1,0,1,2,0,1,0,0,1,1,1,0,1,1


By comparing the vectors we see that some words are common for both files and some are not. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them sparse matrices.

### Bag of Words and Tf-idf

In the above examples, each vector can be considered a bag of words. By itself these may not be helpful until we consider **term frequencies**, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way the number of times a word appears in large documents can be compared to that of smaller documents.

It may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider inverse document frequency, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale.

### Stop Words and Word Stems

Some words like "the" and "and" appear so frequently and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say cat in place of both cat and cats. This will shrink our vocab array and improve performance.

## Part 2: Feature extraction with Scikit-Learn

In [11]:
text = ["This is a line", "This is another line", "Completely different line"]

### CountVectorizer

In [12]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

All words

In [13]:
cv = CountVectorizer()
sparse_matrix = cv.fit_transform(text)
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]], dtype=int64)

In [14]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

Without stop words

In [15]:
cv = CountVectorizer(stop_words='english')

In [16]:
sparse_matrix = cv.fit_transform(text)
sparse_matrix.todense()

matrix([[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1]], dtype=int64)

In [17]:
cv.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

### TfidfTransformer

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

In [18]:
tfidf_transformer = TfidfTransformer()

In [19]:
cv = CountVectorizer()

In [20]:
sparse_matrix = cv.fit_transform(text)
sparse_matrix

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [21]:
tfidf = tfidf_transformer.fit_transform(sparse_matrix)
tfidf.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

### TfIdfVectorizer

Does both above in a single step

In [22]:
tfidf = TfidfVectorizer()

In [23]:
sparse_matrix = tfidf.fit_transform(text)
sparse_matrix.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])