___

<a href='http://www.pieriandata.com'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Copyright by Pierian Data Inc.</em></center>
<center><em>For more information, visit us at <a href='http://www.pieriandata.com'>www.pieriandata.com</a></em></center>

# Feature Extraction from Text

This notebook is divided into two sections:
* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features* by manually calcuating frequencies and building out TF-IDF.
* Next we'll show how to perform these steps using scikit-learn tools.

# Part One: Core Concepts on Feature Extraction


In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">This first section is for illustration only!
<br>Don't worry about memorizing this code - later on we will let Scikit-Learn Preprocessing tools do this for us.</div>

In [33]:
import pandas as pd

## Start with some documents:
For simplicity we won't use any punctuation in the text files One.txt and Two.txt. Let's quickly open them and read them. Keep in mind, you should avoid opening and reading entire files if they are very large, as Python could just display everything depending on how you open the file.


In [2]:
with open('One.txt') as mytext:
    print(mytext.read())

This is a story about dogs
our canine pets
Dogs are furry animals



In [3]:
with open('Two.txt') as mytext:
    print(mytext.read())

This story is about surfing
Catching waves is fun
Surfing is a popular water sport



### Reading entire text as a string

In [4]:
with open('One.txt') as mytext:
    entire_text = mytext.read()

In [5]:
entire_text

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [6]:
print(entire_text)

This is a story about dogs
our canine pets
Dogs are furry animals



### Reading Each Line as a List

In [7]:
with open('One.txt') as mytext:
    lines = mytext.readlines()

In [8]:
lines

['This is a story about dogs\n',
 'our canine pets\n',
 'Dogs are furry animals\n']

### Reading in Words Separately

In [9]:
with open('One.txt') as f:
    words = f.read().lower().split()

In [10]:
words

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

## Building a vocabulary (Creating a "Bag of Words")

create dictionaries that correspond to unique mappings of the words in the documents. We can begin to think of this as mapping out all the possible words available for all (both) documents.

In [11]:
with open('One.txt') as f:
    words_one = f.read().lower().split()

In [12]:
words_one

['this',
 'is',
 'a',
 'story',
 'about',
 'dogs',
 'our',
 'canine',
 'pets',
 'dogs',
 'are',
 'furry',
 'animals']

In [13]:
len(words_one)

13

In [14]:
uni_words_one = set(words)

In [15]:
uni_words_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

**Repeat for Two.txt**

In [16]:
with open('Two.txt') as f:
    words_two = f.read().lower().split()
    uni_words_two = set(words_two)

In [17]:
uni_words_two

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

**Get all unique words across all documents**

In [18]:
all_uni_words = set()
all_uni_words.update(uni_words_one)
all_uni_words.update(uni_words_two)

In [19]:
all_uni_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [20]:
full_vocab = dict()
i = 0

for word in all_uni_words:
    full_vocab[word] = i
    i = i+1

In [21]:
# The for loop goes through the set() in the most efficient way possible, not in alphabetical order!
full_vocab

{'are': 0,
 'fun': 1,
 'about': 2,
 'a': 3,
 'story': 4,
 'animals': 5,
 'water': 6,
 'sport': 7,
 'pets': 8,
 'popular': 9,
 'surfing': 10,
 'is': 11,
 'our': 12,
 'this': 13,
 'waves': 14,
 'catching': 15,
 'canine': 16,
 'furry': 17,
 'dogs': 18}

## Bag of Words to Frequency Counts

Now that we've encapsulated our "entire language" in a dictionary, let's perform *feature extraction* on each of our original documents:

**Empty counts per doc**

In [22]:
# Create an empty vector with space for each word in the vocabulary:
one_freq = [0]*len(full_vocab)
two_freq = [0]*len(full_vocab)
all_words = ['']*len(full_vocab)

In [23]:
one_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [24]:
two_freq

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [25]:
all_words

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

In [26]:
for word in full_vocab:
    word_ind = full_vocab[word]
    all_words[word_ind] = word    

In [27]:
all_words

['are',
 'fun',
 'about',
 'a',
 'story',
 'animals',
 'water',
 'sport',
 'pets',
 'popular',
 'surfing',
 'is',
 'our',
 'this',
 'waves',
 'catching',
 'canine',
 'furry',
 'dogs']

**Add in counts per word per doc:**

In [28]:
# map the frequencies of each word in 1.txt to our vector:
with open('One.txt') as f:
    one_text = f.read().lower().split()
    
for word in one_text:
    word_ind = full_vocab[word]
    one_freq[word_ind]+=1

In [29]:
one_freq

[1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 2]

In [30]:
# Do the same for the second document:
with open('Two.txt') as f:
    two_text = f.read().lower().split()
    
for word in two_text:
    word_ind = full_vocab[word]
    two_freq[word_ind]+=1

In [31]:
two_freq

[0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 2, 3, 0, 1, 1, 1, 0, 0, 0]

In [34]:
pd.DataFrame(data=[one_freq,two_freq],columns=all_words)

Unnamed: 0,are,fun,about,a,story,animals,water,sport,pets,popular,surfing,is,our,this,waves,catching,canine,furry,dogs
0,1,0,1,1,1,1,0,0,1,0,0,1,1,1,0,0,1,1,2
1,0,1,1,1,1,0,1,1,0,1,2,3,0,1,1,1,0,0,0


# Part Two:  Feature Extraction with Scikit-Learn

Explore the more realistic process of using sklearn to complete the tasks mentioned above!

# Scikit-Learn's Text Feature Extraction Options

In [None]:
text = ['This is a line',
           "This is another line",
       "Completely different line"]

## CountVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer,CountVectorizer

In [None]:
cv = CountVectorizer()

In [None]:
cv.fit_transform(text)

In [None]:
sparse_mat = cv.fit_transform(text)

In [None]:
sparse_mat.todense()

In [None]:
cv.vocabulary_

In [None]:
cv = CountVectorizer(stop_words='english')

In [None]:
cv.fit_transform(text).todense()

In [None]:
cv.vocabulary_

## TfidfTransformer

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

In [None]:
tfidf_transformer = TfidfTransformer()

In [None]:
cv = CountVectorizer()

In [None]:
counts = cv.fit_transform(text)

In [None]:
counts

In [None]:
tfidf = tfidf_transformer.fit_transform(counts)

In [None]:
tfidf.todense()

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
pipe = Pipeline([('cv',CountVectorizer()),('tfidf',TfidfTransformer())])

In [None]:
results = pipe.fit_transform(text)

In [None]:
results

In [None]:
results.todense()

## TfIdfVectorizer

Does both above in a single step!

In [None]:
tfidf = TfidfVectorizer()

In [None]:
new = tfidf.fit_transform(text)

In [None]:
new.todense()