<a href="https://colab.research.google.com/github/Nidhi89717/ML/blob/main/10-Naive-Bayes-and-NLP/01_Feature_Extraction_From_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Extraction from Text

# Part One: Core Concepts on Feature Extraction


In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">This first section is for illustration only!


In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
with open('gdrive/My Drive/csv_files/One.txt') as mytext:
  a = mytext.read()

In [3]:
a

'This is a story about dogs\nour canine pets\nDogs are furry animals\n'

In [4]:
print(a)

This is a story about dogs
our canine pets
Dogs are furry animals



In [5]:
with open('gdrive/My Drive/csv_files/One.txt') as mytext:
  word_one = mytext.read().lower().split()
  unique_word_one = set(word_one)

In [6]:
unique_word_one

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'dogs',
 'furry',
 'is',
 'our',
 'pets',
 'story',
 'this'}

In [7]:
with open('gdrive/My Drive/csv_files/Two.txt') as mytext:
  word_two = mytext.read().lower().split()
  unique_word_two = set(word_two)

In [8]:
unique_word_two

{'a',
 'about',
 'catching',
 'fun',
 'is',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

**Get all unique words across all documents**

In [9]:
uni_words = set()
uni_words.update(unique_word_one)

In [10]:
uni_words.update(unique_word_two)

In [11]:
uni_words

{'a',
 'about',
 'animals',
 'are',
 'canine',
 'catching',
 'dogs',
 'fun',
 'furry',
 'is',
 'our',
 'pets',
 'popular',
 'sport',
 'story',
 'surfing',
 'this',
 'water',
 'waves'}

In [12]:
full_vocab = dict()
i = 0

for word in uni_words:
  full_vocab[word] = i
  i = i+1

In [13]:
full_vocab

{'animals': 0,
 'story': 1,
 'canine': 2,
 'pets': 3,
 'a': 4,
 'about': 5,
 'popular': 6,
 'fun': 7,
 'surfing': 8,
 'sport': 9,
 'waves': 10,
 'is': 11,
 'this': 12,
 'dogs': 13,
 'are': 14,
 'furry': 15,
 'catching': 16,
 'water': 17,
 'our': 18}

## Bag of Words to Frequency Counts

Now that we've encapsulated our "entire language" in a dictionary, let's perform *feature extraction* on each of our original documents:

**Empty counts per doc**

In [14]:
one_freq = [0]*len(full_vocab)
two_freq = [0]*len(full_vocab)
all_words = ['']*len(full_vocab)

In [15]:
with open('gdrive/My Drive/csv_files/One.txt') as mytext:
  one_text = mytext.read().lower().split()

**Add in counts per word per doc:**

In [16]:
for word in one_text:
  word_ind = full_vocab[word]
  one_freq[word_ind] +=1

In [17]:
one_freq

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 0, 0, 1]

In [18]:
with open('gdrive/My Drive/csv_files/Two.txt') as mytext:
  two_text = mytext.read().lower().split()

In [19]:
for word in two_text:
  word_ind = full_vocab[word]
  two_freq[word_ind] +=1

In [20]:
two_freq

[0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 1, 3, 1, 0, 0, 0, 1, 1, 0]

In [21]:
for words in full_vocab:
  word_ind = full_vocab[words]
  all_words[word_ind] = words

In [22]:
all_words

['animals',
 'story',
 'canine',
 'pets',
 'a',
 'about',
 'popular',
 'fun',
 'surfing',
 'sport',
 'waves',
 'is',
 'this',
 'dogs',
 'are',
 'furry',
 'catching',
 'water',
 'our']

In [23]:
import pandas as pd
bow = pd.DataFrame(data=[one_freq,two_freq],columns=all_words)

In [24]:
bow

Unnamed: 0,animals,story,canine,pets,a,about,popular,fun,surfing,sport,waves,is,this,dogs,are,furry,catching,water,our
0,1,1,1,1,1,1,0,0,0,0,0,1,1,2,1,1,0,0,1
1,0,1,0,0,1,1,1,1,2,1,1,3,1,0,0,0,1,1,0


By comparing the vectors we see that some words are common to both, some appear only in `One.txt`, others only in `Two.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them **sparse matrices**.




## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. 

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

# Part Two:  Feature Extraction with Scikit-Learn



# Scikit-Learn's Text Feature Extraction Options

In [25]:
text = ['This is a line',
        'This is another line',
        'Completely different line']

## CountVectorizer

In [26]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [None]:
help(CountVectorizer)

In [28]:
cv = CountVectorizer()

In [29]:
sparse_matrix = cv.fit_transform(text)

In [30]:
sparse_matrix.todense()

matrix([[0, 0, 0, 1, 1, 1],
        [1, 0, 0, 1, 1, 1],
        [0, 1, 1, 0, 1, 0]])

In [31]:
cv.vocabulary_

{'this': 5, 'is': 3, 'line': 4, 'another': 0, 'completely': 1, 'different': 2}

In [32]:
cv2 = CountVectorizer(stop_words='english')
sparse_matrix2 = cv2.fit_transform(text)

In [33]:
sparse_matrix2.todense()

matrix([[0, 0, 1],
        [0, 0, 1],
        [1, 1, 1]])

In [34]:
cv2.vocabulary_

{'line': 2, 'completely': 0, 'different': 1}

## TfidfTransformer

TfidfVectorizer is used on sentences, while TfidfTransformer is used on an existing count matrix, such as one returned by CountVectorizer

In [35]:
tf = TfidfTransformer()

In [36]:
results = tf.fit_transform(sparse_matrix)

In [37]:
results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

## TfIdfVectorizer

Does both above in a single step!

In [39]:
tv = TfidfVectorizer()

In [40]:
tv_results = tv.fit_transform(text)

In [41]:
tv_results.todense()

matrix([[0.        , 0.        , 0.        , 0.61980538, 0.48133417,
         0.61980538],
        [0.63174505, 0.        , 0.        , 0.4804584 , 0.37311881,
         0.4804584 ],
        [0.        , 0.65249088, 0.65249088, 0.        , 0.38537163,
         0.        ]])