## Feature Extraction

This notebook discusses the various methods of feature extraction. It is also known as Text Vectorization or text representation. This technique is used to convert the words into vectors so that they can be used for modeling. The various techniques are as following of semantic meaning.


### 1. One hot encoding: 

Most basic technique to convert words into vectors. Consider the following example.
- Doc1 = "It is raining"
- Doc2 = "Boston is raining"
- Doc3 = "Rain raining everywhere."

Vocabulary of this corpus = ["It","is","raining","Boston","rain","everywhere"].
In the vector form these docs can be converted as:
    
```
Doc1 = [[1,0,0,0,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]] -- size of this vector is 3X6.
Doc2 = [[0,0,0,1,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]] -- size of this vector is 3X6.
```

In general, each word of each document is converted to a vector.

**Pros:** 
- Very simple, intuitive, easy to implement.
    
**Cons:** 
- Sparsity, different array sizes won't work, out of vocabulary words will throw an error, no capturing of semantic meaning.

### 2. Bag of Words:
It deals with occurence of words in a document. (Unigram)

- Doc1 = "The more the merrier."
- Doc2 = "More people enrolled"
- Doc3 = "Enrolled last week"

Vocabulary of this corpus = ["The","more","merrier","peope","enrolled","last","week"].
In the vector form these docs can be converted as:
    
```
Doc1 = [2,1,1,0,0,0,0]
Doc2 = [0,1,0,1,1,0,0]
Doc3 = [0,0,0,0,1,1,1]
```

**Pros:**
- Simple and easy to understand, No fixed size required, doesn't throw an error for out of vocabulary words, captures semantic meaning (less)

**Cons:**
- Ignores new words in the text dataset.
- Sparsity
- Consider: "This is a good book" and "This is not a good book". Because the docs contain same words at same frequency, this approach treats the two documents as similar, when in reality they convey opposite meanings.

In [2]:
import pandas as pd

In [42]:
df1 = pd.DataFrame({'documents':["The more the merrier.","More people enrolled","Enrolled last week"]})

In [43]:
df1

Unnamed: 0,documents
0,The more the merrier.
1,More people enrolled
2,Enrolled last week


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [44]:
bow = cv.fit_transform(df1['documents'])

In [45]:
#Vocabulary
cv.vocabulary_

{'the': 5,
 'more': 3,
 'merrier': 2,
 'people': 4,
 'enrolled': 0,
 'last': 1,
 'week': 6}

The numbers indicate their ordering index. These words are alphabetically ordered.

In [46]:
print("Vector form of doc1:",bow[0].toarray())
print("Vector form of doc2:",bow[1].toarray())
print("Vector form of doc3:",bow[2].toarray())

Vector form of doc1: [[0 0 1 1 0 2 0]]
Vector form of doc2: [[1 0 0 1 1 0 0]]
Vector form of doc3: [[1 1 0 0 0 0 1]]


In [47]:
cv.transform(['This sentence contains more words']).toarray()

array([[0, 0, 0, 1, 0, 0, 0]], dtype=int64)

As expected the new words got ignored in the vector form; one of the drawback of BoW.

In [21]:
df = pd.read_csv('IMDB Dataset.csv')

In [24]:
import re
def removetags(text):
    pattern = '<.*?>'
    return re.sub(pattern,'',text)

In [28]:
df['review'] = df['review'].apply(removetags)

In [30]:
bow = cv.fit_transform(df['review'])

In [32]:
len(cv.vocabulary_)

104083

In [34]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [39]:
print(bow[0].toarray())
print(bow[1].toarray())

[[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]]


Sparsity can be seen.

### 3. N-gram:
The vocabulary is the combination of n-words.

**Pros:**
- Consider "This book is very good" and "This book is not good". Using Bi-gram, vocabulary is ["This book","book is","is very","very good","is not","not good"].
```
Doc1 = [1,1,1,1,0,0]
Doc2 = [1,1,0,0,1,1]
```
The vectors are now far from each other in the vector space which is why they are no longer similar.
- Better conveying of semantic meaning.
  
**Cons:**
- Dimension of vocabulary increases --> model interpretation increases, model computation increases.
- Out of Vocabulary words.

In [55]:
cv_2 = CountVectorizer(ngram_range=(2,2))

In [53]:
bow = cv_2.fit_transform(df1['documents'])

In [54]:
cv_2.vocabulary_

{'the more': 6,
 'more the': 3,
 'the merrier': 5,
 'more people': 2,
 'people enrolled': 4,
 'enrolled last': 0,
 'last week': 1}

ngram_range = (1,2) will contain words from unigram as well as bigram.

 ### 4. TF-IDF
 Assigns weightage to words. 

 **Term Frequency (TF)** : Number of times word w occurs in a given documensts / Total number of wordds in the document. <br>
 **Inverse Document Frequency** : log(Number of documents / Number of documents with word w) --> Captures the "rare" or "frequency" of words

In [56]:
df1

Unnamed: 0,documents
0,The more the merrier.
1,More people enrolled
2,Enrolled last week


In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(df1['documents']).toarray()

array([[0.        , 0.        , 0.42339448, 0.32200242, 0.        ,
        0.84678897, 0.        ],
       [0.51785612, 0.        , 0.        , 0.51785612, 0.68091856,
        0.        , 0.        ],
       [0.4736296 , 0.62276601, 0.        , 0.        , 0.        ,
        0.        , 0.62276601]])

In [64]:
print(tfidf.idf_)

[1.28768207 1.69314718 1.69314718 1.28768207 1.69314718 1.69314718
 1.69314718]


In [67]:
#Vocabulary
print(tfidf.get_feature_names_out())

['enrolled' 'last' 'merrier' 'more' 'people' 'the' 'week']


**Why do we take log in the calculation of IDF?**

If you calculate the IDF for a rare word in the document, the computation without log woulg be very high and will dominate the Term Frequency measure. Therefore, to ignore this situation, we take a log while computing IDF.

**Pros:** - Used for Information retrieval. <br>
**Cons:** - Sparsity, Out of Vocabulary, dimension becomes very big and does not capture the semantic meaning of a document.