<a href="https://colab.research.google.com/github/Ajay-Sai-Kiran/Natural-Language-Processing/blob/main/DAY_8_Feature_Engineering_of_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Feature Engineering of Text Data

Feature extraction mainly has two main methods: 

bag-of-words, and 

word embedding. 

Both of them are commonly used and has different approaches. I will explain both of them and differences between them

#Bag of Words with TF-IDF


Bag-of-Words with TF-IDF is a traditional and simple feature extraction method in natural language processing. Bag-of-Words is a “representation model” of text data and TF-IDF is a “calculation method” to score an importance of words in a document.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences.
sentences = [
    "Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks",
    "Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so",
    "You are a very good software engineer, engineer.",
]

# Create CountVectorizer, which create bag-of-words model.
# stop_words : Specify language to remove stopwords. 
vectorizer = CountVectorizer(stop_words='english')

# Learn vocabulary in sentences. 
vectorizer.fit(sentences)

# Get dictionary. 
vectorizer.get_feature_names()



['accurate',
 'ai',
 'allows',
 'applications',
 'artificial',
 'communicate',
 'computers',
 'engineer',
 'explicitly',
 'good',
 'helps',
 'humans',
 'intelligence',
 'language',
 'learning',
 'machine',
 'ml',
 'natural',
 'outcomes',
 'predicting',
 'processing',
 'programmed',
 'related',
 'scales',
 'software',
 'tasks',
 'type']

In [11]:
# Transform each sentences in vector space.
vector = vectorizer.transform(sentences)
vector_spaces = vector.toarray()

vector_spaces

array([[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 3, 0, 0, 0, 1, 0, 0, 1, 0,
        1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
        0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0]])

In [12]:
# Show sentences and vector space representation.
for i, v in zip(sentences, vector_spaces):
    print(i)
    print(v)

Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks
[0 0 0 0 0 1 1 0 0 0 1 1 0 3 0 0 0 1 0 0 1 0 1 1 0 1 0]
Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so
[1 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1]
You are a very good software engineer, engineer.
[0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = [
    "Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks",
    "Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so",
    "You are a very good software engineer, engineer.",
]

# Create TfidfVectorizer.
# stop_words : Get rid of english stop words. 
vectorizer = TfidfVectorizer(stop_words='english')

# Learn vocabulary from sentences. 
vectorizer.fit(sentences)

# Get vocabularies.
vectorizer.vocabulary_

{'accurate': 0,
 'ai': 1,
 'allows': 2,
 'applications': 3,
 'artificial': 4,
 'communicate': 5,
 'computers': 6,
 'engineer': 7,
 'explicitly': 8,
 'good': 9,
 'helps': 10,
 'humans': 11,
 'intelligence': 12,
 'language': 13,
 'learning': 14,
 'machine': 15,
 'ml': 16,
 'natural': 17,
 'outcomes': 18,
 'predicting': 19,
 'processing': 20,
 'programmed': 21,
 'related': 22,
 'scales': 23,
 'software': 24,
 'tasks': 25,
 'type': 26}

In [14]:
# Transform to document-term matrix
vector_spaces = vectorizer.transform(sentences)
vector_spaces.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.23570226, 0.23570226, 0.        , 0.        , 0.        ,
        0.23570226, 0.23570226, 0.        , 0.70710678, 0.        ,
        0.        , 0.        , 0.23570226, 0.        , 0.        ,
        0.23570226, 0.        , 0.23570226, 0.23570226, 0.        ,
        0.23570226, 0.        ],
       [0.26190578, 0.26190578, 0.26190578, 0.26190578, 0.26190578,
        0.        , 0.        , 0.        , 0.26190578, 0.        ,
        0.        , 0.        , 0.26190578, 0.        , 0.26190578,
        0.26190578, 0.26190578, 0.        , 0.26190578, 0.26190578,
        0.        , 0.26190578, 0.        , 0.        , 0.19918609,
        0.        , 0.26190578],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.84678897, 0.        , 0.42339448,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.    

In [15]:
for i, v in zip(sentences, vector_spaces):
    print(i)
    print(v)

Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks
  (0, 25)	0.2357022603955158
  (0, 23)	0.2357022603955158
  (0, 22)	0.2357022603955158
  (0, 20)	0.2357022603955158
  (0, 17)	0.2357022603955158
  (0, 13)	0.7071067811865475
  (0, 11)	0.2357022603955158
  (0, 10)	0.2357022603955158
  (0, 6)	0.2357022603955158
  (0, 5)	0.2357022603955158
Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so
  (0, 26)	0.26190577641518953
  (0, 24)	0.19918609370383894
  (0, 21)	0.26190577641518953
  (0, 19)	0.26190577641518953
  (0, 18)	0.26190577641518953
  (0, 16)	0.26190577641518953
  (0, 15)	0.26190577641518953
  (0, 14)	0.26190577641518953
  (0, 12)	0.26190577641518953
  (0, 8)	0.26190577641518953
  (0, 4)	0.26190577641518953
  (0, 3)	0.26190577641518953
  (0, 2)	0.26190577641518953
  (0, 

#Word Embedding

Word embedding is one of the document representation in vector space model. It captures contexts and semantics of word unlike Bag-of-Words model. Bag-of-Words only represents number of occurrence of words in document without any relationships and contexts. On the other hand, Word embedding preserves contexts and relationships of words so that it detects similar words more accurately.

#Word2vec

Word2vec is one of the most popular implementation of word embedding, which is invented by Google in 2013. It describes word embedding with two-layer shallow neural networks in order to recognize context meanings.

Word2vec is good at grouping similar words and making highly accurate guesses about meaning of words based on contexts. 
It has two different algorithms inside:


CBoW(Continuous Bag-of-Words) and 
skip gram model.

In [7]:
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

# Get document data.
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [16]:
# Word2Vec modeling. 
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)

# Get specified vocabulary's vector. 
model.wv["human"]

array([ 9.9272060e-04,  1.0017658e-03,  1.7501498e-04, -8.6333523e-05,
        4.4604125e-03, -3.5663932e-03,  7.0894028e-05, -2.4404787e-03,
       -3.5875663e-03,  1.1043018e-03, -2.7340502e-03, -1.8319436e-03,
        7.5065304e-04, -1.2301384e-03,  4.8667877e-03,  2.4788410e-03,
       -7.2513492e-04,  3.9498815e-03,  6.8107276e-04,  3.1631251e-03,
       -1.7963970e-03,  9.0599572e-04, -1.5415214e-03, -2.0505588e-03,
       -4.8715952e-03,  4.7385395e-03,  2.5972070e-03, -3.4232132e-04,
       -9.5266529e-04, -1.9619314e-03,  1.3030939e-03,  1.7025197e-03,
        6.5801857e-04,  4.1995966e-03, -2.3826668e-03, -1.5172569e-04,
       -4.9485448e-03,  4.7071259e-03, -3.2198715e-03,  3.7962880e-03,
        1.5945356e-04,  2.6255578e-03, -1.5805045e-03, -5.7039055e-04,
        1.4873416e-03, -2.1401814e-03,  3.1070798e-03,  4.4205035e-03,
        2.0905412e-03,  2.3956869e-03,  2.9179810e-03, -1.8138132e-03,
       -1.5461032e-03, -2.3744792e-04,  9.6716528e-04,  3.4993188e-03,
      

In [17]:
model.wv.most_similar("human")

[('user', 0.12816676497459412),
 ('interface', 0.1062312126159668),
 ('eps', 0.09679573774337769),
 ('computer', 0.09228645265102386),
 ('minors', 0.07000488042831421),
 ('survey', 0.05878273397684097),
 ('response', 0.022046033293008804),
 ('trees', 0.0005789585411548615),
 ('system', -0.020276952534914017),
 ('graph', -0.021966716274619102)]