<a href="https://colab.research.google.com/github/ajaysaikiran2208/Natural-Language-Processing/blob/main/Feature_Extraction_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Feature** **Extraction** **in** ***NLP***

Machine Learning algorithms learn from a pre-defined set of features from the training data to produce output for the test data. But the main problem in working with language processing is that machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of features.
Some of the most popular methods of feature extraction are :

      1.Bag-of-Words
      2.TF-IDF

#Bag of Words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

    1.A vocabulary of known words.
    2.A measure of the presence of known words.
    
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [20]:
paragraph=str(input("Enter the paragraph:"))

Enter the paragraph:Parents remain very conscious regarding their child’s learning and development. They try to do everything good for their children. When children learn a language in their school days English, parents want to hear simple sentences in English. They love to see their kid speaking a foreign language fluently.


In [21]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
wordnet=WordNetLemmatizer()

In [25]:
sentences=nltk.sent_tokenize(paragraph)
finaltext=[]
for i in range(len(sentences)):
  review=re.sub('[^a-zA-Z]', ' ', sentences[i])
  review=review.lower()
  review=review.split()
  review=[wordnet.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
  review=" ".join(review)
  finaltext.append(review)
print(finaltext)    

['parent remain conscious regarding child learning development', 'try everything good child', 'child learn language school day english parent want hear simple sentence english', 'love see kid speaking foreign language fluently']


In [26]:
#creating the Bag of words model

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
bag=cv.fit_transform(finaltext).toarray()
print(bag)

[[1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0]
 [1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [1 0 1 0 2 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 1]
 [0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0]]


#TF-IDF

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

In [27]:
paragraph=str(input("Enter the paragraph:"))

Enter the paragraph:A paragraph is a group of words put together to form a group that is usually longer than a sentence. Paragraphs are often made up of several sentences. There are usually between three and eight sentences. Paragraphs can begin with an indentation (about five spaces), or by missing a line out, and then starting again.


In [28]:
wordnet=WordNetLemmatizer()

In [29]:
sentences=nltk.sent_tokenize(paragraph)
finaltext=[]
for i in range(len(sentences)):
  review=re.sub('[^a-zA-Z]', ' ', sentences[i])
  review=review.lower()
  review=review.split()
  review=[wordnet.lemmatize(word) for word in review if word not in set(stopwords.words("english"))]
  review=" ".join(review)
  finaltext.append(review)
print(finaltext)    

['paragraph group word put together form group usually longer sentence', 'paragraph often made several sentence', 'usually three eight sentence', 'paragraph begin indentation five space missing line starting']


#Creating the TF-IDF model

In [30]:
from sklearn.feature_extraction.text import  TfidfVectorizer

In [31]:
cv=TfidfVectorizer()

In [32]:
tfidf=cv.fit_transform(finaltext).toarray()

In [33]:
print(tfidf)

[[0.         0.         0.         0.30954541 0.61909081 0.
  0.         0.30954541 0.         0.         0.         0.19757882
  0.30954541 0.19757882 0.         0.         0.         0.
  0.30954541 0.24404915 0.30954541]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.51199172 0.         0.51199172 0.32679768
  0.         0.32679768 0.51199172 0.         0.         0.
  0.         0.         0.        ]
 [0.         0.57457953 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.36674667 0.         0.         0.         0.57457953
  0.         0.4530051  0.        ]
 [0.36742339 0.         0.36742339 0.         0.         0.36742339
  0.36742339 0.         0.         0.36742339 0.         0.23452159
  0.         0.         0.         0.36742339 0.36742339 0.
  0.         0.         0.        ]]


In [37]:
import pandas as pd

In [41]:
tfIdfVectorizer=TfidfVectorizer()

In [46]:
tfIdf=tfIdfVectorizer.fit_transform(finaltext)

In [58]:
df=pd.DataFrame(tfIdf[3].T.todense(),index=tfIdfVectorizer.get_feature_names(),columns=["TF*IDF"])

In [59]:
df=df.sort_values("TF*IDF",ascending=False)

In [60]:
print(df)

               TF*IDF
begin        0.367423
five         0.367423
indentation  0.367423
line         0.367423
starting     0.367423
missing      0.367423
space        0.367423
paragraph    0.234522
sentence     0.000000
usually      0.000000
together     0.000000
three        0.000000
several      0.000000
often        0.000000
put          0.000000
eight        0.000000
made         0.000000
longer       0.000000
group        0.000000
form         0.000000
word         0.000000
