<a href="https://colab.research.google.com/github/TusharNautiyal-web/CollabNotebooks/blob/main/All%20Notebooks/Basics_Of_Natural_Language_Processing_Bag_of_Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Natural Language Processing 
1. Tokenization
2. Stemming
3. Lematization
4. Bag of Words

# Step 1 Assing a Paragraph
Paragraph is our corpus and we will be giving this corpus to tokenize and reduce it to important words.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.


In [None]:
# A Sentence From Geek For Geeks
paragraph = '''
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. 
Text preprocessing includes both Stemming as well as Lemmatization. 
Many times people find these two terms confusing. 
Some treat these two as the same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
'''

# Step 2 NLTK
We are using NLTK Library to do Stemmming and tokeninzation. Lets Get Started.

In [None]:
import nltk
from nltk.stem import PorterStemmer # For Stemming
from nltk.corpus import stopwords


[nltk_data] Downloading package wordnet to /root/nltk_data...


True

**Punkt has the feature tokenize so to use tokenization we need to download punk using below statement**

***Also Download These Files***

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
sentences = nltk.sent_tokenize(paragraph)

In [None]:
print(sentences)

['\nLemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.', 'Lemmatization is similar to stemming but it brings context to the words.', 'So it links words with similar meanings to one word.', 'Text preprocessing includes both Stemming as well as Lemmatization.', 'Many times people find these two terms confusing.', 'Some treat these two as the same.', 'Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.']


# Step 3
Now After doing tokenization we can either choose stemming or lematizing as we wish 
Stemming will not save the meaning of the words given to it whereas lematizer will now its upto you as per your usecase. But before lets do some pre-processing of our sentence 

To choose Lemmatizer or stemmer we use below code.

In [None]:
from nltk.stem import WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

**What is Stemming ?**

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [None]:
print(stemmer.stem('thinking'))
print(stemmer.stem('history'))
print(stemmer.stem('going'))

think
histori
go


Lets remove Special Characters and only take the words. For this we can use regular expression re.sub which will subtract everything other then a-z, A-Z

In [None]:
import re
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
corpus

['lemmatization process grouping together different inflected form word analyzed single item',
 'lemmatization similar stemming brings context word',
 'link word similar meaning one word',
 'text preprocessing includes stemming well lemmatization',
 'many time people find two term confusing',
 'treat two',
 'actually lemmatization preferred stemming lemmatization morphological analysis word']

# Lets Apply Stemming
1. We will take all sentences from corpus and tokenize them (Converting Sentence into Words)
2. We will after that apply stemming on each word of tokenized sentence and also remove all stopwords.


In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Getting All Sentences
l1 = [] # This will take all words.
for i in corpus:
  words = nltk.word_tokenize(i)
  for word in words:
    # because we are reading english paragraph and stopwords can be duplicate so to not have any duplicates
    if word not in set(stopwords.words('english')):
      l1.append(stemmer.stem(word))
      print(stemmer.stem(word))

lemmat
process
group
togeth
differ
inflect
form
word
analyz
singl
item
lemmat
similar
stem
bring
context
word
link
word
similar
mean
one
word
text
preprocess
includ
stem
well
lemmat
mani
time
peopl
find
two
term
confus
treat
two
actual
lemmat
prefer
stem
lemmat
morpholog
analysi
word


In [None]:
# ALso lets use Lemmatizer as we are trying to understand how Natural Language Processing is done.
l2 = [] # This will contain Lematized words
for i in corpus:
  words = nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      ans = lemmatizer.lemmatize(word)
      l2.append(ans)


In [None]:
l2

['lemmatization',
 'process',
 'grouping',
 'together',
 'different',
 'inflected',
 'form',
 'word',
 'analyzed',
 'single',
 'item',
 'lemmatization',
 'similar',
 'stemming',
 'brings',
 'context',
 'word',
 'link',
 'word',
 'similar',
 'meaning',
 'one',
 'word',
 'text',
 'preprocessing',
 'includes',
 'stemming',
 'well',
 'lemmatization',
 'many',
 'time',
 'people',
 'find',
 'two',
 'term',
 'confusing',
 'treat',
 'two',
 'actually',
 'lemmatization',
 'preferred',
 'stemming',
 'lemmatization',
 'morphological',
 'analysis',
 'word']

# Creating Features 

Out of all This Data We will use bag of words technique. Below code is what u need to do. We will use count vectorizer to create features out of words.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)

In [None]:
X = cv.fit_transform(corpus)
cv.vocabulary_

{'actually': 0,
 'analysis': 1,
 'analyzed': 2,
 'brings': 3,
 'confusing': 4,
 'context': 5,
 'different': 6,
 'find': 7,
 'form': 8,
 'grouping': 9,
 'includes': 10,
 'inflected': 11,
 'item': 12,
 'lemmatization': 13,
 'link': 14,
 'many': 15,
 'meaning': 16,
 'morphological': 17,
 'one': 18,
 'people': 19,
 'preferred': 20,
 'preprocessing': 21,
 'process': 22,
 'similar': 23,
 'single': 24,
 'stemming': 25,
 'term': 26,
 'text': 27,
 'time': 28,
 'together': 29,
 'treat': 30,
 'two': 31,
 'well': 32,
 'word': 33}

In [None]:
corpus[0]

'lemmatization process grouping together different inflected form word analyzed single item'

In [None]:
X[0].toarray()

array([[0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]])

In [None]:
import pandas as pd
import numpy as np
count_matrix = X.toarray()
df = pd.DataFrame(data=count_matrix,columns = cv.get_feature_names())



In [None]:
df.head()

Unnamed: 0,actually,analysis,analyzed,brings,confusing,context,different,find,form,grouping,...,single,stemming,term,text,time,together,treat,two,well,word
0,0,0,1,0,0,0,1,0,1,1,...,1,0,0,0,0,1,0,0,0,1
1,0,0,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,1,0
4,0,0,0,0,1,0,0,1,0,0,...,0,0,1,0,1,0,0,1,0,0


In [None]:
df.columns

Index(['actually', 'analysis', 'analyzed', 'brings', 'confusing', 'context',
       'different', 'find', 'form', 'grouping', 'includes', 'inflected',
       'item', 'lemmatization', 'link', 'many', 'meaning', 'morphological',
       'one', 'people', 'preferred', 'preprocessing', 'process', 'similar',
       'single', 'stemming', 'term', 'text', 'time', 'together', 'treat',
       'two', 'well', 'word'],
      dtype='object')

**That's All For NLP Bag of Words**

Which is a technique to convert text into features using count vectorizer. If you like such content subscribe to my youtube

Youtube-Link: <a href = 'https://www.youtube.com/channel/UCsfYqXa3LoaLkB-9F2vmplA'>Click Here</a>

Or Connect with me on linkedin.

Linkedin: <a href = 'https://www.linkedin.com/in/tusharnautiyal/'>Click Here </a>

@ author: Tushar Nautiyal
Hope so you liked it 😀