# LEMMATIZATION AND STEMMING

In natural language processing (NLP), both lemmatization and stemming are techniques used to reduce words to their base or root form. They are employed to normalize text data, improve text analysis, and enhance computational efficiency. While they serve a similar purpose, there are key differences between the two approaches.

Lemmatization:
Lemmatization aims to reduce words to their base form, known as the lemma. The lemma is the dictionary form or the canonical form of a word. It ensures that different inflected forms of a word are mapped to the same base form. For example:

Lemmatization of the word "running" would result in "run."
Lemmatization of "better" would yield "good."
Lemmatization considers the context and part of speech (POS) of the word in order to produce the appropriate lemma. For instance, the lemma of "better" could be "good" as an adjective, but it could also be "well" as an adverb.

Stemming:
Stemming involves reducing words to their stems, which are the core parts of words. It involves removing prefixes or suffixes from words to obtain the base form. The stemming process may produce stems that are not actual words. For example:

Stemming the word "running" would result in "run."
Stemming "better" would produce "bett."
Stemming algorithms apply a set of rules or heuristics to perform the reductions. Since stemming relies on rule-based manipulation, it is generally faster than lemmatization. However, it may not always generate linguistically correct stems.

The choice between lemmatization and stemming depends on the specific NLP task and the desired outcome. Lemmatization is often preferred when maintaining the grammatical integrity of the text is important. It produces meaningful base forms that can be useful for tasks such as language understanding, information retrieval, and topic modeling. Stemming, on the other hand, is suitable for tasks like information retrieval, search engines, and sentiment analysis, where speed and simplicity are prioritized over linguistic accuracy.

Both techniques are available in popular NLP libraries such as NLTK (Natural Language Toolkit) and spaCy, offering various lemmatizers and stemmers for different languages.

# 1. Bag of Words (BoW):

BoW is a representation of text data where each document is represented as a collection of words, disregarding grammar and word order.
It involves creating a vocabulary of unique words from the entire corpus and representing each document as a vector of word frequencies or presence/absence indicators.
BoW is a simple and commonly used approach for text classification and information retrieval tasks.

# 2. Unigrams, Bigrams, and N-grams:

Unigrams refer to individual words in a text.
Bigrams are pairs of consecutive words occurring together in a text.
N-grams are sequences of N words occurring together in a text.
N-grams capture more context and can provide a richer representation of text compared to individual words.
By considering different values of N, such as trigrams or higher-order n-grams, more extensive contextual information can be captured.

# 3. Text to Speech (TTS):

TTS is a technology that converts written text into spoken words.
It involves synthesizing human-like speech from text input.
TTS systems use various techniques, such as concatenative synthesis or parametric synthesis, to generate speech.
TTS finds applications in voice assistants, audiobooks, accessibility tools, and more.

# 4. Speech to Text (STT):

STT, also known as Automatic Speech Recognition (ASR), is a technology that converts spoken language into written text.
It involves analyzing audio recordings and transcribing the spoken words into a textual form.
STT systems use techniques like acoustic modeling, language modeling, and decoding algorithms to recognize and convert speech.
STT is used in applications like transcription services, voice-controlled systems, voice assistants, and more.

Both TTS and STT are important components of natural language understanding and communication systems, enabling interactions between humans and machines through speech and text.






# Stemming

In [2]:
import nltk
from nltk.stem.porter import *

The line of code you provided imports the Porter stemming algorithm from the NLTK library. The Porter stemming algorithm is one of the popular stemming algorithms used in natural language processing. It reduces words to their stems by removing common English language suffixes.

The Porter stemming algorithm is useful for applications where speed and simplicity are prioritized over linguistic accuracy. However, it's worth noting that the algorithm may produce stems that are not actual words. If you require more linguistically accurate stemming, you can explore other stemming algorithms available in the NLTK library, such as the Lancaster stemmer or Snowball stemmers for different languages.

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
stemming = PorterStemmer()

In [6]:
words = ['run', 'runs', 'running', 'ran', 'easily', 'fairly', 'easy', 'fair', 'history', 'historical']

In [7]:
for word in words:
  print(word + '--->'+stemming.stem(word))

run--->run
runs--->run
running--->run
ran--->ran
easily--->easili
fairly--->fairli
easy--->easi
fair--->fair
history--->histori
historical--->histor


In [8]:
sentence = "The government has constituted a new internal oversight mechanism for official data, revamping a Standing Committee on Economic Statistics (SCES) set up in late 2019, soon after the findings from the last round of household surveys on consumption expenditure and employment were junked, citing ‘data quality issues’. In an order issued last Thursday, the Statistics Ministry said that the SCES — which was tasked with examining economic indicators only — will now be replaced by a Standing Committee on Statistics (SCoS) which has a broader mandate to review the framework and results of all surveys conducted under the aegis of the National Statistical Office (NSO).Pronab Sen, India’s first chief statistician and the former chairman of the National Statistical Commission (NSC), has been named the chair of the new committee."

In [9]:
from nltk.corpus import stopwords

In [10]:
my_review = nltk.sent_tokenize(sentence)

In [11]:
my_review

['The government has constituted a new internal oversight mechanism for official data, revamping a Standing Committee on Economic Statistics (SCES) set up in late 2019, soon after the findings from the last round of household surveys on consumption expenditure and employment were junked, citing ‘data quality issues’.',
 'In an order issued last Thursday, the Statistics Ministry said that the SCES — which was tasked with examining economic indicators only — will now be replaced by a Standing Committee on Statistics (SCoS) which has a broader mandate to review the framework and results of all surveys conducted under the aegis of the National Statistical Office (NSO).Pronab Sen, India’s first chief statistician and the former chairman of the National Statistical Commission (NSC), has been named the chair of the new committee.']

In [12]:
stemming = PorterStemmer()

In [13]:
for i in range(len(my_review)):
    words = nltk.word_tokenize(my_review[i])
    words = [stemming.stem(word) for word in words if word not in set(stopwords.words('english'))]
    my_review[i] = ' '.join(words)

In [14]:
my_review

['the govern constitut new intern oversight mechan offici data , revamp stand committe econom statist ( sce ) set late 2019 , soon find last round household survey consumpt expenditur employ junk , cite ‘ data qualiti issu ’ .',
 'in order issu last thursday , statist ministri said sce — task examin econom indic — replac stand committe statist ( sco ) broader mandat review framework result survey conduct aegi nation statist offic ( nso ) .pronab sen , india ’ first chief statistician former chairman nation statist commiss ( nsc ) , name chair new committe .']

# Lemmatization


In [15]:
words = ['run', 'runs', 'running', 'ran', 'easily', 'fairly', 'easy', 'fair', 'history', 'historical']

In [16]:
from nltk.stem import WordNetLemmatizer


The line of code you provided imports the WordNet lemmatizer from the NLTK library. The WordNet lemmatizer is a popular lemmatization algorithm used in natural language processing. It reduces words to their base form or lemma based on the WordNet lexical database.

In [17]:
lemmatizer = WordNetLemmatizer()

In [18]:
for word in words:
    print(word + '--->' + lemmatizer.lemmatize(word))


run--->run
runs--->run
running--->running
ran--->ran
easily--->easily
fairly--->fairly
easy--->easy
fair--->fair
history--->history
historical--->historical


In [19]:
for i in range(len(my_review)):
    words = nltk.word_tokenize(my_review[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    my_review[i] = ' '.join(words)

In [20]:
my_review

['govern constitut new intern oversight mechan offici data , revamp stand committe econom statist ( sce ) set late 2019 , soon find last round household survey consumpt expenditur employ junk , cite ‘ data qualiti issu ’ .',
 'order issu last thursday , statist ministri said sce — task examin econom indic — replac stand committe statist ( sco ) broader mandat review framework result survey conduct aegi nation statist offic ( nso ) .pronab sen , india ’ first chief statistician former chairman nation statist commiss ( nsc ) , name chair new committe .']

# BAG OF WORDS (BOW)

In [21]:
sentence = "The SCoS — with “enhanced terms of reference” vis-à-vis the SCES, “to ensure more coverage” — has 10 official members, and four non-official members who are eminent academics. The panel can have up to 16 members, as per the order issued by the Ministry of Statistics and Programme Implementation (MoSPI). The development assumes significance amid sharp critiques of India’s statistical machinery by members of the Economic Advisory Council to the Prime Minister, including its chairperson Bibek Debroy. He had mooted an overhaul of the system and contended that the Indian Statistical Service has “little expertise in survey design”. “The term of the SCES was coming to an end in any case, so it was decided to expand the committee’s mandate beyond economic data and advise the Ministry on technical aspects for all surveys, such as sampling frame, design, survey methodology and finalisation of results,” an official said.Apart from addressing issues raised from time to time on the subject, results and methodology for all surveys, the SCoS’ terms of reference include the identification of data gaps that need to be filled by official statistics, along with an appropriate strategy to plug those gaps. It has also been mandated to explore the use of administrative statistics to improve data outcomes."

In [22]:
my_review = nltk.sent_tokenize(sentence)

In [23]:
my_review

['The SCoS — with “enhanced terms of reference” vis-à-vis the SCES, “to ensure more coverage” — has 10 official members, and four non-official members who are eminent academics.',
 'The panel can have up to 16 members, as per the order issued by the Ministry of Statistics and Programme Implementation (MoSPI).',
 'The development assumes significance amid sharp critiques of India’s statistical machinery by members of the Economic Advisory Council to the Prime Minister, including its chairperson Bibek Debroy.',
 'He had mooted an overhaul of the system and contended that the Indian Statistical Service has “little expertise in survey design”.',
 '“The term of the SCES was coming to an end in any case, so it was decided to expand the committee’s mandate beyond economic data and advise the Ministry on technical aspects for all surveys, such as sampling frame, design, survey methodology and finalisation of results,” an official said.Apart from addressing issues raised from time to time on th

In [26]:
#importing regular expression

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

The import re statement is used to import the re module in Python. The re module provides support for regular expressions, which are powerful patterns used to match and manipulate strings.

# Pre-Processing steps

In [29]:
corpus = []

for i in range(len(my_review)):
    review = re.sub('[^a-zA-Z]', ' ', my_review[i])
    review = review.lower()
    words = nltk.word_tokenize(review)
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    review = ' '.join(words)
    corpus.append(review)

In this code, a loop iterates over the indices of my_review. For each index i, re.sub('[^a-zA-Z]', '', my_review[i]) is used to remove all non-alphabetic characters from my_review[i] using a regular expression.

review = review.lower() : Since Python is a case sensitive


In [30]:
corpus

['scos enhanced term reference vi vi sces ensure coverage official member four non official member eminent academic',
 'panel member per order issued ministry statistic programme implementation mospi',
 'development assumes significance amid sharp critique india statistical machinery member economic advisory council prime minister including chairperson bibek debroy',
 'mooted overhaul system contended indian statistical service little expertise survey design',
 'term sces coming end case decided expand committee mandate beyond economic data advise ministry technical aspect survey sampling frame design survey methodology finalisation result official said apart addressing issue raised time time subject result methodology survey scos term reference include identification data gap need filled official statistic along appropriate strategy plug gap',
 'also mandated explore use administrative statistic improve data outcome']

# Building Bag of Words Model

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=500, binary=True)
x = cv.fit_transform(corpus).toarray()


In [38]:
x

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
        0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [39]:
import pandas as pd
df = pd.DataFrame(x)

In [40]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,81,82,83,84,85,86,87,88,89,90
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0,0,0
4,0,1,0,1,0,1,0,0,1,1,...,0,1,1,1,0,1,1,1,0,0


In [41]:
df.shape

(6, 91)

In [42]:
# Bag of Word
# Converting text to vector
    # Binary BOW
    # Boolean BOW
# creating more sparse matrix - Zero value - disadvantage
# we cannot figureout which word is more significant - disadvantage

# Text to Speech

In [43]:
import os
os.getcwd()

'/content'

In [44]:
!pip install gTTS
# gTTS  - google text to speech

Collecting gTTS
  Downloading gTTS-2.3.2-py3-none-any.whl (28 kB)
Installing collected packages: gTTS
Successfully installed gTTS-2.3.2


In [45]:
from gtts import gTTS

In [48]:
my_sentence = input('Please enter your sentence here to convert into audio : \n>')

Please enter your sentence here to convert into audio : 
>While the panel will help finalise survey results, the NSC will have the ultimate authority to approve the publication of those results. The government had reconstituted the NSC last December, appointing Rajeeva Laxman Karandikar, Professor Emeritus at the Chennai Mathematical Institute, as its part-time chairperson. While the NSC still has two vacancies, the appointments of Mr. Karandikar and two other part-time members were notified on May 30 this year.


In [49]:
sentence = gTTS(text=my_sentence, lang='en')

In [50]:
sentence.save('english.mp3')

In [51]:
# other Language

In [52]:
my_sentence = input('Please enter your sentence here to convert into audio : \n')

Please enter your sentence here to convert into audio : 
এ বার বড় পর্দায় রবীন্দ্রনাথ ঠাকুর। কবিগুরুর সৃষ্ট গান, কবিতা, গল্প বা নাটক নিয়ে নয়— এ বার স্বয়ং তাঁকে নিয়েই কাজ বলিউডে। রবি ঠাকুরের চরিত্রে ইতিমধ্যেই চূড়ান্ত হয়ে গিয়েছেন অভিনেতাও। সম্প্রতি সমাজমাধ্যমের পাতায় প্রকাশ্যে এসেছে রবীন্দ্রনাথের বেশে সেই অভিনেতার ‘লুক’ও। বড় পর্দায় রবীন্দ্রনাথের বেশে দেখা যেতে চলেছে কোন বলিউড অভিনেতাকে?


In [53]:
sentence = gTTS(text=my_sentence, lang='bn')

In [64]:
sentence.save('english.mp3')

In [65]:
my_sentence = input('Please enter your sentence here to convert into audio : \n')

Please enter your sentence here to convert into audio : 
এসো আমার ঘরে।বাহির হয়ে এসো তুমি যে আছ অন্তরে॥ স্বপনদুয়ার খুলে এসো অরুণ-আলোকে, মুগ্ধ এ চোখে। ক্ষণকালের আভাস হতে, চিরকালের তরে এসো আমার ঘরে॥


In [66]:
sentence = gTTS(text=my_sentence, lang='bn')

In [68]:
sentence.save('bengali_telugu.mp3')