# <b> NLP: Intro to stemming 

> Description:
  * Stemming is the process of reducing a word to its base or root form, often by removing suffixes or prefixes. This technique is commonly used in Natural Language Processing (NLP) to reduce the dimensionality of text data and improve computational efficiency. For example, the words "running," "runs," and "ran" would all be reduced to the stem "run." Stemming is a form of text normalization that can help improve the accuracy of tasks such as sentiment analysis, text classification, and search engine indexing.

In [1]:
# importing Libraries :
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords


In [2]:
# Creating a paragraph on which we will be performing Stemming.

paragraph  = '''The Cosmic Microwave Background (CMB) is a form of electromagnetic radiation that pervades the entire universe. It is thought to be the afterglow of the Big Bang, the event that marks the beginning of the universe as we know it.

The CMB was first discovered in 1964 by two radio astronomers, Arno Penzias and Robert Wilson, who were working at Bell Labs in New Jersey. They were using a large horn-shaped antenna to study radio waves emitted by the Milky Way, but they kept detecting a mysterious signal that seemed to be coming from all directions in the sky. After ruling out a number of possible explanations, they realized that they had stumbled upon the CMB.

The CMB is incredibly faint, with a temperature of just 2.7 Kelvin (-270.45 degrees Celsius). However, it is remarkably uniform across the entire sky, with temperature variations of just a few parts in 100,000. These tiny fluctuations are thought to be the result of slight density variations in the early universe, which were stretched out by cosmic expansion to form the large-scale structures we see today, such as galaxies and clusters of galaxies.

Studying the CMB has been crucial to our understanding of the universe and its evolution. It has provided strong evidence for the Big Bang theory, as well as for the existence of dark matter and dark energy. It has also allowed astronomers to measure the age, size, and composition of the universe with unprecedented accuracy.

In recent years, the study of the CMB has entered a new era, with a number of high-precision experiments, such as the Planck satellite and the Atacama Cosmology Telescope, providing even more detailed maps of the CMB and shedding light on some of the universe's deepest mysteries.
'''

In [3]:
# Perfoerming Sentence Tokenization: 

sentences = nltk.sent_tokenize(paragraph)
print (len(sentences))

12


In [4]:
# Perform Stemming :
stemmer = PorterStemmer()
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words =[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))] 
    sentences[i]  = ' '.join(words)

In [5]:
sentences

['the cosmic microwav background ( cmb ) form electromagnet radiat pervad entir univers .',
 'it thought afterglow big bang , event mark begin univers know .',
 'the cmb first discov 1964 two radio astronom , arno penzia robert wilson , work bell lab new jersey .',
 'they use larg horn-shap antenna studi radio wave emit milki way , kept detect mysteri signal seem come direct sky .',
 'after rule number possibl explan , realiz stumbl upon cmb .',
 'the cmb incred faint , temperatur 2.7 kelvin ( -270.45 degre celsiu ) .',
 'howev , remark uniform across entir sky , temperatur variat part 100,000 .',
 'these tini fluctuat thought result slight densiti variat earli univers , stretch cosmic expans form large-scal structur see today , galaxi cluster galaxi .',
 'studi cmb crucial understand univers evolut .',
 'it provid strong evid big bang theori , well exist dark matter dark energi .',
 'it also allow astronom measur age , size , composit univers unpreced accuraci .',
 "in recent year , s

## PROBLEM WITH STEMMING 
> It produces words which may or may not have an actual meaning.

> TO OVERCOME THIS CHALLENGE WE USE LEMMATIZATION