### Introduction
NLTK is a toolkit build for working with NLP in Python. It provides us various text processing libraries.
#### Text Processing steps:
1.Tokenization
2.Lower case conversion
3.Stop Words removal
4.Stemming
5.Lemmatization
6.Parse tree or Syntax Tree generation
7.POS Tagging

#### 1. Tokenization 
The breaking down of text into smaller units is called tokens. tokens are a small part of that text.

In [31]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Text preprocessing is a method to clean the text data and make it ready to feed data to the model. Text data contains noise in various forms like emotions, punctuation, text in a different case. When we talk about Human Language then, there are different ways to say the same thing, And this is only the main problem we have to deal with because machines will not understand words, they need numbers so we need to convert text to numbers in an efficient manner"

print(sent_tokenize(text))
print(word_tokenize(text))

['Text preprocessing is a method to clean the text data and make it ready to feed data to the model.', 'Text data contains noise in various forms like emotions, punctuation, text in a different case.', 'When we talk about Human Language then, there are different ways to say the same thing, And this is only the main problem we have to deal with because machines will not understand words, they need numbers so we need to convert text to numbers in an efficient manner']
['Text', 'preprocessing', 'is', 'a', 'method', 'to', 'clean', 'the', 'text', 'data', 'and', 'make', 'it', 'ready', 'to', 'feed', 'data', 'to', 'the', 'model', '.', 'Text', 'data', 'contains', 'noise', 'in', 'various', 'forms', 'like', 'emotions', ',', 'punctuation', ',', 'text', 'in', 'a', 'different', 'case', '.', 'When', 'we', 'talk', 'about', 'Human', 'Language', 'then', ',', 'there', 'are', 'different', 'ways', 'to', 'say', 'the', 'same', 'thing', ',', 'And', 'this', 'is', 'only', 'the', 'main', 'problem', 'we', 'have',

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\antuh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 2. Lower Case Conversion

In [32]:
import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
words = text.split()
print(words)

['text', 'preprocessing', 'is', 'a', 'method', 'to', 'clean', 'the', 'text', 'data', 'and', 'make', 'it', 'ready', 'to', 'feed', 'data', 'to', 'the', 'model', 'text', 'data', 'contains', 'noise', 'in', 'various', 'forms', 'like', 'emotions', 'punctuation', 'text', 'in', 'a', 'different', 'case', 'when', 'we', 'talk', 'about', 'human', 'language', 'then', 'there', 'are', 'different', 'ways', 'to', 'say', 'the', 'same', 'thing', 'and', 'this', 'is', 'only', 'the', 'main', 'problem', 'we', 'have', 'to', 'deal', 'with', 'because', 'machines', 'will', 'not', 'understand', 'words', 'they', 'need', 'numbers', 'so', 'we', 'need', 'to', 'convert', 'text', 'to', 'numbers', 'in', 'an', 'efficient', 'manner']


#### 3.Stop Words removal:  
Stopwords are common words like “the,” “is,” and “and” that often occur frequently but convey little semantic meaning. Removing stopwords can improve the efficiency of text analysis by reducing noise.

In [34]:
nltk.download('stopwords')

from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\antuh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
filtered_words = [w for w in words if w.lower() not in stopwords.words('english')]

print(filtered_words)

['text', 'preprocessing', 'method', 'clean', 'text', 'data', 'make', 'ready', 'feed', 'data', 'model', 'text', 'data', 'contains', 'noise', 'various', 'forms', 'like', 'emotions', 'punctuation', 'text', 'different', 'case', 'talk', 'human', 'language', 'different', 'ways', 'say', 'thing', 'main', 'problem', 'deal', 'machines', 'understand', 'words', 'need', 'numbers', 'need', 'convert', 'text', 'numbers', 'efficient', 'manner']


#### 4.Stemming:
In our text we may find many words like playing, played, playfully, etc… which have a root word, play all of these convey the same meaning. So we can just extract the root word and remove the rest. Here the root word formed is called ‘stem’ and it is not necessarily that stem needs to exist and have a meaning. Just by committing the suffix and prefix, we generate the stems.

In [37]:
from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['text', 'preprocess', 'is', 'a', 'method', 'to', 'clean', 'the', 'text', 'data', 'and', 'make', 'it', 'readi', 'to', 'feed', 'data', 'to', 'the', 'model', 'text', 'data', 'contain', 'nois', 'in', 'variou', 'form', 'like', 'emot', 'punctuat', 'text', 'in', 'a', 'differ', 'case', 'when', 'we', 'talk', 'about', 'human', 'languag', 'then', 'there', 'are', 'differ', 'way', 'to', 'say', 'the', 'same', 'thing', 'and', 'thi', 'is', 'onli', 'the', 'main', 'problem', 'we', 'have', 'to', 'deal', 'with', 'becaus', 'machin', 'will', 'not', 'understand', 'word', 'they', 'need', 'number', 'so', 'we', 'need', 'to', 'convert', 'text', 'to', 'number', 'in', 'an', 'effici', 'manner']


#### 5. Lemmatization
It stems from the word but ensures it does not lose meaning.  Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.

In [38]:
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\antuh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [39]:
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['text', 'preprocessing', 'is', 'a', 'method', 'to', 'clean', 'the', 'text', 'data', 'and', 'make', 'it', 'ready', 'to', 'feed', 'data', 'to', 'the', 'model', 'text', 'data', 'contains', 'noise', 'in', 'various', 'form', 'like', 'emotion', 'punctuation', 'text', 'in', 'a', 'different', 'case', 'when', 'we', 'talk', 'about', 'human', 'language', 'then', 'there', 'are', 'different', 'way', 'to', 'say', 'the', 'same', 'thing', 'and', 'this', 'is', 'only', 'the', 'main', 'problem', 'we', 'have', 'to', 'deal', 'with', 'because', 'machine', 'will', 'not', 'understand', 'word', 'they', 'need', 'number', 'so', 'we', 'need', 'to', 'convert', 'text', 'to', 'number', 'in', 'an', 'efficient', 'manner']
