# 1. What is the purpose of text preprocessing in NLP, and why is it essential before analysis?

Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming, lemmatization, stop-word removal, and part-of-speech tagging.

# 2. Describe tokenization in NLP and explain its significance in text processing.

Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

In [5]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

text="""Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words)."""
print('Original text:\n',text)
print()
tokenised_sent=sent_tokenize(text)
print('Aftr sentence Tokanization:\n',tokenised_sent)
print()
print('no of sentnces:\n',len(tokenised_sent))
print('='*70)
print()

print('Original text:\n',text)
print()
tokenised_word=word_tokenize(text)
print('Aftr word Tokanization:\n',tokenised_word)
print()
print('no of words:\n',len(tokenised_word))

Original text:
 Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

Aftr sentence Tokanization:
 ['Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning.', 'The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).']

no of sentnces:
 2

Original text:
 Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

Aftr word Tokanization:
 ['Tokenization', 'is', 'used', 'in', 'natural', 'language', 'processing', 'to', 'sp

In [2]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Sravan
[nltk_data]     Kumar\AppData\Roaming\nltk_data...


True

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Sravan
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

# What are the differences between stemming and lemmatization in NLP? When would you choose one over the other?

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'.

In [6]:
from nltk.stem import PorterStemmer

ps=PorterStemmer()

stemmed_words=[]

for w in tokenised_word:
    stemmed_words.append(ps.stem(w))
    

print('='*70)
print('Tokenized words - without stemming:\n\n\t',tokenised_word)
print('='*70)
print('\nTokenized words - afer stemming are:\n\t',stemmed_words)

from nltk.stem import WordNetLemmatizer

lemma=WordNetLemmatizer()

lemma_words=[lemma.lemmatize(word,pos='v') for word in tokenised_word ]

print('='*70)
print('lemmarized words:\n',lemma_words)

Tokenized words - without stemming:

	 ['Tokenization', 'is', 'used', 'in', 'natural', 'language', 'processing', 'to', 'split', 'paragraphs', 'and', 'sentences', 'into', 'smaller', 'units', 'that', 'can', 'be', 'more', 'easily', 'assigned', 'meaning', '.', 'The', 'first', 'step', 'of', 'the', 'NLP', 'process', 'is', 'gathering', 'the', 'data', '(', 'a', 'sentence', ')', 'and', 'breaking', 'it', 'into', 'understandable', 'parts', '(', 'words', ')', '.']

Tokenized words - afer stemming are:
	 ['token', 'is', 'use', 'in', 'natur', 'languag', 'process', 'to', 'split', 'paragraph', 'and', 'sentenc', 'into', 'smaller', 'unit', 'that', 'can', 'be', 'more', 'easili', 'assign', 'mean', '.', 'the', 'first', 'step', 'of', 'the', 'nlp', 'process', 'is', 'gather', 'the', 'data', '(', 'a', 'sentenc', ')', 'and', 'break', 'it', 'into', 'understand', 'part', '(', 'word', ')', '.']
lemmarized words:
 ['Tokenization', 'be', 'use', 'in', 'natural', 'language', 'process', 'to', 'split', 'paragraph', 'and

# 4. Explain the concept of stop words and their role in text preprocessing. How do they impact NLP tasks?

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Sravan
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [9]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

filtered_tokens=[]
for w in tokenised_word:
    if w not in stop_words:
        filtered_tokens.append(w)
print('Length of words:\t',len(tokenised_word))
print('='*70)
print('Tokenized words - with stop words:\n\n\t',tokenised_word)
print('='*70)
print('Length after the remoal of stopwords:\t',len(filtered_tokens))
print('='*70)
print('\nTokenized words - afer removing the stopwords are:\n\t',filtered_tokens)

Length of words:	 48
Tokenized words - with stop words:

	 ['Tokenization', 'is', 'used', 'in', 'natural', 'language', 'processing', 'to', 'split', 'paragraphs', 'and', 'sentences', 'into', 'smaller', 'units', 'that', 'can', 'be', 'more', 'easily', 'assigned', 'meaning', '.', 'The', 'first', 'step', 'of', 'the', 'NLP', 'process', 'is', 'gathering', 'the', 'data', '(', 'a', 'sentence', ')', 'and', 'breaking', 'it', 'into', 'understandable', 'parts', '(', 'words', ')', '.']
Length after the remoal of stopwords:	 31

Tokenized words - afer removing the stopwords are:
	 ['Tokenization', 'used', 'natural', 'language', 'processing', 'split', 'paragraphs', 'sentences', 'smaller', 'units', 'easily', 'assigned', 'meaning', '.', 'The', 'first', 'step', 'NLP', 'process', 'gathering', 'data', '(', 'sentence', ')', 'breaking', 'understandable', 'parts', '(', 'words', ')', '.']


# 5. How does the process of removing punctuation contribute to text preprocessing in NLP? What are its benefits?

The second most common text processing technique is removing punctuations from the textual data. The punctuation removal process will help to treat each text equally. For example, the word data and data! are treated equally after the process of removal of punctuations.

In [10]:
words=[word for word in tokenised_word if word.isalpha()]

print('Original text:\n',text)
print('='*70)

print('after removing puntuations:\n')
print(words)

Original text:
 Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).
after removing puntuations:

['Tokenization', 'is', 'used', 'in', 'natural', 'language', 'processing', 'to', 'split', 'paragraphs', 'and', 'sentences', 'into', 'smaller', 'units', 'that', 'can', 'be', 'more', 'easily', 'assigned', 'meaning', 'The', 'first', 'step', 'of', 'the', 'NLP', 'process', 'is', 'gathering', 'the', 'data', 'a', 'sentence', 'and', 'breaking', 'it', 'into', 'understandable', 'parts', 'words']


# 6. Discuss the importance of lowercase conversion in text preprocessing. Why is it a common step in NLP tasks?

Lower casing: Converting a word to lower case (NLP -> nlp). Words like Book and book mean the same but when not converted to the lower case those two are represented as two different words in the vector space model (resulting in more dimensions).
    
The common approach is to reduce everything to lower case for simplicity. Lowercasing is applicable to most text mining and NLP tasks and significantly helps with consistency of the output. However, it is important to remember that some words, like “US” and “us”, can change meanings when reduced to the lower case.

In [11]:
lower_words=[word.lower() for word in tokenised_word]
print('Original text:\n',text)
print('='*70)

print('after lowering the case:\n')
print(lower_words)

Original text:
 Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).
after lowering the case:

['tokenization', 'is', 'used', 'in', 'natural', 'language', 'processing', 'to', 'split', 'paragraphs', 'and', 'sentences', 'into', 'smaller', 'units', 'that', 'can', 'be', 'more', 'easily', 'assigned', 'meaning', '.', 'the', 'first', 'step', 'of', 'the', 'nlp', 'process', 'is', 'gathering', 'the', 'data', '(', 'a', 'sentence', ')', 'and', 'breaking', 'it', 'into', 'understandable', 'parts', '(', 'words', ')', '.']


# 7. Explain the term "vectorization" concerning text data. How does techniques like CountVectorizer contribute to text preprocessing in NLP?

In the context of natural language processing (NLP), vectorization refers to the process of converting text into numerical representations, typically vectors of real numbers. This transformation allows computers to understand and process text data, which is crucial for various NLP tasks such as machine translation, sentiment analysis.

In [12]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vector=CountVectorizer()

x=vector.fit_transform(tokenised_word)

feature_names=vector.get_feature_names_out()

print('Feature names:\n',feature_names)
print('*'*70)
print('Token counts matrix:')
print(x.toarray())

Feature names:
 ['and' 'assigned' 'be' 'breaking' 'can' 'data' 'easily' 'first'
 'gathering' 'in' 'into' 'is' 'it' 'language' 'meaning' 'more' 'natural'
 'nlp' 'of' 'paragraphs' 'parts' 'process' 'processing' 'sentence'
 'sentences' 'smaller' 'split' 'step' 'that' 'the' 'to' 'tokenization'
 'understandable' 'units' 'used' 'words']
**********************************************************************
Token counts matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


# 8. Describe the concept of normalization in NLP. Provide examples of normalization techniques used in text preprocessing.

Normalization in natural language processing (NLP) refers to the process of transforming text data into a consistent and standardized format. This is done to reduce the variability and redundancy in text, making it easier for NLP algorithms to process and analyze. Normalization techniques are essential for text preprocessing, as they prepare the data for downstream tasks such as machine translation, sentiment analysis, and text classification.