# 1. Text Preprocessing

Text preprocessing is a crucial step in natural language processing (NLP) and text analysis. Its purpose is to clean and transform raw text data into a format that is suitable for analysis, making it easier for machine learning algorithms and other analytical tools to extract meaningful information. Here are some key reasons why text preprocessing is essential:

1. **Noise Reduction:** Raw text data often contains irrelevant information, such as special characters, numbers, and punctuation, which may not contribute to the analysis. Text preprocessing helps remove this noise to focus on the relevant content.

2. **Tokenization:** Breaking down text into smaller units, such as words or phrases (tokens), facilitates further analysis. Tokenization is a fundamental step in text preprocessing that aids in understanding the structure of the text.

3. **Normalization:** Standardizing the text by converting it to lowercase helps ensure uniformity and consistency. This is important, especially when dealing with case-insensitive tasks like text comparison or sentiment analysis.

4. **Lemmatization and Stemming:** These techniques reduce words to their base or root form, helping to group variations of words together. This is beneficial for tasks like information retrieval and text mining.

5. **Stopword Removal:** Stopwords are common words (e.g., "and," "the," "is") that often do not carry significant meaning. Removing stopwords can improve the efficiency of text analysis by focusing on more meaningful terms.

6. **Handling Missing Data:** Dealing with missing or incomplete text data is essential for a comprehensive analysis. Text preprocessing may involve strategies such as filling in missing values or excluding incomplete entries.

7. **Feature Engineering:** Transforming text data into numerical features is essential for machine learning models. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings convert text into a format suitable for modeling.

8. **Handling Special Characters and Encoding:** Text data may contain special characters or need encoding to handle different character sets. Preprocessing addresses these issues to ensure the data is ready for analysis.

In summary, text preprocessing is crucial before analysis because it enhances the quality of the data, removes unnecessary elements, and transforms text into a structured format that facilitates effective analysis and modeling in the field of natural language processing.

# 2. Tokenization

Tokenization is the process of breaking down a text into smaller units, often words or phrases, known as tokens. In the context of natural language processing (NLP), tokenization is a fundamental step in text processing and analysis. The significance of tokenization lies in its ability to structure and organize raw text data, making it suitable for further analysis. Here are some key aspects of tokenization and its significance:

1. **Breaking Text into Units:**
   - **Word Tokenization:** In word tokenization, the text is segmented into individual words. This is a common approach and is suitable for many NLP tasks.
   - **Sentence Tokenization:** In sentence tokenization, the text is segmented into sentences. This is useful for tasks that require a sentence-level analysis.

2. **Facilitating Analysis:**
   - Tokenization helps in understanding the structure of the text by breaking it down into meaningful units. This is essential for various NLP applications, such as sentiment analysis, part-of-speech tagging, and named entity recognition.

3. **Preparation for Further Processing:**
   - Once the text is tokenized, it can be further processed and analyzed. Each token becomes a unit of information that can be examined, transformed, or used as a feature in machine learning models.

4. **Feature Extraction:**
   - Tokenization is a crucial step in feature extraction for NLP tasks. Many natural language processing models require numerical input, and tokenization is often followed by techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings to represent words in a numerical format.

5. **Language Understanding:**
   - Tokenization is essential for language understanding as it breaks down the text into elements that can be analyzed and interpreted. This is particularly important for tasks like sentiment analysis, where the sentiment may be associated with specific words or phrases.

6. **Text Comparison and Retrieval:**
   - Tokenization is useful in tasks that involve text comparison and retrieval. It enables the comparison of individual words or phrases, making it easier to identify similarities and differences between texts.

7. **Improved Efficiency:**
   - Breaking text into tokens reduces the complexity of the data and makes it more manageable. This can lead to more efficient processing and analysis, especially in the context of large datasets.

8. **Handling Ambiguity:**
   - Tokenization can help in handling ambiguity in natural language. By breaking down the text into smaller units, the ambiguity associated with certain phrases or expressions can be reduced, making it easier to analyze.

In summary, tokenization is a critical step in NLP that involves breaking down text into smaller units to facilitate analysis and processing. It plays a key role in various NLP applications by providing a structured representation of text data for further exploration and modeling.

# Breaking Text into Units

# Sentence Tokenization - spliting as a sentences

In [1]:
from nltk.tokenize import sent_tokenize

text="""Hello Mr.Starc, i hope you doing good.
By the way I have a plan to visit to your house in the next week of the month.
We have a big business plan to discuss"""

print('Original text:')
print('='*100)
print(text)
print('='*100)
print()
tokenised_sent=sent_tokenize(text)
print('After Sentence Tokenization:\n',tokenised_sent)
print('='*100)
print('No.of Sentences:\t',len(tokenised_sent))


Original text:
Hello Mr.Starc, i hope you doing good.
By the way I have a plan to visit to your house in the next week of the month.
We have a big business plan to discuss

After Sentence Tokenization:
 ['Hello Mr.Starc, i hope you doing good.', 'By the way I have a plan to visit to your house in the next week of the month.', 'We have a big business plan to discuss']
No.of Sentences:	 3


# Word Tokenization - Spliting as a words

In [2]:
from nltk.tokenize import word_tokenize

print('Original text:')
print('='*100)
print(text)
print('='*100)
print()

token_word=word_tokenize(text)

print('After word Tokenization:\n',token_word)
print('='*100)
print('No.of words:\t',len(token_word))
print('='*100)

for i in token_word:
    print('word : ',i,'&','Length :',len(i))

Original text:
Hello Mr.Starc, i hope you doing good.
By the way I have a plan to visit to your house in the next week of the month.
We have a big business plan to discuss

After word Tokenization:
 ['Hello', 'Mr.Starc', ',', 'i', 'hope', 'you', 'doing', 'good', '.', 'By', 'the', 'way', 'I', 'have', 'a', 'plan', 'to', 'visit', 'to', 'your', 'house', 'in', 'the', 'next', 'week', 'of', 'the', 'month', '.', 'We', 'have', 'a', 'big', 'business', 'plan', 'to', 'discuss']
No.of words:	 37
word :  Hello & Length : 5
word :  Mr.Starc & Length : 8
word :  , & Length : 1
word :  i & Length : 1
word :  hope & Length : 4
word :  you & Length : 3
word :  doing & Length : 5
word :  good & Length : 4
word :  . & Length : 1
word :  By & Length : 2
word :  the & Length : 3
word :  way & Length : 3
word :  I & Length : 1
word :  have & Length : 4
word :  a & Length : 1
word :  plan & Length : 4
word :  to & Length : 2
word :  visit & Length : 5
word :  to & Length : 2
word :  your & Length : 4
word :  h

# 3. Difference between stemming and lemmatization

# Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root forms. The key differences are:

- **Stemming:** Removes prefixes or suffixes to obtain a word's root form, but the result may not be an actual word. It is a faster and less accurate method.

- **Lemmatization:** Involves reducing words to their base or dictionary form (lemma), ensuring the result is a valid word. It is a more accurate but computationally intensive process.

Choose stemming for efficiency in information retrieval or search engine applications. Choose lemmatization for tasks requiring linguistic precision, such as sentiment analysis or machine translation.

In [6]:
#Stemming

st='connect,connected,connecting'
st_words=word_tokenize(st)
from nltk.stem import PorterStemmer
ps=PorterStemmer()
stemmed_words=[]
for w in st_words:
    stemmed_words.append(ps.stem(w))
 
print('Length of words:\t',len(st_words))
print('-'*100)
print('Filtered words:\n\n\t',st_words)
print('='*100)
print('Length after Stemmer:\t',len(stemmed_words))
print('-'*100)
print('\n After Stemming- words are:\n\t',stemmed_words)


Length of words:	 5
----------------------------------------------------------------------------------------------------
Filtered words:

	 ['connect', ',', 'connected', ',', 'connecting']
Length after Stemmer:	 5
----------------------------------------------------------------------------------------------------

 After Stemming- words are:
	 ['connect', ',', 'connect', ',', 'connect']


In [7]:
#Lemmatization


from nltk.stem import WordNetLemmatizer

#Initialize the wordnet lemmatizer
lemma=WordNetLemmatizer()

#prompt to take input string
text=input('Enter a sentences:\n\n\t')

#split as word tokens
words=word_tokenize(text)
print('='*100)
print('Tokens of word:\n',words)
print('Length: \t',len(words))
print('*'*100)

#Apply Lemmatization to each word

lemma_words=[lemma.lemmatize(word,pos='v') for word in words ]

print('-'*100)
print('Original text:\n',text)
print('='*100)
print('Lemmatized words:\n',lemma_words)


Enter a sentences:

	connected connecting flying swimming seems
Tokens of word:
 ['connected', 'connecting', 'flying', 'swimming', 'seems']
Length: 	 5
****************************************************************************************************
----------------------------------------------------------------------------------------------------
Original text:
 connected connecting flying swimming seems
Lemmatized words:
 ['connect', 'connect', 'fly', 'swim', 'seem']


# 4. StopWords

# **Stopwords** are common words, such as "and," "the," "is," etc., that are often removed during text preprocessing in natural language processing (NLP). They play a crucial role in text preprocessing by:

1. **Noise Reduction:** Removing stopwords helps eliminate frequently occurring but less meaningful words, reducing noise in the data.

2. **Improved Efficiency:** By excluding stopwords, NLP tasks become more computationally efficient as the focus shifts to more meaningful words.

3. **Enhancing Analysis:** Stopword removal allows algorithms to focus on content-carrying words, improving the accuracy of tasks like sentiment analysis, text classification, and information retrieval.

In summary, stopwords impact NLP tasks positively by enhancing the efficiency and accuracy of text analysis through the removal of common but less informative words.

In [8]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))
print('Total No.of Stopwords:\t',len(stop_words))
print('='*100)
print('The Stop words are:\n')
print(stop_words)

Total No.of Stopwords:	 179
The Stop words are:

{'any', 'll', "it's", 'of', 'he', "hasn't", "weren't", "you'll", 'into', 'for', 'yourselves', 'own', 'at', 'all', 'wasn', 'isn', 'too', 'some', 'same', 'was', 'when', 'nor', 'herself', 'further', 'to', 'where', 'm', 'can', "aren't", 'below', 'am', 'before', 'just', 'ours', 'being', 'his', 'each', 'as', 'its', 'because', 'aren', "shouldn't", "needn't", "hadn't", 'i', 'the', "she's", "shan't", 'her', "wasn't", 'be', 'yourself', 'above', 's', 'from', 'in', 'if', 'do', 'yours', 'your', 'does', "couldn't", "haven't", 'doing', 'up', 'after', 'under', 'shan', "you'd", 'having', 'a', 'myself', 'then', 'so', "don't", "isn't", 'with', 'out', 'again', 'mustn', 'an', 'himself', 'this', 'by', 'theirs', "should've", 'my', 'only', 'those', 'until', 'have', 'ma', 'what', 'o', 'over', 'why', 'most', 'not', 'hers', 'them', 'these', 'don', 'are', 'weren', 'down', 'were', 'we', 't', 'had', 'whom', 've', "didn't", 'who', 'during', 'is', 'between', 'mightn', 

In [9]:
filtered_tokens=[]
for w in token_word:
    if w not in stop_words:
        filtered_tokens.append(w)
print('Length of words:\t',len(token_word))
print('-'*100)
print('Tokenized words-with stopwords:\n\n\t',token_word)
print('='*100)
print('Length after Remove stopwords:\t',len(filtered_tokens))
print('-'*100)
print('\n After Removing stopwords- words are:\n\t',filtered_tokens)

Length of words:	 37
----------------------------------------------------------------------------------------------------
Tokenized words-with stopwords:

	 ['Hello', 'Mr.Starc', ',', 'i', 'hope', 'you', 'doing', 'good', '.', 'By', 'the', 'way', 'I', 'have', 'a', 'plan', 'to', 'visit', 'to', 'your', 'house', 'in', 'the', 'next', 'week', 'of', 'the', 'month', '.', 'We', 'have', 'a', 'big', 'business', 'plan', 'to', 'discuss']
Length after Remove stopwords:	 21
----------------------------------------------------------------------------------------------------

 After Removing stopwords- words are:
	 ['Hello', 'Mr.Starc', ',', 'hope', 'good', '.', 'By', 'way', 'I', 'plan', 'visit', 'house', 'next', 'week', 'month', '.', 'We', 'big', 'business', 'plan', 'discuss']


# 5. Removing Punctuation

**Removing punctuation** in text preprocessing in NLP contributes by:

1. **Noise Reduction:** Eliminating punctuation helps reduce unnecessary characters that might not contribute to the meaning of the text, reducing noise in the data.

2. **Facilitating Tokenization:** Punctuation removal aids in breaking down the text into meaningful units (tokens), making subsequent tokenization processes more effective.

3. **Improved Analysis:** Punctuation removal is essential for tasks like sentiment analysis and part-of-speech tagging, where the presence or absence of punctuation can influence the meaning of a text.

In summary, removing punctuation is beneficial for noise reduction, tokenization, and improving the accuracy of various NLP tasks.

In [10]:
import nltk
from nltk import pos_tag

#prompt to take input string
text=input('Enter a sentences:\n\n\t')

#split as word tokens
tokens=word_tokenize(text)
print('='*100)
print('Length: \t',len(tokens))
print('Tokens of word:\n',tokens)
print('*'*100)

#Remove all tokens that are not alphabetic
words=[word for word in tokens if word.isalpha()]

print('-'*100)
print('Original text:\n',text)
print('='*100)
print('After Removing Punctuation:\n')
print(words,len(words))

Enter a sentences:

	Hello Mr.Starc, i hope you doing good. By the way I have a plan to visit to your house in the next week of the month. We have a big business plan to discuss
Length: 	 37
Tokens of word:
 ['Hello', 'Mr.Starc', ',', 'i', 'hope', 'you', 'doing', 'good', '.', 'By', 'the', 'way', 'I', 'have', 'a', 'plan', 'to', 'visit', 'to', 'your', 'house', 'in', 'the', 'next', 'week', 'of', 'the', 'month', '.', 'We', 'have', 'a', 'big', 'business', 'plan', 'to', 'discuss']
****************************************************************************************************
----------------------------------------------------------------------------------------------------
Original text:
 Hello Mr.Starc, i hope you doing good. By the way I have a plan to visit to your house in the next week of the month. We have a big business plan to discuss
After Removing Punctuation:

['Hello', 'i', 'hope', 'you', 'doing', 'good', 'By', 'the', 'way', 'I', 'have', 'a', 'plan', 'to', 'visit', 'to', 'y

# 6. Lowercase Conversion

**Lowercase conversion** is a common step in text preprocessing for NLP tasks due to its importance in:

1. **Consistency:** Converting text to lowercase ensures uniformity, making it easier to compare and analyze words without being case-sensitive.

2. **Normalization:** It helps in standardizing the text, reducing the complexity of variations arising from different letter cases.

3. **Efficient Matching:** Lowercasing facilitates efficient matching of words, improving the accuracy of tasks such as information retrieval and text classification.

4. **Feature Extraction:** Many NLP models and algorithms rely on the numerical representation of text features. Lowercasing ensures consistency in feature extraction.

5. **Enhanced Generalization:** Lowercasing enables models to generalize better by treating words with different cases as the same entity, improving performance across diverse text inputs.

In summary, lowercase conversion in text preprocessing is crucial for ensuring consistency, simplifying text analysis, and supporting efficient feature extraction in various NLP tasks.

# 7. Vectorization

**Vectorization** in the context of text data refers to the process of converting textual information into numerical vectors that can be used as input for machine learning algorithms. In natural language processing (NLP), vectorization is a crucial step because most machine learning models require numerical input, while text data is inherently non-numeric.

**CountVectorizer** is a common technique for vectorization in NLP. It works by converting a collection of text documents to a matrix of token counts. Here's how CountVectorizer contributes to text preprocessing:

1. **Word Frequency Representation:** CountVectorizer represents each document as a vector, where each element corresponds to the count of a particular word in the document. This captures the frequency of words and their distribution in the corpus.

2. **Sparse Matrix:** The output of CountVectorizer is typically a sparse matrix, where most entries are zero because not every word appears in every document. This sparse representation is memory-efficient.

3. **Feature Extraction:** CountVectorizer extracts features from text data, allowing machine learning models to work with the information. These features can be used for tasks like text classification, clustering, and information retrieval.

4. **Handling Stopwords:** CountVectorizer can be configured to exclude common stopwords during the vectorization process. This helps in focusing on more meaningful terms and reduces the impact of common, less informative words.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
corpus = ["This is the first document.", "This document is the second document.", "And this is the third one."]

# Create the CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the text data
X = vectorizer.fit_transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Convert the sparse matrix to a dense array for better visibility
dense_array = X.toarray()

# Print the feature names and the resulting matrix
print("Feature Names:", feature_names)
print("Count Vectorizer Output:\n", dense_array)


Feature Names: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Count Vectorizer Output:
 [[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]]


# 8. Normalization

Normalization in the context of NLP involves transforming text data to a standard or normalized form. This step is essential to ensure uniformity and consistency in the representation of words, making it easier to analyze and compare text. Here are some common normalization techniques used in text preprocessing:

Examples:

1. Lowercasing:

Description: Converting all letters in the text to lowercase.
Example:

In [12]:
text = "This is an Example Text."
normalized_text = text.lower()
print(normalized_text)

this is an example text.


2. Stemming:

Description: Reducing words to their base or root form by removing prefixes or suffixes.
Example:

In [13]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print(stemmed_word)


run


3. Lemmatization:

Description: Reducing words to their base or dictionary form (lemma) to ensure they are valid words.
Example:

In [14]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word = "running"
lemmatized_word = lemmatizer.lemmatize(word, pos='v')  # 'v' indicates the part of speech (verb)
print(lemmatized_word)


run


4. Removing Special Characters:

Description: Eliminating non-alphabetic characters, digits, or symbols.
Example:

In [15]:
import re

text = "This is an example sentence with 123 numbers!"
normalized_text = re.sub(r'[^a-zA-Z\s]', '', text)
print(normalized_text)


This is an example sentence with  numbers


5. Removing Stopwords:

Description: Eliminating common words that typically do not contribute much to the meaning of the text.
Example:

In [16]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
text = "This is an example sentence with some common words."
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)


['example', 'sentence', 'common', 'words', '.']
