![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Assignment No. 01</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: start;">
    <strong>Q. Write note on word normalization and stemming. Explain case folding with suitable example</strong>
</div>

**Text preprocessing** is a crucial step in natural language processing (NLP) that prepares raw text data for analysis. It typically involves several key steps: tokenization (breaking text into individual words or phrases), removing punctuation and special characters, converting text to lowercase, eliminating stop words (common words like "the" or "and" that add little meaning), and applying stemming or lemmatization (reducing words to their root form). These processes help standardize the text, reduce noise, and create a more uniform dataset for analysis. The goal is to transform unstructured text into a format that's easier for machines to process, allowing for more accurate and efficient analysis in NLP tasks such as sentiment analysis, topic modeling, or text classification.

In [1]:
%%capture

import nltk
import string
import unicodedata
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

##### **1. Tokenization**: 
- **Definition**: Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, phrases, or even characters, depending on the level of granularity needed.
- **Example**: For the sentence "I love programming!", tokenization would yield `["I", "love", "programming", "!"]`.
- **Purpose**: Tokenization helps in analyzing the structure of the text and is a foundational step in NLP, enabling further processing like filtration, stemming, and model building.

In [2]:
text = """
Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through language. 
It involves the analysis, understanding, and generation of the languages that humans use naturally. NLP has various applications, including sentiment analysis, 
machine translation, and speech recognition.
"""

sentences = sent_tokenize(text)
tokens = word_tokenize(text)

print("\nSentences:", sentences)
print("\nTokens:", tokens)


Sentences: ['\nNatural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through language.', 'It involves the analysis, understanding, and generation of the languages that humans use naturally.', 'NLP has various applications, including sentiment analysis, \nmachine translation, and speech recognition.']

Tokens: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'language', '.', 'It', 'involves', 'the', 'analysis', ',', 'understanding', ',', 'and', 'generation', 'of', 'the', 'languages', 'that', 'humans', 'use', 'naturally', '.', 'NLP', 'has', 'various', 'applications', ',', 'including', 'sentiment', 'analysis', ',', 'machine', 'translation', ',', 'and', 'speech', 'recognition', '.']


##### **2. Filtration**
   - **Definition**: Filtration involves removing unnecessary elements from the text, such as punctuation marks, special characters, numbers, or even specific words that do not contribute meaningfully to the analysis.
   - **Example**: In the sentence "I have 2 dogs.", filtration might remove the number "2" and punctuation, resulting in "I have dogs".
   - **Purpose**: Filtration cleans the text, making it easier to process and reducing noise in the data, which improves the performance of NLP models.

In [3]:
# Filtration (Removing punctuation, special characters, and numbers)

filtered_tokens = [word for word in tokens if word.isalpha()]
print("\nFiltered Tokens:", filtered_tokens)


Filtered Tokens: ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'language', 'It', 'involves', 'the', 'analysis', 'understanding', 'and', 'generation', 'of', 'the', 'languages', 'that', 'humans', 'use', 'naturally', 'NLP', 'has', 'various', 'applications', 'including', 'sentiment', 'analysis', 'machine', 'translation', 'and', 'speech', 'recognition']


##### **3. Script Validation**
   - **Definition**: Script validation checks whether the text is in the expected script (like Latin, Cyrillic, etc.) or language. This step ensures that the text is valid and relevant to the context of the analysis.
   - **Example**: If analyzing English text, script validation might remove text written in other scripts, like "こんにちは" (Japanese).
   - **Purpose**: Script validation helps in maintaining consistency in the data, ensuring that only relevant text is processed.

In [4]:
# Function to validate if a token is in the English script (ASCII characters)
def is_english_script(word):
    return all(ord(char) < 128 for char in word)

# Function to validate if a token is in the Latin script
def is_latin_script(word):
    try:
        return all('LATIN' in unicodedata.name(char) for char in word)
    except ValueError:
        return False

# a: Script Validation for English (ASCII characters only)
english_tokens = [word for word in filtered_tokens if is_english_script(word)]
print("\nEnglish Script Tokens:", english_tokens)

# b: Script Validation for Latin Script
latin_tokens = [word for word in filtered_tokens if is_latin_script(word)]
print("\nLatin Script Tokens:", latin_tokens)


English Script Tokens: ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'language', 'It', 'involves', 'the', 'analysis', 'understanding', 'and', 'generation', 'of', 'the', 'languages', 'that', 'humans', 'use', 'naturally', 'NLP', 'has', 'various', 'applications', 'including', 'sentiment', 'analysis', 'machine', 'translation', 'and', 'speech', 'recognition']

Latin Script Tokens: ['Natural', 'language', 'processing', 'NLP', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'language', 'It', 'involves', 'the', 'analysis', 'understanding', 'and', 'generation', 'of', 'the', 'languages', 'that', 'humans', 'use', 'naturally', 'NLP', 'has', 'various', 'applications', 'including', 'sentiment', 'analysis', 'machine', 'translation', 'and', 'sp

##### **4. Stop Word Removal**
   - **Definition**: Stop words are common words in a language (like "and," "the," "is") that do not add significant meaning and are often removed from the text to focus on more informative words.
   - **Example**: For the sentence "The cat is on the mat," stop word removal would yield `["cat", "mat"]`.
   - **Purpose**: Removing stop words reduces the dimensionality of the data, focusing on the most relevant information for analysis.

In [5]:
# Stop Word Removal
stop_words = set(stopwords.words('english'))
tokens_without_stopwords = [word.lower() for word in english_tokens if word.lower() not in stop_words]
print("\nTokens without Stop Words:", tokens_without_stopwords)


Tokens without Stop Words: ['natural', 'language', 'processing', 'nlp', 'field', 'artificial', 'intelligence', 'focuses', 'interaction', 'computers', 'humans', 'language', 'involves', 'analysis', 'understanding', 'generation', 'languages', 'humans', 'use', 'naturally', 'nlp', 'various', 'applications', 'including', 'sentiment', 'analysis', 'machine', 'translation', 'speech', 'recognition']


##### **5. Stemming**
   - **Definition**: Stemming reduces words to their root form by removing suffixes or prefixes. For instance, "running" becomes "run" and "played" becomes "play".
   - **Example**: "Playing", "played", and "plays" would all be reduced to "play".
   - **Purpose**: Stemming simplifies the text by consolidating different forms of a word into a single term, making it easier for models to learn patterns.

In [6]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens_without_stopwords]
print("\nStemmed Tokens:", stemmed_tokens)


Stemmed Tokens: ['natur', 'languag', 'process', 'nlp', 'field', 'artifici', 'intellig', 'focus', 'interact', 'comput', 'human', 'languag', 'involv', 'analysi', 'understand', 'gener', 'languag', 'human', 'use', 'natur', 'nlp', 'variou', 'applic', 'includ', 'sentiment', 'analysi', 'machin', 'translat', 'speech', 'recognit']


Text preprocessing in NLP prepares raw text for analysis through steps like tokenization, cleaning, stop word removal, and stemming/lemmatization. While essential for simplifying text and improving model performance, it has limitations such as creating meaningless stems or losing context. Understanding these trade-offs is crucial for effective text processing and insight extraction.

<div style="float: right; border: 1px solid black; display: inline-block; padding: 10px; text-align: center">
    <br>
    <br>
    <span style="font-weight: bold;">Signature of Lab Incharge</span>
    <br>
    <span>(Prof. Rupali Sharma)</span> 
</div>