# **Text Processing: NLP**

# What is NLP Text Processing

Text processing in the context of NLP (Natural Language Processing) refers to a set of techniques and operations applied to textual data to make it more accessible and useful for various NLP tasks. It involves the manipulation, analysis, and transformation of text to extract valuable information or gain insights. Text processing is a crucial preliminary step in many NLP applications. Here are some common text processing tasks:

1. **Tokenization:** Tokenization involves splitting a text into individual units, such as words or sentences. These units are known as tokens. Tokenization is a fundamental step in NLP, as it breaks down text into manageable components.

2. **Stemming and Lemmatization:** Stemming and lemmatization are techniques to reduce words to their root or base forms. Stemming removes prefixes and suffixes, while lemmatization considers the word's context and grammar to find its base form.

3. **Stop Word Removal:** Stop words are common words like "the," "and," "in," which may not provide valuable information in certain NLP tasks. Removing stop words can reduce noise in the text.

4. **Normalization:** Text normalization involves standardizing text, making it consistent by converting all text to lowercase, removing special characters, and handling abbreviations or acronyms.

5. **Text Cleaning:** Text cleaning aims to remove noise or irrelevant information from text. This may include removing HTML tags, punctuation, or unwanted characters.

6. **Sentence Segmentation:** Sentence segmentation involves identifying sentence boundaries in a paragraph of text. This is important for tasks like machine translation or summarization.

7. **Text Encoding:** Text encoding converts text into a numerical format that machine learning models can work with. This is typically done using techniques like one-hot encoding or word embeddings (e.g., Word2Vec or GloVe).

8. **Part-of-Speech Tagging:** This task involves labeling words in a sentence with their part of speech (e.g., noun, verb, adjective). It's useful for understanding the grammatical structure of text.

9. **Named Entity Recognition (NER):** NER identifies and classifies entities in text, such as names of people, organizations, locations, dates, and more.

10. **Sentiment Analysis:** Sentiment analysis determines the sentiment or emotion expressed in a piece of text, typically classifying it as positive, negative, or neutral.

11. **Text Summarization:** Text summarization techniques aim to create a concise summary of a longer text while retaining its essential information.

12. **Topic Modeling:** Topic modeling algorithms can identify topics or themes within a collection of documents.

13. **Dependency Parsing:** Dependency parsing analyzes the grammatical structure of a sentence to identify the relationships between words.

14. **Text Translation:** Machine translation systems use NLP techniques to translate text from one language to another.

Text processing plays a crucial role in NLP because it prepares raw textual data for subsequent analysis, modeling, and understanding. The specific techniques used depend on the NLP task at hand and the nature of the text data being processed.

# String `Tokenization`

String tokenization is a fundamental process in Natural Language Processing (NLP) that involves breaking down a text or a sequence of characters (often a sentence or a document) into smaller, meaningful units called tokens. Tokens are usually words, subwords, or even individual characters, depending on the level of granularity desired for text analysis. Tokenization is the first step in preparing textual data for further analysis and is crucial for many NLP tasks. Here's an overview of string tokenization in NLP:

1. **Token Types:**
   - **Word Tokenization:** In this most common form of tokenization, text is divided into individual words or terms. For example, the sentence "The quick brown fox" would be tokenized into the tokens: ["The", "quick", "brown", "fox"].
   - **Subword Tokenization:** Some NLP tasks, like language modeling or handling languages with complex morphology, require breaking words into subword units. This is often done using techniques like Byte-Pair Encoding (BPE), SentencePiece, or WordPiece, which create subword tokens like "subword", "##word" for "subword" in English.
   - **Character Tokenization:** In some cases, especially in languages without clear word boundaries or for character-level tasks, tokenization may occur at the character level. For example, "apple" could be tokenized as ["a", "p", "p", "l", "e"].

2. **Token Boundaries:**
   - **Whitespace Tokenization:** The most basic form of tokenization splits text on spaces, tabs, or newlines. However, this can lead to issues in languages where compound words are common or in languages without spaces between words.
   - **Punctuation Tokenization:** Text can also be tokenized based on punctuation marks, which works well for many languages. For example, "I am happy!" would be tokenized into ["I", "am", "happy", "!"].
   - **Specialized Tokenization:** Depending on the task and language, tokenization can be tailored to the specific needs. For instance, for some languages, you might tokenize based on specific character sequences or linguistic rules.

3. **Tokenization Libraries:** Various NLP libraries and tools offer tokenization functions, such as spaCy, NLTK, the Natural Language Toolkit, or the tokenization tools provided by machine learning frameworks like TensorFlow or PyTorch.

4. **Preprocessing:** Tokenization is often a crucial part of text preprocessing in NLP, preparing the text for downstream tasks like sentiment analysis, machine translation, text classification, and named entity recognition.

5. **Challenges:** Tokenization can be challenging in languages with complex word structures, morphological variations, and ambiguous word boundaries. For these cases, advanced techniques and language-specific tokenizers are employed.

6. **Post-processing:** After tokenization, you may need to further process the tokens, such as removing stopwords (common words like "the," "a") or stemming (reducing words to their root form).

Tokenization is a critical step in NLP because it structures the text data into manageable units, making it suitable for various language analysis and machine learning tasks. The choice of tokenization method and granularity depends on the specific NLP task and the characteristics of the language being processed.

### Using `nltk`

In [1]:
import nltk
nltk.download('punkt')  # Download the necessary data for tokenization

from nltk.tokenize import word_tokenize

text = "Tokenization is an important step in NLP."

# Tokenize the text into words
tokens = word_tokenize(text)

# Print the tokens
print(tokens)

['Tokenization', 'is', 'an', 'important', 'step', 'in', 'NLP', '.']


[nltk_data] Downloading package punkt to /home/blackheart/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Using Hugging Face's `tokenizers`

In [2]:
! pip install tokenizers

Defaulting to user installation because normal site-packages is not writeable


# What is Sentences Processing in NLP?

Sentence processing in Natural Language Processing (NLP) refers to the analysis and understanding of individual sentences or text fragments within a larger body of text. It involves various linguistic and computational techniques to extract meaning, syntactic structure, and context from sentences. Sentence processing is a fundamental step in many NLP applications, including text classification, sentiment analysis, machine translation, and question-answering systems. Here are key aspects of sentence processing:

1. **Tokenization:** The first step in sentence processing is often tokenization, where a sentence is divided into smaller units called tokens. Tokens can be words, subwords, or characters, depending on the specific requirements of the NLP task. Tokenization helps create a structured representation of the sentence, making it easier for subsequent analysis.

2. **Part-of-Speech Tagging:** Part-of-speech tagging involves assigning grammatical labels (e.g., noun, verb, adjective) to each word in a sentence. This helps in understanding the syntactic structure of the sentence and disambiguating word meanings. For example, it distinguishes between the noun "bat" and the verb "bat."

3. **Parsing:** Parsing is the process of determining the syntactic structure of a sentence, typically in the form of a parse tree or dependency tree. It shows how words in a sentence relate to each other and their grammatical roles. For instance, it can reveal subject-verb-object relationships.

4. **Named Entity Recognition (NER):** NER identifies and classifies named entities in a sentence, such as names of people, places, organizations, dates, and more. It is crucial for information extraction and context understanding.

5. **Sentiment Analysis:** In sentiment analysis, the goal is to determine the emotional tone or sentiment expressed in a sentence, such as whether it is positive, negative, or neutral. This is valuable for understanding public opinion or customer feedback.

6. **Dependency Parsing:** Dependency parsing is a technique for representing the grammatical structure of a sentence as a directed graph, where words are nodes and grammatical relationships are edges. It's valuable for understanding the relationships between words in a sentence.

7. **Semantics and Word Sense Disambiguation:** These techniques aim to resolve word meanings and understand the semantics of a sentence. Word sense disambiguation helps in determining which sense of a word is intended in a given context.

8. **Coreference Resolution:** Coreference resolution identifies when multiple words or phrases in a sentence refer to the same entity. For example, in "John met Mary. He gave her a book," coreference resolution helps connect "He" to "John" and "her" to "Mary."

9. **Question Answering:** In question-answering systems, sentence processing is crucial for extracting relevant information from a sentence to answer a user's question.

10. **Machine Translation:** Sentence processing is vital in machine translation systems to break down sentences in one language and construct equivalent sentences in another language.

11. **Text Summarization:** For abstractive or extractive text summarization, sentence processing techniques are employed to identify important sentences and extract relevant information.

Sentence processing is a complex and multifaceted field within NLP, involving a combination of linguistic knowledge and computational techniques. The goal is to transform text into a structured format that can be used for various NLP tasks, enabling machines to understand, interpret, and generate human language.



In [4]:
import re
import nltk

def process_sentence(sentence):
    """Processes a sentence by performing the following steps:
        1. Removing punctuation
        2. Lowercasing the text
        3. Tokenizing the text
        4. Removing stop words
        5. Lemmatizing the text

    Args:
        sentence: A string containing the sentence to be processed.

    Returns:
        A list of strings containing the processed tokens.
    """

    # Remove punctuation
    sentence = re.sub(r'[^\w\s]', '', sentence)

    # Lowercase the text
    sentence = sentence.lower()

    # Tokenize the text
    tokens = nltk.word_tokenize(sentence)

    # Remove stop words
    stopwords = nltk.corpus.stopwords.words('english')
    tokens = [token for token in tokens if token not in stopwords]

    # Lemmatize the text
    lemmatizer = nltk.stem.WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

# Example usage:

sentence = "This is a sentence to be processed."
processed_tokens = process_sentence(sentence)

print(processed_tokens)


['sentence', 'processed']


# What is `Word Embedding`?

Word embeddings are a fundamental concept in Natural Language Processing (NLP) and are used to represent words as dense, continuous-valued vectors in a high-dimensional space. Word embeddings capture the semantic and contextual information of words, making them suitable for various NLP tasks. The primary idea behind word embeddings is to map words with similar meanings or contextual usage to nearby points in the vector space. Word embeddings have become a key component in modern NLP and have significantly improved the performance of various language-related tasks. Here are some key points to understand word embeddings:

1. **Distributed Representation:** Word embeddings provide a distributed representation for words, where each word is represented as a vector of real numbers. Unlike one-hot encoding, where words are represented as binary vectors, word embeddings capture more nuanced information about word relationships.

2. **Semantic Similarity:** Words with similar meanings or usage contexts tend to have similar word vectors. This is because word embeddings are trained on large text corpora, allowing them to capture the semantic relationships between words. For example, in a good word embedding space, "king" and "queen" would be closer together than "king" and "car."

3. **Contextual Information:** Word embeddings capture contextual information. The meaning of a word can depend on the words that surround it in a sentence. Word embeddings are trained to consider these contextual dependencies. For instance, the word "bank" can have different meanings in the context of "river bank" or "financial bank."

4. **Word Embedding Models:** Several word embedding models exist, with Word2Vec, GloVe (Global Vectors for Word Representation), and FastText being some of the most popular. These models are trained on large text corpora, and they learn word embeddings based on co-occurrence statistics, predicting context words given target words or vice versa.

5. **Word Embedding Dimension:** You can choose the dimensionality of word embeddings, which determines the length of the vectors. Common dimensions include 50, 100, 200, or 300. The choice of dimensionality may depend on the specific task and available resources.

6. **Pre-trained Word Embeddings:** Instead of training word embeddings from scratch, you can use pre-trained word embeddings obtained from large text corpora. These embeddings are available for many languages and have been used to boost the performance of various NLP applications.

7. **Word Embeddings and NLP Tasks:** Word embeddings serve as input features or pre-trained knowledge for many NLP tasks, including text classification, sentiment analysis, machine translation, and named entity recognition. They are used to improve the efficiency and effectiveness of models.

8. **Word Embedding Evaluation:** The quality of word embeddings can be evaluated using intrinsic and extrinsic evaluation methods. Intrinsic evaluations measure how well embeddings capture semantic relationships (e.g., word similarity tasks), while extrinsic evaluations assess their usefulness in downstream NLP tasks.

Word embeddings have become a cornerstone in NLP, facilitating the development of powerful language models like BERT, GPT, and more. These models build upon pre-trained word embeddings to capture even more nuanced contextual information and have significantly advanced the state of the art in various NLP applications.


In [5]:
import numpy as np

class WordEmbedding:
    def __init__(self, vocabulary, embedding_dim):
        """Initializes the word embedding.

        Args:
            vocabulary: A list of strings containing the words in the vocabulary.
            embedding_dim: The dimension of the embedding vectors.
        """

        self.vocabulary = vocabulary
        self.embedding_dim = embedding_dim

        # Initialize the embedding matrix with random values.
        self.embedding_matrix = np.random.randn(len(vocabulary), embedding_dim)

    def get_embedding(self, word):
        """Returns the embedding vector for the given word.

        Args:
            word: A string containing the word to get the embedding for.

        Returns:
            A numpy array containing the embedding vector for the given word.
        """

        if word not in self.vocabulary:
            raise KeyError("Word '%s' not found in vocabulary." % word)

        return self.embedding_matrix[self.vocabulary.index(word)]

# Example usage:

vocabulary = ["cat", "dog", "bird", "fish"]
embedding_dim = 10

word_embedding = WordEmbedding(vocabulary, embedding_dim)

# Get the embedding vector for the word "cat".
cat_embedding = word_embedding.get_embedding("cat")

print(cat_embedding)

[-2.17725952  2.71517822 -0.00587937 -0.62705346  1.16898162  0.61741934
  0.62009413 -0.57880074  0.03731495 -1.12025105]


In [8]:
import gensim
from nltk.corpus import brown

# Load the Brown Corpus
corpus = brown.sents()

# Create the Word2Vec model
word2vec = gensim.models.Word2Vec(corpus, vector_size=100, window=5, min_count=5)

# Get the embedding vector for the word "cat"
cat_embedding = word2vec.wv.get_vector("cat")

print(cat_embedding)


[ 0.13148445  0.13151881  0.12148918  0.03608256  0.00664068 -0.17636096
  0.04377532  0.29101297 -0.15758431 -0.15795347 -0.02130313 -0.23803149
  0.07775006  0.10257284  0.17024262 -0.11945024  0.07561881 -0.04605036
 -0.07598561 -0.2583426   0.20864409  0.05027337  0.23839258  0.08950492
 -0.00207741 -0.04361648 -0.15078464  0.04011612 -0.01617661 -0.01065332
  0.1590092  -0.0477016   0.19777317 -0.181487   -0.0091598   0.11103874
 -0.05676052  0.02131273 -0.1728234  -0.06679914  0.05130099 -0.11396933
  0.00971357 -0.02986136  0.04378323  0.04729443 -0.13513972  0.07450062
 -0.00672828  0.15183085  0.10400049 -0.15612711 -0.08382687 -0.08972693
  0.06481903 -0.08024552  0.01797736  0.01247341 -0.09846347 -0.05893303
 -0.04407734  0.14424768 -0.02657512  0.0418223  -0.17740534  0.1396236
  0.13482891  0.12985158 -0.2287927   0.2821855  -0.00092568  0.04678812
  0.18353017 -0.04521959  0.11435018  0.09481178  0.0900376   0.0298633
  0.00390646 -0.00905123 -0.02584108  0.1497851  -0.1

# Lemmatization In Text Processing

Lemmatization is a text processing technique used in Natural Language Processing (NLP) to reduce words to their base or dictionary form, known as the "lemma." The primary goal of lemmatization is to transform different inflected forms of a word into a common base form so that they can be analyzed, compared, and understood more easily. Lemmatization helps in addressing word variations caused by tense, case, gender, number, and other grammatical differences. Here's how lemmatization works:

1. **Lemmatization vs. Stemming:**
   Lemmatization is often compared to stemming, another text normalization technique. While stemming reduces words to their root form, it doesn't always guarantee that the resulting "stem" is a valid word. Lemmatization, on the other hand, aims to reduce words to their dictionary form (lemma), ensuring that the output is a valid word. For example:
   - Stemming: "jumping" -> "jump"
   - Lemmatization: "jumping" -> "jump"

2. **Lemmatization Process:**
   Lemmatization involves dictionary or vocabulary lookup to map each word to its lemma. The process typically considers the word's part of speech (POS) because the lemma of a word may vary depending on whether it's used as a noun, verb, adjective, etc. For example, the lemma of "better" could be "good" when used as an adjective but "well" when used as an adverb.

3. **POS Tagging:** To perform accurate lemmatization, it's essential to determine the part of speech of each word in the text. This is typically done using POS tagging, a process where each word is assigned a grammatical category (noun, verb, adjective, etc.) based on its context.

4. **Lemmatization Algorithms:** Lemmatization is often implemented using linguistic resources like WordNet or through lemmatization algorithms that take into account the word's morphology and the POS. Popular lemmatization libraries in Python include spaCy and NLTK.

5. **Example:**
   - Lemmatization of the word "running":
     - When used as a verb: "run"
     - When used as a noun: "running"

6. **Use Cases:**
   Lemmatization is valuable in various NLP tasks, including:
   - Information retrieval and text indexing.
   - Text classification and sentiment analysis.
   - Named entity recognition (NER) and part-of-speech tagging.
   - Machine translation and information retrieval.
   - Search engines for matching queries to documents.
   - Text summarization and document clustering.

Lemmatization helps improve the quality of text analysis and language understanding, particularly when words need to be matched, compared, or aggregated in a way that considers their grammatical variations. It is a valuable preprocessing step in NLP and can enhance the performance of various language-related tasks.

In [11]:
import nltk

def lemmatize_text(text):
    """Lemmatizes the given text.

    Args:
        text: A string containing the text to be lemmatized.

    Returns:
        A list of strings containing the lemmatized tokens.
    """

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # Lemmatize the tokens
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return lemmatized_tokens

# Example usage:

text = "This is a sentence to be lemmatized."
text2="He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
lem=lemmatize_text(text2)
lemmatized_tokens = lemmatize_text(text)
print(lem)
print(lemmatized_tokens)

['He', 'wa', 'running', 'and', 'eating', 'at', 'same', 'time', '.', 'He', 'ha', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'Sun', '.']
['This', 'is', 'a', 'sentence', 'to', 'be', 'lemmatized', '.']
