# **NLP Text Processing**

# What is NLP Text Processing

Text processing in the context of NLP (Natural Language Processing) refers to a set of techniques and operations applied to textual data to make it more accessible and useful for various NLP tasks. It involves the manipulation, analysis, and transformation of text to extract valuable information or gain insights. Text processing is a crucial preliminary step in many NLP applications. Here are some common text processing tasks:

1. **Tokenization:** Tokenization involves splitting a text into individual units, such as words or sentences. These units are known as tokens. Tokenization is a fundamental step in NLP, as it breaks down text into manageable components.

2. **Stemming and Lemmatization:** Stemming and lemmatization are techniques to reduce words to their root or base forms. Stemming removes prefixes and suffixes, while lemmatization considers the word's context and grammar to find its base form.

3. **Stop Word Removal:** Stop words are common words like "the," "and," "in," which may not provide valuable information in certain NLP tasks. Removing stop words can reduce noise in the text.

4. **Normalization:** Text normalization involves standardizing text, making it consistent by converting all text to lowercase, removing special characters, and handling abbreviations or acronyms.

5. **Text Cleaning:** Text cleaning aims to remove noise or irrelevant information from text. This may include removing HTML tags, punctuation, or unwanted characters.

6. **Sentence Segmentation:** Sentence segmentation involves identifying sentence boundaries in a paragraph of text. This is important for tasks like machine translation or summarization.

7. **Text Encoding:** Text encoding converts text into a numerical format that machine learning models can work with. This is typically done using techniques like one-hot encoding or word embeddings (e.g., Word2Vec or GloVe).

8. **Part-of-Speech Tagging:** This task involves labeling words in a sentence with their part of speech (e.g., noun, verb, adjective). It's useful for understanding the grammatical structure of text.

9. **Named Entity Recognition (NER):** NER identifies and classifies entities in text, such as names of people, organizations, locations, dates, and more.

10. **Sentiment Analysis:** Sentiment analysis determines the sentiment or emotion expressed in a piece of text, typically classifying it as positive, negative, or neutral.

11. **Text Summarization:** Text summarization techniques aim to create a concise summary of a longer text while retaining its essential information.

12. **Topic Modeling:** Topic modeling algorithms can identify topics or themes within a collection of documents.

13. **Dependency Parsing:** Dependency parsing analyzes the grammatical structure of a sentence to identify the relationships between words.

14. **Text Translation:** Machine translation systems use NLP techniques to translate text from one language to another.

Text processing plays a crucial role in NLP because it prepares raw textual data for subsequent analysis, modeling, and understanding. The specific techniques used depend on the NLP task at hand and the nature of the text data being processed.

# NLP V/s TOA V/s CD

Natural Language Processing (NLP), theory of automata, and compiler design are related in some ways, as they all deal with aspects of language, but they are distinct fields with different focuses. Here's how they relate:

1. **Theory of Automata:**
   - The theory of automata, particularly finite automata and regular languages, is a fundamental concept in computer science and forms the theoretical foundation for many aspects of NLP.
   - Regular expressions and finite automata are used in NLP for tasks like pattern matching, tokenization, and text processing.
   - Automata theory can be applied to language recognition and parsing in NLP, such as in the development of regular expression-based text processing tools.

2. **Compiler Design:**
   - Compiler design is a field of computer science that deals with the development of compilers, which are programs that translate source code into machine code or another language.
   - While compiler design and NLP are distinct fields, they can intersect when dealing with domain-specific languages or domain-specific compilers. In some cases, NLP techniques might be used to assist in developing compilers for languages that have natural language-like features.
   - Natural language might also be used in error messages or documentation produced by compilers.

3. **NLP:**
   - NLP is a specialized field within artificial intelligence and linguistics that focuses on the interaction between computers and human language.
   - NLP involves a wide range of tasks, including text processing, sentiment analysis, machine translation, speech recognition, and more.
   - While NLP may use concepts from automata theory and compiler design, its primary focus is on understanding and generating human language.

In summary, while there are connections between NLP, automata theory, and compiler design, they are separate fields with distinct objectives. NLP primarily deals with human language understanding and generation, while automata theory and compiler design are more foundational to computer science and programming language translation. Concepts from these fields can be applied in NLP when appropriate, but they are not the central focus of the discipline.

# Implementation

In [1]:
import re # regular expression library for string matching and manipulation
import nltk # natural language toolkit
from nltk.corpus import stopwords # stopwords corpus
from nltk.stem import PorterStemmer # stemming library
from nltk.stem import WordNetLemmatizer # lemmatization library
from nltk.tokenize import word_tokenize # word tokenizer library for tokenization
from nltk.tokenize import sent_tokenize # sentence tokenizer library for tokenization

### Sample Text

In [2]:
sample_text="""
Nikola Tesla: A Pioneer in Electrical Engineering

Nikola Tesla, born on July 10, 1856, in Smiljan, Croatia (then part of the Austro-Hungarian Empire), was a visionary inventor and electrical engineer who made groundbreaking contributions to the field of electrical engineering. His life and work are a testament to innovation and scientific curiosity.

Tesla's Early Years:
Tesla displayed an early aptitude for mathematics and mechanics, and he attended the Technical University of Graz and the University of Prague before finding work in the emerging electrical industry. He moved to the United States in 1884, where he began working with the legendary inventor Thomas Edison.

AC vs. DC:
One of the most significant chapters in Tesla's life revolved around the "War of Currents." He championed alternating current (AC) as a more efficient and safer method for electricity transmission than Edison's direct current (DC). This battle between AC and DC technologies ultimately led to the widespread adoption of AC as the standard for electrical power distribution.

Inventions and Innovations:
Tesla was a prolific inventor and held numerous patents. His inventions included the Tesla coil, a device capable of generating high-voltage, low-current electricity; the alternating current induction motor; and the development of the polyphase system for electricity distribution.

Wireless Power Transmission:
Tesla was also fascinated by the idea of wireless power transmission. He proposed the construction of the Wardenclyffe Tower, a wireless transmission facility, to transmit electricity wirelessly across vast distances. Although this project was never completed, it laid the groundwork for future developments in wireless technology.

Contributions to Science:
Tesla's work extended beyond electrical engineering. He made contributions to fields such as wireless communication, radio waves, and X-rays. His "Tesla's oscillator" was an early example of radio wave generation, which later played a crucial role in modern communication.

Challenges and Later Life:
Despite his many achievements, Tesla faced financial difficulties and isolation in his later years. He continued to work on inventions and ideas but struggled to bring them to fruition. He passed away on January 7, 1943, in New York City.

Legacy:
Nikola Tesla's legacy is profound, and his ideas continue to influence the fields of electrical engineering, technology, and science. He is celebrated for his contributions to the development of the modern electrical power system and his pioneering work in areas that laid the foundation for future innovations.

In conclusion, Nikola Tesla's story is one of scientific brilliance and innovation, and his work has left an enduring impact on the world of technology and engineering.


"""

### 1. Sentence Tokenization

In [3]:

nltk.download('punkt') # download the punkt tokenizer models for sentence tokenization
nltk.download('stopwords') # download the stopwords corpus for stopword removal
nltk.download('wordnet') # download the wordnet corpus for lemmatization 

[nltk_data] Downloading package punkt to /home/blackheart/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/blackheart/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/blackheart/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
sentences = sent_tokenize(sample_text)
words = [word_tokenize(sentence) for sentence in sentences]

### Lower case and removing special character

In [5]:

clean_word=[[re.sub(r'[^a-zA-Z0-9]','', word.lower( )) for word in word_tokenize(sentence) if word not in stopwords.words('english')] for sentence in sent_tokenize(sample_text)]


In [6]:
print(clean_word)

[['nikola', 'tesla', '', 'a', 'pioneer', 'electrical', 'engineering', 'nikola', 'tesla', '', 'born', 'july', '10', '', '1856', '', 'smiljan', '', 'croatia', '', 'part', 'austrohungarian', 'empire', '', '', 'visionary', 'inventor', 'electrical', 'engineer', 'made', 'groundbreaking', 'contributions', 'field', 'electrical', 'engineering', ''], ['his', 'life', 'work', 'testament', 'innovation', 'scientific', 'curiosity', ''], ['tesla', 's', 'early', 'years', '', 'tesla', 'displayed', 'early', 'aptitude', 'mathematics', 'mechanics', '', 'attended', 'technical', 'university', 'graz', 'university', 'prague', 'finding', 'work', 'emerging', 'electrical', 'industry', ''], ['he', 'moved', 'united', 'states', '1884', '', 'began', 'working', 'legendary', 'inventor', 'thomas', 'edison', ''], ['ac', 'vs', 'dc', '', 'one', 'significant', 'chapters', 'tesla', 's', 'life', 'revolved', 'around', '', 'war', 'currents', '', ''], ['he', 'championed', 'alternating', 'current', '', 'ac', '', 'efficient', 'saf

In [7]:
# Removing Stopwords

stop_words = set(stopwords.words('english'))
filtered_words = [[word for word in sentence if word not in stop_words] for sentence in clean_word]

### Stemming and Lemmatization

In [8]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_words = [[stemmer.stem(word) for word in sentence] for sentence in filtered_words]
lemmatized_words = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in filtered_words]

### Printing Processed Sentenses

In [9]:
print("Original Sentences:")
for sentence in sentences:
    print(sentence)

print("\nProcessed Sentences (Lemmatized):")
for sentence in lemmatized_words:
    print(' '.join(sentence))

Original Sentences:

Nikola Tesla: A Pioneer in Electrical Engineering

Nikola Tesla, born on July 10, 1856, in Smiljan, Croatia (then part of the Austro-Hungarian Empire), was a visionary inventor and electrical engineer who made groundbreaking contributions to the field of electrical engineering.
His life and work are a testament to innovation and scientific curiosity.
Tesla's Early Years:
Tesla displayed an early aptitude for mathematics and mechanics, and he attended the Technical University of Graz and the University of Prague before finding work in the emerging electrical industry.
He moved to the United States in 1884, where he began working with the legendary inventor Thomas Edison.
AC vs. DC:
One of the most significant chapters in Tesla's life revolved around the "War of Currents."
He championed alternating current (AC) as a more efficient and safer method for electricity transmission than Edison's direct current (DC).
This battle between AC and DC technologies ultimately led

# **Thank You!**