# Text Preprocessing in Natural Language Processing (NLP)

Text preprocessing is a crucial step in preparing raw text data for analysis and modeling in Natural Language Processing (NLP). This process involves several techniques to clean and transform the text into a format that can be effectively used by machine learning algorithms.

## Steps of Text Preprocessing

### 1. **Tokenization**
   - **Definition**: The process of breaking down text into smaller units called tokens, which can be words, phrases, or sentences.
   - **Purpose**: Tokenization allows us to analyze text at a granular level.

### 2. **Lowercasing**
   - **Definition**: Converting all characters in the text to lowercase.
   - **Purpose**: Helps in reducing the complexity of the data by treating words like "Apple" and "apple" as the same token.

### 3. **Stopword Removal**
   - **Definition**: The removal of common words (stopwords) that do not add significant meaning to the text, such as "is", "the", "and".
   - **Purpose**: Reduces the noise in the text, allowing the model to focus on more informative words.

### 4. **Stemming**
   - **Definition**: The process of reducing words to their root form by removing suffixes and prefixes. For example, "running" becomes "run".
   - **Purpose**: Helps in reducing inflected words to their base form, thus consolidating similar words.

### 5. **Lemmatization**
   - **Definition**: Similar to stemming, but lemmatization considers the context and meaning of the word, converting it to its base or dictionary form. For example, "better" becomes "good".
   - **Purpose**: Provides a more accurate representation of the word by understanding its meaning and usage.

### 6. **Part-of-Speech (POS) Tagging**
   - **Definition**: The process of assigning grammatical categories (such as noun, verb, adjective) to each word in a sentence.
   - **Purpose**: Helps in understanding the grammatical structure and meaning of sentences.

### 7. **Named Entity Recognition (NER)**
   - **Definition**: The identification and classification of named entities in the text (such as names of people, organizations, and locations).
   - **Purpose**: Useful for extracting meaningful information from text and organizing it into predefined categories.

### 8. **Normalization**
   - **Definition**: This includes processes such as removing punctuation, special characters, and correcting typos.
   - **Purpose**: Ensures a consistent format for text data, improving the quality of analysis.


In [2]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize # Tokenizer
from nltk.corpus import stopwords # Stopwords removal
from nltk.stem import PorterStemmer, WordNetLemmatizer # Stemming and Lemmatizer



In [3]:
text = "Last summer, Alice and Bob traveled to Paris. They enjoyed running along the Seine and admired the beautiful buildings."

## Tokenizer 
- Sentence Tokenizer
- Word Tokenizer


In [4]:
print("Sentence Tokens: \n", nltk.sent_tokenize(text))


Sentence Tokens: 
 ['Last summer, Alice and Bob traveled to Paris.', 'They enjoyed running along the Seine and admired the beautiful buildings.']


In [5]:
words = nltk.word_tokenize(text)
print("Word Tokens: \n", words)

Word Tokens: 
 ['Last', 'summer', ',', 'Alice', 'and', 'Bob', 'traveled', 'to', 'Paris', '.', 'They', 'enjoyed', 'running', 'along', 'the', 'Seine', 'and', 'admired', 'the', 'beautiful', 'buildings', '.']


## Lowercase

In [6]:
words = [word.lower() for word in words]
print(words)

['last', 'summer', ',', 'alice', 'and', 'bob', 'traveled', 'to', 'paris', '.', 'they', 'enjoyed', 'running', 'along', 'the', 'seine', 'and', 'admired', 'the', 'beautiful', 'buildings', '.']


## Stopwords

In [7]:
stop_words = stopwords.words("english")
print(len(stop_words), stop_words[0])

179 i


In [8]:
words_filtered = [word for word in words if word not in stop_words]
print(words_filtered) # stop words removed.

['last', 'summer', ',', 'alice', 'bob', 'traveled', 'paris', '.', 'enjoyed', 'running', 'along', 'seine', 'admired', 'beautiful', 'buildings', '.']


## Stemming


In [9]:
stemmer = PorterStemmer()
stem_words = [stemmer.stem(word) for word in words_filtered]
print(stem_words)

['last', 'summer', ',', 'alic', 'bob', 'travel', 'pari', '.', 'enjoy', 'run', 'along', 'sein', 'admir', 'beauti', 'build', '.']


## Lemmatizing

In [10]:
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words_filtered]
print(lemmatized_words)

['last', 'summer', ',', 'alice', 'bob', 'traveled', 'paris', '.', 'enjoyed', 'running', 'along', 'seine', 'admired', 'beautiful', 'building', '.']


## POS Tagging

In [13]:
pos_tags = nltk.pos_tag(words_filtered)
print(pos_tags)

[('last', 'JJ'), ('summer', 'NN'), (',', ','), ('alice', 'NN'), ('bob', 'NN'), ('traveled', 'VBD'), ('paris', 'NN'), ('.', '.'), ('enjoyed', 'VBN'), ('running', 'VBG'), ('along', 'RB'), ('seine', 'NN'), ('admired', 'VBD'), ('beautiful', 'JJ'), ('buildings', 'NNS'), ('.', '.')]
