<a href="https://colab.research.google.com/github/SKumarAshutosh/natural-language-processing/blob/master/Preprocessing_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Natural Language Processing (NLP) involves a series of steps to convert raw text into a format that is understandable by machine learning models. Here is a general guide on common NLP preprocessing steps:

### 1. Text Cleaning:
   - **Lowercasing:** Convert all text to lowercase to maintain consistency.
   - **Remove Punctuation:** Eliminate punctuation symbols.
   - **Handling Special Characters:** Address symbols, emojis, etc.
   - **Remove Stopwords:** Exclude common words (e.g., "and", "the") that may not add much meaning.
   - **Remove HTML Tags:** If working with web data, remove any HTML tags.

### 2. Tokenization:
   - **Word Tokenization:** Break text into words.
   - **Sentence Tokenization:** Break text into sentences.

### 3. Normalization:
   - **Stemming:** Trim words to their root form (e.g., "running" to "run").
   - **Lemmatization:** Reduce words to their base form considering the context (e.g., "went" to "go").

### 4. Handling Numbers:
   - **Number Removal:** Sometimes numbers are removed to reduce complexity.
   - **Number Replacement:** Replace numbers with tokens or words (e.g., "1000" to "<NUM>").

### 5. Handling URLs and Usernames:
   - **URL Removal/Replacement:** Exclude or replace URLs.
   - **Username Removal/Replacement:** Exclude or replace usernames or identifiers.

### 6. Spell Checking and Correction:
   - **Auto-Correction:** Correct misspelled words using predefined dictionaries or algorithms.

### 7. Part-of-Speech Tagging:
   - Identify and tag the part of speech (e.g., noun, verb) of each word.

### 8. Named Entity Recognition:
   - Identify and categorize entities (e.g., names, organizations, locations).

### 9. Noise Removal:
   - **Handle Typos:** Address spelling mistakes.
   - **Remove Extra Whitespaces:** Exclude additional space characters.

### 10. Text Encoding:
   - **One-Hot Encoding:** Convert words into vectors of 0s and 1s.
   - **Word Embedding:** Utilize pre-trained models like Word2Vec, GloVe, or use embeddings from models like BERT.

### 11. Removing/Handling Contractions:
   - **Contraction Expansion:** Convert contractions (e.g., "it's" to "it is").

### 12. Feature Engineering:
   - **Bag of Words:** Create a matrix of word frequencies.
   - **TF-IDF:** Utilize term frequency-inverse document frequency to reflect words’ importance.

### 13. Handling Imbalanced Data:
   - **Oversampling:** Duplicate instances from the minority class.
   - **Undersampling:** Remove instances from the majority class.

### 14. Data Augmentation:
   - **Back Translation:** Translate text to another language and then back to the original.
   - **Synonym Replacement:** Replace words with their synonyms.

### 15. Dependency Parsing:
   - Analyze the grammatical structure and determine the dependencies between words.

### 16. Coreference Resolution:
   - Identify when different words (pronouns, nouns) refer to the same entity in the text.

### 17. Sentiment Analysis (if applicable):
   - Analyze and classify the sentiment expressed in the text.

### 18. Document Classification (if applicable):
   - Categorize documents into predefined classes.

### 19. Creating Sequences (for deep learning):
   - Convert text into sequences of tokens or embeddings for input into models like LSTMs or GRUs.

### 20. Padding Sequences (for deep learning):
   - Ensure that all text sequences are of the same length for model training.

### 21. Creating Data Batches (for deep learning):
   - Group data into batches for efficient training.

Remember that the relevance of each step can depend on the specific NLP task and data at hand. Always tailor your preprocessing pipeline according to the requirements and characteristics of your problem.

In [1]:
# or prefix with ! in Jupyter notebook
!python -m spacy download en_core_web_sm

2023-10-15 18:46:16.441835: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import re
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Example Text
text = "Hello, I'm learning NLP! Visit https://www.nlp-example.com for more info. It's exciting, isn't it?"

# 1. Text Cleaning
def clean_text(text):
    text = text.lower()  # Lowercasing
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r"<.*?>", "", text)  # Remove HTML tags if any
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    return text

# 2. Tokenization
def tokenize_text(text):
    return word_tokenize(text)

# 3. Remove Stopwords
def remove_stopwords(tokens):
    stop_words = set(stopwords.words("english"))
    return [token for token in tokens if token not in stop_words]

# 4. Lemmatization
def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

# Applying Preprocessing Steps
cleaned_text = clean_text(text)
tokens = tokenize_text(cleaned_text)
tokens_without_stopwords = remove_stopwords(tokens)
lemmatized_tokens = lemmatize_tokens(tokens_without_stopwords)

# Output
print(f"Original Text: {text}")
print(f"Cleaned Text: {cleaned_text}")
print(f"Tokens: {tokens}")
print(f"Tokens without Stopwords: {tokens_without_stopwords}")
print(f"Lemmatized Tokens: {lemmatized_tokens}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Original Text: Hello, I'm learning NLP! Visit https://www.nlp-example.com for more info. It's exciting, isn't it?
Cleaned Text: hello im learning nlp visit for more info its exciting isnt it
Tokens: ['hello', 'im', 'learning', 'nlp', 'visit', 'for', 'more', 'info', 'its', 'exciting', 'isnt', 'it']
Tokens without Stopwords: ['hello', 'im', 'learning', 'nlp', 'visit', 'info', 'exciting', 'isnt']
Lemmatized Tokens: ['hello', 'im', 'learning', 'nlp', 'visit', 'info', 'exciting', 'isnt']


**Explanation**:
clean_text: Lowercases text, removes URLs, HTML tags, punctuation, and extra spaces.


**tokenize_text**: Splits the text into words.


**remove_stopwords**: Excludes common English words (stopwords) from the tokens.


**lemmatize_tokens**: Converts each word to its base or root form.
Note:
Ensure to have Python and the required packages installed to run this code on your local machine. Remember to choose the preprocessing steps that are relevant to your specific use case and data.

Feel free to add or modify steps according to your NLP task, such as using different tokenization/normalization techniques, adding feature engineering steps, etc.

While the previous guide provides an extensive overview of many common preprocessing steps, there are other techniques and strategies that might be relevant for specific Natural Language Processing (NLP) applications:

### 1. Context Preservation Techniques:
   - **Coreference Resolution:** Resolve words that refer to the same entity (e.g., “Obama” and “he”).
   - **Anaphora Resolution:** Identify what a pronoun or a noun phrase refers to.

### 2. Text Summarization:
   - **Extractive Summarization:** Selecting important sentences or phrases as they are.
   - **Abstractive Summarization:** Generating new sentences that retain the original meaning.

### 3. Dealing with Slangs and Abbreviations:
   - **Slang Normalization:** Convert slang or informal expressions to their standard form.
   - **Abbreviation Expansion:** Convert abbreviations to their full forms.

### 4. Language Detection:
   - Identifying the language of the text.

### 5. Morphological Analysis:
   - Understanding the structure and construction of words.

### 6. Collocation Extraction:
   - Identifying words that often occur together.

### 7. Language Translation:
   - If dealing with multilingual data, translating text to a consistent language.

### 8. Language Model Fine-tuning:
   - Adjusting pre-trained language models to better suit specific tasks/domains.

### 9. Handling Dates and Times:
   - Standardizing date and time formats.

### 10. Word Case Normalization:
   - Handling capitalized words used for emphasis or in headers.

### 11. Building Vocabulary:
   - Creating a mapping of words to unique integer indices for tokenization.

### 12. Bi-gram or N-gram Models:
   - Considering multiple word phrases as features.

### 13. Syntactic Parsing:
   - Understanding sentence structure and hierarchical organization of words.

### 14. Negation Handling:
   - Recognizing and handling negated expressions.

### 15. Code Switching Handling:
   - Managing instances where two or more languages are used within the same utterance or context.

### 16. Text Alignment:
   - Aligning text in parallel corpora (useful in translation tasks).

### 17. Dialogue Turn Breaking:
   - Separating and identifying different turns in dialogue or conversation data.

### 18. Discourse Analysis:
   - Understanding and identifying logical structures and argumentation within the text.

### 19. Phonetics Normalization:
   - Converting words to a representation of their phonetic form.

### 20. Text Categorization:
   - Assigning predefined labels or categories based on content.

### 21. Semantic Role Labeling (SRL):
   - Determining the semantic roles of constituent phrases in a sentence.

### 22. Identifying Text Genres:
   - Identifying the genre of text like news, fiction, scientific paper, etc.

### 23. Multi-modal Data Handling:
   - Managing data that combines text with other modalities, such as images or audio.

### 24. Encoding Techniques:
   - Implementing diverse text encoding strategies like BPE (Byte Pair Encoding).

### 25. Relevance and Redundancy Check:
   - Ensuring the provided text is relevant and not redundant for the task.

### 26. Dialect Identification:
   - Identifying specific dialects within a language.

### 27. Paraphrase Identification:
   - Identifying and handling paraphrases in text.

### 28. Text Segmentation:
   - Dividing text into meaningful segments or chunks.

The preprocessing steps should be selected and implemented based on the specific NLP task, dataset characteristics, and desired outcomes. Always tailor your preprocessing strategy according to the unique demands and challenges of your project.