# NLP Pipeline Notebook

#### **Motivation:** A Genric functions and pipline that can fit into our project (I want my life to be easy).

It includes:

- Reading input text
- preprocessing
- Tokenization and POS tagging
- Dependency parsing
- Named Entity Recognition (NER)
- Sentiment Analysis using both TextBlob and VADER


## Step 1: Environment Setup


In [2]:
# !pip install spacy textblob nltk
# !python -m textblob.download_corpora
# !python -m spacy download en_core_web_sm
# !python -m nltk.downloader vader_lexicon

### My Little Explanation:
- **spaCy**: A fast and production-ready library for NLP tasks.
- **TextBlob**: Provides an easy API for basic NLP tasks, e.g. sentiment analysis.
- **nltk**: A classic NLP library. We'll use its VADER sentiment analyzer.


## Step 2: Import Packages and Load Models


In [3]:
import spacy
from textblob import TextBlob
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

import re
from spacy.lang.en.stop_words import STOP_WORDS

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK resources (if not already downloaded)
# nltk.download('punkt')
# nltk.download('stopwords')
nltk.download('wordnet')
# nltk.download('omw-1.4')  # For WordNet lemmatizer language support


# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Download VADER lexicon for sentiment analysis (if not already downloaded)
# nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()




[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
# import nltk
# nltk.download('punkt', quiet=True)
# nltk.download('stopwords', quiet=True)
# nltk.download('wordnet', quiet=True)
# nltk.download('omw-1.4', quiet=True)


In [5]:
# nltk.data.path

## Preprocessing

In [6]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

def nltk_preprocess_text(text, lower=True, remove_numbers=True, remove_punct=True, 
                           remove_stopwords=True, use_lemmatization=False, use_stemming=False,
                           extra_clean=True):
    """
    Preprocesses input text using NLTK.
    
    The function performs the following steps:
    1. **Extra Cleaning:** Removes extra spaces and, optionally, unwanted characters via regex.
    2. **Lowercasing:** Converts the text to lowercase.
    3. **Tokenization:** Splits the text into tokens using NLTK's word_tokenize.
    4. **Punctuation & Number Removal:** Filters out tokens that are punctuation or numbers.
    5. **Stop Word Removal:** Removes common stop words using NLTK's stop word list.
    6. **Lemmatization:** Converts tokens to their base forms using the WordNetLemmatizer.
    7. **Stemming (Optional):** Optionally applies PorterStemmer to tokens after lemmatization.
    
    Parameters:
      - text (str): The raw input text.
      - lower (bool): If True, converts text to lowercase. Default: True.
      - remove_numbers (bool): If True, removes tokens that are digits. Default: True.
      - remove_punct (bool): If True, removes tokens that are non-alphanumeric. Default: True.
      - remove_stopwords (bool): If True, filters out common stop words. Default: True.
      - use_lemmatization (bool): If True, uses WordNetLemmatizer on tokens. Default: True.
      - use_stemming (bool): If True, applies PorterStemmer to tokens after lemmatization. Default: False.
      - extra_clean (bool): If True, performs extra regex cleaning to remove unwanted characters and extra spaces. Default: True.
      
    Returns:
      str: The cleaned, preprocessed text.
    """
    # Extra cleaning using regex: remove extra spaces and non-alphanumeric characters if desired.
    if extra_clean:
        text = re.sub(r'\s+', ' ', text)      # Replace multiple spaces with one space.
        text = re.sub(r'[^\w\s]', ' ', text)   # Remove punctuation characters.
    
    # Lowercase conversion
    if lower:
        text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Define stop words set
    stops = set(stopwords.words('english'))
    
    # Initialize lemmatizer and stemmer objects
    wnl = WordNetLemmatizer()
    ps = PorterStemmer()
    
    processed_tokens = []
    for token in tokens:
        # Optionally remove tokens that are punctuation (i.e., non-alphanumeric)
        if remove_punct and not token.isalnum():
            continue
        # Optionally remove numbers
        if remove_numbers and token.isdigit():
            continue
        # Optionally remove stop words
        if remove_stopwords and token in stops:
            continue
        
        # Apply lemmatization (get the base form of the word)
        if use_lemmatization:
            token = wnl.lemmatize(token)
        # Optionally apply stemming after lemmatization
        if use_stemming:
            token = ps.stem(token)
        
        # Add token if it's not empty
        if token.strip():
            processed_tokens.append(token.strip())
    
    return " ".join(processed_tokens)




# --- Example usage ---
sample_text = ("Apple Inc. is planning to launch a new iPhone this September! "
               "Visit www.apple.com for details in 2023. Running, runner, runs.")

print("Original Text:")
print(sample_text)

print("\nProcessed Text (no lemm no stemming):")
print(nltk_preprocess_text(sample_text, use_lemmatization=False, use_stemming=False))

print("\nProcessed Text (only  Stemming):")
print(nltk_preprocess_text(sample_text, use_lemmatization=False, use_stemming=True))


Original Text:
Apple Inc. is planning to launch a new iPhone this September! Visit www.apple.com for details in 2023. Running, runner, runs.

Processed Text (no lemm no stemming):
apple inc planning launch new iphone september visit www apple com details running runner runs

Processed Text (only  Stemming):
appl inc plan launch new iphon septemb visit www appl com detail run runner run


## Step 3: Reading Input Text


In [7]:
def read_input_file(file_path):
    """
    Reads text from a given file path.
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read().strip()


## Step 4: Tokenization and Part-of-Speech Tagging


In [8]:
def tokenize_and_tag(text):
    """
    Tokenizes and tags the text, returning tokens with their POS tags.
    """
    doc = nlp(text)
    return [(token.text, token.pos_) for token in doc]

example_text = "Apple Inc. is planning to launch a new iPhone this September."
tokenize_and_tag(example_text)


[('Apple', 'PROPN'),
 ('Inc.', 'PROPN'),
 ('is', 'AUX'),
 ('planning', 'VERB'),
 ('to', 'PART'),
 ('launch', 'VERB'),
 ('a', 'DET'),
 ('new', 'ADJ'),
 ('iPhone', 'PROPN'),
 ('this', 'DET'),
 ('September', 'PROPN'),
 ('.', 'PUNCT')]

## Step 5: Dependency Parsing

#### we parse the sentence to understand the grammatical relationships between words.


In [9]:
# one note if you dont know dependency_parse please read it out, it we'll not take even 30 min (this note is for me only)

def dependency_parse(text):
    """
    Returns the dependency relations for each token.
    """
    doc = nlp(text)
    return [(token.text, token.dep_, token.head.text) for token in doc]

dependency_parse(example_text)

[('Apple', 'compound', 'Inc.'),
 ('Inc.', 'nsubj', 'planning'),
 ('is', 'aux', 'planning'),
 ('planning', 'ROOT', 'planning'),
 ('to', 'aux', 'launch'),
 ('launch', 'xcomp', 'planning'),
 ('a', 'det', 'iPhone'),
 ('new', 'amod', 'iPhone'),
 ('iPhone', 'dobj', 'launch'),
 ('this', 'det', 'September'),
 ('September', 'npadvmod', 'launch'),
 ('.', 'punct', 'planning')]

### Explanation:
- The `dependency_parse` function shows the relationship (like subject, object, etc.) between each token and its head word.
- This helps in understanding how parts of the sentence relate to each other.


## Step 6: Named Entity Recognition (NER)

In [10]:
def named_entities(text):
    """
    Extracts named entities from the text.
    """
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]


named_entities(example_text)


[('Apple Inc.', 'ORG'), ('iPhone', 'ORG'), ('this September', 'DATE')]

## Step 7: Sentiment Analysis

We'll perform sentiment analysis using both TextBlob and VADER.


In [11]:
def sentiment_textblob(text):
    """
    Uses TextBlob to analyze the sentiment of the text.
    Returns polarity and subjectivity.
    """
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

def sentiment_vader(text):
    """
    Uses nltk's VADER to analyze sentiment.
    Returns a dictionary with sentiment scores.
    """
    return sia.polarity_scores(text)




In [12]:
example_text

'Apple Inc. is planning to launch a new iPhone this September.'

In [13]:
print("TextBlob Sentiment:", sentiment_textblob(example_text))
print("VADER Sentiment:", sentiment_vader(example_text))

TextBlob Sentiment: (0.13636363636363635, 0.45454545454545453)
VADER Sentiment: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


### Explanation:
- **TextBlob** returns sentiment polarity ([-1, 1]) and subjectivity ([0, 1]).  
- **VADER** is particularly tuned for social media and news text, returning scores for positive, negative, neutral, and compound sentiment.


## Step 8: Full NLP Pipeline Function



In [14]:
def run_pipeline_with_preprocessing(text):
    """
    Runs the full NLP pipeline after preprocessing the input text.
    """
    print("----- Raw Input -----")
    print(text)
    
    preprocessed_text = nltk_preprocess_text(text, lower=True, remove_numbers=True, remove_punct=True, 
                           remove_stopwords=False, use_stemming=False,
                           extra_clean=True)
    
    print("\n----- Preprocessed Text -----")
    print(preprocessed_text)
    
    print("\n Tokens & POS Tags:")
    tokens_pos = tokenize_and_tag(preprocessed_text)
    print(tokens_pos)
    
    print("\n Dependency Parse:")
    dependencies = dependency_parse(preprocessed_text)
    print(dependencies)
    
    print("\n Named Entities:")
    entities = named_entities(preprocessed_text)
    print(entities)
    
    print("\n Sentiment Analysis (TextBlob):")
    polarity, subjectivity = sentiment_textblob(preprocessed_text)
    print(f"Polarity: {polarity}, Subjectivity: {subjectivity}")
    
    print("\n Sentiment Analysis (VADER):")
    vader_scores = sentiment_vader(preprocessed_text)
    print(vader_scores)


input_text = 'The weather in Bengaluru today is perfect for a walk in the park'  #positive 
# input_text = 'VADER is not smart, handsome, nor funny.' #negative
run_pipeline_with_preprocessing(input_text)


----- Raw Input -----
The weather in Bengaluru today is perfect for a walk in the park

----- Preprocessed Text -----
the weather in bengaluru today is perfect for a walk in the park

 Tokens & POS Tags:
[('the', 'DET'), ('weather', 'NOUN'), ('in', 'ADP'), ('bengaluru', 'NOUN'), ('today', 'NOUN'), ('is', 'AUX'), ('perfect', 'ADJ'), ('for', 'ADP'), ('a', 'DET'), ('walk', 'NOUN'), ('in', 'ADP'), ('the', 'DET'), ('park', 'NOUN')]

 Dependency Parse:
[('the', 'det', 'weather'), ('weather', 'nsubj', 'is'), ('in', 'prep', 'weather'), ('bengaluru', 'pobj', 'in'), ('today', 'npadvmod', 'is'), ('is', 'ROOT', 'is'), ('perfect', 'acomp', 'is'), ('for', 'prep', 'is'), ('a', 'det', 'walk'), ('walk', 'pobj', 'for'), ('in', 'prep', 'walk'), ('the', 'det', 'park'), ('park', 'pobj', 'in')]

 Named Entities:
[('today', 'DATE')]

 Sentiment Analysis (TextBlob):
Polarity: 1.0, Subjectivity: 1.0

 Sentiment Analysis (VADER):
{'neg': 0.0, 'neu': 0.748, 'pos': 0.252, 'compound': 0.5719}
