<a href="https://colab.research.google.com/github/Krishishah7/nlp-pipeline/blob/main/nlp_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini NLP Pipeline

This notebook demonstrates an end-to-end NLP pipeline that processes raw text through preprocessing, POS tagging, word frequency analysis, and sentiment analysis.


In [1]:
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('vader_lexicon')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [2]:
text = """
Natural Language Processing is very useful for analyzing text data.
However, it can sometimes be challenging and complex.
"""
print("Original Text:\n", text)


Original Text:
 
Natural Language Processing is very useful for analyzing text data.
However, it can sometimes be challenging and complex.



Text Preprocessing

In [3]:
# Lowercase
text = text.lower()

# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# Tokenization
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

print("Preprocessed Tokens:\n", filtered_tokens)


Preprocessed Tokens:
 ['natural', 'language', 'processing', 'useful', 'analyzing', 'text', 'data', 'however', 'sometimes', 'challenging', 'complex']


POS Tagging

In [4]:
pos_tags = nltk.pos_tag(filtered_tokens)

print("\nPOS Tagging:")
for word, tag in pos_tags:
    print(word, "->", tag)



POS Tagging:
natural -> JJ
language -> NN
processing -> NN
useful -> JJ
analyzing -> VBG
text -> NN
data -> NNS
however -> RB
sometimes -> RB
challenging -> VBG
complex -> JJ


Word Frequency Analysis

In [5]:
freq_dist = FreqDist(filtered_tokens)

print("\nWord Frequency:")
for word, freq in freq_dist.items():
    print(word, ":", freq)



Word Frequency:
natural : 1
language : 1
processing : 1
useful : 1
analyzing : 1
text : 1
data : 1
however : 1
sometimes : 1
challenging : 1
complex : 1


Sentiment Analysis

In [6]:
sia = SentimentIntensityAnalyzer()
sentiment_scores = sia.polarity_scores(text)

print("\nSentiment Scores:")
print(sentiment_scores)



Sentiment Scores:
{'neg': 0.0, 'neu': 0.673, 'pos': 0.327, 'compound': 0.7425}


In [7]:
compound = sentiment_scores['compound']

if compound >= 0.05:
    sentiment = "Positive"
elif compound <= -0.05:
    sentiment = "Negative"
else:
    sentiment = "Neutral"

print("Overall Sentiment:", sentiment)


Overall Sentiment: Positive


## Final Pipeline Summary

- Raw text was cleaned using preprocessing techniques
- POS tagging was applied to understand grammatical structure
- Word frequency analysis identified important terms
- Sentiment analysis classified the overall sentiment of the text
