<a href="https://colab.research.google.com/github/JogendraSingh1879/Text-Processing-NLP-Project/blob/main/NLP_Tasks_with_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP Tasks with Examples

Setup and Installation

In [1]:
# Install necessary libraries
!pip install nltk spacy transformers

# Download necessary resources for NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Download Spacy model
!python -m spacy download en_core_web_sm



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Tokenization


In [2]:
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is a fascinating field of AI. Let's explore it!"

# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("\nSentence Tokens:", sentence_tokens)

Word Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'Let', "'s", 'explore', 'it', '!']

Sentence Tokens: ['Natural Language Processing is a fascinating field of AI.', "Let's explore it!"]


## Stopword Removal

In [3]:
from nltk.corpus import stopwords

# Get English stopwords
stop_words = set(stopwords.words('english'))

filtered_sentence = [word for word in word_tokens if word.lower() not in stop_words]
print("Filtered Sentence (Without Stopwords):", filtered_sentence)

Filtered Sentence (Without Stopwords): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', '.', 'Let', "'s", 'explore', '!']


## Stemming

In [4]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["running", "going", "played", "worked"]

# Stemming
stemmed_words = [ps.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)

Stemmed Words: ['run', 'go', 'play', 'work']


# Part of Speech (POS) Tagging

In [5]:
from nltk import pos_tag

# POS Tagging
pos_tags = pos_tag(word_tokens)
print("POS Tags:", pos_tags)

POS Tags: [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('fascinating', 'JJ'), ('field', 'NN'), ('of', 'IN'), ('AI', 'NNP'), ('.', '.'), ('Let', 'NNP'), ("'s", 'POS'), ('explore', 'VB'), ('it', 'PRP'), ('!', '.')]


NNP: Proper Nouns (e.g., "Language", "Processing", "AI")

VBZ: Verb, 3rd person singular present (e.g., "is")

JJ: Adjective (e.g., "Natural", "fascinating")

NN: Noun (e.g., "field")

IN: Preposition (e.g., "of")

PRP: Pronoun (e.g., "it")

VB: Base form of a verb (e.g., "explore")

DT: Determiner (e.g., "a")

POS: Possessive ending or contraction part (e.g., "'s").


## Named Entity Recognition (NER)

In [11]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")

# Applying NERAmazon is one of the largest companies in the world.
doc = nlp('')
for ent in doc.ents:
    print(ent.text, ent.label_)

## Sentiment Analysis

In [16]:
from transformers import pipeline

# Initialize sentiment analysis model
classifier = pipeline("sentiment-analysis")

# Example text for sentiment analysis
sentiment = classifier("I love working with NLP!")
print(sentiment)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



[{'label': 'POSITIVE', 'score': 0.999126136302948}]


In [17]:
from transformers import pipeline

# Initialize sentiment analysis model
classifier = pipeline("sentiment-analysis")

# Example text for sentiment analysis
sentiment = classifier("I hate working with NLP!")
print(sentiment)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9991481304168701}]


In [20]:
from transformers import pipeline

# Initialize sentiment analysis model
classifier = pipeline("sentiment-analysis")

# Example text for sentiment analysis
sentiment = classifier("he is lovely man and bad working with NLP!")
print(sentiment)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9613810181617737}]


## Language Modeling (Text Generation)

In [22]:
generator = pipeline("text-generation")

# Generate text
generated_text = generator("Natural language processing is", max_length=1000)
print("Generated Text:", generated_text[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: Natural language processing is a long, complicated process where you have to process multiple parts and do them differently to separate the words and sentences. And even if all the words are translated correctly, sometimes some words you see are in fact words from which you do not hear them.

And then sometimes your interpreter is trying to translate, like this example, "I'm so really cool, huh?" and then you've just seen one of the different sentences. Or if there are two different sentences, or if you've seen two different words, you've actually seen the two different words. It's not something that's easy to do because it just takes a lot of imagination.

What are the problems, how do you do this work in languages you have no experience with?

We use the concept of lexical ambiguity in terms of what we consider syntactic specificity. All languages are at their own kind of ambiguity. I'll start off with a language you know, that in Arabic we refer to as "al Qis" or, in

## Text Summarization

In [24]:
summarizer = pipeline("summarization")

# Example text for summarization
long_text = """
Natural language processing is a field of artificial intelligence that deals with the interaction between computers and humans through language.
It has many applications such as sentiment analysis, translation, and summarization.
It is a rapidly growing field with many advancements being made every day.
"""

# Generate summary
summary = summarizer(long_text, max_length=50, min_length=25)
print("Summary:", summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Summary:  Natural language processing is a field of artificial intelligence that deals with the interaction between computers and humans through language . It has many applications such as sentiment analysis, translation, and summarization .
