# **Assignment 8 : Natural Language Processing (NLP)**

# **1. Text Preprocessing:**
**• What is tokenization in NLP? Explain its importance.**

`Tokenization` is the process of breaking down a sequence of text into smaller units, known as tokens. These tokens can be words, sentences, or subwords.

**Importance of Tokenization:**

1. **Foundation for Further Analysis:** Tokenization is the first step in most NLP tasks (such as text classification, sentiment analysis, machine translation, etc.), making it essential for enabling further text processing.
2. **Simplifies Text Representation:** By dividing the text into smaller units, tokenization allows each token to be analyzed for its meaning, function, or significance in a specific context.
3. **Text Preprocessing:** It allows for efficient filtering, stemming, lemmatization, and feature extraction, which are necessary for effective machine learning algorithms.




**• Perform word and sentence tokenization on the followina text: "Data science is an interdisciplinary field that uses scientific methods processes, algorithms, and systems to extract knowledge from data."**

In [1]:
# Install necessary libraries
!pip install nltk spacy transformers



In [2]:
# Download necessary resources for NLTK
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
# Download Spacy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [5]:
text = "Data science is an interdisciplinary field that uses scientific methods processes, algorithms, and systems to extract knowledge from data."

In [6]:
# Word Tokenization
word_tokens = word_tokenize(text)
print("Word Tokens:", word_tokens)

Word Tokens: ['Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', 'processes', ',', 'algorithms', ',', 'and', 'systems', 'to', 'extract', 'knowledge', 'from', 'data', '.']


In [7]:
# Sentence Tokenization
sentence_tokens = sent_tokenize(text)
print("\nSentence Tokens:", sentence_tokens)


Sentence Tokens: ['Data science is an interdisciplinary field that uses scientific methods processes, algorithms, and systems to extract knowledge from data.']


**• Explain stopwords in NLP. Why is it important to remove them?**

`Stopword` are common words (such as "the", "is", "in", "and", "or", etc.) that are typically removed from text data during preprocessing because they do not carry significant meaning and do not contribute to the overall context of a sentence.

**Importance of removing stopwords:**

**Reduce Noise:** Removing stopwords helps focus on the more meaningful, content-carrying words in the text.

**Improve Efficiency:** Processing large datasets becomes more efficient when stopwords are eliminated, as fewer tokens need to be analyzed.

**Better Performance:** In many cases, models perform better when stopwords are removed because the model isn't distracted by words that don't contribute to the meaning of the text.



**• Perform stopword removal on the tokenized words from the text above.**

In [8]:
from nltk.corpus import stopwords

In [9]:
# Get English stopwords
stop_words = set(stopwords.words('english'))

In [10]:
filtered_sentence = [word for word in word_tokens if word.lower() not in stop_words]
print("Filtered Sentence (Without Stopwords):", filtered_sentence)

Filtered Sentence (Without Stopwords): ['Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', 'processes', ',', 'algorithms', ',', 'systems', 'extract', 'knowledge', 'data', '.']



**• What is stemming? Why is it useful in text preprocessing?**

`Stemming` is the process of reducing words to their root or base form (also called the "stem"), often by removing prefixes or suffixes.

**Usefullness:**

1. **Reduces Complexity:** Stemming helps normalize words, reducing the number of unique tokens and thus simplifying the model's task.

2. **Improves Matching:** It helps identify different forms of a word (e.g., "running", "runner", "ran") as the same word, allowing the model to generalize better.

3. **Text Preprocessing:** It is particularly useful in search engines, document classification, and sentiment analysis, where various forms of a word need to be treated as equivalent.



**• Apply stemming on the following words: ['improving", "processed", "arguing", "analysis']**

In [11]:
from nltk.stem import PorterStemmer

In [12]:
ps = PorterStemmer()

In [13]:
words = ['improving', 'processed', 'arguing', 'analysis']


In [19]:
# Stemming and printing:
for word in words:
    stemmed_word = ps.stem(word)
    print(f"{word} : {stemmed_word}")

improving : improv
processed : process
arguing : argu
analysis : analysi


# **2. Part-of-Speech (POS) Tagging:**
**• What is part-of-speech (POS) tagging? How does it help in understanding text?**

`Part-of-Speech (POS) tagging` is the process of assigning specific grammatical categories or "tags" to each word in a sentence based on its role or function in that sentence. These categories can include nouns, verbs, adjectives, adverbs, prepositions, and more.

For example, the word "run" can be tagged as a verb (action) or a noun (an activity), depending on the context.




**• Perform POS tagging on the sentence: "Machine learning models require large datasets for accurate predictions."**

In [15]:
# Download the required resource
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [16]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

In [17]:
# Tokenize the sentence
word_tokens = word_tokenize("Machine learning models require large datasets for accurate predictions.")

In [18]:
# POS Tagging
pos_tags = pos_tag(word_tokens)
print("POS Tags:", pos_tags)

POS Tags: [('Machine', 'NN'), ('learning', 'NN'), ('models', 'NNS'), ('require', 'VBP'), ('large', 'JJ'), ('datasets', 'NNS'), ('for', 'IN'), ('accurate', 'JJ'), ('predictions', 'NNS'), ('.', '.')]


Explanation of POS Tags:

1. NN: Singular Noun (e.g., "machine")
2. NNS: Plural Noun (e.g., "models", "datasets", "predictions")
3. VB: Base form of a verb (e.g., "require")
4. JJ: Adjective (e.g., "large", "accurate")
5. IN: Preposition (e.g., "for")

**• Explain the significance of POS tags like MN, v8, and 32 in text analysis.**

**1. MN:**
  
  Significance: In text analysis, recognizing specific noun phrases like monetary amounts (e.g., "100 dollars") is crucial for tasks such as financial document processing or automatic summarization.

**2. v8:**

  Significance: Identifying verb tenses is important for syntactic parsing and sentiment analysis, where the time of the action (past, present, or future) impacts the overall meaning of a sentence or the sentiment conveyed.

**3. 32:**

  Significance: Extracting numbers and understanding their context is key in tasks like information retrieval, question answering, and document classification, where numeric data is essential.

# **3. Named Entity Recognition (NER):**
**• What is Named Entity Recognition (NER)? List some common types of entities identified by NER models.**

`Named Entity Recognition (NER)` is an NLP technique used to identify and classify named entities in text into predefined categories such as names of people, organizations, locations, dates, and more.  

**Common types of entities identified by NER models:**
1. **Person (PER):** Names of individuals (e.g., "Elon Musk").  
2. **Organization (ORG):** Companies, institutions, etc. (e.g., "Google").  
3. **Location (LOC):** Cities, countries, landmarks (e.g., "Kathmandu").  
4. **Date**/Time (DATE/TIME): Specific dates and times (e.g., "January 30, 2025").  
5. **Money (MONEY):** Currency amounts (e.g., "$100").  
6. **Percent (PERCENT):** Percentages (e.g., "50%").  
7. **GPE (Geopolitical Entity):** Countries, states, cities (e.g., "Nepal").  
8. **Product (PRODUCT):** Named products (e.g., "iPhone 15").  






**• Perform NER on the following sentence: "Google is planning to open a new office in New York next year."**

In [20]:
import spacy

In [21]:
# Load English model
nlp = spacy.load("en_core_web_sm")

In [24]:
# Applying NER - Google is planning to open a new office in New York next year.
doc = nlp('Google is planning to open a new office in New York next year.')
for ent in doc.ents:
    print(ent.text, ":", ent.label_)

Google : ORG
New York : GPE
next year : DATE


**• Identify the entities and explain their labels.**

Identified entities and their explanations:

1. **Google: ORG (Organization):**  
   - Represents a company, institution, or organization.  
   - Example: Google, Microsoft, OpenAI.  

2. **New York: GPE (Geopolitical Entity):**
   - Refers to geographical locations such as cities, states, or countries.  
   - Example: New York, Nepal, France.  

3. **next year: DATE (Date/Time):**
   - Represents a time-related entity, including specific dates, years, or relative time expressions.  
   - Example: 2025, January 30, next month.  

# **4. Sentiment Analvsis:**
**• What is sentiment analysis? How is it useful in understanding text data?**

`Sentiment Analysis` is an NLP technique used to determine the emotional tone of a given text. It classifies text into categories such as positive, negative, or neutral based on the sentiment expressed.  

**Usefulness in understanding text data:**
1. **Customer Feedback Analysis:** Helps businesses analyze reviews and improve services.  
2. **Brand Monitoring:** Tracks public perception of a company or product.  
3. **Market Research:** Identifies trends and consumer opinions.  
4. **Social Media Analysis:** Detects public sentiment towards events, policies, or individuals.  
5. **Automated Support Systems:** Enhances chatbot responses by understanding user emotions.  




**• Perform sentiment analysis on the following text: "The project outcome was highly satisfying, and the team did an excellent job." "The service was terrible, and the support was unresponsive."**

In [25]:
from transformers import pipeline

In [26]:
# Initialize sentiment analysis model
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


In [27]:
# Example text for sentiment analysis
sentiment = classifier("The project outcome was highly satisfying, and the team did an excellent job.")
print(sentiment)

[{'label': 'POSITIVE', 'score': 0.9998728036880493}]


In [28]:
# Example text for sentiment analysis
sentiment = classifier("The service was terrible, and the support was unresponsive.")
print(sentiment)

[{'label': 'NEGATIVE', 'score': 0.9995108842849731}]


**• Explain the sentiment results and their interpretation.**
*   The first statement conveys praise and satisfaction, making it strongly positive.
*   The second statement reflects frustration and disappointment, making it strongly negative.



# **5. Text Generation:**
**• What is text generation in NLP? Provide real-world applications of text generation models.**

`Text generation` in NLP is the process of generating coherent, meaningful text based on a given input or prompt. It uses models like GPT to predict the next word or sequence of words.

**Real-world applications:**
- **Chatbots/Assistants:** Customer support, virtual assistants.
- **Content Creation:** Automated writing for blogs, news articles.
- **Text Summarization:** Creating concise summaries of lengthy texts.
- **Creative Writing:** Generating stories, poetry, or dialogue.
- **Translation:** Converting text between languages.




**• Use a pre-trained language model to generate text for the prompt: "Artificial intelligence is revolutionizing"**

In [29]:
generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [30]:
# Generate text
generated_text = generator("Artificial intelligence is revolutionizing", max_length=100)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [31]:
print("Generated Text:", generated_text[0]['generated_text'])

Generated Text: Artificial intelligence is revolutionizing the way we live. To paraphrase the "paleo" philosophy, AI is coming into being in every form, and I am committed to making it the most powerful it can possibly be, until at the very most important milestone it can make an impact. I am talking about AI in terms of power—that is to say, how much, how widely, and how hard people can do something. This is why I am starting the Future of Our Species Program


**• Discuss how text generation models like GPT work.**

Text generation models like GPT (Generative Pretrained Transformer) work by predicting the next word or sequence of words based on a given input. Here's how they function:

1. **Pretraining:** The model is trained on large datasets of text to learn patterns, grammar, and context.
2. **Architecture:** GPT uses a Transformer architecture with attention mechanisms, which allow the model to focus on relevant words in a sentence and capture long-range dependencies.
3. **Input Processing:** When a prompt is given, the model tokenizes the input into smaller units (words or subwords).
4. **Prediction:** GPT predicts the next word by calculating probabilities for each possible next word, based on context learned during pretraining.
5. **Output:** The model generates a coherent continuation of the input prompt by selecting the most probable words.

# **6. Text Summarization:**
**• What is text summarization? Differentiate between extractive and abstractive summarization techniques.**

`Text Summarization` is the process of reducing a large piece of text to a shorter version while retaining key information and meaning.

**Types of Summarization:**

**1. Extractive Summarization:**
   - **Method:** Selects and extracts sentences directly from the source text to form a summary.
   - **Strength:** Keeps the original wording intact.
   - **Limitation:** May lack coherence and flow between selected sentences.
   
**2. Abstractive Summarization:**
   - **Method:** Generates new sentences that convey the core ideas of the original text.
   - **Strength:** Produces more coherent and fluent summaries.
   - **Limitation:** May introduce errors or lose important details due to rephrasing.




**• Summarize the following text into 2-3 sentences: "Artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. Al is being applied in a wide range of industries, from healthcare to finance, and has the potential to improve efficiency and decision-making."**

In [32]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [33]:
# Example text for summarization
long_text = """
Artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. Al is being applied in a wide range of industries, from healthcare to finance, and has the potential to improve efficiency and decision-making.
"""

In [34]:
# Generate summary
summary = summarizer(long_text, max_length=50, min_length=25)
print("Summary:", summary[0]['summary_text'])

Summary:  Artificial intelligence is being applied in a wide range of industries, from healthcare to finance . It has the potential to improve efficiency and decision-making .


**• Discuss the importance of summarization in NLP.**

**Importance of Summarization in NLP:**

**1. Information Overload:** Summarization helps condense large volumes of text, making it easier to digest and extract important information quickly.
  
**2. Efficiency:** Saves time by providing concise versions of lengthy documents, making it ideal for busy professionals or researchers.

**3. Search & Retrieval:** Enhances the relevance of search results by offering summarized content, improving user experience.

**4. Content Creation:** Automates the creation of summaries for news articles, reports, and social media, aiding content generation.

**5. Data Compression:** Reduces storage requirements by creating more compact representations of information, especially useful for large datasets.

**6. Improved Understanding:** Helps users quickly grasp the essence of texts, making it easier to identify key points, especially in technical or academic content.