Q1: NLP Preprocessing Pipeline
Write a Python function that performs basic NLP preprocessing on a sentence. The function should do the following steps:

1. Tokenize the sentence into individual words.
2. Remove common English stopwords (e.g., "the", "in", "are").
3. Apply stemming to reduce each word to its root form.
   Use the sentence:
   "NLP techniques are used in virtual assistants like Alexa and Siri."
   The function should print:
   • A list of all tokens
   • The list after stop words are removed
   • The final list after stemming
   Expected Output:
   Your program should print three outputs in order:
4. Original Tokens – All words and punctuation split from the sentence
5. Tokens Without Stopwords – Only meaningful words remain
6. Stemmed Words – Each word is reduced to its base/root form

Short Answer Questions:

1. What is the difference between stemming and lemmatization? Provide examples with the word “running.”
2. Why might removing stop words be useful in some NLP tasks, and when might it actually be harmful?


In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

def preprocess(sentence):
    tokens = word_tokenize(sentence)
    print(f"tokens: {tokens}")
    
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    print(f'tokens without stop words: {filtered_tokens}')
    
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    print(f'words after stemming: {stemmed_tokens}')
    
sentence = "NLP techniques are used in virtual assistants like Alexa and Siri."

preprocess(sentence)  

tokens: ['NLP', 'techniques', 'are', 'used', 'in', 'virtual', 'assistants', 'like', 'Alexa', 'and', 'Siri', '.']
tokens without stop words: ['NLP', 'techniques', 'used', 'virtual', 'assistants', 'like', 'Alexa', 'Siri', '.']
words after stemming: ['nlp', 'techniqu', 'use', 'virtual', 'assist', 'like', 'alexa', 'siri', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gadda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gadda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


short answer questions:

1. What is the difference between stemming and lemmatization? Provide examples with the word “running.”

- Stemming reduces words to their root words by removing prefixes or suffixes. It doesn’t consider the context or actual dictionary word.
running becomes run but porter stemmer might return run or even runn

- Lemmatization reduces words to their lemma which is the dictionary base form of the words using vocabulary and morphological analysis. It considers the part of speech.
running is seen as averb and changed to run

- One difference between Lemmatization and Stemming is that Lemmatization is more accurate but slower than stemming.

2. Why might removing stop words be useful in some NLP tasks, and when might it actually be harmful?
- Removing stop words might be useful when performing tasks like text summarization where we focus more on the core content of the text.
- Removing stop words might be harmful when performing sentiment analysis tasks. If we remove not from 'not good' then the meaning is lost and results in misclassification. 