#Theory Questions

**Question 1: Compare and contrast NLTK and spaCy in terms of features, ease of use,and performance.**

**Answer:**

  | Criteria        | **NLTK**                                                   | **spaCy**                                                            |
| --------------- | ---------------------------------------------------------- | -------------------------------------------------------------------- |
| **Purpose**     | For teaching, research, and experimentation                | For production-ready, real-world NLP                                 |
| **Features**    | Many tools + large corpora; supports classical NLP methods | Modern pretrained models; efficient pipelines; supports transformers |
| **Ease of Use** | More steps, less streamlined                               | Very easy, plug-and-play pipelines                                   |
| **Performance** | Slower (Python-based)                                      | Much faster (Cython-optimized)                                       |
| **Accuracy**    | Good for rule-based tasks                                  | Higher accuracy for NER, POS, parsing                                |
| **Best For**    | Learning NLP concepts                                      | Large-scale, industrial applications                                 |
                  

**Question 2: What is TextBlob and how does it simplify common NLP tasks like sentiment analysis and translation?**

**Answer:**
TextBlob is a simple and beginner-friendly Python library built on top of NLTK and Pattern. It provides an easy API for performing common NLP tasks with minimal code.

| Task                   | How TextBlob Helps                                                                                                                                                       |
| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Sentiment Analysis** | Provides a built-in sentiment method that directly returns **polarity** (positive/negative) and **subjectivity**. No need to train models or preprocess text manually. |
| **Translation**        | Offers simple functions like translate() and detect_language() which use underlying translation APIs, making translation a one-line operation.                       |
| **Other Tasks**        | Easily performs tokenization, POS tagging, noun phrase extraction, and spelling correction using very short code.                                                        |



**Question 3: Explain the role of Standford NLP in academic and industry NLP Projects.**

**Answer:**Stanford NLP (also known as Stanford CoreNLP) is a powerful NLP toolkit widely used in both academic research and industry applications because of its accuracy and comprehensive language processing tools.

*Role in Academic Projects*
* Provides state-of-the-art models for tasks like POS tagging, NER, parsing, sentiment analysis, and coreference resolution.
* Widely used for linguistics research, developing new algorithms, and benchmarking NLP models.
* Offers a rich, rule-based and statistical approach, making it suitable for experiments and theoretical studies.

*Role in Industry Projects*
* Used for production-level NLP tasks such as information extraction, customer analytics, document processing, and text understanding.
* Known for high accuracy and reliable results, especially in NER and dependency parsing.
* Provides server-based APIs, allowing integration into real-time systems in sectors like finance, healthcare, legal, and customer service.

**Question 4: Describe the architecture and functioning of a Recurrent Natural Network(RNN).**

**Answer:**
A Recurrent Neural Network (RNN) is a type of neural network designed to process sequential data such as text, speech, or time-series.

*Architecture*
* Consists of input layer, hidden (recurrent) layer, and output layer.
* The hidden layer has recurrent connections, meaning it receives input from the current step and its own previous output.
* This creates a loop that allows the network to store information across time steps.

*Functioning*
* At each time step, the RNN takes an input (e.g., a word or value).
* It combines this input with the previous hidden state to generate a new hidden state.
* This hidden state acts as the network’s memory, carrying information from earlier steps.
* The final output is produced from the hidden state, depending on the task (classification, prediction, etc.).

**Question 5: What is the key difference between LSTM and GRU networks in NLP applications?**

**Answer:**
| Aspect                | **LSTM (Long Short-Term Memory)**              | **GRU (Gated Recurrent Unit)**          |
| --------------------- | ---------------------------------------------- | --------------------------------------- |
| **Gates Used**        | Three gates: **input**, **forget**, **output** | Two gates: **reset** and **update**     |
| **Memory Components** | Has a **separate cell state** and hidden state | Combines both into **one hidden state** |
| **Complexity**        | More complex, more parameters                  | Simpler, fewer parameters               |
| **Training Speed**    | Slower                                         | Faster                                  |
| **Performance**       | Good for long sequences and complex patterns   | Similar performance but more efficient  |



#Practical Questions

In [1]:
'''
Question 6: Write a Python program using TextBlob to perform sentiment analysis on
the following paragraph of text:

“I had a great experience using the new mobile banking app. The interface is intuitive,
and customer support was quick to resolve my issue. However, the app did crash once
during a transaction, which was frustrating"

Your program should print out the polarity and subjectivity scores.
(Include your Python code and output in the code box below.)

Answer:

'''
from textblob import TextBlob

text = """I had a great experience using the new mobile banking app.
The interface is intuitive, and customer support was quick to resolve my issue.
However, the app did crash once during a transaction, which was frustrating."""

# Create TextBlob object
blob = TextBlob(text)

# Get sentiment scores
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity

# Print results
print("Polarity:", polarity)
print("Subjectivity:", subjectivity)


Polarity: 0.21742424242424244
Subjectivity: 0.6511363636363636


In [5]:
'''
Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

(Include your Python code and output in the code box below.)

Answer:
'''
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist


text = """Natural Language Processing (NLP) is a fascinating field that combines
linguistics, computer science, and artificial intelligence. It enables machines to
understand, interpret, and generate human language. Applications of NLP include
chatbots, sentiment analysis, and machine translation. As technology advances,
the role of NLP in modern solutions is becoming increasingly critical."""

# Tokenize text
tokens = word_tokenize(text)

# Frequency Distribution
freq = FreqDist(tokens)

print("Tokens:")
print(tokens)
print("\nMost Common Words:")
print(freq.most_common(10))



Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Most Common Words:
[(',', 7), ('.', 4), ('NLP', 3), ('and', 3), ('is', 2), ('of', 2), ('Natural', 1), ('Language', 1), ('Processing', 1), ('(', 1)]


In [6]:
'''
Question 8: Implement a basic LSTM model in Keras for a text classification task using
the following dummy dataset. Your model should classify sentences as either positive
(1) or negative (0).

# Dataset
texts = [
“I love this project”, #Positive
“This is an amazing experience”, #Positive
“I hate waiting in line”, #Negative
“This is the worst service”, #Negative
“Absolutely fantastic!” #Positive
]

labels = [1, 1, 0, 0, 1]

Preprocess the text, tokenize it, pad sequences, and build an LSTM model to train on
this data. You may use Keras with TensorFlow backend.

(Include your Python code and output in the code box below.)

Answer:
'''

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Dataset
texts = [
    "I love this project",
    "This is an amazing experience",
    "I hate waiting in line",
    "This is the worst service",
    "Absolutely fantastic!"
]

labels = [1, 1, 0, 0, 1]

# --------------------------
# 1. Text Tokenization
# --------------------------
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

print("Tokenized Sequences:", sequences)

# --------------------------
# 2. Padding Sequences
# --------------------------
max_len = max(len(seq) for seq in sequences)
padded = pad_sequences(sequences, maxlen=max_len, padding='post')

print("Padded Sequences:\n", padded)

# Convert labels to array
labels = np.array(labels)

# --------------------------
# 3. Build LSTM Model
# --------------------------
vocab_size = len(tokenizer.word_index) + 1

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=16, input_length=max_len))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# --------------------------
# 4. Train Model
# --------------------------
history = model.fit(padded, labels, epochs=10, verbose=1)

# --------------------------
# 5. Evaluate
# --------------------------
loss, acc = model.evaluate(padded, labels, verbose=0)
print("\nTraining Accuracy:", acc)


Tokenized Sequences: [[2, 4, 1, 5], [1, 3, 6, 7, 8], [2, 9, 10, 11, 12], [1, 3, 13, 14, 15], [16, 17]]
Padded Sequences:
 [[ 2  4  1  5  0]
 [ 1  3  6  7  8]
 [ 2  9 10 11 12]
 [ 1  3 13 14 15]
 [16 17  0  0  0]]
Epoch 1/10




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.4000 - loss: 0.6943
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step - accuracy: 0.4000 - loss: 0.6926
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - accuracy: 0.6000 - loss: 0.6909
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step - accuracy: 0.6000 - loss: 0.6891
Epoch 5/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.6000 - loss: 0.6874
Epoch 6/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step - accuracy: 0.6000 - loss: 0.6856
Epoch 7/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 0.6000 - loss: 0.6837
Epoch 8/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - accuracy: 0.6000 - loss: 0.6818
Epoch 9/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms

In [7]:
'''
Question 9: Using spaCy, build a simple NLP pipeline that includes tokenization,
lemmatization, and entity recognition. Use the following paragraph as your dataset:

“Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role in the
development of India’s atomic energy program. He was the founding director of the Tata
Institute of Fundamental Research (TIFR) and was instrumental in establishing the
Atomic Energy Commission of India.”

Write a Python program that processes this text using spaCy, then prints tokens, their
lemmas, and any named entities found.

(Include your Python code and output in the code box below.)

Answer:

'''
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

text = """Homi Jehangir Bhaba was an Indian nuclear physicist who played a key role
in the development of India’s atomic energy program. He was the founding director
of the Tata Institute of Fundamental Research (TIFR) and was instrumental in
establishing the Atomic Energy Commission of India."""

# Process text
doc = nlp(text)

# -----------------------------
# 1. Tokenization + Lemmatization
# -----------------------------
print("Tokens and Lemmas:")
for token in doc:
    print(f"{token.text} --> {token.lemma_}")

# -----------------------------
# 2. Named Entity Recognition
# -----------------------------
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text} --> {ent.label_}")


Tokens and Lemmas:
Homi --> Homi
Jehangir --> Jehangir
Bhaba --> Bhaba
was --> be
an --> an
Indian --> indian
nuclear --> nuclear
physicist --> physicist
who --> who
played --> play
a --> a
key --> key
role --> role

 --> 

in --> in
the --> the
development --> development
of --> of
India --> India
’s --> ’s
atomic --> atomic
energy --> energy
program --> program
. --> .
He --> he
was --> be
the --> the
founding --> found
director --> director

 --> 

of --> of
the --> the
Tata --> Tata
Institute --> Institute
of --> of
Fundamental --> Fundamental
Research --> Research
( --> (
TIFR --> TIFR
) --> )
and --> and
was --> be
instrumental --> instrumental
in --> in

 --> 

establishing --> establish
the --> the
Atomic --> Atomic
Energy --> Energy
Commission --> Commission
of --> of
India --> India
. --> .

Named Entities:
Homi Jehangir Bhaba --> FAC
Indian --> NORP
India --> GPE
the Tata Institute of Fundamental Research --> ORG
the Atomic Energy Commission of India --> ORG


**Question 10: You are working on a chatbot for a mental health platform. Explain how you would leverage LSTM or GRU networks along with libraries like spaCy or Stanford NLP to understand and respond to user input effectively. Detail your architecture, data preprocessing pipeline, and any ethical considerations.**

(Include your Python code and output in the code box below.)

**Answer:**A mental-health chatbot must understand user emotions, context, and intent.To achieve this, we combine spaCy (for preprocessing) with an LSTM/GRU model (for intent/emotion classification).

1. Architecture Overview

-> Preprocessing Layer (spaCy / Stanford NLP)
* Tokenization
* Lemmatization
* Named Entity Recognition (NER)
* Stopword removal
* Sentence vectorization

-> LSTM/GRU Classification Layer
* Takes processed vectors as input
* Predicts intent (e.g., stress, sadness, emergency, normal conversation)
* Detects sentiment
* Routes user to appropriate response module

-> Response Generation Layer
* Rule-based responses for safety-sensitive cases
* Template + retrieval-based responses for normal conversation
* No unsafe generative responses in crisis situations

2. Ethical Considerations
* Privacy: User messages must be encrypted and not stored unnecessarily.
* Bias Reduction: Train on balanced, diverse mental-health datasets.
* Safety: If the model detects self-harm/suicide intent, escalate to help lines.
* Transparency: Inform users that the chatbot is not a therapist.


In [9]:
import spacy
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# --------------------------
# 1. spaCy preprocessing
# --------------------------
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop]
    return " ".join(tokens)

# Training dataset
texts = [
    "I feel very sad and lonely",
    "I am extremely anxious today",
    "I am feeling a bit better now",
    "I am happy and relaxed"
]

labels = [0, 0, 1, 1]  # 0 = distress, 1 = positive

# Preprocess training text
cleaned = [preprocess(t) for t in texts]

# --------------------------
# 2. Tokenization + Padding
# --------------------------
tokenizer = Tokenizer()
tokenizer.fit_on_texts(cleaned)
seq = tokenizer.texts_to_sequences(cleaned)
padded = pad_sequences(seq, padding='post')
labels = np.array(labels)

# --------------------------
# 3. Build LSTM Model
# --------------------------
vocab_size = len(tokenizer.word_index) + 1
model = Sequential([
    Embedding(vocab_size, 16),
    LSTM(32),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train model
model.fit(padded, labels, epochs=10, verbose=0)

# --------------------------
# 4. User Input Prediction
# --------------------------
user_text = input("Enter your message: ")

clean_user = preprocess(user_text)
user_seq = tokenizer.texts_to_sequences([clean_user])
user_pad = pad_sequences(user_seq, maxlen=padded.shape[1], padding='post')

prediction = model.predict(user_pad)[0][0]

if prediction < 0.5:
    print("Model Output: Distress / Negative Emotion (0)")
else:
    print("Model Output: Positive Emotion (1)")


Enter your message: i feel sad and  anxious
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 121ms/step
Model Output: Distress / Negative Emotion (0)
