Question 1: What is Computational Linguistics and how does it relate to NLP?

Ans:-
Computational Linguistics is the scientific study of language using computational methods.
It aims to understand how human language works by creating models that mimic or analyze linguistic behavior.

It combines:

Linguistics (syntax, semantics, phonetics, morphology, pragmatics)

Computer Science (algorithms, data structures, machine learning)

Mathematics (probability, statistics)

The goal of CL

To model natural language in a precise, testable way

To build systems that understand linguistic rules

To study language scientifically using computational tools

CL is more theoretical, leaning toward understanding why language works the way it does.

Computational Linguistics is the scientific study and modeling of language using computational methods, whereas NLP is the engineering discipline that applies these models to build language-based applications. They are deeply interconnected: CL provides theories and models, NLP uses them to create practical systems.

Question 2: Briefly describe the historical evolution of Natural Language Processing

Ans:-

1. 1950s–1960s: Rule-Based Systems (Symbolic NLP)

Early NLP relied on hand-crafted linguistic rules.

Inspired by Noam Chomsky's grammar theories.

Focus on machine translation (e.g., Georgetown-IBM experiment, 1954).

Systems used simple pattern matching and manually written grammars.

2. 1970s–1980s: Knowledge-Based and Linguistic Models

Development of syntactic parsers, semantic networks, and expert systems.

NLP systems started using world knowledge and linguistic theories.

SHRDLU (1970) demonstrated natural language understanding in restricted domains.

3. 1990s: Statistical NLP Revolution

Shift from rules to probability and statistics.

Introduction of:

Hidden Markov Models (HMMs)

N-grams

Statistical Machine Translation

Large corpora like the Brown Corpus emerged.

Data-driven methods replaced hand-written rules.

4. 2000s: Machine Learning Era

Use of supervised and unsupervised learning for NLP tasks.

Algorithms like SVMs, decision trees, and CRFs became common.

Improvement in text classification, POS tagging, named entity recognition.

5. 2010s: Deep Learning Era

Major breakthroughs with neural networks:

Word2Vec (2013)

RNNs, LSTMs, GRUs

Seq2Seq models with attention (2014)

Enabled better machine translation, summarization, and sentiment analysis.

6. 2017–Present: Transformer Models & Large Language Models

Introduction of Transformers (Vaswani et al., 2017) changed NLP completely.

Transformers enabled:

BERT (2018)

GPT series (2018–present)

T5, XLNet, RoBERTa

Large Language Models (LLMs) learn from billions of parameters and huge datasets.

NLP now focuses on:

Zero-shot learning

Few-shot learning

Multimodal models

Generative AI (ChatGPT, Claude, Gemini)

Question 3: List and explain three major use cases of NLP in today’s tech industry.

Ans:-

1. Chatbots and Virtual Assistants
Use Case:

Customer support chatbots

Voice assistants like Siri, Alexa, Google Assistant

AI support in banking, e-commerce, healthcare

How NLP is used:

Intent detection: Understand what the user wants

Entity recognition: Extract important information (name, date, location)

Natural language generation: Respond naturally and conversationally

Impact:

Reduces customer service cost

Provides 24/7 automated assistance

Improves user experience

2. Sentiment Analysis
Use Case:

Analyzing customer reviews on Amazon, Flipkart

Monitoring brand sentiment on social media

Detecting public opinion for products, movies, political campaigns

How NLP is used:

Classifies text as positive, negative, or neutral

Understands emotions, tone, and context

Impact:

Helps companies understand customer satisfaction

Supports marketing and product decision-making

Detects trends and public reactions

3. Machine Translation
Use Case:

Google Translate, DeepL

Real-time translation in apps, websites, and customer support

Cross-language communication for global companies

How NLP is used:

Sequence-to-sequence models and Transformers

Converts text from one language to another while preserving meaning

Impact:

Breaks language barriers

Enables international business and communication

Helps in multilingual content creation.

Question 4: What is text normalization and why is it essential in text processing tasks?

Ans:-

Text normalization is the process of converting raw, unstructured text into a standard, consistent, and clean format so it can be easily processed by NLP models.

It reduces variations in text caused by:

Spelling differences

Case differences

Punctuation

Abbreviations

Slang or noisy data

 Why is Text Normalization Essential?

It is essential because NLP models require uniform and clean input to perform accurately. Raw text contains a lot of noise (like uppercase letters, extra spaces, emojis, punctuation, slang), which can confuse models.

Key reasons it is important:
1. Improves Model Accuracy

Removes inconsistencies that affect tokenization and feature extraction.

Ensures similar words like “Dog”, “dog”, and “DOG” are treated as the same word.

2. Reduces Data Sparsity

Converts multiple forms of the same word (e.g., "running", "runs", "ran") into a standardized form.

Helps statistical and ML models by reducing the vocabulary size.

3. Enhances Performance of Downstream Tasks

Tasks like sentiment analysis, translation, classification, and summarization work better with normalized text.

Examples of Text Normalization Techniques

Lowercasing

Removing punctuation

Stopword removal

Stemming and Lemmatization

Expanding contractions (e.g., “don’t” → “do not”)

Removing special characters, emojis, URLs

Spell correction

Question 5: Compare and contrast stemming and lemmatization with suitable examples.

Ans:-

| Feature             | Stemming               | Lemmatization        |
| ------------------- | ---------------------- | -------------------- |
| Based on            | Heuristic rules        | Linguistic knowledge |
| Accuracy            | Low–Medium             | High                 |
| Output              | Not always a real word | Always a real word   |
| Speed               | Faster                 | Slower               |
| Context Awareness   | No                     | Yes                  |
| Example ("studies") | "studi"                | "study"              |




In [1]:
'''Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz'''

import re

text = """
Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz
"""

# Regex pattern for email extraction
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Extract all emails
emails = re.findall(pattern, text)

print("Extracted Email Addresses:")
for email in emails:
    print(email)


Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


In [2]:
'''Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”'''
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Download required NLTK resources
nltk.download('punkt')

paragraph = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# --------- Step 1: Tokenization ---------
tokens = word_tokenize(paragraph)

print("Tokens:")
print(tokens)

# --------- Step 2: Frequency Distribution ---------
freq_dist = FreqDist(tokens)

print("\nFrequency Distribution:")
for word, freq in freq_dist.items():
    print(word, ":", freq)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [3]:
'''Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.
(Include your Python code and output in the code box below.)
'''
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = """
Natural Language Processing (NLP) is widely used by companies like Google, Microsoft,
and OpenAI. Researchers such as John Smith and Emily Clarke have contributed
significantly to the field. The University of Cambridge and Stanford University
are leaders in AI research.
"""

# Process text
doc = nlp(text)

# Custom annotator to extract proper nouns
proper_nouns = [(token.text, token.pos_) for token in doc if token.pos_ == "PROPN"]

print("Proper Noun Annotations:")
for word, label in proper_nouns:
    print(f"{word} --> {label}")


Proper Noun Annotations:
Natural --> PROPN
Language --> PROPN
Processing --> PROPN
NLP --> PROPN
Google --> PROPN
Microsoft --> PROPN
OpenAI --> PROPN
John --> PROPN
Smith --> PROPN
Emily --> PROPN
Clarke --> PROPN
University --> PROPN
Cambridge --> PROPN
Stanford --> PROPN
University --> PROPN
AI --> PROPN


In [4]:
'''Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.
'''
# Install gensim if needed
# !pip install gensim

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# -------------------------
# Dataset
# -------------------------
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# -------------------------
# Step 1: Preprocessing & Tokenization
# -------------------------
# simple_preprocess() lowers, tokenizes, removes punctuation and short words
tokenized_data = [simple_preprocess(sentence) for sentence in dataset]

print("Tokenized Sentences:")
for tokens in tokenized_data:
    print(tokens)

# -------------------------
# Step 2: Train Word2Vec Model
# -------------------------
model = Word2Vec(
    sentences=tokenized_data,
    vector_size=50,      # size of word vector
    window=3,            # context window size
    min_count=1,         # include all words
    workers=4,           # parallel threads
    sg=1                 # skip-gram (sg=1), CBOW (sg=0)
)

# -------------------------
# Step 3: Example Output
# -------------------------
print("\nVector for word 'language':")
print(model.wv['language'])

print("\nMost similar words to 'word':")
print(model.wv.most_similar('word'))


ModuleNotFoundError: No module named 'gensim'

Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.

Ans:-

1. Data Collection

Gather reviews from:

Mobile app feedback

Play Store / App Store reviews

Website feedback forms

Customer support chat logs

Store data in a database or CSV for preprocessing.

2. Data Cleaning

Clean the raw text to remove noise:

a. Remove unwanted characters

HTML tags

URLs

Emojis

Special symbols

b. Normalize text

Lowercasing

Expanding contractions (can’t → cannot)

Spell correction (opt-in)

c. Remove stopwords

Words like “the”, “is”, “and” add no meaning.

d. Tokenization

Split sentences into words for processing.

3. Text Preprocessing
a. Lemmatization or Stemming

Convert words to their root form
e.g., “running” → “run”

b. Handle negations

Combine negation + word
e.g., “not good” → “not_good”

c. Remove very rare/very frequent terms

Reduces noise for machine learning models.

4. Exploratory Data Analysis (EDA)
a. Word frequency analysis

Create frequency distribution of important terms

Identify common complaints and positive feedback

b. N-gram analysis

Bigram/trigram extraction to find patterns like:

“late payment”

“poor customer support”

“fast transactions”

c. Word Cloud

Visualize the most common words (after filtering).

5. Sentiment Analysis
Approaches:

Rule-based (e.g., VADER for short reviews)

Machine Learning (SVM, Naive Bayes)

Deep Learning (LSTM, BERT, RoBERTa)

Output:

Positive

Negative

Neutral

Helps identify how customers feel about:

App performance

Financial services

Onboarding process

6. Topic Modeling

Use unsupervised learning to discover hidden themes:

Methods:

LDA (Latent Dirichlet Allocation)

NMF (Non-negative Matrix Factorization)

BERTopic (Transformer-based, highly effective)

Typical fintech topics:

App crashes

Payment delays

Loan approval issues

Security concerns

Good customer service

7. Aspect-Based Sentiment Analysis (ABSA)

Break down reviews by aspects:

Aspect	Sentiment
App UI	Positive
KYC verification	Negative
Transaction speed	Mixed
Customer support	Negative

Helps identify exactly which features need improvement.

8. Classification (Optional)

Build models to auto-classify reviews into categories:

Bug report

Feature request

Complaint

Praise

Algorithms: Logistic Regression, SVM, Random Forest, or BERT-based classifiers.

9. Summarization

Use text summarization models (e.g., T5, BART) to generate:

Monthly summary of customer issues

Executive reports

10. Visualization & Reporting

Create dashboards using:

Power BI, Tableau

Matplotlib / Plotly

Metrics to include:

Sentiment trends over time

Word frequency

Most common complaint categories

Satisfaction score

11. Actionable Insights for Product Team

From NLP analysis, provide recommendations such as:

Improve KYC verification time

Fix payment gateway reliability

Simplify loan application flow

Enhance customer support responsiveness