**Question 1:** What is Computational Linguistics and how does it relate to NLP?


Computational Linguistics (CL) is the scientific study of language from a computational perspective.
It aims to understand how language works by creating models that can process, analyze, and generate human language using algorithms.

In simple words:
 Computational Linguistics = Linguistics + Computer Science

It tries to answer questions like:

* How can a computer understand sentences?

* How can we represent grammar in a machine-readable way?

* How can a computer learn meaning, context, and ambiguity?

* How it relates to NLP (Natural Language Processing)?

Natural Language Processing is the practical application side, where we build real systems that work with human language — such as chatbots, translators, summarizers, voice assistants, etc.

Relationship:

* Computational Linguistics provides the theories and scientific foundations.
(Grammar, parsing, semantics, morphology, etc.)

* NLP uses those theories to build working applications.
(Text classification, speech recognition, translation)

Think of it like this:

Field	Purpose
Computational Linguistics	Understand language scientifically & model it computationally
NLP	Build tools/systems that use language in the real world
Analogy

CL = Studying the rules of language + building theoretical models

NLP = Using those rules to build applications like ChatGPT, Google Translate, Siri

**Question 2:** Briefly describe the historical evolution of Natural Language Processing.

**Historical Evolution of Natural Language Processing (NLP)**

NLP has evolved through several major phases:

1. 1950s–1960s: Rule-Based & Symbolic Era

* NLP began with linguistic rules written by experts.

* Early systems used grammar rules, dictionaries, and logic.

* Famous milestone: Alan Turing’s Turing Test (1950).

* Machine Translation experiments (e.g., Georgetown experiment, 1954).

2. 1970s–1980s: Knowledge-Based Systems

* Development of semantic networks, frames, and AI knowledge bases.

* Systems tried to “understand” language using world knowledge.

* Example: SHRDLU (natural language understanding in a blocks world).

3. 1990s–2000s: Statistical NLP

* Shift from rules to probabilistic and statistical models.

* Use of large corpora and machine learning.

* Algorithms like Hidden Markov Models, Decision Trees, n-grams.

* Speech recognition and part-of-speech tagging improved greatly.

4. 2010s: Deep Learning Revolution

* Introduction of neural networks, especially RNNs, LSTMs, GRUs.

* Word embeddings created for meaning representation (Word2Vec, GloVe).

* Major improvements in translation, sentiment analysis, and speech.

5. 2018–Present: Transformer Models & Large Language Models

* Breakthrough model: Transformer (Vaswani et al., 2017).

* Enabled BERT, GPT, T5, XLNet, etc.

* These models understand context better and generate human-like text.

* Today’s NLP is dominated by large-scale pre-trained models.


**Question 3:** List and explain three major use cases of NLP in today’s tech industry.

Three Major Use Cases of NLP in Today’s Tech Industry:


**1. Machine Translatio**n

* Converts text from one language to another (e.g., English → Hindi).

* Used in tools like Google Translate, Microsoft Translator.

* Helps in global communication, content localization, and multilingual support.

**2. Sentiment Analysis**

* Determines whether a text expresses positive, negative, or neutral emotions.

* Used by companies to analyze customer reviews, social media posts, and feedback.

* Helps businesses understand public opinion and improve products or services.

**3. Chatbots and Virtual Assistants**

* NLP powers conversational agents like ChatGPT, Siri, Alexa, Google Assistant.

* Helps automate customer support, answer queries, and provide personalized assistance.

* Widely used in banking, e-commerce, healthcare, and customer service.

**Question 4:** What is text normalization and why is it essential in text processing tasks?

What is Text Normalization?

Text normalization is the process of converting raw text into a consistent, standard, and uniform format so that a machine can easily understand and process it.

It removes variations in text by applying steps like:

* Lowercasing

* Removing punctuation

* Expanding abbreviations

* Correcting spelling

* Lemmatization or stemming

**Why is Text Normalization Essential?**

Text data is often messy, inconsistent, and full of variations.
Normalization is important because:

1. Improves accuracy of NLP models by reducing noise.

2. Ensures similar words are treated the same (e.g., “Running”, “RUNNING”, “run”).

3. Reduces vocabulary size, making processing faster and more efficient.

4. Helps algorithms focus on meaning, not formatting differences.

**Question 5:** Compare and contrast stemming and lemmatization with suitable
examples.

Stemming vs Lemmatization

**1. Stemming**

* A rule-based process that cuts off word endings to reduce a word to its base form (called “stem”).

* It does not consider grammar or meaning.

* Often produces non-dictionary words.

Example:

* “Running” → “run” or “runn”

* “Better” → “bet”

* “Studies” → “studi”

2. Lemmatization

* A linguistically informed process that converts a word to its dictionary base form (called “lemma”).

* Considers grammar, part of speech, and context.

* Always produces valid words.

Example:

* “Running” → “run”

* “Better” → “good” (comparative form)

* “Studies” → “study”

## Comparison Table

| Feature | Stemming | Lemmatization |
| :------ | :------- | :------------ |
| Method | Removes suffixes using rules | Uses vocabulary + grammar |
| Output | May be non-words | Always valid dictionary words |
| Accuracy | Lower | Higher |
| Speed | Faster | Slower |
| Example (“Studies”) | “studi” | “study” |  

**Question 6:** Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:

“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”


In [None]:
import re

text = """
Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us. For partnership
inquiries, email partners@xyz.biz.
"""

# Regular expression pattern for email extraction
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# Extract all email addresses
emails = re.findall(pattern, text)

# Print result
print("Extracted Email Addresses:")
for email in emails:
    print(email)


Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


**Question 7:** Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# Download tokenizer (run once)
nltk.download('punkt_tab')

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:")
print(tokens)

# 2. Frequency Distribution
freq_dist = FreqDist(tokens)
print("\nFrequency Distribution:")
for word, frequency in freq_dist.most_common(10):
    print(word, ":", frequency)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution:
, : 7
. : 4
NLP : 3
and : 3
is : 2
of : 2
Natural : 1
Language : 1
Processing : 1
( : 1


**Question 8:** Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.


In [None]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

text = "John and Mary visited London to meet Professor Charles Xavier from Oxford University."

# Process the text
doc = nlp(text)

# Custom annotator to extract proper nouns
print("Proper Nouns Identified:")
for token in doc:
    if token.pos_ == "PROPN":
        print(token.text, "→ PROPER NOUN")


Proper Nouns Identified:
John → PROPER NOUN
Mary → PROPER NOUN
London → PROPER NOUN
Professor → PROPER NOUN
Charles → PROPER NOUN
Xavier → PROPER NOUN
Oxford → PROPER NOUN
University → PROPER NOUN


**Question 9:** Using Genism, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.


In [None]:
!pip install gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# Given dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# 1. Preprocessing & Tokenization using simple_preprocess
tokenized_sentences = [simple_preprocess(sentence) for sentence in dataset]
print("Tokenized Sentences:")
print(tokenized_sentences)

# 2. Train a Word2Vec model
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=50,    # dimensionality of word vectors
    window=5,          # context window size
    min_count=1,       # include all words
    workers=4,         # for parallel training
    sg=0               # CBOW model (sg=1 → Skip-gram)
)

# 3. Example: Get vector for a word
print("\nVector for 'language':")
print(model.wv['language'])

# 4. Example: Similar words
print("\nMost similar words to 'word':")
print(model.wv.most_similar('word'))

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Tokenized Sentences:
[['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language'], ['word', 'embeddings', 'are', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation'], ['word', 'vec', 'is', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications'], ['text', 'preprocessing', 'is', 'critical', 'step', 'before', 'training', 'word', 'embeddings'], ['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'fo

**Question 10:** Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.


As a data scientist at a fintech startup analyzing thousands of customer reviews, I would follow these systematic steps:

**1. Data Collection & Understanding**

* Gather customer reviews from app stores, surveys, emails, support tickets, etc.

* Understand the structure: text length, language, duplicates, metadata (date, rating).

**2. Text Cleaning & Preprocessing**

Perform normalization to make text uniform and machine-friendly:

* Lowercasing all text

* Removing noise: punctuation, URLs, numbers, HTML tags, special characters

* Tokenization: splitting text into words

* Stopword removal: removing common words (is, the, this…)

* Spelling correction (optional)

* Stemming or lemmatization to reduce words to root form

* Handling emojis/emoticons, as they carry sentiment

* Removing duplicate reviews

**3. Exploratory Text Analysis**

* Generate initial insights:

* Most common words (word frequency)

* N-grams (phrases like “late payment”, “loan approval”)

* Word clouds

* Review length distribution

* This helps to understand general customer concerns.

**4. Sentiment Analysis**

Apply supervised models (Logistic Regression, SVM) or pre-trained transformers to classify reviews as positive, negative, or neutral.

* Aggregate sentiment across time to track customer satisfaction trends.

**5. Topic Modeling**

Use unsupervised techniques to discover hidden themes:

* LDA (Latent Dirichlet Allocation)

* NMF (Non-negative Matrix Factorization)

Example topics:

* “App performance issues”

* “Loan rejection complaints”

* “Good customer support”

**6. Named Entity Recognition (NER)**

* Identify important entities:

* Product names

* Transaction types

* Bank names

* Complaint categories

* This helps find exactly where issues are occurring.

**7. Text Classification**

Train models to classify reviews into categories such as:

* Payment issues

* KYC problems

* Account login issues

* Customer support complaints

* Useful for routing complaints to the right department.

**8. Aspect-Based Sentiment Analysis**

Analyze sentiment for specific features:

* Loan approval time → negative

* Mobile app design → positive

* Customer support → mixed

* Helps prioritize product improvements.

**9. Visualization & Reporting**

Use dashboards (Power BI, Tableau, Python plots):

* Sentiment trends

* Top issues

* Topic summaries

* Customer pain points

* Provide actionable insights to product, engineering, and support teams.

**10. Continuous Monitoring**

* Automate the pipeline to analyze new reviews daily.

* Update models regularly to maintain accuracy.

In [None]:
# Install necessary packages (run once in your environment)
# !pip install nltk gensim spacy textblob sklearn

import re
import nltk
import spacy
import gensim
from gensim import corpora
from textblob import TextBlob
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Download required resources
nltk.download("punkt")
nltk.download("stopwords")

# Sample dataset (normally thousands of reviews)
reviews = [
    "The loan approval process is very slow and frustrating.",
    "Great app design and easy to use features!",
    "Customer support is extremely helpful and polite.",
    "I faced issues during KYC verification.",
    "Too many bugs in the latest update.",
    "Instant loan disbursal. Loved the experience!",
    "Unable to complete payment, the app keeps crashing."
]

# ------------------------------------------------------------
# 1. TEXT CLEANING FUNCTION
# ------------------------------------------------------------
def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)       # remove URLs
    text = re.sub(r"[^a-zA-Z\s]", "", text)   # remove special characters
    text = re.sub(r"\s+", " ", text).strip()  # remove extra spaces
    return text

cleaned_reviews = [clean_text(r) for r in reviews]

# ------------------------------------------------------------
# 2. TOKENIZATION & STOPWORD REMOVAL
# ------------------------------------------------------------
stop_words = set(stopwords.words("english"))

def tokenize(text):
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]
    return tokens

tokenized_reviews = [tokenize(r) for r in cleaned_reviews]

print("Tokenized Reviews:")
print(tokenized_reviews)

# ------------------------------------------------------------
# 3. SENTIMENT ANALYSIS (TextBlob)
# ------------------------------------------------------------
def get_sentiment(text):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0:
        return "Positive"
    elif polarity < 0:
        return "Negative"
    else:
        return "Neutral"

sentiments = [get_sentiment(r) for r in cleaned_reviews]

print("\nSentiment Analysis:")
for review, sentiment in zip(reviews, sentiments):
    print(f"{review} --> {sentiment}")

# ------------------------------------------------------------
# 4. TOPIC MODELING USING LDA (Gensim)
# ------------------------------------------------------------
# Prepare data for LDA
dictionary = corpora.Dictionary(tokenized_reviews)
corpus = [dictionary.doc2bow(text) for text in tokenized_reviews]

# Train LDA model
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=2,
    passes=10
)

print("\nLDA Topics:")
topics = lda_model.print_topics(num_words=5)
for t in topics:
    print(t)

# ------------------------------------------------------------
# 5. NER USING spaCy (to extract entities like product names, places, etc.)
# ------------------------------------------------------------
nlp = spacy.load("en_core_web_sm")

print("\nNamed Entities:")
for review in reviews:
    doc = nlp(review)
    print(review)
    for ent in doc.ents:
        print("  ", ent.text, "-->", ent.label_)
    print()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Tokenized Reviews:
[['loan', 'approval', 'process', 'slow', 'frustrating'], ['great', 'app', 'design', 'easy', 'use', 'features'], ['customer', 'support', 'extremely', 'helpful', 'polite'], ['faced', 'issues', 'kyc', 'verification'], ['many', 'bugs', 'latest', 'update'], ['instant', 'loan', 'disbursal', 'loved', 'experience'], ['unable', 'complete', 'payment', 'app', 'keeps', 'crashing']]

Sentiment Analysis:
The loan approval process is very slow and frustrating. --> Negative
Great app design and easy to use features! --> Positive
Customer support is extremely helpful and polite. --> Negative
I faced issues during KYC verification. --> Neutral
Too many bugs in the latest update. --> Positive
Instant loan disbursal. Loved the experience! --> Positive
Unable to complete payment, the app keeps crashing. --> Negative

LDA Topics:
(0, '0.065*"app" + 0.065*"loan" + 0.039*"crashing" + 0.039*"complete" + 0.039*"keeps"')
(1, '0.051*"bugs" + 0.051*"latest" + 0.051*"update" + 0.051*"helpful" + 0