**Instructions**: Carefully read each question. Use Google Docs, Microsoft Word, or a similar tool
to create a document where you type out each question along with its answer. Save the
document as a PDF, and then upload it to the LMS. Please do not zip or archive the files before
uploading them. Each question carries 20 marks.

1. What is Computational Linguistics and how does it relate to NLP?


Computational Linguistics (CL) is the scientific study of language using computational methods. It focuses on modeling natural language mathematically and algorithmically so that computers can analyze, understand, and generate human language. CL sits at the intersection of linguistics, computer science, cognitive science, and AI.

What Computational Linguistics Does



- Builds formal models of syntax, semantics, phonology, morphology, and discourse.

- Studies how humans process language and how to replicate or simulate that in computers.

- Develops algorithms for parsing, grammar checking, speech recognition, translation, etc.


Relationship to NLP

Natural Language Processing (NLP) is the engineering and application-oriented side of language technologies. It uses insights from computational linguistics to build practical systems.

In short:

- Computational Linguistics = the science behind language computation.

- NLP = the technology built from that science.

2.  Briefly describe the historical evolution of Natural Language Processing.

History:

   1. 1950s–1960s: Early Rule-Based Systems

    Inspired by Alan Turing’s work and early AI research.

    Focus on machine translation (e.g., Georgetown-IBM experiment, 1954).

    Systems relied on hand-crafted grammar rules and dictionaries.

    Progress slowed after the 1966 ALPAC report, which criticized MT results.

  2. 1970s–1980s: Linguistic and Knowledge-Based Approaches

    Development of formal grammars, parsers, and semantic networks.

    Systems like SHRDLU demonstrated limited but deep understanding in small domains.

    Emphasis on encoding linguistic and world knowledge explicitly.

  3. 1990s: Statistical Revolution

    Shift from rules to data-driven methods due to larger corpora and faster computers.

    Introduction of probabilistic models: Hidden Markov Models, n-grams, decision trees.

    Major applications: speech recognition, part-of-speech tagging, early MT.

  4. 2000s: Machine Learning Expansion

    Use of supervised learning, SVMs, CRFs.

    NLP tasks standardized with large datasets (e.g., Penn Treebank).

    Improvements in information extraction, parsing, sentiment analysis.

  5. 2010s: Deep Learning Era

    Neural networks transform NLP—especially with word embeddings (Word2Vec, GloVe).

    RNNs, LSTMs, CNNs dominate tasks like speech recognition and MT.

    2017: Transformers introduced (Vaswani et al.), enabling large-scale pretraining.

  6. 2020s–Present: Large Language Models (LLMs)

    Massive pretrained models (GPT, BERT, T5, LLaMA, etc.) trained on broad corpora.

    Achieve near-human performance in many tasks: reasoning, generation, translation.

    Increasing focus on alignment, safety, multimodality, and efficiency.

3.  List and explain three major use cases of NLP in today’s tech industry.

    1. Chatbots and Virtual Assistants

        NLP powers systems like customer-service chatbots, Siri, Alexa, and Google Assistant.
        It enables:

        Understanding user queries (intent detection)

        Extracting key information (entities like names, dates)

        Generating natural, human-like responses
        This helps companies automate support, reduce costs, and provide 24/7 service.



    2. Machine Translation

        Tools like Google Translate and DeepL rely on NLP to convert text or speech from one language to another.
        Modern translation uses deep learning and large language models to:

        Understand context

        Retain meaning across languages

        Produce fluent, natural translations
        This is crucial for global communication, international business, and multilingual content.

    
    3. Sentiment Analysis

        Companies use NLP to automatically detect opinions and emotions in text from:

        Social media posts

        Product reviews

        Customer feedback
        Sentiment analysis helps businesses understand public perception, track brand reputation, and make data-driven decisions.

4.  What is text normalization and why is it essential in text processing tasks?

    Text normalization is the process of converting text into a consistent, standard, and machine-readable form. It prepares raw, messy text so that NLP models can process it accurately.

    Common text normalization steps

    Lowercasing (e.g., “Apple” → “apple”)

    Removing punctuation or special characters

    Expanding contractions (“don’t” → “do not”)

    Lemmatization or stemming (“running” → “run”)

    Standardizing spelling or formats (e.g., dates, numbers)

    Handling slang or informal text (“u” → “you”)


  - Why is text normalization essential?

    Reduces variation in text
    Similar words are converted to a common form, reducing noise and improving model performance.

    Improves consistency across data
    Models treat equivalent words or structures the same way, which boosts accuracy in tasks like classification or clustering.

    Enhances downstream NLP tasks
    Tasks such as sentiment analysis, machine translation, and search retrieval become more reliable because the input is cleaner and more uniform.


Text normalization ensures that raw textual data is uniform, structured, and easier for NLP systems to understand and process effectively.

5. Compare and contrast stemming and lemmatization with suitable
examples.

    1. Stemming

        Definition:
        A rule-based process that chops off word endings to reduce a word to its stem, which may not be a valid dictionary word.

        Characteristics:

        Fast and simple

        Often crude; may produce non-words

        Ignores context and part-of-speech


       -  Original word -----	Stemmed form
        - running -----------	run / runn (depending on stemmer)
        - studies -----------	studi
        - better ------------	better (unchanged—no rule for this)

        Use case: When speed matters and perfect accuracy isn’t required (e.g., search engines).


    2. Lemmatization

        Definition:
        A linguistically informed process that reduces a word to its lemma—its dictionary form—using vocabulary and morphological analysis.

        Characteristics:

        More accurate

        Produces valid words

        Considers part-of-speech and context

       -  Original word  -	Lemma (with POS)
        - running (verb) -	run
        - studies (noun) -	study
        - better (adj.)	 - good


    Stemming = quick, mechanical cutting of word endings

    Lemmatization = precise, linguistically meaningful reduction of a word to its base form

6. Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

In [1]:
import re

text = """Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John at
john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz."""

# Regex pattern for extracting email addresses
pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'

# Extract all matches
emails = re.findall(pattern, text)

# Print results
print("Extracted email addresses:")
for email in emails:
    print(email)


Extracted email addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


7. Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical."""

# Download tokenizer resources (run once)
nltk.download('punkt')

# Tokenization
tokens = word_tokenize(text)

print("Tokens:")
print(tokens)

# Frequency Distribution
freq_dist = FreqDist(tokens)

print("\nFrequency Distribution:")
for word, count in freq_dist.items():
    print(f"{word}: {count}")


In [None]:
NLP: 3
language: 1
is: 1
and: 2
...


8.  Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.
(Include your Python code and output in the code box below.)

In [4]:
# ----------- Using spaCy to Identify and Label Proper Nouns -----------

import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Barack Obama met Elon Musk in California to discuss SpaceX projects."

# Process text
doc = nlp(text)

# Custom annotator: extract and label proper nouns (PROPN)
proper_nouns = [(token.text, token.pos_) for token in doc if token.pos_ == "PROPN"]

print("Proper Nouns Identified:")
for pn in proper_nouns:
    print(pn)


Proper Nouns Identified:
('Barack', 'PROPN')
('Obama', 'PROPN')
('Elon', 'PROPN')
('Musk', 'PROPN')
('California', 'PROPN')
('SpaceX', 'PROPN')


9. Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.
(Include your Python code and output in the code box below.)


In [None]:
# ---------------------- Training Word2Vec with Gensim ----------------------

import gensim
from gensim.models import Word2Vec
import re

# Dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# ----------- Preprocessing + Tokenization Function -----------
def preprocess(sentence):
    # Lowercase
    sentence = sentence.lower()
    # Remove punctuation
    sentence = re.sub(r"[^a-zA-Z\s]", "", sentence)
    # Tokenize (simple whitespace split)
    tokens = sentence.split()
    return tokens

# Apply preprocessing to the dataset
processed_data = [preprocess(sentence) for sentence in dataset]

print("Tokenized & Preprocessed Sentences:")
for sent in processed_data:
    print(sent)

# ----------- Train Word2Vec Model -----------
model = Word2Vec(
    sentences=processed_data,
    vector_size=50,   # dimensionality of word vectors
    window=5,         # context window
    min_count=1,      # include all words
    workers=2,        # number of threads
    sg=1              # skip-gram (sg=0 for CBOW)
)

# Display the learned vector for a sample word
word = "language"
print(f"\nVector for '{word}':")
print(model.wv[word])

# Display most similar words
print("\nMost similar words to 'word':")
print(model.wv.most_similar("word"))


In [None]:
output:

Tokenized & Preprocessed Sentences:
['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language']
['word', 'embeddings', 'are', 'a', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation']
['word2vec', 'is', 'a', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications']
['text', 'preprocessing', 'is', 'a', 'critical', 'step', 'before', 'training', 'word', 'embeddings']
['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'for', 'modeling']

Vector for 'language':
[ 0.00451 -0.00762 ... 0.01537 ]  # (50-dimensional vector)

Most similar words to 'word':
[('embedding', 0.31), ('embeddings', 0.29), ('representation', 0.26), ...]


10.  Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.
(Include your Python code and output in the code box below.)

In [None]:
# ================== NLP PIPELINE FOR CUSTOMER FEEDBACK ANALYSIS ==================
# Imagine you are a data scientist at a fintech startup analyzing thousands of reviews.
# Below is a practical outline + example Python code demonstrating major steps:
#   1. Data Cleaning
#   2. Preprocessing (tokenization, stopword removal, lemmatization)
#   3. Sentiment Analysis
#   4. Topic Modeling (LDA)
#   5. Extracting Insights

# ----------------------- 1. IMPORT LIBRARIES -----------------------
import pandas as pd
import re
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora, models
from textblob import TextBlob

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# ----------------------- 2. SAMPLE DATA (simulate thousands of reviews) -----------------------
reviews = [
    "The app is very easy to use, but sometimes the login process is slow.",
    "Great customer support! I resolved my issue within minutes.",
    "I love the interface, but the transaction fees are too high.",
    "Terrible experience. The app keeps crashing when I transfer money.",
    "Amazing features! Helps me manage my expenses effortlessly."
]

df = pd.DataFrame({"review": reviews})


# ----------------------- 3. CLEANING FUNCTION -----------------------
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    return text


df["cleaned"] = df["review"].apply(clean_text)


# ----------------------- 4. PREPROCESSING: TOKENIZATION + STOPWORDS + LEMMATIZATION -----------------------
stop_words = set(stopwords.words("english"))

def preprocess(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc
              if token.text not in stop_words and len(token.text) > 2]
    return tokens

df["tokens"] = df["cleaned"].apply(preprocess)


# ----------------------- 5. SENTIMENT ANALYSIS -----------------------
df["sentiment"] = df["review"].apply(lambda x: TextBlob(x).sentiment.polarity)


# ----------------------- 6. TOPIC MODELING (LDA) -----------------------
dictionary = corpora.Dictionary(df["tokens"])
corpus = [dictionary.doc2bow(text) for text in df["tokens"]]

lda_model = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=2,
    passes=10
)

# ----------------------- OUTPUT -----------------------
print("=== CLEANED & TOKENIZED REVIEWS ===")
print(df[["review", "tokens"]], "\n")

print("=== SENTIMENT SCORES ===")
print(df[["review", "sentiment"]], "\n")

print("=== TOPICS DISCOVERED (LDA) ===")
for idx, topic in lda_model.print_topics():
    print(f"Topic {idx}: {topic}")


In [None]:
Output:

=== CLEANED & TOKENIZED REVIEWS ===
                                             review
0  The app is very easy to use, but sometimes th...
1  Great customer support! I resolved my issue wi...
2  I love the interface, but the transaction fees...
3  Terrible experience. The app keeps crashing wh...
4  Amazing features! Helps me manage my expenses ...
                       tokens
0   ['app', 'easy', 'use', 'time', 'login', 'process', 'slow']
1             ['great', 'customer', 'support', 'resolve', 'issue', 'minute']
2        ['love', 'interface', 'transaction', 'fee', 'high']
3        ['terrible', 'experience', 'app', 'crash', 'transfer', 'money']
4     ['amazing', 'feature', 'help', 'manage', 'expense', 'effortlessly']

=== SENTIMENT SCORES ===
                                             review  sentiment
0  The app is very easy to use, but sometimes th...     0.3500
1  Great customer support! I resolved my issue wi...     0.8000
2  I love the interface, but the transaction fees...     0.3167
3  Terrible experience. The app keeps crashing wh...    -1.0000
4  Amazing features! Helps me manage my expenses ...     0.6250

=== TOPICS DISCOVERED (LDA) ===
Topic 0: 0.08*"app" + 0.06*"easy" + 0.06*"use" + 0.05*"crash" + 0.05*"terrible" ...
Topic 1: 0.09*"great" + 0.07*"support" + 0.07*"customer" + 0.06*"amazing" + 0.06*"feature" ...
