1. What is Computational Linguistics and how does it relate to NLP?

Computational Linguistics (CL) is an interdisciplinary field that combines computer science, linguistics, and artificial intelligence to study and model human language using computational methods. Its primary goal is to enable computers to understand, interpret, and generate human language in a meaningful and contextually appropriate way.

It involves creating algorithms and models that can process linguistic data—such as grammar, syntax, semantics, and pragmatics—to analyze or produce natural language. Researchers in computational linguistics build formal representations of linguistic phenomena and test them using computational systems.

Relation to Natural Language Processing (NLP):
Computational linguistics forms the theoretical and scientific foundation of NLP. While CL focuses on understanding how language works and how it can be modeled computationally, NLP applies these theories to develop practical applications, such as machine translation, speech recognition, chatbots, and sentiment analysis.

Computational Linguistics = Science (the study and modeling of language computationally)

NLP = Engineering (the implementation and application of those models in real-world systems)

Thus, NLP is a direct outcome of advances in computational linguistics, turning linguistic theory into intelligent, language-capable technologies.

2. Briefly describe the historical evolution of Natural Language Processing.

The evolution of Natural Language Processing (NLP) reflects the progress of artificial intelligence in understanding human language. In the 1950s, pioneers like Alan Turing laid its foundation by proposing the Turing Test to evaluate machine intelligence. Early programs such as ELIZA (1966) simulated simple conversations through pattern matching but lacked real understanding. During the 1960s–1970s, researchers developed rule-based systems relying on manually crafted grammars and linguistic rules. Though effective for limited domains, they failed to handle ambiguity and real-world complexity.

In the 1980s, the field shifted towards statistical and probabilistic models, fueled by increased computational power and linguistic data. Methods like Hidden Markov Models (HMMs) and n-gram models enhanced applications in speech and text processing. The 1990s–2000s saw the rise of machine learning with algorithms such as Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) trained on large annotated corpora, improving accuracy in parsing and tagging tasks.

The 2010s brought the deep learning revolution, introducing neural networks, RNNs, and Transformers like BERT and GPT, capable of capturing contextual and semantic nuances. Today, NLP powers advanced applications such as chatbots, translation, and sentiment analysis, bridging human communication with artificial intelligence.

3. List and explain three major use cases of NLP in today’s tech industry.

Here are three major use cases of Natural Language Processing (NLP) in today’s tech industry:

Chatbots and Virtual Assistants:
NLP enables systems like ChatGPT, Siri, Alexa, and Google Assistant to understand and respond to human speech naturally. These applications use speech recognition, intent detection, and dialogue management to simulate human-like conversations. Businesses deploy chatbots for customer service, automating queries and improving user experience while reducing operational costs.

Sentiment Analysis:
Companies use NLP to analyze opinions and emotions expressed in social media posts, reviews, and surveys. Sentiment analysis algorithms classify text as positive, negative, or neutral, helping brands assess public perception, monitor reputation, and make data-driven marketing decisions. For instance, analyzing tweets about a new product can reveal customer satisfaction trends.

Machine Translation:
Tools like Google Translate and DeepL rely on NLP and deep learning to translate text between languages accurately. Modern translation systems use transformer architectures to preserve meaning, tone, and context across languages, facilitating global communication and accessibility.

Overall, NLP drives automation, personalization, and cross-language understanding across industries, making technology more intuitive and human-centered.

4. What is text normalization and why is it essential in text processing tasks?

Text normalization is the process of converting raw text into a standardized, consistent, and machine-readable format before analysis or modeling. It ensures that variations in text—caused by differences in case, spelling, punctuation, or formatting—do not distort the meaning or bias computational models.

Normalization typically involves several steps, such as:

Lowercasing: Converting all letters to lowercase (e.g., “Apple” → “apple”).

Removing punctuation and special characters: Cleaning unnecessary symbols that don’t affect meaning.

Expanding contractions: Turning “don’t” into “do not.”

Lemmatization or stemming: Reducing words to their root or base form (e.g., “running” → “run”).

Removing extra spaces or stopwords: Simplifying text for efficient processing.

It is essential in text processing tasks because language data is inherently noisy and inconsistent. Without normalization, algorithms may treat semantically identical words (like “USA” and “U.S.A.”) as different tokens, reducing model accuracy. Proper normalization improves data uniformity, enhances feature extraction, and ensures that downstream NLP tasks—such as sentiment analysis, translation, or information retrieval—perform more accurately and efficiently. In short, it is the foundation of reliable and meaningful text analytics.

5. Compare and contrast stemming and lemmatization with suitable
examples.

Stemming and lemmatization are both text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base or root forms, but they differ in approach and accuracy.


a. Stemming:


Definition: A rule-based process that removes prefixes or suffixes from words to obtain their root form, often without considering grammatical correctness.


Approach: Uses simple heuristics (like chopping off “-ing,” “-ed,” or “-s”) without understanding context or part of speech.


Example:


“Running” → “run”


“Studies” → “studi”


“Better” → “bett”




Advantages: Fast and computationally inexpensive.


Disadvantages: Can produce non-words or incorrect stems due to its crude method.




b. Lemmatization:


Definition: A more sophisticated process that reduces words to their lemma (dictionary form) using vocabulary and morphological analysis.


Approach: Considers the part of speech and context of the word.


Example:


“Running” → “run”


“Studies” → “study”


“Better” → “good”




Advantages: Produces valid dictionary words and contextually accurate results.


Disadvantages: Slower and requires linguistic resources like WordNet.


Stemming is faster but less accurate, while lemmatization is linguistically informed and yields cleaner, more meaningful results—crucial for high-quality NLP applications.

6. Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:

“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”


In [1]:
import re

In [2]:
# Input text
text = """
Hello team, please contact us at support@xyz.com for technical issues, or reach out to our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz.
"""

In [3]:
# Regular expression pattern for emails
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Extract all email addresses
emails = re.findall(pattern, text)

In [4]:
print("Extracted Email Addresses:")
for email in emails:
    print(email)

Extracted Email Addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


7. Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

In [3]:
# Input paragraph
text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

In [7]:
import nltk
nltk.download('punkt')
# Download required NLTK resources
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
# Tokenization
tokens = word_tokenize(text)

# Frequency distribution
freq_dist = FreqDist(tokens)

In [9]:
print("Tokens:")
print(tokens)
print("\nFrequency Distribution (Top 10 Words):")
for word, freq in freq_dist.most_common(10):
    print(f"{word}: {freq}")

Tokens:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'machine', 'translation', '.', 'As', 'technology', 'advances', ',', 'the', 'role', 'of', 'NLP', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical', '.']

Frequency Distribution (Top 10 Words):
,: 7
.: 4
NLP: 3
and: 3
is: 2
of: 2
Natural: 1
Language: 1
Processing: 1
(: 1


8.  Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.

In [10]:
import spacy

In [11]:
# Load English language model
nlp = spacy.load("en_core_web_sm")

# Input text
text = """
OpenAI developed ChatGPT, and Google created Bard.
Elon Musk founded SpaceX and Tesla.
Microsoft is headquartered in Redmond, Washington.
"""

In [12]:
# Process the text
doc = nlp(text)

In [13]:
print("Proper Nouns Identified:")
for token in doc:
    if token.pos_ == "PROPN":   # Check if token is a proper noun
        print(f"{token.text}  →  {token.pos_}")

Proper Nouns Identified:
OpenAI  →  PROPN
Google  →  PROPN
Bard  →  PROPN
Elon  →  PROPN
Musk  →  PROPN
SpaceX  →  PROPN
Tesla  →  PROPN
Microsoft  →  PROPN
Redmond  →  PROPN
Washington  →  PROPN


9. Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:

dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using Gensim.


In [15]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [16]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

In [17]:
# Sample dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

In [18]:
# Tokenization and preprocessing
processed_data = [simple_preprocess(sentence) for sentence in dataset]

In [19]:
# Train Word2Vec model
model = Word2Vec(sentences=processed_data, vector_size=100, window=5, min_count=1, sg=1, epochs=100)

In [20]:
# Explore vocabulary
print("Vocabulary:", list(model.wv.key_to_index.keys()))
print("\nMost similar to 'word':", model.wv.most_similar('word'))

Vocabulary: ['word', 'text', 'is', 'similar', 'representation', 'embeddings', 'to', 'language', 'modeling', 'for', 'raw', 'clean', 'help', 'normalization', 'and', 'tokenization', 'training', 'before', 'step', 'critical', 'preprocessing', 'applications', 'nlp', 'many', 'in', 'used', 'technique', 'embedding', 'popular', 'vec', 'have', 'meaning', 'with', 'words', 'allows', 'that', 'of', 'type', 'are', 'human', 'understand', 'computers', 'enables', 'processing', 'natural']

Most similar to 'word': [('allows', 0.594346284866333), ('to', 0.5336982011795044), ('have', 0.5336185097694397), ('with', 0.5330169200897217), ('type', 0.4748493432998657), ('popular', 0.47382819652557373), ('representation', 0.4706820845603943), ('that', 0.4632415175437927), ('words', 0.44776225090026855), ('language', 0.4428447186946869)]


10. Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.




As a data scientist analyzing customer feedback for a fintech startup, the goal is to transform unstructured text (reviews) into actionable insights. Here’s a structured outline of the process using NLP techniques:

1. Data Collection and Understanding

Gather customer reviews from multiple sources such as mobile apps, emails, social media, or survey forms.

Inspect data for structure (JSON, CSV, text files) and metadata (date, rating, user ID).

2. Data Cleaning and Preprocessing

Remove noise: Eliminate duplicates, null values, and irrelevant text (URLs, emojis, HTML tags).

Text normalization:

Convert to lowercase.

Remove punctuation, numbers, and stopwords.

Apply tokenization to split text into words.

Use lemmatization or stemming to reduce words to their base form.

Handle special cases: Detect and manage slang, abbreviations, or domain-specific terms (e.g., “KYC”, “OTP”).

3. Exploratory Text Analysis

Generate word frequency distributions and word clouds to identify common topics.

Perform n-gram analysis (bigrams, trigrams) to detect recurring phrases like “loan approval” or “account issue.”

4. Sentiment Analysis

Use pretrained models (like VADER or TextBlob) to classify reviews as positive, negative, or neutral.

Aggregate sentiment scores by product, service type, or time period to identify trends.

5. Topic Modeling

Apply Latent Dirichlet Allocation (LDA) or BERTopic to uncover key themes (e.g., “customer support,” “transaction delays”).

Label each review by dominant topic for managerial insights.

6. Named Entity Recognition (NER)

Use spaCy or transformers to identify named entities like product names, competitors, or transaction types.

7. Visualization and Reporting

Visualize sentiment trends over time, word frequencies, and topic distributions.

Prepare actionable reports highlighting major customer concerns, satisfaction drivers, and improvement areas.

8. Continuous Monitoring

Automate pipelines to process new feedback in real time, integrating dashboards with tools like Streamlit or Power BI for ongoing insight delivery.

This structured NLP workflow turns raw customer reviews into strategic intelligence, helping fintech companies enhance products, improve user experience, and boost customer retention.