#Theory Questions

**Question 1: What is Computational Linguistics and how does it relate to NLP?**

**Answer:** Computational Linguistics is a field that applies computer science techniques to analyze, model, and understand human language. It focuses on creating algorithms and linguistic models that explain how language works.

Its relation to Natural Language Processing (NLP) is very close:

1. Computational Linguistics provides the theoretical and linguistic foundations (syntax, semantics, morphology, phonology).
2. NLP applies these theories to build practical language-processing systems, such as chatbots, translators, speech recognizers, and text analyzers.

**Question 2: Briefly describe the historical evolution of Natural Language Processing.**

**Answer:**
The evolution of NLP can be summarized in four main phases:

1. 1950s–1960s: Rule-Based Systems
* NLP began with symbolic, hand-crafted grammar rules and early machine translation experiments.

2. 1970s–1980s: Linguistic & Knowledge-Based Approaches
* Systems used formal grammars, parsing techniques, and expert knowledge bases to understand language.

3. 1990s–2010: Statistical NLP
* With large corpora and improved computing power, probabilistic models (HMMs, n-grams, CRFs) and machine learning became dominant.

4. 2010–Present: Neural & Deep Learning Era
* Neural networks, word embeddings, LSTMs, and later Transformers (e.g., BERT, GPT) revolutionized NLP with high accuracy and end-to-end learning.

**Question 3: List and explain three major use cases of NLP in today’s tech industry.**

**Answer:** Three major use cases of NLP in today’s tech industry are:

1. **Machine Translation :-**
Converts text or speech from one language to another (e.g., Google Translate), enabling cross-language communication and global content access.

2. **Sentiment Analysis :-**
Identifies opinions and emotions in text, widely used in social media monitoring, customer feedback analysis, and brand reputation management.

3. **Chatbots & Virtual Assistants :-**
Systems like Siri, Alexa, and customer-support bots use NLP to understand user queries and generate meaningful responses, improving automation and user experience.


**Question 4: What is text normalization and why is it essential in text processing tasks?**

**Answer:**
* Text normalization is the process of converting text into a consistent, standardized form before analysis.
* It includes steps like lowercasing, removing punctuation, expanding contractions, correcting spellings, and lemmatization/stemming.
* It is essential because raw text is often messy and inconsistent. Normalization ensures uniformity, reduces noise, and improves the accuracy of downstream NLP tasks such as classification, sentiment analysis, and machine translation.

**Question 5: Compare and contrast stemming and lemmatization with suitable
examples.**

**Answer:**Stemming and lemmatization both reduce words to their base form, but they differ in method and output.

**Stemming**
* Uses simple cutting rules to remove suffixes.
* Often produces non-dictionary words.

Example:
1. “playing” → play
2. “studies” → studi

**Lemmatization**
* Uses vocabulary and grammatical rules to return the meaningful root form (lemma).
* Always produces valid dictionary words.

Example:
1. “playing” → play
2. “studies” → study


#Practical Questions

In [1]:
'''
Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:

“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

(Include your Python code and output in the code box below.)

Answer:
'''
import re

text = """
Hello team, please contact us at support@xyz.com for technical issues,
or reach out to our HR at hr@xyz.com. You can also connect with John
at john.doe@xyz.org and jenny via jenny_clarke126@mail.co.us.
For partnership inquiries, email partners@xyz.biz.
"""

# Regex pattern for email extraction
pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

emails = re.findall(pattern, text)

print(emails)


['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


In [4]:
'''
Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:

“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

(Include your Python code and output in the code box below.)

Answer:

'''

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# Tokenization
tokens = word_tokenize(text)

# Stopword Removal
stop_words = set(stopwords.words("english"))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word.isalpha()]

# Frequency Distribution
freq_dist = FreqDist(filtered_tokens)

print("Filtered Tokens:")
print(filtered_tokens)

print("\nFrequency Distribution:")
for word, count in freq_dist.items():
    print(f"{word:15} : {count}")




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Filtered Tokens:
['Natural', 'Language', 'Processing', 'NLP', 'fascinating', 'field', 'combines', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'enables', 'machines', 'understand', 'interpret', 'generate', 'human', 'language', 'Applications', 'NLP', 'include', 'chatbots', 'sentiment', 'analysis', 'machine', 'translation', 'technology', 'advances', 'role', 'NLP', 'modern', 'solutions', 'becoming', 'increasingly', 'critical']

Frequency Distribution:
Natural         : 1
Language        : 1
Processing      : 1
NLP             : 3
fascinating     : 1
field           : 1
combines        : 1
linguistics     : 1
computer        : 1
science         : 1
artificial      : 1
intelligence    : 1
enables         : 1
machines        : 1
understand      : 1
interpret       : 1
generate        : 1
human           : 1
language        : 1
Applications    : 1
include         : 1
chatbots        : 1
sentiment       : 1
analysis        : 1
machine         : 1
translation     : 1
techn

In [5]:
'''
Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.

(Include your Python code and output in the code box below.)

Answer:
'''
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

text = """
Amazon is planning to open a new office in Mumbai.
Jeff Bezos visited India last year to discuss partnerships with Reliance Industries.
"""

# Process text
doc = nlp(text)

# Custom annotator: extract and label proper nouns
proper_nouns = [(token.text, "PROPER_NOUN") for token in doc if token.pos_ == "PROPN"]

print("Proper Noun Annotations:")
for word, label in proper_nouns:
    print(f"{word} --> {label}")


Proper Noun Annotations:
Amazon --> PROPER_NOUN
Mumbai --> PROPER_NOUN
Jeff --> PROPER_NOUN
Bezos --> PROPER_NOUN
India --> PROPER_NOUN
Reliance --> PROPER_NOUN
Industries --> PROPER_NOUN


In [9]:
'''
Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:

dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.

(Include your Python code and output in the code box below.)

Answer:

'''


import gensim
from gensim.models import Word2Vec
import re

# Dataset
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]

# -------- Preprocessing + Tokenization --------
def preprocess(sentence):
    sentence = sentence.lower()                         # Lowercase
    sentence = re.sub(r'[^a-zA-Z\s]', '', sentence)    # Remove punctuation
    tokens = sentence.split()                          # Tokenization
    return tokens

processed_data = [preprocess(sent) for sent in dataset]

print("Tokenized & Preprocessed Sentences:\n", processed_data)

# -------- Train Word2Vec Model --------
model = Word2Vec(
    sentences=processed_data,
    vector_size=50,     # Dimensionality of vectors
    window=5,           # Context window
    min_count=1,        # Include all words
    workers=2,
    sg=1                # Skip-gram model
)

# -------- Example Output: Similar Words --------
print("\nSimilar words to 'language':")
print(model.wv.most_similar("language"))


Tokenized & Preprocessed Sentences:
 [['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language'], ['word', 'embeddings', 'are', 'a', 'type', 'of', 'word', 'representation', 'that', 'allows', 'words', 'with', 'similar', 'meaning', 'to', 'have', 'similar', 'representation'], ['wordvec', 'is', 'a', 'popular', 'word', 'embedding', 'technique', 'used', 'in', 'many', 'nlp', 'applications'], ['text', 'preprocessing', 'is', 'a', 'critical', 'step', 'before', 'training', 'word', 'embeddings'], ['tokenization', 'and', 'normalization', 'help', 'clean', 'raw', 'text', 'for', 'modeling']]

Similar words to 'language':
[('used', 0.3150634169578552), ('for', 0.2372531294822693), ('meaning', 0.21844501793384552), ('of', 0.20309294760227203), ('to', 0.1847095489501953), ('and', 0.17878921329975128), ('allows', 0.14185984432697296), ('text', 0.13890081644058228), ('are', 0.13452698290348053), ('preprocessing', 0.11213743686676025)]


**Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.**

**Answer:**
Steps to Clean, Process & Extract Insights from Customer Reviews

1. **Data Collection & Loading**
* Import customer reviews from CSV/Database.
* Remove duplicates and missing values.

2. Text Cleaning
* Lowercasing
* Removing punctuation, numbers, special characters

3. Removing stopwords
* Lemmatization for normalization

4. Tokenization
* Split text into individual words/tokens.

5. Exploratory Text Analysis
* Word frequency distribution
* N-grams (common phrases)
* Word clouds or summary statistics

6 Sentiment Analysis
* Rule-based (VADER) or model-based analysis
* Identify positive, negative, and neutral reviews

7. Topic Modeling
* Use LDA to discover key themes in customer feedback
* Helps identify recurring issues/features customers mention

8. Insight Extraction & Reporting
* Summaries of top complaints, positive highlights
* Visual charts to present findings to stakeholders

In [12]:
'''
Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews.

(Include your Python code and output in the code box below.)

Answer:
'''
# Install required libraries
#!pip install nltk gensim wordcloud

import pandas as pd
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.probability import FreqDist
from gensim import corpora, models

# ---------------- SAMPLE DATA ----------------
reviews = [
    "The app is great but the login process is slow.",
    "Customer support is excellent!",
    "I faced issues with payment verification.",
    "Very smooth experience. Loved the UI.",
    "Payments fail sometimes, please fix this.",
    "Fast service but needs better security features."
]

df = pd.DataFrame({"review": reviews})

# ---------------- PREPROCESSING ----------------
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_review(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word)
              for word in tokens if word not in stop_words]
    return tokens

df["cleaned"] = df["review"].apply(clean_review)

# ---------------- SENTIMENT ANALYSIS ----------------
sia = SentimentIntensityAnalyzer()

def get_sentiment(txt):
    return sia.polarity_scores(txt)

df["sentiment"] = df["review"].apply(get_sentiment)

# ---------------- WORD FREQUENCY ----------------
all_words = [word for tokens in df["cleaned"] for word in tokens]
freq = FreqDist(all_words)

# ---------------- TOPIC MODELING ----------------
dictionary = corpora.Dictionary(df["cleaned"])
corpus = [dictionary.doc2bow(text) for text in df["cleaned"]]

lda = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# ---------------- CLEAN OUTPUT ----------------
print("CLEANED REVIEWS:")
print(df["cleaned"], "\n")

print("WORD FREQUENCY (TOP 10):")
print(freq.most_common(10), "\n")

print("SENTIMENT SCORES:")
print(df["sentiment"], "\n")

print("TOPIC MODELING (LDA):")
for i, topic in lda.print_topics():
    print(f"Topic {i}: {topic}")



CLEANED REVIEWS:
0                  [app, great, login, process, slow]
1                      [customer, support, excellent]
2               [faced, issue, payment, verification]
3                     [smooth, experience, loved, ui]
4             [payment, fail, sometimes, please, fix]
5    [fast, service, need, better, security, feature]
Name: cleaned, dtype: object 

WORD FREQUENCY (TOP 10):
[('payment', 2), ('app', 1), ('great', 1), ('login', 1), ('process', 1), ('slow', 1), ('customer', 1), ('support', 1), ('excellent', 1), ('faced', 1)] 

SENTIMENT SCORES:
0    {'neg': 0.0, 'neu': 0.779, 'pos': 0.221, 'comp...
1    {'neg': 0.0, 'neu': 0.23, 'pos': 0.77, 'compou...
2    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3    {'neg': 0.0, 'neu': 0.546, 'pos': 0.454, 'comp...
4    {'neg': 0.357, 'neu': 0.408, 'pos': 0.235, 'co...
5    {'neg': 0.0, 'neu': 0.418, 'pos': 0.582, 'comp...
Name: sentiment, dtype: object 

TOPIC MODELING (LDA):
Topic 0: 0.055*"security" + 0.055*"service" + 0

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
