1. What is Computational Linguistics and how does it relate to NLP?

Computational Linguistics (CL) is an interdisciplinary field that focuses on the scientific study and computational modeling of human language. It involves developing algorithms, models, and theories to understand linguistic structures, syntax, semantics, and phonetics using computers. CL emphasizes theoretical aspects, such as formal grammars and knowledge representation, to analyze and generate language in a way that mimics human cognitive processes.

Natural Language Processing (NLP), while closely related, is more application-driven. It applies computational techniques to enable machines to process, understand, and interact with human language in practical scenarios, such as chatbots or translation systems.

The relationship between CL and NLP is foundational: CL provides the theoretical underpinnings and linguistic insights that inform NLP techniques, while NLP often implements CL models in real-world technologies. In essence, CL is more about studying language computationally for knowledge, whereas NLP focuses on engineering solutions for useful tasks. Many professionals work at the intersection of both fields, with CL advancing the science that powers NLP innovations.

2. Briefly describe the historical evolution of Natural Language Processing.

The field of Natural Language Processing (NLP) has evolved significantly since its inception, driven by advancements in computing, linguistics, and artificial intelligence.

* 1940s-1950s: Origins in Machine Translation: NLP began in the late 1940s post-World War II, with early efforts focused on automatic translation between languages, such as Russian to English, motivated by geopolitical needs. Pioneering work included Warren Weaver's 1949 memorandum on translation using computers. The 1950s saw symbolic approaches, including Noam Chomsky's 1957 book "Syntactic Structures," which introduced generative grammar and influenced rule-based systems.

* 1960s-1970s: Rule-Based Systems and Early AI: The 1960s introduced conversational systems like ELIZA (1966), a simple chatbot simulating a therapist, and SHRDLU (1970), which understood natural language in a block world. The 1970s emphasized conceptual ontologies to structure real-world knowledge for language understanding. However, progress stalled due to the "AI Winter" from overhyped expectations and limited computing power.

* 1980s-1990s: Statistical and Machine Learning Shift: The 1980s marked a transition to statistical methods, using probabilities and corpora for tasks like speech recognition. The 1990s integrated machine learning, with algorithms like Hidden Markov Models enabling data-driven approaches, fueled by increasing data availability and computational resources.

* 2000s-Present: Deep Learning and Modern Era: The 2000s saw the rise of neural networks and large datasets. Key milestones include Word2Vec (2013) for word embeddings, followed by deep learning breakthroughs like RNNs, LSTMs, and Transformers (2017). Models such as BERT (2018) and GPT series (from 2018 onward) revolutionized NLP with pre-trained language models. Today, NLP incorporates multimodal data, ethical considerations, and applications in generative AI, with ongoing advancements in efficiency and multilingual support.

This evolution reflects a shift from rigid rules to data-driven, learning-based systems, enabling more robust and scalable applications.

3. List and explain three major use cases of NLP in today’s tech industry.

NLP is integral to the tech industry, powering tools that enhance user experiences, automate processes, and derive insights from text data. Here are three major use cases:

1. Sentiment Analysis: This involves analyzing text to determine the emotional tone, such as positive, negative, or neutral. In the tech industry, companies like Amazon and Twitter use it to gauge customer opinions from reviews, social media posts, and feedback. For example, it helps brands monitor reputation, improve products, and personalize marketing. Advanced models classify sentiments at scale, enabling real-time insights.

2. Chatbots and Virtual Assistants: NLP enables conversational AI systems like Siri, Alexa, or customer service bots on websites. These systems understand user queries, generate responses, and perform tasks such as booking appointments or answering FAQs. In e-commerce and customer support, they reduce response times and operational costs while providing 24/7 availability.

3. Machine Translation: Tools like Google Translate use NLP to convert text or speech from one language to another. In the global tech industry, this facilitates cross-border communication, content localization for apps/websites, and accessibility. It employs neural networks for context-aware translations, improving accuracy for business expansion and user engagement.

These use cases demonstrate NLP's role in making technology more intuitive and efficient.

4. What is text normalization and why is it essential in text processing tasks?


Text normalization is the process of converting raw text into a standardized, consistent format to facilitate analysis and modeling. It involves techniques such as:

* Lowercasing all text (e.g., "Hello" → "hello").
* Removing punctuation, special characters, or noise (e.g., hashtags, URLs).
* Expanding contractions (e.g., "don't" → "do not").
* Handling numbers, dates, or acronyms uniformly.
* Correcting spelling errors or removing stop words (common words like "the," "is").

This step ensures that variations in text (e.g., "Run," "running," "RUN") are treated as the same entity.

It is essential in text processing tasks because raw text from sources like social media or reviews is often inconsistent, noisy, and varied due to human input errors, abbreviations, or formatting. Without normalization, models may treat similar words differently, leading to poor performance in tasks like sentiment analysis, search, or machine learning. Normalization reduces dimensionality, improves accuracy, and enhances efficiency by creating cleaner data for downstream processes like tokenization or embedding.

5. Compare and contrast stemming and lemmatization with suitable examples.


Stemming and lemmatization are both text preprocessing techniques used to reduce words to their base or root form, but they differ in approach, accuracy, and output.

**Similarities:**

* Both aim to normalize words by reducing inflected forms (e.g., plurals, tenses) to a common root, helping in tasks like information retrieval or sentiment analysis.

* They reduce vocabulary size, improving computational efficiency and model performance by treating variants as one.

**Differences:**

**Stemming:**

A heuristic, rule-based process that chops off suffixes/prefixes to find the stem, often producing non-real words. It is faster but less accurate, as it doesn't consider context or part-of-speech (POS).

* Algorithm: Common ones include Porter or Snowball stemmer.

Examples:

* "Running" → "run"

* "Computers" → "comput"

* "Better" → "better" (no change, but may fail on irregulars like "went" → "went")


Pros: Simple, quick.

Cons: Can over-stem (e.g., "university" → "univers") or produce invalid words.

**Lemmatization:**

 A more sophisticated method that reduces words to their dictionary base form (lemma), considering context, POS, and morphology. It requires linguistic knowledge and is slower but more precise.

* Algorithm: Uses tools like WordNet or spaCy/NLTK lemmatizers.

Examples:

* "Running" → "run"
* "Computers" → "computer"
* "Better" → "good" (handles adjectives correctly)
* "Went" → "go"


Pros: Produces valid words, context-aware.

Cons: Computationally intensive, needs POS tagging.

In summary, stemming is crude and fast for large datasets where accuracy isn't critical, while lemmatization is preferred for tasks requiring semantic precision, like question answering.

6. Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text

In [1]:
import re

text = """Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."""

emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b', text)

print(emails)

['support@xyz.com', 'hr@xyz.com', 'john.doe@xyz.org', 'jenny_clarke126@mail.co.us', 'partners@xyz.biz']


7. Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK

In [31]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [32]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import os

# Set the NLTK data path explicitly
nltk.data.path.append(os.path.abspath("."))

# Download required data (if available in environment)
# Check if 'punkt' is already downloaded to avoid repeated downloads
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    pass # Removed nltk.download('punkt') as requested
except LookupError:
    pass # Removed nltk.download('punkt') as requested


text = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

tokens = word_tokenize(text)
fdist = FreqDist(tokens)

print("Tokens:", tokens)
print("Frequency Distribution:", fdist.most_common())

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/content'
    - '/content'
    - '/content'
    - '/content'
    - '/content'
**********************************************************************


* since error is persisting and i have found no solution , we use another library

In [33]:
import re
from collections import Counter

# Your provided text
text = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

# 1. Tokenization and Normalization (in one step)
# Use regex to find all sequences of word characters (\w+)
# \b ensures we only match whole words
# .lower() converts the whole text to lowercase first
tokens = re.findall(r'\b\w+\b', text.lower())

# 2. Frequency Distribution
# The Counter object works just like NLTK's FreqDist
fdist = Counter(tokens)

# Print the results
print("--- Standard Python Results ---")
print("\nAll Tokens:")
print(tokens)

print("\nFrequency Distribution:")
# Print the 10 most common words
print(fdist.most_common(10))

print("\nFull Frequency List:")
for word, frequency in fdist.items():
    print(f"{word}: {frequency}")

--- Standard Python Results ---

All Tokens:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'it', 'enables', 'machines', 'to', 'understand', 'interpret', 'and', 'generate', 'human', 'language', 'applications', 'of', 'nlp', 'include', 'chatbots', 'sentiment', 'analysis', 'and', 'machine', 'translation', 'as', 'technology', 'advances', 'the', 'role', 'of', 'nlp', 'in', 'modern', 'solutions', 'is', 'becoming', 'increasingly', 'critical']

Frequency Distribution:
[('nlp', 3), ('and', 3), ('language', 2), ('is', 2), ('of', 2), ('natural', 1), ('processing', 1), ('a', 1), ('fascinating', 1), ('field', 1)]

Full Frequency List:
natural: 1
language: 2
processing: 1
nlp: 3
is: 2
a: 1
fascinating: 1
field: 1
that: 1
combines: 1
linguistics: 1
computer: 1
science: 1
and: 3
artificial: 1
intelligence: 1
it: 1
enables: 1
machines: 1
to: 1
understand: 1
interpret: 1
generat

8. Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.

In [12]:
import spacy

# Load the model (if available)
nlp = spacy.load('en_core_web_sm')

def custom_proper_noun_annotator(text):
    doc = nlp(text)
    proper_nouns = []
    for token in doc:
        if token.pos_ == 'PROPN':
            proper_nouns.append((token.text, 'Proper Noun'))
    return proper_nouns

# Example text (using the paragraph from Question 7 for consistency)
text = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

proper_nouns = custom_proper_noun_annotator(text)
print("Proper Nouns:", proper_nouns)

Proper Nouns: [('Natural', 'Proper Noun'), ('Language', 'Proper Noun'), ('Processing', 'Proper Noun'), ('NLP', 'Proper Noun'), ('NLP', 'Proper Noun'), ('NLP', 'Proper Noun')]


9. Using Gensim, demonstrate how to train a simple Word2Vec model on the following dataset consisting of example sentences:

In [14]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [15]:
import gensim
from gensim.models import Word2Vec

dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

tokenized_dataset = [sentence.lower().split() for sentence in dataset]

model = Word2Vec(sentences=tokenized_dataset, vector_size=50, window=5, min_count=1, workers=4, sg=1, epochs=20)

print("Vocabulary:", list(model.wv.key_to_index.keys()))

print("\nSimilar to 'word':", model.wv.most_similar('word'))

print("\nSimilar to 'language':", model.wv.most_similar('language'))

Vocabulary: ['word', 'a', 'text', 'is', 'similar', 'representation', 'embeddings', 'to', 'language', 'modeling', 'for', 'raw', 'clean', 'help', 'normalization', 'and', 'tokenization', 'training', 'before', 'step', 'critical', 'preprocessing', 'applications', 'nlp', 'many', 'in', 'used', 'technique', 'embedding', 'popular', 'word2vec', 'have', 'meaning', 'with', 'words', 'allows', 'that', 'of', 'type', 'are', 'human', 'understand', 'computers', 'enables', 'processing', 'natural']

Similar to 'word': [('before', 0.2734066843986511), ('enables', 0.26125431060791016), ('meaning', 0.24966613948345184), ('normalization', 0.22074124217033386), ('nlp', 0.2000207155942917), ('are', 0.1829235851764679), ('raw', 0.17133577167987823), ('applications', 0.16520081460475922), ('popular', 0.1566496044397354), ('help', 0.1488247513771057)]

Similar to 'language': [('used', 0.3111428916454315), ('for', 0.23686298727989197), ('meaning', 0.2165430784225464), ('of', 0.2031378597021103), ('to', 0.1837280988

10. Imagine you are a data scientist at a fintech startup. You’ve been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.

As a data scientist at a fintech startup, analyzing customer feedback from thousands of reviews involves systematic NLP techniques to uncover actionable insights like common pain points, sentiment trends, and feature requests. Below is an outline of the steps, followed by sample Python code demonstrating key parts (using available libraries like pandas for data handling and gensim for basic embedding; advanced NLP would typically use NLTK/spaCy, but simulated here due to environment constraints).

Outlined Steps:

1. Data Collection and Loading: Gather reviews from sources like app stores, surveys, or databases. Load into a structured format (e.g., pandas DataFrame) for easy manipulation.

2. Data Cleaning: Remove irrelevant elements such as HTML tags, emojis, URLs, or duplicates. Handle missing values and filter out non-textual data.

3. Text Normalization and Preprocessing: Convert to lowercase, remove punctuation/numbers, expand contractions, and apply stemming/lemmatization. Remove stop words to focus on meaningful terms.

4. Tokenization and Feature Extraction: Break text into tokens (words/phrases). Use techniques like TF-IDF for weighting or word embeddings (e.g., Word2Vec) for semantic representation.

5. Analysis and Insight Extraction: Perform sentiment analysis (e.g., polarity scoring), topic modeling (e.g., LDA), or clustering to identify themes. Calculate metrics like average sentiment per product feature.

6. Visualization and Reporting: Use plots (e.g., word clouds, bar charts) to visualize frequencies, sentiments, or topics. Derive insights like "80% negative feedback on transaction fees" and recommend actions.

7. Iteration and Model Improvement: Validate results with manual checks, fine-tune models, and deploy for ongoing monitoring.



Sample Python Code (Using pandas and gensim for a small example dataset; assumes basic preprocessing without NLTK):

In [16]:
import pandas as pd
import re
from gensim.models import Word2Vec
from collections import Counter  # For frequency/sentiment simulation

# Step 1: Sample data (thousands of reviews would be loaded from CSV)
reviews = [
    "The app crashes often during transactions.",
    "Great user interface but high fees.",
    "Excellent customer support and fast transfers.",
    "Security features are top-notch.",
    "Too many bugs in the latest update."
]
df = pd.DataFrame({'review': reviews})

# Step 2-3: Cleaning and Normalization
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # Remove punctuation/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

df['cleaned'] = df['review'].apply(clean_text)

# Step 4: Tokenization (simple split) and Word2Vec for embeddings
tokenized = [text.split() for text in df['cleaned']]
model = Word2Vec(sentences=tokenized, vector_size=10, window=3, min_count=1, workers=4, epochs=10)

# Step 5: Extract Insights (e.g., word frequency for themes, simulated sentiment)
word_freq = Counter(word for tokens in tokenized for word in tokens)
print("Common Words/Themes:", word_freq.most_common(5))

# Simulated sentiment (positive/negative based on keywords; in reality, use VADER or classifier)
positive_keywords = ['great', 'excellent', 'topnotch', 'fast']
negative_keywords = ['crashes', 'high', 'bugs', 'often']
df['sentiment'] = df['cleaned'].apply(lambda x: 'positive' if any(k in x for k in positive_keywords) else ('negative' if any(k in x for k in negative_keywords) else 'neutral'))
print("\nSentiment Distribution:\n", df['sentiment'].value_counts())

# Example Insight: Similar words to 'fees' for related complaints
try:
    print("\nSimilar to 'fees':", model.wv.most_similar('fees'))
except KeyError:
    print("\n'fees' not in vocabulary (small dataset).")

Common Words/Themes: [('the', 2), ('app', 1), ('crashes', 1), ('often', 1), ('during', 1)]

Sentiment Distribution:
 sentiment
positive    3
negative    2
Name: count, dtype: int64

Similar to 'fees': [('security', 0.27360907196998596), ('high', 0.18073837459087372), ('the', 0.17394468188285828), ('user', 0.13758254051208496), ('features', 0.10865871608257294), ('in', 0.09499213844537735), ('transactions', 0.04095911607146263), ('update', -0.03519481420516968), ('during', -0.050490282475948334), ('often', -0.06517450511455536)]
