#  NLP Introduction & Text Processing

Question 1: What is Computational Linguistics and how does it relate to NLP?


Answer: Computational Linguistics (CL) is an interdisciplinary field that deals with the computational aspects of human language. It involves applying computer science techniques to analyze, understand, and generate natural language.

Natural Language Processing (NLP) is a subfield within Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language.

Relationship between CL and NLP:

Foundation: Computational Linguistics provides the theoretical and methodological foundations for NLP. It develops the formalisms, models, and algorithms necessary for processing language computationally.
Application: NLP takes these theoretical underpinnings and applies them to practical problems, such as machine translation, sentiment analysis, spam detection, and chatbots.
Overlap: There is a significant overlap, and sometimes the terms are used interchangeably. However, CL is generally more concerned with the linguistic theories and models, while NLP is more focused on the engineering and application aspects of building systems that can interact with human language.

Question 2: Briefly describe the historical evolution of Natural Language Processing.


Answer: Natural Language Processing (NLP) has evolved through several distinct phases:

Symbolic NLP (1950s - early 1990s): Early approaches relied heavily on hand-crafted rules, grammars, and lexicons. Examples include ELIZA (1966) and SHRDLU (1970s). These systems were brittle and struggled with ambiguity.

Statistical NLP (late 1980s - 2000s): With the rise of machine learning, this era focused on probabilistic models like Hidden Markov Models (HMMs) and Naive Bayes, using large text corpora. Feature engineering was a key aspect.

Machine Learning & Neural Networks (2000s - early 2010s): More sophisticated ML algorithms (e.g., SVMs) were applied. The introduction of word embeddings (e.g., Word2Vec) was a significant development, capturing semantic relationships.

Deep Learning & Neural NLP (mid-2010s - Present): This phase saw the rise of Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTMs), and Convolutional Neural Networks (CNNs). Breakthroughs came with attention mechanisms and, crucially, Transformer models (from 2017 onwards) like BERT, GPT, and T5. These models, often pre-trained on massive datasets, led to the development of today's powerful Large Language Models (LLMs), demonstrating advanced abilities in understanding, generating, and reasoning with human language.

Question 3: List and explain three major use cases of NLP in today’s tech industry.


Answer: Sentiment Analysis/Opinion Mining: This involves analyzing text to determine the emotional tone, sentiment (positive, negative, neutral), or subjective opinions expressed within it. Companies use NLP to process customer reviews, social media posts, and survey responses to understand public perception of their products or services. It helps in market research, brand management, and identifying areas for improvement.

Chatbots and Virtual Assistants: NLP is fundamental to the operation of chatbots (for customer service, support, sales) and virtual assistants (like Siri, Google Assistant, Alexa). These systems use NLP to understand user queries in natural language, extract intentions and entities, and generate appropriate responses. They enable natural and intuitive human-computer interaction, automating tasks and providing instant information or support.

Machine Translation: This is the automatic translation of text or speech from one natural language to another. Modern machine translation systems heavily rely on advanced NLP techniques, particularly neural machine translation (NMT) models. This technology breaks down language barriers in communication, enables global e-commerce, and facilitates access to information across different cultures and languages.

Question 4: What is text normalization and why is it essential in text processing tasks?

Answer: Text normalization is the process of transforming text into a canonical (standard) form. The goal is to reduce variability in the text data, making it easier for machines to process and understand. It involves several sub-processes that aim to bring different variations of words or characters to a consistent representation.

Common examples of text normalization include:

Lowercasing: Converting all text to lowercase (e.g., 'Apple', 'apple', 'APPLE' all become 'apple').
Tokenization: Breaking down text into smaller units (words, phrases, symbols).
Stemming: Reducing words to their root or stem (e.g., 'running', 'runs', 'ran' all become 'run').
Lemmatization: Reducing words to their base or dictionary form (lemma), considering vocabulary and morphological analysis (e.g., 'better' becomes 'good', 'caring' becomes 'care').
Removing punctuation: Eliminating characters like commas, periods, exclamation marks.
Removing stop words: Eliminating common words that carry little semantic meaning (e.g., 'the', 'a', 'is').
Handling special characters and numbers: Deciding whether to remove, replace, or standardize them.

Text normalization is essential because natural language is highly variable and ambiguous. Without normalization, a machine would treat different forms of the same word (e.g., 'run', 'running', 'ran') as distinct entities, leading to:

Improved Consistency: Ensures that variations of the same word or concept are treated uniformly, which is vital for accurate analysis.
Reduced Vocabulary Size: By collapsing different forms of words into a single base form, it significantly reduces the overall vocabulary, making models more efficient and less prone to sparsity issues.
Enhanced Accuracy: For tasks like sentiment analysis, information retrieval, machine translation, and text classification, normalization helps in accurately matching relevant terms and improving the performance of NLP models.
Better Feature Representation: Provides a cleaner, more standardized input for machine learning models, allowing them to learn more meaningful patterns from the text data.
Handling Noise: Helps in cleaning up noisy text data by removing irrelevant characters, symbols, or common words that do not contribute much to the meaning.


Question 5: Compare and contrast stemming and lemmatization with suitable
examples.

Answer: Stemming and lemmatization are both text normalization techniques in Natural Language Processing (NLP) that aim to reduce words to their base or root form to help improve search accuracy and reduce the dimensionality of data.
The core difference is that lemmatization uses vocabulary and morphological analysis to ensure the base form (lemma) is a valid, meaningful word, considering the word's context and part of speech. In contrast, stemming uses simpler, faster rule-based algorithms to chop off the ends of words, which may result in a "stem" that is not a real dictionary word.

**Stemming and lemmatization are both text normalization techniques in Natural Language Processing (NLP), but they differ in approach, accuracy, and linguistic sophistication.**

---


Both techniques aim to reduce words to their base or root form to improve text processing and analysis. However:

| Feature              | **Stemming**                                         | **Lemmatization**                                      |
|----------------------|------------------------------------------------------|--------------------------------------------------------|
| **Definition**        | Removes suffixes/prefixes to get the root form      | Converts word to its dictionary base (lemma)           |
| **Method**            | Uses crude heuristics or rules                      | Uses vocabulary and morphological analysis             |
| **Accuracy**          | Less accurate, may produce non-words                | More accurate, always returns valid words              |
| **Speed**             | Faster, simpler                                     | Slower, computationally intensive                      |
| **Example**           | “running” → “run”, “flies” → “fli”                  | “running” → “run”, “flies” → “fly”                     |
| **Use Case**          | Quick search engines, large-scale indexing          | Chatbots, machine translation, semantic analysis       |

---

### Examples


| **Original Word** | **Stemming Result** | **Lemmatization Result** |
|-------------------|---------------------|---------------------------|
| caring            | car                 | care                      |
| better            | better              | good                      |
| studies           | studi               | study                     |
| went              | went                | go                        |

- **Stemming** often chops off endings without understanding context (e.g., “studies” → “studi”).
- **Lemmatization** uses context and grammar (e.g., “went” → “go”, recognizing it as past tense).

---


Question 6: Write a Python program that uses regular expressions (regex) to extract all
email addresses from the following block of text:
“Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.”

Ans:


In [1]:
import re

# Input text
text = """
Hello team, please contact us at support@xyz.com for technical issues, or reach out to
our HR at hr@xyz.com. You can also connect with John at john.doe@xyz.org and jenny
via jenny_clarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz.
"""

# Regular expression pattern for email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Find all matches
emails = re.findall(email_pattern, text)

# Print the extracted email addresses
print("Extracted email addresses:")
for email in emails:
    print(email)

Extracted email addresses:
support@xyz.com
hr@xyz.com
john.doe@xyz.org
jenny_clarke126@mail.co.us
partners@xyz.biz


Question 7: Given the sample paragraph below, perform string tokenization and
frequency distribution using Python and NLTK:
“Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.”

In [None]:
import nltk

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

# Sample paragraph
text = """
Natural Language Processing (NLP) is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables machines to understand,
interpret, and generate human language. Applications of NLP include chatbots,
sentiment analysis, and machine translation. As technology advances, the role of NLP
in modern solutions is becoming increasingly critical.
"""

# Tokenization
tokens = word_tokenize(text)

# Frequency Distribution
fdist = FreqDist(tokens)

# Display results
print("Tokenized Words:")
print(tokens)
print("\nFrequency Distribution:")
for word, freq in fdist.most_common(10):
    print(f"{word}: {freq}")

Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels
proper nouns in a given text.

In [5]:
import spacy

# Load the English language model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("SpaCy English model not found. Please run the previous cell to download it.")
    print("python -m spacy download en_core_web_sm")
    exit()

def custom_proper_noun_annotator(text):
    """
    Identifies and labels proper nouns in a given text using spaCy's NER.
    """
    doc = nlp(text)
    proper_nouns = []
    for ent in doc.ents:
        # spaCy's NER labels for common proper nouns include PERSON, ORG, GPE, LOC, etc.
        # You can customize this list based on what you consider a 'proper noun'.
        if ent.label_ in ["PERSON", "ORG", "GPE", "LOC", "NORP", "FAC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"]:
            proper_nouns.append({"text": ent.text, "label": ent.label_})
    return proper_nouns

# Sample Text
sample_text = "Google was founded by Larry Page and Sergey Brin. It is headquartered in Mountain View, California, and has offices around the world, including London and New York. The company also developed the Android operating system."

# Annotate the text
annotations = custom_proper_noun_annotator(sample_text)

print("Proper Noun Annotations:")
if annotations:
    for pn in annotations:
        print(f"- Text: '{pn['text']}', Label: '{pn['label']}'")
else:
    print("No proper nouns identified.")

Proper Noun Annotations:
- Text: 'Google', Label: 'ORG'
- Text: 'Larry Page', Label: 'PERSON'
- Text: 'Sergey Brin', Label: 'PERSON'
- Text: 'Mountain View', Label: 'GPE'
- Text: 'California', Label: 'GPE'
- Text: 'London', Label: 'GPE'
- Text: 'New York', Label: 'GPE'
- Text: 'Android', Label: 'ORG'


Question 9: Using Genism, demonstrate how to train a simple Word2Vec model on the
following dataset consisting of example sentences:
dataset = [
 "Natural language processing enables computers to understand human language",
 "Word embeddings are a type of word representation that allows words with similar
meaning to have similar representation",
 "Word2Vec is a popular word embedding technique used in many NLP applications",
 "Text preprocessing is a critical step before training word embeddings",
 "Tokenization and normalization help clean raw text for modeling"
]
Write code that tokenizes the dataset, preprocesses it, and trains a Word2Vec model using
Gensim.


In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
import string

# Download tokenizer resources
nltk.download('punkt')

# Sample dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

# Preprocessing: lowercase, tokenize, remove punctuation
def preprocess(text):
    tokens = word_tokenize(text.lower())
    return [word for word in tokens if word not in string.punctuation]

# Tokenize and preprocess each sentence
processed_data = [preprocess(sentence) for sentence in dataset]

# Train Word2Vec model
model = Word2Vec(sentences=processed_data, vector_size=100, window=5, min_count=1, workers=4)

# Example: find most similar words to 'word'
similar_words = model.wv.most_similar('word', topn=5)
print("Words similar to 'word':")
for word, score in similar_words:
    print(f"{word}: {score:.4f}")

Question 10: Imagine you are a data scientist at a fintech startup. You’ve been tasked
with analyzing customer feedback. Outline the steps you would take to clean, process,
and extract useful insights using NLP techniques from thousands of customer reviews

Ans: **To analyze customer feedback using NLP at a fintech startup, you would follow a structured pipeline involving data cleaning, preprocessing, modeling, and insight extraction. This ensures accurate sentiment analysis, topic detection, and actionable recommendations.**

---

### 1. Data Collection & Cleaning

- **Source**: Gather reviews from platforms like app stores, surveys, emails, or support tickets.
- **Remove noise**: Strip HTML tags, emojis, special characters, and irrelevant metadata.
- **Handle missing data**: Drop or impute null entries.

---

### 2. Text Preprocessing

- **Lowercasing**: Normalize text for uniformity.
- **Tokenization**: Split text into words or phrases using tools like NLTK or SpaCy.
- **Stopword removal**: Eliminate common words (e.g., “the”, “is”) that don’t add meaning.
- **Stemming/Lemmatization**: Reduce words to their root form (e.g., “running” → “run”) for consistency.
- **Spelling correction**: Use libraries like `TextBlob` or `SymSpell` to fix typos.
- **Named Entity Recognition (NER)**: Identify entities like bank names, transaction types, or locations.

---

### 3. Exploratory Data Analysis (EDA)

- **Word frequency**: Identify common terms using `FreqDist` or `Counter`.
- **N-grams**: Extract frequent phrases (e.g., “credit card”, “loan approval”).
- **Word clouds**: Visualize dominant themes.

---

### 4. Sentiment Analysis

- **Lexicon-based**: Use tools like VADER or TextBlob for polarity scores.
- **Model-based**: Train classifiers (e.g., logistic regression, BERT) to detect positive, negative, or neutral sentiment.
- **Use case**: Understand customer satisfaction, detect frustration, or praise.

---

### 5. Topic Modeling

- **LDA (Latent Dirichlet Allocation)**: Discover hidden topics in reviews (e.g., “customer service”, “app usability”).
- **Clustering**: Group similar feedback using K-means or DBSCAN.

---

### 6. Keyword & Intent Extraction

- **TF-IDF**: Identify unique and important terms.
- **Dependency parsing**: Understand relationships between words (e.g., “delay in payment”).
- **Intent classification**: Categorize feedback into intents like complaint, suggestion, or praise.

---

### 7. Visualization & Reporting

- **Dashboards**: Use tools like Power BI or Tableau to present insights.
- **Trend analysis**: Track sentiment or topic shifts over time.
- **Alerts**: Flag critical feedback (e.g., fraud mentions) for immediate action.

---

### 8. Automation & Deployment

- **Pipeline**: Build an automated NLP workflow using Python, Airflow, or cloud services.
- **Model retraining**: Periodically update models with new data.
- **Integration**: Feed insights into CRM or product development tools.

---
