# NLP Introduction & Text Processing - Assignment Answers




## Question 1: What is Computational Linguistics and how does it relate to NLP?

**Answer:**


Computational Linguistics is an interdisciplinary field that combines linguistics, computer science, and artificial intelligence to study and model natural language using computational methods. It focuses on understanding the structure, meaning, and use of human language from a computational perspective.

**Relationship with NLP:**

Natural Language Processing (NLP) is the applied branch of Computational Linguistics. While Computational Linguistics is more research-oriented and theoretical, focusing on understanding how language works computationally, NLP is the practical application that builds systems and tools to process, understand, and generate human language.

**Key Differences:**
- **Computational Linguistics**: More theoretical, focuses on linguistic phenomena, language structure, and computational models of language
- **NLP**: More applied, focuses on building practical systems like chatbots, translators, sentiment analyzers, etc.

**Relationship**: NLP implements the theories and models developed in Computational Linguistics to create real-world applications. Computational Linguistics provides the foundation (linguistic theories, parsing algorithms, semantic models), while NLP applies these foundations to solve practical problems in industry and technology.


## Question 2: Briefly describe the historical evolution of Natural Language Processing.

**Answer:**


The historical evolution of Natural Language Processing can be divided into several key phases:

**1. Early Period (1940s-1950s):**
- Foundation laid by Alan Turing's work on machine intelligence
- First machine translation experiments (Georgetown-IBM experiment, 1954)
- Focus on rule-based systems and symbolic approaches

**2. Symbolic Era (1960s-1980s):**
- Development of formal grammars (Chomsky's transformational grammar)
- Rule-based systems using hand-crafted linguistic rules
- ELIZA chatbot (1966) demonstrated early conversational AI
- Expert systems and knowledge bases for language understanding
- Limited by the complexity of encoding all linguistic rules

**3. Statistical Revolution (1990s-2000s):**
- Shift from rule-based to statistical and probabilistic methods
- Introduction of machine learning approaches
- Hidden Markov Models (HMMs) for speech recognition
- Statistical machine translation (IBM models)
- N-gram language models
- Part-of-speech tagging and parsing using statistical methods

**4. Machine Learning Era (2000s-2010s):**
- Support Vector Machines (SVMs) and Maximum Entropy models
- Conditional Random Fields (CRFs) for sequence labeling
- Feature engineering became crucial
- Improved performance on various NLP tasks

**5. Deep Learning Revolution (2010s-Present):**
- Word embeddings (Word2Vec, GloVe) revolutionized representation learning
- Recurrent Neural Networks (RNNs, LSTMs, GRUs) for sequence modeling
- Convolutional Neural Networks (CNNs) for text classification
- Attention mechanisms and Transformer architecture (2017)
- Pre-trained language models (BERT, GPT, T5) achieving state-of-the-art results
- Transfer learning and fine-tuning became standard practice

**6. Large Language Models Era (2020s-Present):**
- GPT-3, GPT-4, and other large-scale models
- Few-shot and zero-shot learning capabilities
- Multimodal models combining text, images, and other modalities
- Democratization of NLP through accessible APIs and tools

This evolution shows a progression from rigid rule-based systems to flexible, data-driven approaches that can learn patterns from large amounts of text data.


## Question 3: List and explain three major use cases of NLP in today's tech industry.

**Answer:**


**1. Chatbots and Virtual Assistants:**
- **Explanation**: NLP powers conversational AI systems that can understand user queries, maintain context, and provide appropriate responses. These systems use intent recognition, entity extraction, and dialogue management.
- **Examples**: Customer service chatbots, virtual assistants (Siri, Alexa, Google Assistant), help desk automation
- **Impact**: Reduces human workload, provides 24/7 support, handles routine queries efficiently

**2. Sentiment Analysis and Opinion Mining:**
- **Explanation**: NLP techniques analyze text to determine the emotional tone, sentiment (positive, negative, neutral), and opinions expressed. This involves text classification, aspect-based sentiment analysis, and emotion detection.
- **Examples**: Social media monitoring, product review analysis, brand reputation management, market research
- **Impact**: Helps businesses understand customer satisfaction, track brand perception, make data-driven decisions

**3. Machine Translation and Language Services:**
- **Explanation**: NLP enables automatic translation between languages using neural machine translation models. Modern systems use sequence-to-sequence models with attention mechanisms.
- **Examples**: Google Translate, multilingual content localization, real-time translation apps, cross-language communication tools
- **Impact**: Breaks down language barriers, enables global communication, facilitates international business

**Additional Notable Use Cases:**
- **Information Extraction**: Extracting structured data from unstructured text (named entity recognition, relation extraction)
- **Text Summarization**: Automatic generation of concise summaries from long documents
- **Search Engines**: Understanding search queries and retrieving relevant documents
- **Email Filtering**: Spam detection and email categorization
- **Content Recommendation**: Understanding user preferences from text data to recommend articles, products, etc.


## Question 4: What is text normalization and why is it essential in text processing tasks?

**Answer:**


**Text Normalization** is the process of converting text into a standard, consistent format by applying various transformations to handle variations, inconsistencies, and noise in raw text data.

**Key Components of Text Normalization:**

1. **Case Normalization**: Converting all text to lowercase (or uppercase) to ensure consistent representation
   - Example: "Hello", "HELLO", "hello" → "hello"

2. **Whitespace Normalization**: Standardizing spaces, tabs, and newlines
   - Example: "word1    word2" → "word1 word2"

3. **Punctuation Handling**: Removing or standardizing punctuation marks
   - Example: "Hello!" → "Hello" or "Hello !"

4. **Number Normalization**: Converting numbers to text or standardizing their representation
   - Example: "123" → "one hundred twenty three" or "<NUM>"

5. **Accent and Diacritic Removal**: Removing accents from characters
   - Example: "café" → "cafe"

6. **URL and Email Normalization**: Replacing URLs and emails with placeholders
   - Example: "Visit https://example.com" → "Visit <URL>"

7. **Contraction Expansion**: Expanding contractions
   - Example: "don't" → "do not"

**Why Text Normalization is Essential:**

1. **Consistency**: Ensures uniform representation of text, reducing variations that could confuse models
   - "Hello" and "hello" are treated as the same word

2. **Improved Model Performance**: Machine learning models perform better with normalized data as they don't need to learn multiple representations of the same concept

3. **Reduced Vocabulary Size**: Normalization reduces the vocabulary size, making models more efficient and reducing memory requirements

4. **Better Tokenization**: Normalized text is easier to tokenize correctly, leading to better word boundaries and segmentation

5. **Noise Reduction**: Removes irrelevant characters, formatting, and inconsistencies that don't contribute to meaning

6. **Standardization Across Sources**: When processing text from multiple sources (social media, emails, documents), normalization ensures consistent format

7. **Improved Search and Matching**: Normalized text enables better exact matching and similarity comparisons

**Example Impact:**
Without normalization: "Hello", "HELLO", "hello", "Hello!" might be treated as 4 different tokens
With normalization: All become "hello" - treated as 1 token, improving model efficiency and accuracy


## Question 5: Compare and contrast stemming and lemmatization with suitable examples.

**Answer:**


**Stemming** and **Lemmatization** are both text normalization techniques used to reduce words to their base or root forms, but they differ in their approach and accuracy.

### **Stemming:**

**Definition**: Stemming is a heuristic process that chops off the end of words to get their root form (stem), often by removing suffixes using simple rules.

**Characteristics:**
- Fast and computationally inexpensive
- Uses rule-based approach (e.g., remove "-ing", "-ed", "-s")
- May produce stems that are not actual words
- Doesn't consider context or part of speech
- Can be aggressive and may over-stem or under-stem

**Examples:**
- "running" → "run"
- "flies" → "fli" (not a real word)
- "better" → "better" (doesn't reduce to "good")
- "studies" → "studi"
- "happily" → "happili"

### **Lemmatization:**

**Definition**: Lemmatization is a more sophisticated process that uses vocabulary and morphological analysis to return the base or dictionary form (lemma) of a word.

**Characteristics:**
- Slower and more computationally expensive
- Uses dictionary lookup and morphological analysis
- Always produces valid words
- Considers context and part of speech (POS)
- More accurate but requires POS tagging

**Examples:**
- "running" (verb) → "run"
- "flies" (noun) → "fly"
- "better" (adjective) → "good"
- "studies" → "study"
- "happily" → "happy"

### **Comparison Table:**

| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| **Method** | Rule-based, heuristic | Dictionary-based, linguistic analysis |
| **Speed** | Fast | Slower |
| **Output** | May not be a real word | Always a valid word |
| **Context Awareness** | No | Yes (requires POS) |
| **Accuracy** | Lower | Higher |
| **Example** | "flies" → "fli" | "flies" → "fly" |

### **When to Use:**

**Use Stemming when:**
- Speed is critical
- Approximate matching is acceptable
- Working with large datasets where slight inaccuracy is tolerable
- Building search engines or information retrieval systems

**Use Lemmatization when:**
- Accuracy is important
- Need valid words for downstream tasks
- Working with smaller datasets where quality matters
- Building applications that require proper word forms (e.g., text generation, grammar checking)

### **Practical Example:**

**Input**: "The cats are running and the dogs were barking happily."

**Stemming Result**: "the cat are run and the dog were bark happili"

**Lemmatization Result**: "the cat be run and the dog be bark happy"

Note: Lemmatization correctly converts "are" → "be" and "were" → "be", while stemming doesn't handle these irregular forms.


## Question 6: Write a Python program that uses regular expressions (regex) to extract all email addresses from the following block of text.

**Text:**
"Hello team, please contact us at support@xyz.com for technical issues, or reach out to our HR at hr@xyz.com.You can also connect with John at john.doe@xyz.org and jenny via jennyclarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."

**Answer:**


In [1]:
import re


text = "Hello team, please contact us at support@xyz.com for technical issues, or reach out to our HR at hr@xyz.com.You can also connect with John at john.doe@xyz.org and jenny via jennyclarke126@mail.co.us. For partnership inquiries, email partners@xyz.biz."

email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'


emails = re.findall(email_pattern, text)


print("Extracted Email Addresses:")
print("-" * 50)
for i, email in enumerate(emails, 1):
    print(f"{i}. {email}")

print(f"\nTotal emails found: {len(emails)}")


Extracted Email Addresses:
--------------------------------------------------
1. support@xyz.com
2. hr@xyz.com.You
3. john.doe@xyz.org
4. jennyclarke126@mail.co.us
5. partners@xyz.biz

Total emails found: 5


## Question 7: Given the sample paragraph below, perform string tokenization and frequency distribution using Python and NLTK.

**Paragraph:**
"Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

**Answer:**


In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string


try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    try:
        nltk.download('punkt_tab', quiet=True)
    except:
        nltk.download('punkt', quiet=True)

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)


paragraph = "Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical."

print("Original Paragraph:")
print("-" * 80)
print(paragraph)
print("\n" + "=" * 80 + "\n")


tokens = word_tokenize(paragraph)

print("Tokenized Words:")
print("-" * 80)
print(tokens)
print(f"\nTotal tokens: {len(tokens)}")
print("\n" + "=" * 80 + "\n")

tokens_clean = [token.lower() for token in tokens if token not in string.punctuation]

print("Cleaned Tokens (lowercase, no punctuation):")
print("-" * 80)
print(tokens_clean)
print(f"\nTotal cleaned tokens: {len(tokens_clean)}")
print("\n" + "=" * 80 + "\n")

freq_dist = FreqDist(tokens_clean)

print("Frequency Distribution (Top 20 most common words):")
print("-" * 80)
for word, frequency in freq_dist.most_common(20):
    print(f"{word:20s} : {frequency:3d}")

print("\n" + "=" * 80 + "\n")

print("Frequency Distribution Statistics:")
print("-" * 80)
print(f"Total unique words: {len(freq_dist)}")
print(f"Most common word: '{freq_dist.most_common(1)[0][0]}' (appears {freq_dist.most_common(1)[0][1]} times)")
print(f"Total word count: {freq_dist.N()}")

print("\n" + "=" * 80)
print("Complete Frequency Distribution (all words):")
print("-" * 80)
for word, frequency in sorted(freq_dist.items(), key=lambda x: (-x[1], x[0])):
    print(f"{word:20s} : {frequency:3d}")


Original Paragraph:
--------------------------------------------------------------------------------
Natural Language Processing (NLP) is a fascinating field that combines linguistics, computer science, and artificial intelligence. It enables machines to understand, interpret, and generate human language. Applications of NLP include chatbots, sentiment analysis, and machine translation. As technology advances, the role of NLP in modern solutions is becoming increasingly critical.


Tokenized Words:
--------------------------------------------------------------------------------
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'machines', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', '.', 'Applications', 'of', 'NLP', 'include', 'chatbots', ',', 'sentiment', 'analysis', ',', 'and', 'mac

## Question 8: Create a custom annotator using spaCy or NLTK that identifies and labels proper nouns in a given text.

**Answer:**


In [3]:

try:
    import spacy
    USE_SPACY = True

    try:
        nlp = spacy.load("en_core_web_sm")
    except OSError:
        print("spaCy English model not found. Falling back to NLTK.")
        print("To use spaCy, install it using: python -m spacy download en_core_web_sm")
        USE_SPACY = False
except ImportError:
    print("spaCy not installed. Using NLTK instead.")
    print("To install spaCy: pip install spacy")
    USE_SPACY = False


text = "Apple Inc. was founded by Steve Jobs in Cupertino, California. Tim Cook is the current CEO. The company is known for products like iPhone, iPad, and MacBook. Microsoft Corporation, based in Redmond, Washington, is another major tech company."

print("Original Text:")
print("-" * 80)
print(text)
print("\n" + "=" * 80 + "\n")

if USE_SPACY:

    doc = nlp(text)

    proper_nouns = []
    for token in doc:
        if token.pos_ == "PROPN":
            proper_nouns.append({
                'text': token.text,
                'label': token.pos_,
                'start': token.idx,
                'end': token.idx + len(token.text)
            })

    print("Identified Proper Nouns:")
    print("-" * 80)
    for i, pn in enumerate(proper_nouns, 1):
        print(f"{i}. '{pn['text']}' (Position: {pn['start']}-{pn['end']}, Label: {pn['label']})")

    print(f"\nTotal proper nouns found: {len(proper_nouns)}")

    print("\n" + "=" * 80 + "\n")

    print("Named Entity Recognition (NER) - More Detailed Analysis:")
    print("-" * 80)
    for ent in doc.ents:
        print(f"Entity: '{ent.text}' | Label: {ent.label_} | Description: {spacy.explain(ent.label_)}")

    print("\n" + "=" * 80 + "\n")

    def identify_proper_nouns(text, model=nlp):
        """
        Custom function to identify and label proper nouns in text.

        Args:
            text: Input text string
            model: spaCy language model (default: en_core_web_sm)

        Returns:
            List of dictionaries containing proper noun information
        """
        doc = model(text)
        proper_nouns = []

        for token in doc:
            if token.pos_ == "PROPN":
                proper_nouns.append({
                    'word': token.text,
                    'pos_tag': token.pos_,
                    'pos_description': 'Proper Noun',
                    'start_char': token.idx,
                    'end_char': token.idx + len(token.text),
                    'sentence': token.sent.text.strip()
                })

        return proper_nouns

    print("Using Custom Annotator Function:")
    print("-" * 80)
    result = identify_proper_nouns(text)
    for item in result:
        print(f"Proper Noun: '{item['word']}' | POS: {item['pos_tag']} | Sentence: {item['sentence'][:60]}...")

else:
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.tag import pos_tag

    try:
        nltk.data.find('tokenizers/punkt_tab')
    except LookupError:
        try:
            nltk.download('punkt_tab', quiet=True)
        except:
            nltk.download('punkt', quiet=True)

    try:
        nltk.data.find('taggers/averaged_perceptron_tagger')
    except LookupError:
        nltk.download('averaged_perceptron_tagger', quiet=True)

    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)

    proper_nouns = []
    for word, tag in tagged:
        if tag in ['NNP', 'NNPS']:
            proper_nouns.append({
                'text': word,
                'label': tag,
                'description': 'Proper Noun (singular)' if tag == 'NNP' else 'Proper Noun (plural)'
            })

    print("Identified Proper Nouns (using NLTK):")
    print("-" * 80)
    for i, pn in enumerate(proper_nouns, 1):
        print(f"{i}. '{pn['text']}' (Label: {pn['label']}, {pn['description']})")

    print(f"\nTotal proper nouns found: {len(proper_nouns)}")

    print("\n" + "=" * 80 + "\n")


    def identify_proper_nouns(text):
        """
        Custom function to identify and label proper nouns in text using NLTK.

        Args:
            text: Input text string

        Returns:
            List of dictionaries containing proper noun information
        """
        tokens = word_tokenize(text)
        tagged = pos_tag(tokens)

        proper_nouns = []
        for word, tag in tagged:
            if tag in ['NNP', 'NNPS']:
                proper_nouns.append({
                    'word': word,
                    'pos_tag': tag,
                    'pos_description': 'Proper Noun (singular)' if tag == 'NNP' else 'Proper Noun (plural)'
                })

        return proper_nouns

    # Test the custom annotator
    print("Using Custom Annotator Function (NLTK):")
    print("-" * 80)
    result = identify_proper_nouns(text)
    for item in result:
        print(f"Proper Noun: '{item['word']}' | POS: {item['pos_tag']} | {item['pos_description']}")


Original Text:
--------------------------------------------------------------------------------
Apple Inc. was founded by Steve Jobs in Cupertino, California. Tim Cook is the current CEO. The company is known for products like iPhone, iPad, and MacBook. Microsoft Corporation, based in Redmond, Washington, is another major tech company.


Identified Proper Nouns:
--------------------------------------------------------------------------------
1. 'Apple' (Position: 0-5, Label: PROPN)
2. 'Inc.' (Position: 6-10, Label: PROPN)
3. 'Steve' (Position: 26-31, Label: PROPN)
4. 'Jobs' (Position: 32-36, Label: PROPN)
5. 'Cupertino' (Position: 40-49, Label: PROPN)
6. 'California' (Position: 51-61, Label: PROPN)
7. 'Tim' (Position: 63-66, Label: PROPN)
8. 'Cook' (Position: 67-71, Label: PROPN)
9. 'CEO' (Position: 87-90, Label: PROPN)
10. 'iPhone' (Position: 131-137, Label: PROPN)
11. 'iPad' (Position: 139-143, Label: PROPN)
12. 'MacBook' (Position: 149-156, Label: PROPN)
13. 'Microsoft' (Position: 1

## Question 9: Using Gensim, demonstrate how to train a simple Word2Vec model on the following dataset.

**Dataset:**
- "Natural language processing enables computers to understand human language"
- "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation"
- "Word2Vec is a popular word embedding technique used in many NLP applications"
- "Text preprocessing is a critical step before training word embeddings"
- "Tokenization and normalization help clean raw text for modeling"

**Answer:**


In [6]:
!pip install gensim
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import nltk
from nltk.tokenize import word_tokenize
import string


try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    try:
        nltk.download('punkt_tab', quiet=True)
    except:
        nltk.download('punkt', quiet=True)

# Dataset
dataset = [
    "Natural language processing enables computers to understand human language",
    "Word embeddings are a type of word representation that allows words with similar meaning to have similar representation",
    "Word2Vec is a popular word embedding technique used in many NLP applications",
    "Text preprocessing is a critical step before training word embeddings",
    "Tokenization and normalization help clean raw text for modeling"
]

print("Original Dataset:")
print("-" * 80)
for i, sentence in enumerate(dataset, 1):
    print(f"{i}. {sentence}")
print("\n" + "=" * 80 + "\n")


def preprocess_text(text):
    """
    Preprocess text: tokenize, lowercase, remove punctuation
    """

    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    return tokens

processed_data = [preprocess_text(sentence) for sentence in dataset]

print("Preprocessed Data (Tokenized and Lowercased):")
print("-" * 80)
for i, tokens in enumerate(processed_data, 1):
    print(f"{i}. {tokens}")
print("\n" + "=" * 80 + "\n")

print("Training Word2Vec Model...")
print("-" * 80)


model = Word2Vec(
    sentences=processed_data,
    vector_size=100,
    window=5,
    min_count=1,
    workers=1,
    sg=0
)

print("Model training completed!")
print(f"Vocabulary size: {len(model.wv.key_to_index)} unique words")

total_word_count = sum(len(sentence) for sentence in processed_data)
print(f"Total word tokens in dataset: {total_word_count}")

print("\n" + "=" * 80 + "\n")

print("Vocabulary:")
print("-" * 80)
vocab = list(model.wv.key_to_index.keys())
print(f"Words in vocabulary ({len(vocab)}):")
print(vocab)

print("\n" + "=" * 80 + "\n")


print("Word Vector Examples:")
print("-" * 80)
sample_words = ['natural', 'language', 'processing', 'word', 'embeddings']
for word in sample_words:
    if word in model.wv:
        vector = model.wv[word]
        print(f"'{word}': Vector shape = {vector.shape}, First 5 values = {vector[:5]}")
    else:
        print(f"'{word}': Not in vocabulary")

print("\n" + "=" * 80 + "\n")

print("Finding Similar Words:")
print("-" * 80)
test_words = ['language', 'word', 'processing', 'text']
for word in test_words:
    if word in model.wv:
        similar = model.wv.most_similar(word, topn=5)
        print(f"\nWords similar to '{word}':")
        for similar_word, similarity in similar:
            print(f"  - {similar_word}: {similarity:.4f}")
    else:
        print(f"'{word}' not in vocabulary")

print("\n" + "=" * 80 + "\n")

print("Word Similarity Examples:")
print("-" * 80)
word_pairs = [
    ('natural', 'language'),
    ('word', 'embeddings'),
    ('text', 'processing'),
    ('tokenization', 'normalization')
]

for word1, word2 in word_pairs:
    if word1 in model.wv and word2 in model.wv:
        similarity = model.wv.similarity(word1, word2)
        print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")
    else:
        missing = [w for w in [word1, word2] if w not in model.wv]
        print(f"Cannot compute similarity: {missing} not in vocabulary")

print("\n" + "=" * 80 + "\n")

print("Word2Vec Model Training and Demonstration Complete!")


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Original Dataset:
--------------------------------------------------------------------------------
1. Natural language processing enables computers to understand human language
2. Word embeddings are a type of word representation that allows words with similar meaning to have similar representation
3. Word2Vec is a popular word embedding technique used in many NLP applications
4. Text preprocessing is a critical step before training word embeddings
5. Tokenization and normalization help clean raw text for modeling


Preprocessed Data (Tokenized and Lowercased):
---------------

## Question 10: Imagine you are a data scientist at a fintech startup. You've been tasked with analyzing customer feedback. Outline the steps you would take to clean, process, and extract useful insights using NLP techniques from thousands of customer reviews.

**Answer:**


Data Cleaning & Normalization
Remove duplicates, fix encoding issues, strip HTML, standardize casing, correct obvious misspellings, and clean emojis/special characters depending on use-case.

Text Preprocessing
Tokenize → remove stopwords → lemmatize/stem → handle negations → detect language and filter non-English reviews.

Exploratory Text Analysis
Compute word frequencies, n-grams, keyword extraction (TF-IDF), and identify recurring themes to understand dominant topics and issues.

Sentiment & Emotion Classification
Use pre-trained transformers or fine-tuned models (e.g., BERT, DistilBERT) to assign sentiment scores and detect emotions like anger, trust, confusion.

Topic Modeling / Clustering
Apply LDA, BERTopic, or sentence-embedding + clustering to discover underlying categories (e.g., “app crashes,” “payment delays,” “customer service”).

Insight Extraction & Summarization
Generate human-readable summaries per topic, extract frequently mentioned entities (NER), and identify pain points/feature requests.

Dashboards & Business Reporting
Build visual dashboards showing trends, sentiment over time, top complaints, and actionable recommendations for product or support teams.