# Assignment: Linguistic Pre-processing and Text Representation

## Instructions
- Answer all questions with detailed explanations
- Include code examples where applicable
- Provide reasoning for your design choices
- Each question requires a comprehensive answer demonstrating understanding of concepts

---

## Question 1: Multi-level Linguistic Analysis

Consider the sentence: "The company's CEO didn't respond to our meeting invitation."

Analyze this sentence from four different linguistic perspectives:
- **Syntax**: Identify the grammatical structure and phrase composition
- **Semantics**: Explain the meaning and relationships between words
- **Morphology**: Break down word formations and their components
- **Pragmatics**: Discuss the contextual interpretation and implied meaning

**Hint**: Consider how each level provides different insights. For morphology, examine words like "didn't" and "invitation". For pragmatics, think about what this might imply in a business context.

In [None]:
Syntax: SVO NP=The company's CEO, VP=didn’t respond, PP=to our meeting invitation
Semantics: agent failed to perform responding action
Morphology: didn’t=did+not, company's=company+'s, invitation=invite+-tion
Pragmatics: implies soft refusal / low priority

---

## Question 2: Pre-processing Pipeline Design

You are building a sentiment analysis system for customer reviews from an e-commerce platform. The reviews contain:
- Informal language and slang ("gonna", "wanna", "u")
- Emojis and special characters
- Product codes and prices
- Misspellings and typos

Design a comprehensive text pre-processing pipeline. For each step (tokenization, normalization, stop-word removal, stemming/lemmatization), explain:
1. Why you would include or exclude it
2. What specific considerations apply to this use case
3. The order of operations and why it matters

**Hint**: Consider whether stemming or lemmatization is more appropriate for sentiment analysis. Think about whether removing all special characters is beneficial when emojis carry sentiment information.

In [None]:
Pipeline:
- tokenize
- normalize slang → mapping dict
- keep emojis (carry sentiment)
- lemmatize > stem
order: normalize→tokenize→emoji preserve→lemma→vectorize

---

## Question 3: Stemming vs Lemmatization Trade-offs

Consider these sentences:
1. "The meeting was well organized and the organizers did a great job."
2. "She is better at organizing than her predecessor was."

Apply both stemming (Porter Stemmer) and lemmatization to these sentences. Then:
- Compare the outputs and explain the differences
- Discuss scenarios where stemming would be preferred over lemmatization and vice versa
- Analyze the impact on: search engines, text classification, and information retrieval systems

**Hint**: Consider computational cost, accuracy, and preservation of meaning. Words like "better", "organizing", and "was" behave differently under stemming vs lemmatization.

In [None]:
Stemming crude vs lemma correct base.
Use lemma for classification.
Example code:
from nltk.stem import PorterStemmer,WordNetLemmatizer

---

## Question 4: POS Tagging for Ambiguity Resolution

Examine these ambiguous sentences:
1. "The duck is ready to eat."
2. "They can fish."
3. "Time flies like an arrow."

Explain:
- How POS tagging helps resolve these ambiguities
- The difference between rule-based and probabilistic POS tagging approaches
- Which approach would perform better for each sentence and why
- Limitations of both approaches

**Hint**: Consider how context and word order influence tagging. Think about the Hidden Markov Model approach for probabilistic tagging vs pattern-matching rules.

In [None]:
POS tagging resolves role: bank NN vs VB.
spaCy example:
import spacy

---

## Question 5: Named Entity Recognition System Design

You need to build an NER system for extracting information from medical reports. The text contains:
- Disease names ("Type 2 Diabetes", "COVID-19")
- Medication names ("Metformin", "Ibuprofen 200mg")
- Dosages and measurements
- Doctor and patient names
- Hospital names and dates

Compare dictionary-based and CRF-based NER methods for this application:
- Advantages and disadvantages of each approach
- How would you handle new drug names not in the dictionary?
- What features would you use in a CRF model?
- How would you combine both approaches for optimal results?

**Hint**: Consider that medical terminology is specialized but relatively standardized. Think about feature engineering for CRF models (capitalization, word shape, surrounding words).

In [None]:
Dictionary=precise; CRF=generalise OOV. Combine CRF→dict validate

---

## Question 6: N-gram Language Models and Perplexity

Given a small corpus:
```
"I love machine learning"
"I love deep learning"
"Machine learning is fascinating"
"Deep learning is powerful"
```

a) Build a bigram language model and calculate probabilities for:
   - "I love natural learning"
   - "Machine learning is powerful"

b) Explain the zero-probability problem and demonstrate:
   - How Laplace smoothing addresses it
   - The concept of backoff strategies
   - How to calculate and interpret perplexity

c) Discuss why lower perplexity indicates a better language model.

**Hint**: For unseen bigrams like "natural learning", consider what probability would be assigned without smoothing. Calculate perplexity as a measure of how "surprised" the model is.

In [None]:
Higher n reduces perplexity until sparse.
Laplace smoothing fixes zero probability.
PP lower = better

---

## Question 7: Bag-of-Words vs TF-IDF Analysis

Consider three documents:
- Doc1: "Machine learning is a subset of artificial intelligence"
- Doc2: "Deep learning is a subset of machine learning"
- Doc3: "Artificial intelligence and machine learning are transforming industries"

a) Construct the BoW representation and TF-IDF vectors for all documents

b) Calculate cosine similarity between documents using both representations

c) Explain:
   - Why the similarity scores differ between BoW and TF-IDF
   - Which representation better captures document similarity for:
     - Information retrieval
     - Document clustering
     - Topic modeling
   - Limitations of both approaches

**Hint**: Consider how TF-IDF downweights common terms like "is" and "a". Think about what information is lost (word order, context, semantics).

In [None]:
BoW=counts; TFIDF=downweight common → better similarity

---

## Question 8: Word2Vec Architectures Deep Dive

Explain the Word2Vec model by addressing:

a) **CBOW (Continuous Bag of Words)**:
   - Architecture and training objective
   - How context words predict the target word
   - Best use cases

b) **Skip-gram**:
   - Architecture and training objective
   - How target word predicts context words
   - Best use cases

c) For the sentence "The quick brown fox jumps over the lazy dog" (window size = 2):
   - Show training examples for both CBOW and Skip-gram when target word is "fox"
   - Explain which architecture works better for:
     - Small datasets
     - Rare words
     - Frequent words

**Hint**: CBOW is faster and works well with frequent words, while Skip-gram is better for rare words and smaller datasets. Consider the number of training instances generated.

In [None]:
CBOW fast on big data; Skipgram better rare.
small=skipgram large=cbow

---

## Question 9: GloVe vs FastText Comparison

Compare and contrast GloVe and FastText embedding techniques:

a) **Training methodology**:
   - How does GloVe use global co-occurrence statistics?
   - How does FastText incorporate subword information?

b) **Handling Out-of-Vocabulary (OOV) words**:
   - Given the trained words: "playing", "player", "played"
   - How would each model handle the unseen word "gameplay"?
   - Which model is more suitable for morphologically rich languages (e.g., German, Turkish)?

c) **Practical considerations**:
   - Training time and computational requirements
   - Model size and memory footprint
   - Performance on rare and misspelled words

**Hint**: FastText breaks words into character n-grams (e.g., "playing" → "<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"). GloVe uses matrix factorization on co-occurrence counts.

In [None]:
GloVe global stats; FastText subwords → handles OOV, better morph languages

---

## Question 10: Classical vs Distributed Representations - Application Perspective

You are tasked with building three different NLP applications:

1. **Legal document search engine** (searching through contracts and legal texts)
2. **Chatbot intent classification** (understanding user queries)
3. **Academic paper recommendation system** (suggesting related research papers)

For each application:

a) Decide whether to use classical representations (BoW/TF-IDF) or distributed representations (Word2Vec/GloVe/FastText)

b) Justify your choice by considering:
   - Semantic similarity requirements
   - Vocabulary size and domain specificity
   - Training data availability
   - Computational constraints
   - Interpretability needs

c) Discuss hybrid approaches: Could combining both representation types improve performance? How?

**Hint**: Legal documents might require exact term matching, while chatbots benefit from semantic understanding. Consider that classical methods are sparse and interpretable, while distributed representations are dense and capture semantic relationships.

In [None]:
Legal=TFIDF exact; chatbot=embeddings; recommender=hybrid concatenate

---

## Submission Guidelines

- Complete all questions in this notebook
- Include code implementations where applicable (using NLTK, spaCy, scikit-learn, or gensim)
- Provide clear explanations and reasoning
- Add visualizations if they help explain your answers
- Ensure your code is properly commented