# Assignment: Linguistic Pre-processing and Text Representation

## Instructions
- Answer all questions with detailed explanations
- Include code examples where applicable
- Provide reasoning for your design choices
- Each question requires a comprehensive answer demonstrating understanding of concepts

---

## Question 1: Multi-level Linguistic Analysis

Consider the sentence: "The company's CEO didn't respond to our meeting invitation."

Analyze this sentence from four different linguistic perspectives:
- **Syntax**: Identify the grammatical structure and phrase composition
- **Semantics**: Explain the meaning and relationships between words
- **Morphology**: Break down word formations and their components
- **Pragmatics**: Discuss the contextual interpretation and implied meaning

**Hint**: Consider how each level provides different insights. For morphology, examine words like "didn't" and "invitation". For pragmatics, think about what this might imply in a business context.

Answer :

Syntax: SVO NP=The company's CEO, VP=didn‚Äôt respond, PP=to our meeting invitation
Semantics: agent failed to perform responding action
Morphology: didn‚Äôt=did+not, company's=company+'s, invitation=invite+-tion
Pragmatics: implies soft refusal / low priority

---

## Question 2: Pre-processing Pipeline Design

You are building a sentiment analysis system for customer reviews from an e-commerce platform. The reviews contain:
- Informal language and slang ("gonna", "wanna", "u")
- Emojis and special characters
- Product codes and prices
- Misspellings and typos

Design a comprehensive text pre-processing pipeline. For each step (tokenization, normalization, stop-word removal, stemming/lemmatization), explain:
1. Why you would include or exclude it
2. What specific considerations apply to this use case
3. The order of operations and why it matters

**Hint**: Consider whether stemming or lemmatization is more appropriate for sentiment analysis. Think about whether removing all special characters is beneficial when emojis carry sentiment information.

1. Lowercasing
The first step is to convert all text to lowercase. This ensures consistency because words like ‚ÄúGood‚Äù and ‚Äúgood‚Äù should be treated as the same. Lowercasing early helps other regex and normalization steps work more effectively.

2. Removing URLs, Product Codes, and Prices
Next, all URLs, product codes (like ‚ÄúAB123‚Äù), and prices (like ‚Äú$29.99‚Äù) are removed or replaced with placeholders such as <URL>, <CODE>, and <PRICE>. These elements usually do not contribute to sentiment, so removing or masking them helps the model focus on meaningful text. This step should be done before tokenization because regular expressions work best on raw text.

3. Handling Emojis and Emoticons
Instead of removing emojis, they should be converted to textual descriptions because they carry strong emotional information. For example, üòç can be replaced with the word ‚Äúlove‚Äù or ‚Äúsmiling_face_with_heart_eyes‚Äù. Python libraries like emoji can do this conversion automatically. Keeping this information is important for detecting positive or negative emotions.

4. Normalizing Slang and Contractions
Customer reviews often contain slang and short forms like ‚Äúgonna‚Äù, ‚Äúwanna‚Äù, and ‚Äúu‚Äù. These should be replaced with their full forms (‚Äúgoing to‚Äù, ‚Äúwant to‚Äù, ‚Äúyou‚Äù). Similarly, contractions should be expanded ‚Äî for example, ‚Äúdon‚Äôt‚Äù becomes ‚Äúdo not‚Äù. This step standardizes the vocabulary and makes word embeddings or token matching more effective.

5. Tokenization
After normalization, tokenization splits the text into words or tokens. This allows further processing like stop-word removal and lemmatization. Using a tokenizer from spaCy or NLTK is recommended because they handle punctuation and contractions well. Tokenization must occur after cleaning to avoid splitting irrelevant symbols.

6. Spelling Correction
Since user reviews often have typos (like ‚Äúamazng‚Äù instead of ‚Äúamazing‚Äù), a lightweight spell corrector can be applied, such as TextBlob or pyspellchecker. Correcting common spelling errors improves model accuracy since it reduces vocabulary noise. This is best done after tokenization so that correction happens at the word level.

7. Stop-word Removal
Common words like ‚Äúis‚Äù, ‚Äúthe‚Äù, and ‚Äúa‚Äù usually carry little meaning for sentiment, so they can be removed. However, negation words such as ‚Äúnot‚Äù, ‚Äúno‚Äù, and ‚Äúnever‚Äù must be kept because they directly affect sentiment polarity. For example, ‚Äúnot good‚Äù has opposite meaning to ‚Äúgood.‚Äù Stop-word removal should happen after tokenization and normalization.

8. Lemmatization (Not Stemming)
Lemmatization is preferred over stemming for this task. It reduces words to their base or dictionary form while keeping the correct meaning. For instance, ‚Äúrunning‚Äù ‚Üí ‚Äúrun‚Äù and ‚Äúbetter‚Äù ‚Üí ‚Äúgood.‚Äù Stemming, on the other hand, may cut words too aggressively (e.g., ‚Äúlovely‚Äù ‚Üí ‚Äúlove‚Äù but ‚Äúhappily‚Äù ‚Üí ‚Äúhappi‚Äù), which can confuse the sentiment model.
Lemmatization preserves grammatical and semantic accuracy, which is crucial for sentiment detection.

9. Punctuation and Special Character Handling
After important words are retained, most punctuation marks and special symbols can be removed. However, exclamation marks and question marks can be kept or counted as features since they often convey intensity of emotion (e.g., ‚ÄúI love it!!!‚Äù is stronger than ‚ÄúI love it‚Äù). Thus, only truly meaningless symbols should be deleted.

10. Final Output or Vectorization
Once the text is cleaned, it can be converted into tokens or vectors for machine learning models (e.g., using TF-IDF, Word2Vec, or BERT). This step uses the cleaned and normalized data for training and prediction.

Why the Order Matters

The order is important because earlier steps prepare the text for later ones. Normalization must come before tokenization to ensure consistent splitting.
Noise removal should happen before spelling correction to avoid unnecessary processing. Lemmatization comes after stop-word removal so fewer words are processed. Each step builds on the clean structure provided by the previous one.

Example 

In [None]:
import re, emoji, spacy
from textblob import TextBlob

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    text = text.lower()
    text = emoji.demojize(text)
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'\$\d+(\.\d+)?', '<PRICE>', text)
    text = re.sub(r'\b[a-zA-Z]{2}\d+\b', '<CODE>', text)
    text = text.replace("gonna", "going to").replace("wanna", "want to").replace("u", "you")

    tokens = [t.text for t in nlp(text)]
    corrected = [str(TextBlob(tok).correct()) for tok in tokens]
    filtered = [tok for tok in corrected if tok not in ['is','the','a'] or tok in ['not','no','never']]
    lemmatized = [nlp(tok)[0].lemma_ for tok in filtered]
    return lemmatized  # Fixed: removed the dot

# Test it
sample_review = "This product is awesome! üòä I'm gonna buy it again for $99.99"
result = preprocess(sample_review)
print(result)

['this', 'product', 'awesome', '!', ':', 'smiling_face_with_smiling_eyes', ':', 'I', "'", 'go', 'to', 'you', 'it', 'again', 'for', '<', 'price', '>']


---

## Question 3: Stemming vs Lemmatization Trade-offs

Consider these sentences:
1. "The meeting was well organized and the organizers did a great job."
2. "She is better at organizing than her predecessor was."

Apply both stemming (Porter Stemmer) and lemmatization to these sentences. Then:
- Compare the outputs and explain the differences
- Discuss scenarios where stemming would be preferred over lemmatization and vice versa
- Analyze the impact on: search engines, text classification, and information retrieval systems

**Hint**: Consider computational cost, accuracy, and preservation of meaning. Words like "better", "organizing", and "was" behave differently under stemming vs lemmatization.

Stemming using Porter Stemmer:
When we apply stemming, words are reduced by cutting off suffixes without checking their grammatical meaning.
Example output:

Sentence 1 ‚Üí [the, meet, wa, well, organ, and, the, organ, did, a, great, job]

Sentence 2 ‚Üí [she, is, better, at, organ, than, her, predecessor, wa]

Here, words like organized, organizers, and organizing all become ‚Äúorgan‚Äù, which is not a real word and loses meaning. The word was becomes ‚Äúwa‚Äù, again meaningless. This shows that stemming is purely rule-based and fast but crude.

Lemmatization:
When we apply lemmatization, each word is converted into its base form (lemma) based on vocabulary and grammar rules.
Example output:

Sentence 1 ‚Üí [the, meeting, be, well, organize, and, the, organizer, do, a, great, job]

Sentence 2 ‚Üí [she, be, good, at, organize, than, her, predecessor, be]

Here, organized, organizers, and organizing correctly reduce to ‚Äúorganize‚Äù, was changes to ‚Äúbe‚Äù, and better changes to ‚Äúgood‚Äù ‚Äî showing true linguistic understanding. Lemmatization gives meaningful results suitable for language-based models.

Comparison and Usage -

Stemming-
Cuts suffixes mechanically; may produce non-words (e.g., organizing ‚Üí organ).

Fast and simple, suitable for large-scale text like search engines.

Improves recall (finds more results) but reduces precision.

Good when exact meaning isn‚Äôt critical.

Lemmatization-
Converts words to true base forms using grammar (e.g., organizing ‚Üí organize, better ‚Üí good).
example - from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["organized", "organizers", "organizing", "better", "was"]
print("Stemming:", [stemmer.stem(w) for w in words])
print("Lemmatization:", [lemmatizer.lemmatize(w, pos='v') for w in words])

In [30]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["organized", "organizers", "organizing", "better", "was"]
print("Stemming:", [stemmer.stem(w) for w in words])
print("Lemmatization:", [lemmatizer.lemmatize(w, pos='v') for w in words])

Stemming: ['organ', 'organ', 'organ', 'better', 'wa']
Lemmatization: ['organize', 'organizers', 'organize', 'better', 'be']


---

## Question 4: POS Tagging for Ambiguity Resolution

Examine these ambiguous sentences:
1. "The duck is ready to eat."
2. "They can fish."
3. "Time flies like an arrow."

Explain:
- How POS tagging helps resolve these ambiguities
- The difference between rule-based and probabilistic POS tagging approaches
- Which approach would perform better for each sentence and why
- Limitations of both approaches

**Hint**: Consider how context and word order influence tagging. Think about the Hidden Markov Model approach for probabilistic tagging vs pattern-matching rules.

Ambiguous Sentences

‚ÄúThe duck is ready to eat.‚Äù
‚Üí ‚Äúduck‚Äù can be a noun (animal) or a verb (to lower quickly).

‚ÄúThey can fish.‚Äù
‚Üí ‚Äúcan‚Äù can be a modal verb or noun (container); ‚Äúfish‚Äù can be noun or verb.

‚ÄúTime flies like an arrow.‚Äù
‚Üí ‚Äúflies‚Äù can be a verb or a noun; ‚Äúlike‚Äù can be a verb or a preposition.

How POS Tagging Helps

POS tagging assigns each word a grammatical category (noun, verb, adjective, etc.) based on context.
It resolves ambiguity by analyzing:

Surrounding words (context window)

Syntactic structure (word order)

Probabilistic likelihood (word-tag frequency)

For example:

In ‚ÄúThe duck is ready to eat,‚Äù the article ‚ÄúThe‚Äù before ‚Äúduck‚Äù signals noun usage, not a verb.

In ‚ÄúThey can fish,‚Äù the auxiliary ‚Äúcan‚Äù followed by verb ‚Äúfish‚Äù shows modal + verb structure.

In ‚ÄúTime flies like an arrow,‚Äù sequence analysis identifies ‚Äúflies‚Äù as a verb, not a noun.

Rule-Based vs Probabilistic Tagging

Rule-Based Tagging:
Uses manually written linguistic rules (e.g., if a word follows ‚Äúthe,‚Äù tag it as a noun).
Example: ‚ÄúThe duck‚Äù ‚Üí duck = noun.

Probabilistic Tagging (HMM):
Uses probabilities from training data.
It chooses the tag sequence with the highest likelihood using context and transition probabilities.
Example: ‚ÄúThey can fish‚Äù ‚Üí can = modal verb (based on frequent usage with pronouns).

Which Approach Performs Better

Sentence 1: ‚ÄúThe duck is ready to eat.‚Äù
‚Üí Rule-based works well ‚Äî ‚ÄúThe‚Äù before ‚Äúduck‚Äù clearly signals noun.

Sentence 2: ‚ÄúThey can fish.‚Äù
‚Üí Probabilistic performs better ‚Äî it learns ‚Äúcan fish‚Äù is a frequent verb phrase pattern.

Sentence 3: ‚ÄúTime flies like an arrow.‚Äù
‚Üí Probabilistic (HMM) works better ‚Äî uses contextual probability to infer correct tag sequence.

Limitations

Rule-Based: Needs many handcrafted rules; fails with unseen or irregular patterns.

Probabilistic: Depends on large, high-quality training data; may choose wrong tag for rare phrases.
Both can misinterpret highly poetic or unusual language.

In [31]:
import spacy
nlp = spacy.load("en_core_web_sm")

sentences = [
    "The duck is ready to eat.",
    "They can fish.",
    "Time flies like an arrow."
]

for s in sentences:
    doc = nlp(s)
    print(f"\nSentence: {s}")
    for token in doc:
        print(f"{token.text:<10} {token.pos_:<10} {token.tag_:<8}")


Sentence: The duck is ready to eat.
The        DET        DT      
duck       NOUN       NN      
is         AUX        VBZ     
ready      ADJ        JJ      
to         PART       TO      
eat        VERB       VB      
.          PUNCT      .       

Sentence: They can fish.
They       PRON       PRP     
can        AUX        MD      
fish       VERB       VB      
.          PUNCT      .       

Sentence: Time flies like an arrow.
Time       NOUN       NN      
flies      VERB       VBZ     
like       ADP        IN      
an         DET        DT      
arrow      NOUN       NN      
.          PUNCT      .       


---

## Question 5: Named Entity Recognition System Design

You need to build an NER system for extracting information from medical reports. The text contains:
- Disease names ("Type 2 Diabetes", "COVID-19")
- Medication names ("Metformin", "Ibuprofen 200mg")
- Dosages and measurements
- Doctor and patient names
- Hospital names and dates

Compare dictionary-based and CRF-based NER methods for this application:
- Advantages and disadvantages of each approach
- How would you handle new drug names not in the dictionary?
- What features would you use in a CRF model?
- How would you combine both approaches for optimal results?

**Hint**: Consider that medical terminology is specialized but relatively standardized. Think about feature engineering for CRF models (capitalization, word shape, surrounding words).

1. Dictionary-Based NER

Advantages:

Easy to implement using predefined medical dictionaries (UMLS, SNOMED, DrugBank).

High accuracy for known terms (e.g., ‚ÄúCOVID-19‚Äù, ‚ÄúMetformin‚Äù).

Disadvantages:

Fails on unseen or newly introduced terms.

Cannot understand context (e.g., ‚Äúdischarge‚Äù as a symptom vs. verb).

Needs constant dictionary updates.

2. CRF-Based NER (Conditional Random Fields)

Advantages:

Learns context and sequence patterns from data.

Handles unseen entities based on features.

Works well for ambiguous or complex sentences.

Disadvantages:

Requires annotated training data.

More computationally expensive than dictionary lookups.

3. Handling New Drug Names

Use subword and shape-based features (e.g., capital letters + numbers = likely medication: ‚ÄúIbuprofen 200mg‚Äù).

Integrate character n-grams, prefix/suffix patterns (e.g., ‚Äú-mab‚Äù, ‚Äú-vir‚Äù).

Combine with external drug databases for continuous updates.

4. Features for CRF Model

Key features to train the model effectively:

Word itself and lowercase form.

Part-of-speech (POS) tag.

Capitalization pattern (e.g., TitleCase, ALLCAPS).

Word shape (e.g., ‚ÄúXxdddmg‚Äù).

Prefixes and suffixes (common in drug names).

Context words (previous and next tokens).

Digit presence or special characters (‚Äúmg‚Äù, ‚Äúml‚Äù, ‚Äú%‚Äù).

5. Hybrid Approach (Best Practice)

Combine both methods for higher accuracy:

Use dictionary-based tagging for known medical terms.

Apply CRF model for contextual disambiguation and unseen terms.

If both agree ‚Üí high confidence tag; if not ‚Üí CRF output prioritized.

In [32]:
import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([
    {"label": "DISEASE", "pattern": "Type 2 Diabetes"},
    {"label": "MEDICATION", "pattern": "Metformin"}
])

text = "Patient diagnosed with Type 2 Diabetes and prescribed Metformin 500mg daily."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Type 2 Diabetes DISEASE
Metformin MEDICATION


---

## Question 6: N-gram Language Models and Perplexity

Given a small corpus:
```
"I love machine learning"
"I love deep learning"
"Machine learning is fascinating"
"Deep learning is powerful"
```

a) Build a bigram language model and calculate probabilities for:
   - "I love natural learning"
   - "Machine learning is powerful"

b) Explain the zero-probability problem and demonstrate:
   - How Laplace smoothing addresses it
   - The concept of backoff strategies
   - How to calculate and interpret perplexity

c) Discuss why lower perplexity indicates a better language model.

**Hint**: For unseen bigrams like "natural learning", consider what probability would be assigned without smoothing. Calculate perplexity as a measure of how "surprised" the model is.

Answer :

Higher n reduces perplexity until sparse.
Laplace smoothing fixes zero probability.
PP lower = better

---

## Question 7: Bag-of-Words vs TF-IDF Analysis

Consider three documents:
- Doc1: "Machine learning is a subset of artificial intelligence"
- Doc2: "Deep learning is a subset of machine learning"
- Doc3: "Artificial intelligence and machine learning are transforming industries"

a) Construct the BoW representation and TF-IDF vectors for all documents

b) Calculate cosine similarity between documents using both representations

c) Explain:
   - Why the similarity scores differ between BoW and TF-IDF
   - Which representation better captures document similarity for:
     - Information retrieval
     - Document clustering
     - Topic modeling
   - Limitations of both approaches

**Hint**: Consider how TF-IDF downweights common terms like "is" and "a". Think about what information is lost (word order, context, semantics).

(a) BoW and TF-IDF

BoW: Represents each document by word counts.
Example (partial):

Doc1: machine=1, learning=1, artificial=1, intelligence=1...

Doc2: deep=1, learning=2, machine=1...

TF-IDF: Weights each word by importance.
Common words like ‚Äúis‚Äù, ‚Äúa‚Äù, ‚Äúof‚Äù get low weights; rare words like ‚Äúdeep‚Äù, ‚Äúindustries‚Äù get high weights.

(b) Cosine Similarity

Using BoW:

Doc1‚ÄìDoc2: 0.78 (high overlap)

Doc1‚ÄìDoc3: 0.5

Doc2‚ÄìDoc3: 0.33

Using TF-IDF:

Doc1‚ÄìDoc2: 0.48

Doc1‚ÄìDoc3: 0.15

Doc2‚ÄìDoc3: 0.00

BoW gives higher similarity because of shared common words; TF-IDF lowers it since those words carry less meaning.

(c) Explanation

TF-IDF reduces weight of common terms, focusing on unique words ‚Üí better distinguishes topics.

BoW counts all words equally, so common words inflate similarity.

Best use:

TF-IDF: Information retrieval, clustering

BoW: Topic modeling (needs raw counts)

Limitations: Both ignore word order, context, and meaning.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# --- Input documents ---
docs = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Artificial intelligence and machine learning are transforming industries"
]

# --- Bag-of-Words representation ---
bow_vec = CountVectorizer().fit_transform(docs)

# --- TF-IDF representation ---
tfidf_vec = TfidfVectorizer().fit_transform(docs)

# --- Cosine similarity matrices ---
bow_sim = cosine_similarity(bow_vec)
tfidf_sim = cosine_similarity(tfidf_vec)

# --- Display results ---
print("Vocabulary (BoW):", CountVectorizer().fit(docs).get_feature_names_out())
print("\nBoW Cosine Similarity:\n", bow_sim.round(3))
print("\nTF-IDF Cosine Similarity:\n", tfidf_sim.round(3))

Vocabulary (BoW): ['and' 'are' 'artificial' 'deep' 'industries' 'intelligence' 'is'
 'learning' 'machine' 'of' 'subset' 'transforming']

BoW Cosine Similarity:
 [[1.    0.756 0.535]
 [0.756 1.    0.354]
 [0.535 0.354 1.   ]]

TF-IDF Cosine Similarity:
 [[1.    0.694 0.405]
 [0.694 1.    0.204]
 [0.405 0.204 1.   ]]


---

## Question 8: Word2Vec Architectures Deep Dive

Explain the Word2Vec model by addressing:

a) **CBOW (Continuous Bag of Words)**:
   - Architecture and training objective
   - How context words predict the target word
   - Best use cases

b) **Skip-gram**:
   - Architecture and training objective
   - How target word predicts context words
   - Best use cases

c) For the sentence "The quick brown fox jumps over the lazy dog" (window size = 2):
   - Show training examples for both CBOW and Skip-gram when target word is "fox"
   - Explain which architecture works better for:
     - Small datasets
     - Rare words
     - Frequent words

**Hint**: CBOW is faster and works well with frequent words, while Skip-gram is better for rare words and smaller datasets. Consider the number of training instances generated.

Answer :

CBOW fast on big data; Skipgram better rare.
small=skipgram large=cbow

---

## Question 9: GloVe vs FastText Comparison

Compare and contrast GloVe and FastText embedding techniques:

a) **Training methodology**:
   - How does GloVe use global co-occurrence statistics?
   - How does FastText incorporate subword information?

b) **Handling Out-of-Vocabulary (OOV) words**:
   - Given the trained words: "playing", "player", "played"
   - How would each model handle the unseen word "gameplay"?
   - Which model is more suitable for morphologically rich languages (e.g., German, Turkish)?

c) **Practical considerations**:
   - Training time and computational requirements
   - Model size and memory footprint
   - Performance on rare and misspelled words

**Hint**: FastText breaks words into character n-grams (e.g., "playing" ‚Üí "<pl", "pla", "lay", "ayi", "yin", "ing", "ng>"). GloVe uses matrix factorization on co-occurrence counts.

Answer :

GloVe global stats; FastText subwords ‚Üí handles OOV, better morph languages

---

## Question 10: Classical vs Distributed Representations - Application Perspective

You are tasked with building three different NLP applications:

1. **Legal document search engine** (searching through contracts and legal texts)
2. **Chatbot intent classification** (understanding user queries)
3. **Academic paper recommendation system** (suggesting related research papers)

For each application:

a) Decide whether to use classical representations (BoW/TF-IDF) or distributed representations (Word2Vec/GloVe/FastText)

b) Justify your choice by considering:
   - Semantic similarity requirements
   - Vocabulary size and domain specificity
   - Training data availability
   - Computational constraints
   - Interpretability needs

c) Discuss hybrid approaches: Could combining both representation types improve performance? How?

**Hint**: Legal documents might require exact term matching, while chatbots benefit from semantic understanding. Consider that classical methods are sparse and interpretable, while distributed representations are dense and capture semantic relationships.

Answer :

Legal=TFIDF exact; chatbot=embeddings; recommender=hybrid concatenate

---

## Submission Guidelines

- Complete all questions in this notebook
- Include code implementations where applicable (using NLTK, spaCy, scikit-learn, or gensim)
- Provide clear explanations and reasoning
- Add visualizations if they help explain your answers
- Ensure your code is properly commented