# Basics of NLP — Day 3

- **Basics of NLP**: Lexical, Syntactic, Semantic, Discourse, Pragmatic analysis
- **Simple Chatbot Components**: NLTK, scikit-learn, corpora, preprocessing, tokenization, vectorization, Bag-of-Words, Naive Bayes
- **Translators**: Using Google Translator class (via `googletrans`)
- **Practical**: Build a tiny translator and a rule+ML hybrid chatbot

Tip: Run cells from top to bottom. If something fails, re-run the setup cell.

## 0) Setup
- Installs required packages (nltk, scikit-learn, googletrans)
- Downloads small NLTK resources

If you're offline, the download steps may skip — basic parts will still work.

In [None]:
# If running in a fresh environment, uncomment the next two lines to install:
# !pip install -q nltk scikit-learn googletrans==4.0.0-rc1

import nltk

# Download essential NLTK data quietly
packages = [
    'punkt',                # tokenizers
    'stopwords',            # stopwords list
    'wordnet',              # WordNet for lemmatization/semantics
    'omw-1.4',              # WordNet multilingual data
    'averaged_perceptron_tagger',  # POS tagger
    'maxent_ne_chunker',    # NER chunker
    'words'                 # word list for NER
]
for p in packages:
    try:
        nltk.download(p, quiet=True)
    except Exception as e:
        print(f'Could not download {p}:', e)

print('Setup complete. ✅')

## 1) Lexical Analysis (Words and Sentences)
Lexical analysis = breaking text into units (sentences, words), normalizing, and cleaning.

We'll cover:
- **Sentence tokenization**
- **Word tokenization**
- **Stopword removal**
- **Stemming vs Lemmatization**
- A simple **Regex tokenizer**


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = (
    "NLP is fun! It helps computers understand language.
    Machine learning + linguistics = powerful tools."
)

# Sentence tokenization
sentences = sent_tokenize(text)
print('Sentences:', sentences)

# Word tokenization
words = word_tokenize(text)
print('Words:', words)

# Lowercasing and stopword removal
stop_words = set(stopwords.words('english'))
words_clean = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
print('Clean words (no stopwords/punct):', words_clean)

# Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stems = [stemmer.stem(w) for w in words_clean]
lemmas = [lemmatizer.lemmatize(w) for w in words_clean]
print('Stems:', stems)
print('Lemmas:', lemmas)

# Regex tokenizer: only words of 2+ letters
regex_tok = RegexpTokenizer(r'[A-Za-z]{2,}')
print('Regex tokens:', regex_tok.tokenize(text))

## 2) Syntactic Analysis (Grammar / Structure)
Syntactic analysis = understanding parts of speech (POS) and phrase structures.

We'll do:
- **POS tagging**
- **Chunking** (simple noun phrase finder)


In [None]:
import nltk
tokens = word_tokenize("John bought a new laptop from the store in Chennai.")
pos_tags = nltk.pos_tag(tokens)
print('POS tags:', pos_tags)

# Define a simple noun phrase (NP) chunk grammar: determiner + adjectives* + noun
grammar = r"NP: {<DT>?<JJ>*<NN.*>}"
cp = nltk.RegexpParser(grammar)
tree = cp.parse(pos_tags)
print(tree)  # text-based tree
# To visualize in a window (optional in local env):
# tree.draw()


## 3) Semantic Analysis (Meaning)
Semantic analysis = getting meaning from words/sentences.

We'll do:
- **WordNet**: synonyms / definitions
- **Simple similarity**
- **Named Entity Recognition (NER)**


In [None]:
from nltk.corpus import wordnet as wn

# WordNet: look up synsets for a word
word = 'computer'
synsets = wn.synsets(word)
print(f'Synsets for {word!r}:')
for s in synsets[:3]:
    print('-', s.name(), '| definition:', s.definition())

# Simple path similarity between two senses
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print('Similarity(dog, cat):', dog.path_similarity(cat))

# NER: find named entities (like PERSON, ORGANIZATION)
sentence = "Google hired Sundar Pichai in California."
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
ner_tree = nltk.ne_chunk(tags)
print(ner_tree)
# ner_tree.draw()  # optional GUI tree


## 4) Discourse Analysis (Across Sentences)
Discourse analysis looks at how multiple sentences connect (coherence, reference). True coreference is advanced, but we can do small checks.

Here: we will find simple **discourse markers** (e.g., 'however', 'therefore') and link basic references by repeated nouns.

In [None]:
import re
from collections import Counter

paragraph = (
    "Ravi bought a phone. He liked the camera; however, the battery was weak.
    Therefore, he returned the phone."
)
sents = sent_tokenize(paragraph)
markers = {
    'however', 'therefore', 'moreover', 'meanwhile', 'furthermore', 'nevertheless'
}

print('Sentences:')
for i, s in enumerate(sents, 1):
    found = [m for m in markers if m in s.lower()]
    print(f'{i}.', s, ('| markers: ' + ', '.join(found)) if found else '')

# Very naive entity repetition tracker
all_words = [w.lower() for w in word_tokenize(paragraph) if w.isalpha()]
counts = Counter(all_words)
print('Repeated content words (possible discourse links):',
      [w for w, c in counts.items() if c > 1])

## 5) Pragmatic Analysis (Meaning in Context / Intent)
Pragmatics = how context changes meaning (sarcasm, politeness, intent). This is hard to code simply; modern systems use ML with context.

We will: show how the same sentence can be different based on context, and use a tiny rule to guess intent.

In [None]:
def guess_intent(utterance: str) -> str:
    u = utterance.lower().strip()
    if any(x in u for x in ['price', 'cost', 'how much']):
        return 'intent: ask_price'
    if any(x in u for x in ['hi', 'hello', 'hey']):
        return 'intent: greeting'
    if any(x in u for x in ['bye', 'goodbye', 'see you']):
        return 'intent: farewell'
    return 'intent: unknown'

samples = [
    'Hello!',
    'How much is this phone?',
    'Ok bye'
]
for s in samples:
    print(s, '->', guess_intent(s))

# Part 2 — Components of a Simple Chatbot
We'll build a tiny text classifier to map user messages to intents using:

- **Preprocessing & Tokenization** (NLTK)
- **Vectorization** (CountVectorizer = Bag of Words)
- **Classifier** (Naive Bayes)

Then, we write a simple `respond()` function.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

# 1) Tiny dataset (toy intents)
texts = [
    'hi', 'hello there', 'hey', 'good morning',
    'bye', 'goodbye', 'see you later',
    'what is the price', 'how much does it cost', 'price of the item',
    'what is your name', 'who are you',
    'can you help me', 'i need help', 'please assist'
]
labels = [
    'greet','greet','greet','greet',
    'bye','bye','bye',
    'ask_price','ask_price','ask_price',
    'ask_name','ask_name',
    'ask_help','ask_help','ask_help'
]

# 2) Train/test split
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42, stratify=labels)

# 3) Pipeline: CountVectorizer + Naive Bayes
model = Pipeline([
    ('vect', CountVectorizer(lowercase=True, ngram_range=(1,2))),
    ('clf', MultinomialNB())
])
model.fit(X_train, y_train)

# 4) Evaluate
pred = model.predict(X_test)
print(classification_report(y_test, pred))

# 5) Respond function
def respond(user_text: str) -> str:
    intent = model.predict([user_text])[0]
    if intent == 'greet':
        return 'Hello! How can I help you today?'
    if intent == 'bye':
        return 'Goodbye! Have a great day.'
    if intent == 'ask_price':
        return 'This item costs $499.'
    if intent == 'ask_name':
        return "I'm a simple demo bot."
    if intent == 'ask_help':
        return 'Sure, tell me what you need help with.'
    return "I'm not sure I understood. Could you rephrase?"

for q in ['hello', 'price please', 'who are you', 'bye']:
    print(q, '->', respond(q))

### How Bag of Words (CountVectorizer) works (quick peek)
- It builds a vocabulary of tokens from training text.
- Each message becomes a vector counting token occurrences.
- Naive Bayes learns probabilities of words per class.

In [None]:
vect = model.named_steps['vect']
vocab = vect.vocabulary_
print('Vocabulary size:', len(vocab))
# Show a few tokens
for token, idx in list(vocab.items())[:10]:
    print(token, '->', idx)

sample_vector = vect.transform(['hello price']).toarray()[0]
nonzero = np.where(sample_vector>0)[0]
print('Non-zero features for "hello price":', nonzero)

# Part 3 — Translator (Google Translator Class)
We'll use `googletrans` (unofficial) for quick demos. It may be unstable if Google changes endpoints.

If translation fails, re-run the setup cell and ensure internet is available.

In [None]:
try:
    from googletrans import Translator
    translator = Translator()
    print('googletrans is ready.')
except Exception as e:
    translator = None
    print('Could not initialize googletrans:', e)

def translate_text(text: str, dest: str = 'en', src: str = 'auto'):
    if translator is None:
        return '[translator unavailable]'
    try:
        res = translator.translate(text, dest=dest, src=src)
        return res.text
    except Exception as e:
        return f'[error: {e}]'

examples = [
    ('வணக்கம்', 'en'),   # Tamil -> English
    ('Hello, how are you?', 'ta'),  # English -> Tamil
    ('Buenos días', 'en'), # Spanish -> English
]
for text, dest in examples:
    print(f'{text!r} -> ({dest})', translate_text(text, dest=dest))

# Practical — Put It Together
- A helper that translates user input to English, gets chatbot reply, then translates back to the user's language.
- If translator is not available, it will just reply in English.

In [None]:
def multilingual_respond(user_text: str, user_lang: str = 'en') -> str:
    # Step 1: translate to English (if needed)
    text_en = translate_text(user_text, dest='en') if user_lang != 'en' else user_text
    # Step 2: bot respond in English
    reply_en = respond(text_en)
    # Step 3: translate back to user language (if needed)
    reply_user = translate_text(reply_en, dest=user_lang) if user_lang != 'en' else reply_en
    return reply_user

tests = [
    ('hello', 'en'),
    ('قیمت چقدر است؟', 'fa'),   # Persian: what's the price?
    ('precio por favor', 'es'),  # Spanish
    ('வணக்கம்', 'ta'),          # Tamil
]
for text, lang in tests:
    print(f'User({lang}):', text)
    print('Bot:', multilingual_respond(text, user_lang=lang))
    print('-'*40)