# Basics of NLP — Core (Beginner Friendly)

This notebook teaches the core basics of NLP only. No chatbot, no translator — just fundamental concepts with small, clear examples.

We'll cover:
- Lexical Analysis (sentences, words, stopwords, stemming, lemmatization)
- Syntactic Analysis (POS tagging, simple chunking)
- Semantic Analysis (WordNet, simple similarity, basic NER)
- Discourse Analysis (links across sentences)
- Pragmatic Analysis (basic intent via simple rules)

Run cells from top to bottom. If anything fails, re-run the Setup cell.

## 0) Prerequisites and Installation (Step-by-step)
You said you have Python and pip. Great! If you're running this locally, do the following once.



- Install required package:
```powershell
pip install nltk
```
You can also run the next cell to install from inside the notebook (uncomment the line).

In [None]:
# Install NLTK from inside the notebook if needed (uncomment):
# !pip install -q nltk

import nltk

# Download essential NLTK data (safe to run multiple times)
packages = [
    'punkt',                # tokenizers
    'stopwords',            # stopwords list
    'wordnet',              # WordNet for lemmatization/semantics
    'omw-1.4',              # WordNet multilingual data
    'averaged_perceptron_tagger',  # POS tagger
    'maxent_ne_chunker',    # NER chunker
    'words'                 # word list for NER
]
for p in packages:
    try:
        nltk.download(p, quiet=False)
    except Exception as e:
        print(f'Could not download {p}:', e)

print('Setup complete. ✅ If downloads failed, check your internet and rerun this cell.')

## 1) Lexical Analysis — Tokenization and Normalization
Lexical analysis breaks text into sentences/words and normalizes it.
We'll do: sentence tokenization, word tokenization, stopword removal, stemming, lemmatization, regex tokenization.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = (
    "NLP is fun! It helps computers understand language.
    Machine learning and linguistics are powerful together."
)

# Sentence tokenization
sentences = sent_tokenize(text)
print('Sentences:', sentences)

# Word tokenization
words = word_tokenize(text)
print('Words:', words)

# Lowercase + remove stopwords/punctuation
stop_words = set(stopwords.words('english'))
words_clean = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
print('Clean words:', words_clean)

# Stemming vs Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stems = [stemmer.stem(w) for w in words_clean]
lemmas = [lemmatizer.lemmatize(w) for w in words_clean]
print('Stems:', stems)
print('Lemmas:', lemmas)

# Regex tokenizer: words with 2+ letters
regex_tok = RegexpTokenizer(r'[A-Za-z]{2,}')
print('Regex tokens:', regex_tok.tokenize(text))

## 2) Syntactic Analysis — POS Tagging and Chunking
Syntactic analysis finds grammatical structure. We'll tag parts-of-speech and extract simple noun phrases (NP).

In [None]:
import nltk
tokens = word_tokenize("John bought a new laptop from the store in Chennai.")
pos_tags = nltk.pos_tag(tokens)
print('POS tags:', pos_tags)

# Simple NP chunk grammar: optional determiner, any adjectives, then a noun
grammar = r"NP: {<DT>?<JJ>*<NN.*>}"
cp = nltk.RegexpParser(grammar)
tree = cp.parse(pos_tags)
print(tree)  # text-based parse tree
# Optional GUI view if supported locally:
# tree.draw()


## 3) Semantic Analysis — Word Meaning and Entities
We'll use WordNet for synonyms/definitions, a simple similarity measure, and NLTK's basic NER.

In [None]:
from nltk.corpus import wordnet as wn

word = 'computer'
synsets = wn.synsets(word)
print(f'Synsets for {word!r}:')
for s in synsets[:3]:
    print('-', s.name(), '| definition:', s.definition())

# Simple path similarity between two senses
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print('Similarity(dog, cat):', dog.path_similarity(cat))

# Named Entity Recognition (basic)
sentence = "Google hired Sundar Pichai in California."
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
ner_tree = nltk.ne_chunk(tags)
print(ner_tree)
# ner_tree.draw()  # optional GUI


## 4) Discourse Analysis — Across Sentences
We look for simple discourse markers and repeated content words to see how sentences connect.

In [None]:
from collections import Counter

paragraph = (
    "Ravi bought a phone. He liked the camera; however, the battery was weak.
    Therefore, he returned the phone."
)
sents = sent_tokenize(paragraph)
markers = {'however', 'therefore', 'moreover', 'meanwhile', 'furthermore', 'nevertheless'}

print('Sentences:')
for i, s in enumerate(sents, 1):
    found = [m for m in markers if m in s.lower()]
    print(f'{i}.', s, ('| markers: ' + ', '.join(found)) if found else '')

# Very naive repetition tracker
all_words = [w.lower() for w in word_tokenize(paragraph) if w.isalpha()]
counts = Counter(all_words)
print('Repeated content words:', [w for w, c in counts.items() if c > 1])

## 5) Pragmatic Analysis — Intent (Very Simple)
Pragmatics looks at meaning in context. We'll create a tiny rule-based intent guesser to show the idea.

In [None]:
def guess_intent(utterance: str) -> str:
    u = utterance.lower().strip()
    if any(x in u for x in ['price', 'cost', 'how much']):
        return 'intent: ask_price'
    if any(x in u for x in ['hi', 'hello', 'hey']):
        return 'intent: greeting'
    if any(x in u for x in ['bye', 'goodbye', 'see you']):
        return 'intent: farewell'
    return 'intent: unknown'

for s in ['Hello!', 'How much is this?', 'Ok bye', 'Can you help?']:
    print(s, '->', guess_intent(s))

## Optional Exercises
- Change the input text and observe tokenization differences.
- Add your own stopwords (domain-specific words) and re-run.
- Try LancasterStemmer or SnowballStemmer and compare to PorterStemmer.
- Modify the chunk grammar (e.g., capture prepositional phrases).
- Look up different WordNet synsets and compare similarities.
- Extend `guess_intent()` with more rules.