# Natural Language Processing Tutorials

This notebook walks through common NLP tasks using Python libraries.
We install required packages, explore tokenization, create a bag-of-words
representation, train a simple classifier, and use pre-trained models for
sentiment analysis, named entity recognition, and word embeddings. Each
section explains what the code is doing under the hood.

In [None]:
!pip install nltk scikit-learn transformers spacy

In [None]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from transformers import pipeline
import spacy
from spacy.cli import download

## 0. Tokenize text with NLTK

Tokenization is the process of breaking raw text into smaller units such as
words or punctuation symbols. NLTK provides language specific tokenizers
that contain rules for splitting text. Here we download the tokenizer
data and apply it to a short sentence to obtain a list of word tokens.

In [None]:
nltk.download('punkt_tab')
text = 'Natural language processing with Python is fun!'
tokens = nltk.word_tokenize(text)
print(tokens)

## 1. Bag-of-words representation

A bag-of-words model encodes each document as a vector of token counts.
`CountVectorizer` builds a vocabulary mapping every unique word to an index
and then counts how often those words occur in each document. The resulting
sparse matrix can be used as input to machine learning models.

In [None]:
docs = ['I love coding in Python', 'Python can be used for NLP']
vectorizer = CountVectorizer()
bag = vectorizer.fit_transform(docs)
print('Vocabulary:', vectorizer.vocabulary_)
print('Bag-of-words matrix:', bag.toarray())

## 2. Train a text classifier

We fetch two categories from the 20 Newsgroups dataset (baseball and space).
The text is converted to a bag-of-words representation and a logistic
regression model is trained to distinguish the topics. After fitting, we
evaluate on the held-out test split and print a classification report.

In [None]:
categories = ['rec.sport.baseball', 'sci.space']
train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers','footers','quotes'))
test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers','footers','quotes'))
clf = make_pipeline(CountVectorizer(), LogisticRegression(max_iter=1000))
clf.fit(train.data, train.target)
preds = clf.predict(test.data)
print(classification_report(test.target, preds, target_names=test.target_names))

## 3. Sentiment analysis with transformers

The Hugging Face `pipeline` API downloads a pretrained transformer model
that can detect positive or negative sentiment. We pass a sentence to the
pipeline and it returns the predicted label and confidence score.

In [None]:
sentiment = pipeline('sentiment-analysis')
result = sentiment('I love using transformers for NLP!')[0]
print(f"Label: {result['label']}, score: {result['score']:.3f}")

## 4. Named entity recognition with spaCy

spaCy ships with pretrained models for many languages. Loading the English
model gives access to a statistical parser that can identify names,
organizations and locations in text. We process a sample sentence and
loop over the detected entities to print their text and type.

In [None]:
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple was founded by Steve Jobs in California.')
for ent in doc.ents:
    print(ent.text, ent.label_)

## 5. Word embeddings and similarity

spaCy's medium model includes word vectors that capture semantic meaning.
By comparing the cosine similarity between vectors we can quantify how
similar two words are. The example loads the vectors and prints the
pairwise similarity scores for a few tokens.

In [None]:
try:
    nlp_md = spacy.load('en_core_web_md')
except OSError:
    download('en_core_web_md')
    nlp_md = spacy.load('en_core_web_md')
tokens = nlp_md('dog cat banana')
for t1 in tokens:
    for t2 in tokens:
        print(f'Similarity({t1.text}, {t2.text}) = {t1.similarity(t2):.3f}')