🔥 First Concept: NLP - Tokenization

💡 What is it?
Tokenization = Breaking down text into smaller pieces (words, subwords, or characters) so that models can understand and process them.

🧠 Why it matters?
Machines can’t read sentences directly. They need tokens (like “Hello” → [Hello]) to convert into numbers (embeddings).

🔧 Types of Tokenizers:

- Word-level → “Hello world” → ['Hello', 'world']
- Subword-level (BPE) → “unhappiness” → ['un', 'happi', 'ness']
- Character-level → “Hi” → ['H', 'i']

✅ Real Use:
Hugging Face models like BERT use subword tokenizers.

🤗 Transformers – Core Concept

💡 What is a Transformer?
A Transformer is an architecture that understands sequences (like sentences) using self-attention – it looks at all words at once and learns which ones matter most.

🧠 Why It Matters?
This powers BERT, GPT, Claude, Gemini – all modern LLMs.

Key Ideas:

- No loops, just attention
- Parallel processing = Fast
- Can understand long-range word relationships (e.g., “bank” = riverbank or money)

🧱 Transformer Parts (simple view):
- Input Embeddings: Text → Vectors
- Positional Encoding: Adds word order info
- Self-Attention: Learns context
- Feed Forward Layers: Processes info
- Output: Classifies, generates, etc.

In [2]:
# !pip install transformers

# change in notepad
# or change in system Move Python 3.13 to Top in environmental variable



In [1]:
import sys
print(sys.executable)


import transformers
print(transformers.__version__)

/Library/Developer/CommandLineTools/usr/bin/python3


  from .autonotebook import tqdm as notebook_tqdm


4.52.4


In [4]:
# 🤗 Using Hugging Face Transformers (Hands-On)
# Let’s load a real model and run it on your own text 👇

from transformers import pipeline    # Correct

classifier = pipeline("sentiment-analysis")

# result = classifier("I love lesarning huggface with chatgpt")
result = classifier("happy")
print(result) # [{'label': 'POSITIVE', 'score': 0.9979}] 'label': Predicted class 'score': Confidence (close to 1.0 = very confident)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9998753070831299}]


🧠 What’s happening here:
pipeline("sentiment-analysis"): Loads a pretrained model like BERT that’s fine-tuned for sentiment.

You pass in raw text → it gets tokenized, embedded, processed by transformer → gives a label (POSITIVE/NEGATIVE) + confidence.