# Foundations of NLP


**Definitions:**
- **Tokenization**: Splitting text into words, subwords, or characters.
- **Stopword Removal**: Removing common words that add little meaning.
- **Stemming**: Reducing words to their root form by chopping suffixes.
- **Lemmatization**: Reducing words to dictionary form using grammar rules.
- **POS Tagging**: Assigning grammatical categories to words.
- **Dependency Parsing**: Finding grammatical relationships between words.
- **NER**: Identifying named entities like people, locations, dates.
- **BoW/TF-IDF**: Representing text as frequency-based vectors.
    

In [2]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("averaged_perceptron_tagger")

nltk.download("maxent_ne_chunker")
nltk.download("maxent_ne_chunker_tab")
nltk.download("words")


from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, word_tokenize, ne_chunk

text = "OpenAI creates powerful AI models like GPT-4."

tokens = word_tokenize(text)
print("Tokens:", tokens)

filtered = [w for w in tokens if w.lower() not in stopwords.words("english")]
print("Stopwords removed:", filtered)

stemmer = PorterStemmer()
print("Stems:", [stemmer.stem(w) for w in filtered])

lemmatizer = WordNetLemmatizer()
print("Lemmas:", [lemmatizer.lemmatize(w) for w in filtered])

print("POS tags:", pos_tag(tokens))
print("Named Entities:", ne_chunk(pos_tag(tokens)))

vectorizer = CountVectorizer()
print("BoW:", vectorizer.fit_transform([text]).toarray())

tfidf = TfidfVectorizer()
print("TF-IDF:", tfidf.fit_transform([text]).toarray())
    

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers

Tokens: ['OpenAI', 'creates', 'powerful', 'AI', 'models', 'like', 'GPT-4', '.']
Stopwords removed: ['OpenAI', 'creates', 'powerful', 'AI', 'models', 'like', 'GPT-4', '.']
Stems: ['openai', 'creat', 'power', 'ai', 'model', 'like', 'gpt-4', '.']
Lemmas: ['OpenAI', 'creates', 'powerful', 'AI', 'model', 'like', 'GPT-4', '.']
POS tags: [('OpenAI', 'NNP'), ('creates', 'VBZ'), ('powerful', 'JJ'), ('AI', 'NNP'), ('models', 'NNS'), ('like', 'IN'), ('GPT-4', 'NNP'), ('.', '.')]
Named Entities: (S
  (ORGANIZATION OpenAI/NNP)
  creates/VBZ
  powerful/JJ
  AI/NNP
  models/NNS
  like/IN
  GPT-4/NNP
  ./.)
BoW: [[1 1 1 1 1 1 1]]
TF-IDF: [[0.37796447 0.37796447 0.37796447 0.37796447 0.37796447 0.37796447
  0.37796447]]


# Neural NLP


**Definitions:**
- **RNN/LSTM/GRU**: Neural networks for sequential data, handling dependencies.
- **Attention**: Focus mechanism to weigh important parts of input.
- **Transformers**: Models built on attention, parallelizable.
- **Encoder/Decoder**: Encoder processes input; decoder generates output.
- **Pretraining/Fine-tuning**: Training on large data first, then adapting to tasks.
- **BERT**: Encoder-only model using Masked LM.
- **GPT**: Decoder-only autoregressive model.
- **T5/BART**: Seq2Seq transformer models.
    

In [3]:
from transformers import pipeline

sentiment = pipeline("sentiment-analysis")
print(sentiment("I love NLP!"))

classifier = pipeline("text-classification", model="distilbert-base-uncased")
print(classifier("The stock market is going up."))

ner = pipeline("ner", grouped_entities=True)
print(ner("Barack Obama was born in Hawaii."))

translator = pipeline("translation_en_to_fr", model="t5-small")
print(translator("NLP is fascinating!", max_length=40))
    

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998449087142944}]


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'LABEL_0', 'score': 0.5242969393730164}]


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'entity_group': 'PER', 'score': np.float32(0.99917614), 'word': 'Barack Obama', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': np.float32(0.99945), 'word': 'Hawaii', 'start': 25, 'end': 31}]


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'translation_text': 'NLP est fascinant !'}]


# Generative AI Core Concepts


**Definitions:**
- **Autoregressive LM**: Predicts next token (GPT).
- **Seq2Seq**: Input to output sequences (T5, BART).
- **Generation Methods**: Greedy, Beam, Top-k, Top-p sampling.
- **Instruction Tuning**: Training to follow instructions.
- **RLHF**: Aligning models with human feedback.
- **LoRA/PEFT**: Efficient fine-tuning.
- **Evaluation Metrics**: Perplexity, BLEU, ROUGE.
    

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("Once upon a time", return_tensors="pt")

greedy_output = model.generate(**inputs, max_length=30)
print("Greedy:", tokenizer.decode(greedy_output[0]))

beam_output = model.generate(**inputs, max_length=30, num_beams=5, early_stopping=True)
print("Beam:", tokenizer.decode(beam_output[0]))

topk_output = model.generate(**inputs, max_length=30, do_sample=True, top_k=50)
print("Top-k:", tokenizer.decode(topk_output[0]))

topp_output = model.generate(**inputs, max_length=30, do_sample=True, top_p=0.9)
print("Top-p:", tokenizer.decode(topp_output[0]))
    

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Greedy: Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Beam: Once upon a time, it was said, there would be a time when the world would be a better place.

It was a time when


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Top-k: Once upon a time, most of us grew up believing in God, believing that he was the final and perfect figure that needed to be given the throne
Top-p: Once upon a time, the whole universe became one gigantic, multicellular being that couldn't even fly.

"The universe is all this


# Advanced NLP


**Definitions:**
- **Multimodal NLP**: Combining text with image/audio/video.
- **CLIP**: Aligns text and image embeddings.
- **RAG**: Combines retrieval with generation.
- **Embeddings**: Dense vector representations for similarity search.
- **Hybrid Search**: Combining keyword and dense search.
- **Long-context Models**: Models designed for large inputs (Longformer).
- **Hallucination Handling**: Mitigating incorrect generations.
    

In [2]:
from sentence_transformers import SentenceTransformer, util

embedder = SentenceTransformer("all-MiniLM-L6-v2")
sent1 = "Artificial Intelligence is the future."
sent2 = "AI will change the world."
emb1 = embedder.encode(sent1, convert_to_tensor=True)
emb2 = embedder.encode(sent2, convert_to_tensor=True)
print("Cosine similarity:", util.pytorch_cos_sim(emb1, emb2))
    

Cosine similarity: tensor([[0.6641]])


# Cutting-edge Research


**Definitions:**
- **Zero/Few-shot Learning**: Performing tasks with no or few examples.
- **Chain-of-Thought**: Prompting models to reason step by step.
- **LangChain/LlamaIndex**: Frameworks for LLM apps and RAG.
- **Quantization**: Reducing precision for efficiency.
- **Distillation**: Training small model from large one.
- **MoE**: Mixture of Experts architecture.
- **Alignment**: Making models safe and human-aligned.
    

In [3]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
print(zero_shot("I love using AI for healthcare.", candidate_labels=["education", "health", "finance"]))

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = '''Classify the sentiment:
Text: "The product is amazing"
Sentiment: Positive
Text: "This is the worst purchase ever"
Sentiment: Negative
Text: "I think it's okay, not great"
Sentiment:'''

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print("Few-shot:", tokenizer.decode(outputs[0]))
    

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:  13%|#2        | 210M/1.63G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


{'sequence': 'I love using AI for healthcare.', 'labels': ['health', 'education', 'finance'], 'scores': [0.9965513944625854, 0.0017774497391656041, 0.0016711429925635457]}


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Few-shot: Classify the sentiment:
Text: "The product is amazing"
Sentiment: Positive
Text: "This is the worst purchase ever"
Sentiment: Negative
Text: "I think it's okay, not great"
Sentiment: Positive
Text: "I'm not sure what to say"
Sentiment: Negative
Text: "I'm not sure what to say"
Sentiment: Negative
Text: "I'm not sure what to say"

