<a href="https://colab.research.google.com/github/Mohsal2026/github.com/blob/main/1_introduction_to__natural__language__processing_n_l_p_and_some_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🎥 Recommended Video: [Natural Language Processing: Crash Course AI](https://www.youtube.com/watch?v=oi0JXuL19TA)

🎥 Recommended Video: [What is NLP? Learn Natural Language Processing in Artificial Intelligence
](https://www.youtube.com/watch?v=QwBaFEeUUMA)


# What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human languages. It involves enabling computers to understand, interpret, and generate human language in a meaningful way. Applications include language translation, sentiment analysis, chatbots, and more.


## Natural Language Processing (NLP) Applications

### Ontology in NLP
Ontology refers to the way we organize and represent concepts in a structured manner so that they can be easily manipulated by computer programs. It defines concepts and the relationships between them. For example, a triple like `("python", "language", "is-a")` represents the relationship that Python is a language.


# Touring Popular NLP Libraries and Picking Up NLP Basics

In this lecture, we will explore essential Python libraries for Natural Language Processing (NLP) and learn fundamental NLP concepts. We will cover libraries like **NLTK**, **spaCy**, **Gensim**, and **TextBlob**, along with key NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, stemming, and lemmatization.

---

## 1. Installing Famous NLP Libraries

### 1.1 NLTK (Natural Language Toolkit)
NLTK is one of the most popular libraries for NLP. It is widely used for educational and industrial purposes.

**Installation**:
```bash
# Using pip
sudo pip install -U nltk

# Using conda
conda install nltk
```

---

### 1.2 spaCy
spaCy is a powerful and memory-optimized NLP library written in Cython. It uses state-of-the-art algorithms for tasks like tagging and named entity recognition.

**Installation**:
```bash
# Using pip
pip install -U spacy

# Using conda
conda install -c conda-forge spacy
```

---

### 1.3 Gensim
Gensim is a library designed for topic modeling and similarity retrieval. It is highly efficient and scalable.

**Installation**:
```bash
# Using pip
pip install --upgrade gensim

# Using conda
conda install -c conda-forge gensim
```

---

### 1.4 TextBlob
TextBlob is built on top of NLTK and provides easy-to-use interfaces for common NLP tasks like sentiment analysis, translation, and spell checking.

**Installation**:
```bash
# Using pip
pip install -U textblob

# Using conda
conda install -c conda-forge textblob
```

## 2. Exploring NLTK Corpora

NLTK provides over 100 corpora (text datasets) for NLP tasks. Some popular corpora include:
- **Gutenberg Corpus**: Literary works from Project Gutenberg.
- **Reuters Corpus**: News articles for text classification.
- **Movie Reviews Corpus**: Sentiment analysis dataset.
- **WordNet**: Lexical database of English words.

In [1]:
import nltk
nltk.download('names')

from nltk.corpus import names

# Print the first 10 names
print(names.words()[:10])

# Total number of names
print(len(names.words()))  # Output: 7944

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abigael', 'Abigail', 'Abigale']
7944


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


## 3. NLP Tasks and Techniques

### 3.1 Text Vectorization

Text vectorization is the process of converting text into numerical representations.

#### Key Uses of Text Vectorization
- Preparing text for machine learning models.
- Enabling similarity comparisons between texts.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Import Data Files from Google Drive

import requests
import pandas as pd
from io import StringIO
def read_gd(sharingurl):
    file_id = sharingurl.split('/')[-2]
    download_url='https://drive.google.com/uc?export=download&id=' + file_id
    url = requests.get(download_url).text
    csv_raw = StringIO(url)
    return csv_raw

url = "https://drive.google.com/file/d/1FgIpdZaw7ell0zSvhbtpB9Ex4Ck3Mbcw/view?usp=sharing"
gdd = read_gd(url)

documents = gdd.read().splitlines()

# Vectorize text
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
print(X.toarray())

[[0.         0.         0.         0.47569405 0.         0.56343915
  0.         0.         0.35202427 0.         0.         0.
  0.         0.         0.         0.         0.40763366 0.
  0.40763366]
 [0.         0.         0.         0.70261292 0.         0.
  0.         0.         0.25997466 0.         0.         0.
  0.         0.         0.5074392  0.         0.30104295 0.
  0.30104295]
 [0.         0.50211386 0.         0.         0.         0.
  0.         0.         0.25724635 0.         0.         0.50211386
  0.         0.         0.         0.         0.29788364 0.50211386
  0.29788364]
 [0.         0.         0.         0.47569405 0.         0.56343915
  0.         0.         0.35202427 0.         0.         0.
  0.         0.         0.         0.         0.40763366 0.
  0.40763366]
 [0.         0.         0.         0.         0.         0.
  0.4501536  0.4501536  0.23062572 0.         0.         0.
  0.4501536  0.4501536  0.         0.36913239 0.         0.
  0.        

This shows how important each word is to each document, with higher values indicating more important words. Words that appear in many documents (like "the", "is") will get lower weights, while words specific to particular documents will get higher weights.

### 3.2 Text Classification

Text classification involves categorizing text into predefined groups.

#### Key Uses of Text Classification
- Spam detection.
- Sentiment analysis.
- Topic categorization.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Sample documents and labels (equal length!)
documents = [
    "The team won the championship",               # sports
    "Soccer match ended in a draw",               # sports
    "New smartphone released with advanced AI",    # tech
    "Programming languages are evolving fast",     # tech
    "Basketball playoffs start next week",        # sports
    "The new laptop has a 16-hour battery life",  # tech
    "Olympic athletes train rigorously",          # sports
    "Quantum computing breakthroughs announced"   # tech
]
labels = ["sports", "sports", "tech", "tech", "sports", "tech", "sports", "tech"]

# Vectorize Text (TF-IDF)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
print("Feature names:", vectorizer.get_feature_names_out())

X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

clf = MultinomialNB()
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))

Feature names: ['16' 'advanced' 'ai' 'announced' 'are' 'athletes' 'basketball' 'battery'
 'breakthroughs' 'championship' 'computing' 'draw' 'ended' 'evolving'
 'fast' 'has' 'hour' 'in' 'languages' 'laptop' 'life' 'match' 'new' 'next'
 'olympic' 'playoffs' 'programming' 'quantum' 'released' 'rigorously'
 'smartphone' 'soccer' 'start' 'team' 'the' 'train' 'week' 'with' 'won']
              precision    recall  f1-score   support

      sports       0.50      1.00      0.67         1
        tech       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 3.3 Tokenization
Tokenization is the process of breaking text into smaller units like words or sentences.

In [4]:
## Word Tokenization with NLTK:
import nltk
try:
    nltk.download('punkt_tab')
except:
    nltk.download('punkt')  # Fallback to punkt

from nltk.tokenize import word_tokenize

text = "I am reading a book. It is Python Machine Learning By Example, 4th edition."
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...


['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '4th', 'edition', '.']


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
# Sentence Tokenization with NLTK
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(text)
print(sentences)


['I am reading a book.', 'It is Python Machine Learning By Example, 4th edition.']


In [6]:
# Tokenization with spaCy
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I have been to U.K. and U.S.A.")
print([token.text for token in doc])

['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']


### 3.4 Part-of-Speech (PoS) Tagging
PoS tagging assigns grammatical categories (e.g., noun, verb) to words in a sentence.

In [7]:
import nltk
try:
    nltk.download('averaged_perceptron_tagger_eng')
except:
    nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
tokens = word_tokenize("I am reading a book.")
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


[('I', 'PRP'), ('am', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('.', '.')]


In [8]:
# PoS Tagging with spaCy

doc = nlp("I have been to U.K. and U.S.A.")
print([(token.text, token.pos_) for token in doc])

[('I', 'PRON'), ('have', 'AUX'), ('been', 'AUX'), ('to', 'ADP'), ('U.K.', 'PROPN'), ('and', 'CCONJ'), ('U.S.A.', 'PROPN')]


### 3.5 Named Entity Recognition (NER)
NER identifies and classifies named entities like persons, organizations, and locations.

In [9]:
# NER with spaCy

doc = nlp("The book written by Hayden Liu in 2024 was sold at $30 in America.")
print([(ent.text, ent.label_) for ent in doc.ents])

[('Hayden Liu', 'PERSON'), ('2024', 'DATE'), ('30', 'MONEY'), ('America', 'GPE')]


### 3.6 Stemming and Lemmatization
Stemming reduces words to their root form, while lemmatization converts words to their base or dictionary form.

In [10]:
# Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem("machines"))
print(stemmer.stem("learning"))

machin
learn


In [11]:
# Lemmatization with NLTK

import nltk
# Dependencies
nltk.download('wordnet')  # Download the WordNet corpus
nltk.download('omw-1.4')  # Open Multilingual WordNet

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("machines"))
print(lemmatizer.lemmatize("learning", pos="v"))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


machine
learn


### 3.7 Topic Modelling
Topic modeling involves uncovering abstract topics within a collection of documents. Essentially, it helps in identifying patterns and themes without needing any prior labels or annotations.

In [12]:
# Topic Modelling Using Gensym

# !pip uninstall -y numpy gensim scipy
# !pip install --no-cache-dir numpy==1.23.5 gensim==4.3.2 scipy==1.9.3


import gensim
from gensim import corpora
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

# Preprocessing
stop_words = set(stopwords.words('english'))
texts = [
    [word for word in document.lower().split() if word not in stop_words]
    for document in documents
]

# Create dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# LDA model
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=3,
    random_state=42,
    passes=15,
    alpha='auto'
)

# Print topics
for idx, topic in lda_model.print_topics(num_words=4):
    print(f"Topic {idx}: {topic}\n")

ModuleNotFoundError: No module named 'gensim'

In [None]:
import pyLDAvis.gensim_models
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

### 3.8 Word Embeddings (BERT)

Word embeddings are dense vector representations of words that capture semantic relationships. Among various models, BERT (Bidirectional Encoder Representations from Transformers) has become a significant advancement in NLP.

#### What is BERT?
BERT is a deep learning model designed by Google to improve understanding of the context of words in a sentence. Unlike traditional embeddings, BERT uses a bidirectional approach, meaning it considers both the left and right context of a word simultaneously.

#### Key Uses of BERT
- **Text Classification**: Sentiment analysis, spam detection, etc.
- **Question Answering**: Extracting precise answers from text.
- **Named Entity Recognition**: Identifying entities in text.
- **Machine Translation**: Translating languages with high accuracy.

In [None]:
!pip uninstall -y numpy torch transformers
!pip install --no-cache-dir numpy==1.21.2 torch transformers

In [None]:
import numpy as np
from transformers import BertTokenizer, BertModel

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize text
inputs = tokenizer("I love natural language processing!", return_tensors="pt")
outputs = model(**inputs)

# Extract embeddings
print(outputs.last_hidden_state.shape)  # Should print: torch.Size([1, 7, 768])

### 3.9 Language Modeling

Language modeling predicts the next word in a sentence or sequence of words.

#### Key Uses of Language Modeling
- Text generation.
- Autocomplete systems.
- Language translation.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load GPT-2
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1)
print(tokenizer.decode(outputs[0]))

### 3.10 Machine Translation

Machine Translation involves translating text from one language to another.

#### Key Uses of Machine Translation
- Cross-language communication.
- Global content localization.

In [None]:
from transformers import MarianMTModel, MarianTokenizer

# Load Marian model for English to French
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Translate text
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### 3.11 Large Language Models (LLMs) and Attention Mechanisms

LLMs are trained on massive datasets to understand and generate human-like text. Attention mechanisms help focus on relevant parts of the input sequence.

#### Key Uses of LLMs
- Summarization.
- Question answering.
- Creative writing.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load a pre-trained model (T5)
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Example usage
# text = "summarize: Machine learning enables computers to learn from data."

text = """summarize: Climate change is one of the most pressing issues of our time.
Rising global temperatures, caused primarily by human activities such as burning fossil fuels
and deforestation, are leading to severe consequences like extreme weather events, melting
ice caps, and rising sea levels. Scientists warn that without immediate action, these effects
will become irreversible. Governments worldwide are implementing policies to reduce carbon
emissions, while individuals can contribute by adopting sustainable practices like using
renewable energy, reducing waste, and supporting eco-friendly initiatives. The time to act
is now—delaying further will only exacerbate the crisis."""

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### 3.12 Chatbots
Chatbots simulate conversation with users by understanding input and generating meaningful responses.

#### Key Uses of Chatbots
-Customer support automation.
-Personalized assistance.
-Interactive learning.

In [None]:
from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration

# Load Blenderbot
model_name = "facebook/blenderbot-400M-distill"
tokenizer = BlenderbotTokenizer.from_pretrained(model_name)
model = BlenderbotForConditionalGeneration.from_pretrained(model_name)

# Chatbot interaction
text = "Hello, how can I help you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:
def chat():
    print("Bot: Hi! I'm your friendly AI. Type 'quit' to end the chat.")
    while True:
        # Get user input
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break

        # Format input and generate response
        inputs = tokenizer(user_input, return_tensors="pt")
        outputs = model.generate(
            inputs['input_ids'],
            max_length=100,
            num_beams=5,          # Better quality responses
            temperature=0.7,      # More creative answers
            early_stopping=True
        )

        # Print response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"Bot: {response}")

# Start chatting
chat()

## 4. Real-World NLP Applications

#### 4.1 Sentiment Analysis
Sentiment analysis involves determining the sentiment (e.g., positive, negative, or neutral) expressed in a text. For example:
- **Binary Classification**: Positive or negative sentiment.
- **Multiclass Classification**: Positive, neutral, or negative sentiment.

**Use Case**: News sentiment analysis can provide valuable signals for stock market trading.

---

#### 4.2 News Topic Classification
News topic classification assigns categories (e.g., technology, sports, religion) to news articles. Categories may or may not be mutually exclusive. For example:
- An article about the Olympic Games could be labeled as both **sports** and **politics** if there is political involvement.

---

#### 4.3 Named Entity Recognition (NER)
NER identifies and classifies named entities in text, such as:
- **Persons**: Elon Musk
- **Organizations**: SpaceX
- **Locations**: California
- **Dates**: 2020
- **Quantities**: 9 meters

**Example**:
> "SpaceX[Organization], a California[Location]-based company founded by Elon Musk[Person], announced a 9[Quantity]-meter-diameter launch vehicle for 2020[Date]."

---

####  Other Key NLP Applications

#### 4.4 Language Translation
NLP powers machine translation systems like **Google Translate** and **Microsoft Translator**, enabling real-time translation between languages.

---

#### 4.5 Speech Recognition
NLP converts spoken language into written text. Examples include virtual assistants like **Siri**, **Alexa**, and **Google Assistant**.

---

#### 4.6 Text Summarization
NLP can generate concise summaries of lengthy texts, aiding in information retrieval and content curation.

---

#### 4.7 Language Generation
NLP models like **Generative Pre-trained Transformers (GPTs)** can generate human-like text, including creative writing, poetry, and dialogue.

---

#### 4.8 Information Retrieval
NLP helps retrieve relevant information from unstructured data (e.g., web pages, documents). Search engines use NLP to understand user queries and fetch appropriate results.

---

#### 4.9 Chatbots and Virtual Assistants
NLP powers interactive systems like chatbots and virtual assistants, enabling them to answer queries, assist with tasks, and guide users.

---