# What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling machines to understand, interpret, and generate human language in a valuable way. NLP combines computational linguistics with statistical, machine learning, and deep learning models.

## Importance of NLP

NLP is crucial because it bridges the gap between human communication and machine understanding. It enables computers to:
- Process and analyze large amounts of natural language data
- Extract meaningful information from text
- Understand context and sentiment
- Generate human-like responses

## Applications

- **Sentiment Analysis**: Understanding emotions and opinions in text
- **Machine Translation**: Translating text between languages
- **Chatbots**: Conversational AI systems
- **Text Summarization**: Creating concise summaries of long documents
- **Spam Detection**: Identifying unwanted emails
- **Speech Recognition**: Converting speech to text

## Challenges

- **Ambiguity**: Words and phrases can have multiple meanings
- **Context Understanding**: Meaning depends on surrounding text
- **Data Scarcity**: Limited labeled data for training
- **Language Variations**: Slang, dialects, and evolving language


## A First Glimpse of NLP

The NLP pipeline typically follows these steps:

1. **Text**: Raw data in human language
2. **Tokens**: Breaking text into words, subwords, or characters
3. **Encoded Vector**: Converting tokens to numerical representation (machine-readable)
4. **Model**: Applying Machine Learning or Deep Learning models


# Tokenization

**Tokenization** is the process of breaking down text into smaller units called tokens. These tokens can be words, characters, or subwords, depending on the tokenization strategy used.

## Types of Tokenization

### 1. Word Tokenization
Splitting text into individual words. This is the most common approach.

### 2. Character Tokenization
Splitting text into individual characters. Useful for languages with no word boundaries or for character-level models.

### 3. Subword Tokenization
Splitting text into meaningful subword units (e.g., "un-", "-ing", "book" → "book"). This helps handle out-of-vocabulary words and reduces vocabulary size.


## Text to Tokens Example

Let's see how the text **"Don't judge a book by its cover"** is tokenized:


In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Example text
text = "Don't judge a book by its cover"

# Word tokenization
tokens = word_tokenize(text)
print("Original text:", text)
print("Tokens:", tokens)
print(f"\nNumber of tokens: {len(tokens)}")


KeyboardInterrupt: 

# Tokens to Encoded Vector

To process text with machine learning models, we need to convert tokens into numerical representations. This process is called **encoding** or **vectorization**.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example tokens (as a sentence for vectorization)
text = "Don't judge a book by its cover"

# Using CountVectorizer (Bag of Words)
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform([text])

print("Tokens:", word_tokenize(text))
print("\nVocabulary:", count_vectorizer.get_feature_names_out())
print("\nEncoded vector (Count):")
print(count_matrix.toarray())

# Using TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text])

print("\nEncoded vector (TF-IDF):")
print(tfidf_matrix.toarray())


# Word Embeddings and Vector Representation

**Word Embeddings** are dense vector representations of words in a continuous vector space. Unlike sparse representations (like one-hot encoding), embeddings capture semantic relationships between words.

## Key Concepts

- **Dense Vectors**: Each word is represented by a dense vector of real numbers
- **Semantic Relationships**: Words with similar meanings are close in the vector space
- **Context-Aware**: Modern embeddings capture context-dependent meanings

## Types of Word Embeddings

1. **Word2Vec**: Predicts words from context (CBOW) or context from words (Skip-gram)
2. **GloVe**: Global Vectors for Word Representation using co-occurrence statistics
3. **FastText**: Extends Word2Vec with subword information
4. **Contextual Embeddings**: BERT, ELMo, GPT (capture context-dependent meanings)


## Vector Representation Methods

### 1. One-Hot Encoding
Each word is represented as a binary vector where only one element is 1 and all others are 0.

### 2. TF-IDF (Term Frequency-Inverse Document Frequency)
Weights words based on their frequency in a document relative to their frequency across all documents.


In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Example: One-Hot Encoding
texts = ["Don't judge a book by its cover", "A good book is a good friend"]
tokens_list = [word_tokenize(text.lower()) for text in texts]

# Get all unique words
all_words = set()
for tokens in tokens_list:
    all_words.update(tokens)
all_words = sorted(list(all_words))

print("Vocabulary:", all_words)
print(f"Vocabulary size: {len(all_words)}")

# Create one-hot encoding
one_hot_vectors = []
for tokens in tokens_list:
    vector = [1 if word in tokens else 0 for word in all_words]
    one_hot_vectors.append(vector)
    print(f"\nText: {' '.join(tokens)}")
    print(f"One-hot vector: {vector}")


In [None]:
# TF-IDF Vectorization with multiple documents
documents = [
    "Don't judge a book by its cover",
    "A good book is a good friend",
    "The cover of the book is beautiful"
]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print("Documents:")
for i, doc in enumerate(documents):
    print(f"{i+1}. {doc}")

print("\nVocabulary:", tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
print("\nShape:", tfidf_matrix.shape)


# Character Embeddings

Character embeddings represent individual characters as vectors. This is particularly useful for:
- Handling out-of-vocabulary words
- Morphologically rich languages
- Character-level language models


In [None]:
# Character-level tokenization
text = "Don't judge a book by its cover"

# Character tokenization
char_tokens = list(text)
print("Original text:", text)
print("Character tokens:", char_tokens)
print(f"Number of characters: {len(char_tokens)}")

# Character-level one-hot encoding
unique_chars = sorted(set(char_tokens))
char_to_idx = {char: idx for idx, char in enumerate(unique_chars)}

print(f"\nUnique characters: {unique_chars}")
print(f"Character to index mapping: {char_to_idx}")

# Create character-level one-hot vectors
char_vectors = []
for char in char_tokens[:10]:  # Show first 10 characters
    vector = [1 if i == char_to_idx[char] else 0 for i in range(len(unique_chars))]
    char_vectors.append(vector)
    print(f"'{char}' -> {vector[:5]}...")  # Show first 5 dimensions


# Applications of NLP

## 1. Sentiment Analysis
Analyzing emotions and opinions expressed in text to determine whether the sentiment is positive, negative, or neutral.

## 2. Machine Translation
Automatically translating text from one language to another while preserving meaning.

## 3. Chatbots
Conversational AI systems that can understand user queries and provide relevant responses.

## 4. Text Summarization
Creating concise summaries of long documents while preserving key information.

## 5. Spam Detection
Identifying and filtering unwanted or malicious emails and messages.

## 6. Speech Recognition
Converting spoken language into text format.


In [None]:
# Example: Simple Sentiment Analysis using TF-IDF
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample data
positive_texts = [
    "I love this product!",
    "This is amazing and wonderful",
    "Great quality, highly recommend",
    "Excellent service and fast delivery",
    "Best purchase I've ever made"
]

negative_texts = [
    "This is terrible and disappointing",
    "Poor quality, waste of money",
    "Very bad experience",
    "Not worth the price",
    "I hate this product"
]

# Create dataset
texts = positive_texts + negative_texts
labels = ['positive'] * len(positive_texts) + ['negative'] * len(negative_texts)

# Vectorize
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
y = labels

# Train a simple classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predictions
y_pred = classifier.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Test on new text
new_text = "This product is really good!"
new_vector = vectorizer.transform([new_text])
prediction = classifier.predict(new_vector)[0]
print(f"\nText: '{new_text}'")
print(f"Predicted sentiment: {prediction}")


# Challenges in NLP

## 1. Ambiguity
Words and phrases can have multiple meanings depending on context. For example:
- "Bank" can mean a financial institution or the side of a river
- "I saw her duck" - did she duck or did I see her pet duck?

## 2. Context Understanding
Understanding meaning requires context from surrounding words, sentences, and even broader discourse.

## 3. Data Scarcity
Many NLP tasks require large amounts of labeled training data, which can be expensive and time-consuming to create.

## 4. Language Variations
- Slang and informal language
- Regional dialects
- Evolving language over time
- Multiple languages and code-switching

## 5. Sarcasm and Irony
Detecting when words mean the opposite of their literal meaning.

## 6. Named Entity Recognition
Identifying and classifying entities like names, locations, organizations.


In [None]:
# Demonstrating ambiguity challenge
ambiguous_sentences = [
    "I saw her duck",  # Did she duck or did I see her pet duck?
    "The bank is closed",  # Financial institution or river bank?
    "Time flies like an arrow",  # Time moves fast or insects that like arrows?
    "The chicken is ready to eat"  # Ready to be eaten or ready to eat something?
]

print("Examples of Ambiguous Sentences:")
for i, sentence in enumerate(ambiguous_sentences, 1):
    print(f"{i}. {sentence}")

# Tokenization doesn't resolve ambiguity
print("\nTokenization of ambiguous sentences:")
for sentence in ambiguous_sentences:
    tokens = word_tokenize(sentence)
    print(f"'{sentence}' -> {tokens}")


# Future of NLP

The field of NLP is rapidly evolving with several exciting developments:

## Recent Advances

1. **Large Language Models (LLMs)**: Models like GPT, BERT, and T5 have revolutionized NLP
2. **Transformer Architecture**: Attention mechanisms enable better context understanding
3. **Multimodal Models**: Combining text, images, and audio
4. **Few-Shot Learning**: Models that can learn from few examples
5. **Zero-Shot Learning**: Models that can perform tasks they weren't explicitly trained on

## Emerging Trends

- **Conversational AI**: More natural and context-aware chatbots
- **Multilingual Models**: Single models handling multiple languages
- **Domain-Specific Models**: Specialized models for healthcare, legal, finance, etc.
- **Efficient Models**: Smaller, faster models for edge devices


# Ethical Considerations

As NLP technology becomes more powerful, it's important to consider:

## Key Ethical Issues

1. **Bias and Fairness**: Models can perpetuate or amplify societal biases
2. **Privacy**: Handling sensitive personal information in text data
3. **Misinformation**: Potential for generating false or misleading content
4. **Transparency**: Understanding how models make decisions
5. **Accessibility**: Ensuring NLP benefits are available to all
6. **Job Displacement**: Impact on employment in language-related fields

## Responsible AI Development

- Regular bias audits
- Diverse training data
- Transparent model documentation
- User privacy protection
- Continuous monitoring and evaluation


# Conclusion

## Key Takeaways

1. **Tokenization** is the first step in NLP, breaking text into smaller units
2. **Vectorization** converts tokens into numerical representations for machine processing
3. **Word Embeddings** capture semantic relationships between words
4. **NLP Applications** span from sentiment analysis to machine translation
5. **Challenges** include ambiguity, context understanding, and data scarcity
6. **Future** holds promise with large language models and multimodal approaches

## Next Steps

- Explore pre-trained embeddings (Word2Vec, GloVe, FastText)
- Experiment with transformer models (BERT, GPT)
- Build practical NLP applications
- Study advanced topics like attention mechanisms and transfer learning


In [None]:
# Complete Example: Full NLP Pipeline

# Step 1: Text Input
text = "Natural Language Processing is fascinating and powerful!"

# Step 2: Tokenization
tokens = word_tokenize(text.lower())
print("Step 1 - Text:", text)
print("Step 2 - Tokens:", tokens)

# Step 3: Vectorization
vectorizer = TfidfVectorizer()
vector = vectorizer.fit_transform([text])
print("\nStep 3 - Encoded Vector (TF-IDF):")
print(f"Shape: {vector.shape}")
print(f"Vector (first 10 values): {vector.toarray()[0][:10]}")

# Step 4: Vocabulary
print("\nVocabulary:", vectorizer.get_feature_names_out())
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

print("\n✓ Complete NLP pipeline executed successfully!")
