# Advanced Text Processing Techniques in NLP

In this notebook, we will explore advanced concepts in text processing, including sentence and word tokenization, as well as understanding stopwords. These techniques are important for performing more nuanced natural language processing tasks.

## Sentence vs Word Tokenization

Let's understand the difference between sentence tokenization and word tokenization:

<div style='display: grid; grid-template-columns: 1fr 1fr; gap: 20px;'><div style='background: #f0f8ff; padding: 20px; border-radius: 10px;'>### 📝 Sentence Tokenization- Splits text into sentences- Handles abbreviations (Dr., U.S.A.)- Useful for document analysis- Preserves sentence structure</div><div style='background: #fff8f0; padding: 20px; border-radius: 10px;'>### 🔤 Word Tokenization- Splits text into individual words- Handles contractions (don't → do, n't)- Foundation for most NLP tasks- Creates vocabulary</div></div>

## Understanding Stopwords

**Stopwords are common words that don't add much meaning:**

**Examples:** the, is, at, which, on, a, an, and, or, but...

**Why remove them?**

- Focus on meaningful words- Reduce noise in analysis- Improve processing speed- Better keyword extraction

## Advanced Processing Demo

In [None]:
import nltknltk.download('punkt')nltk.download('stopwords')from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize, sent_tokenizetext = "Natural language processing is amazing! It helps computers understand human language."# Sentence tokenizationsentences = sent_tokenize(text)print(f"Sentences: {len(sentences)}")# Word tokenization with stopword removalwords = word_tokenize(text.lower())stop_words = set(stopwords.words('english'))filtered_words = [w for w in words if w not in stop_words and w.isalpha()]print(f"Original words: {len(words)}")print(f"After filtering: {len(filtered_words)}")print(f"Filtered: {filtered_words}")

### Explore Advanced Concepts

[🚀 Explore Advanced Concepts in Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/1/advanced.ipynb)

## Advanced Reflection

**Advanced tokenization gives us more control over text processing...**

💭 **When might you want to keep stopwords vs remove them? Think of different use cases!**