# üìå Topic: Text Normalization - Lowercasing

### What you will learn
- Why lowercasing is important in NLP
- How to convert text to lowercase in Python
- When lowercasing helps and when it hurts
- Batch processing multiple texts efficiently

### Why this matters
Text normalization is the second critical step in NLP pipelines. Lowercasing specifically treats "Luna" and "luna" as the same word, reducing vocabulary size and improving model performance. However, it's not always the right choice‚Äîsometimes capitalization matters (e.g., detecting proper nouns, acronyms).

---

## What is Text Normalization?

**Text normalization** is the process of converting text into a standard, consistent format. The most common normalization technique is **lowercasing** ‚Äî converting all characters to lowercase.

### Why lowercase?
- **Reduces vocabulary**: "Luna", "LUNA", and "luna" are treated as the same word
- **Improves model training**: Models see more examples of each word (better statistics)
- **Simplifies matching**: Easier to find words regardless of capitalization

### When NOT to lowercase:
- **Named Entity Recognition (NER)**: You need capital letters to identify proper nouns
- **Acronyms**: "NASA" vs "nasa" have different meanings
- **Sentiment analysis**: Some capitalization conveys emotion ("AMAZING" vs "amazing")

### Rule of thumb:
Lowercase for **vocabulary reduction tasks** (word counting, text classification), but preserve case for **entity detection** or **emotion analysis**.

## Part 1: Lowercasing a Single Text

In [None]:
# Store a sentence with mixed capitalization
# Notice: 'Her' is capitalized (start of sentence), 'Luna' is a proper noun
sentence = "Her cat's name is Luna"

### The `.lower()` Method

Python strings have a built-in `.lower()` method that converts all uppercase characters to lowercase. This is the simplest normalization technique.

```python
"HELLO".lower()  # Returns "hello"
"HeLLo".lower()  # Returns "hello"
```

**Important**: `.lower()` creates a NEW string; it doesn't modify the original (strings are immutable in Python).

In [None]:
# Convert the entire sentence to lowercase using Python's built-in .lower() method
# This method works on any string object
lowercase_sentence = sentence.lower()

In [None]:
# Display the results
print(f"Original:    {sentence}")
print(f"Lowercase:   {lowercase_sentence}")
print(f"\nNotice: 'H' ‚Üí 'h' and 'L' ‚Üí 'l'")

## Part 2: Lowercasing Multiple Texts

In real NLP tasks, you'll process many documents, not just one. Python **list comprehensions** provide an elegant way to normalize entire datasets.

### Why use list comprehensions?
- **Concise**: One line instead of a loop
- **Efficient**: Faster than explicit loops (compiled under the hood)
- **Pythonic**: Standard pattern in data science

### Syntax:
```python
[expression for item in iterable]
# Equivalent to:
result = []
for item in iterable:
    result.append(expression)
```

In [None]:
# Create a list of sentences with various capitalization patterns
sentence_list = [
    "Her cat's name is Luna",        # Mixed case
    "India Is my country",           # Random capitalization
    "Python is powerful and flexible"  # Random capitalization
]

# Use a list comprehension to lowercase ALL sentences at once
# For each sentence 'x' in sentence_list, apply .lower() and collect results
lowercase_list = [x.lower() for x in sentence_list]

In [None]:
# Display the results
print("Original sentences:")
for i, sent in enumerate(sentence_list, 1):
    print(f"  {i}. {sent}")

print("\nAfter lowercasing:")
for i, sent in enumerate(lowercase_list, 1):
    print(f"  {i}. {sent}")

# Notice how 'Luna' ‚Üí 'luna' and 'AGI' ‚Üí 'agi'
# This is why NER must preserve case‚Äîlowercased text loses entity information!

## Side-by-Side Comparison

Let's highlight what changes when we lowercase:

In [None]:
print("Detailed comparison:")
print("=" * 60)
for original, lowercased in zip(sentence_list, lowercase_list):
    print(f"\nOriginal:    {original}")
    print(f"Lowercase:   {lowercased}")
    print("-" * 60)

## Common Beginner Mistakes

### ‚ùå Mistake 1: Forgetting that strings are immutable
```python
sentence.lower()  # This doesn't modify 'sentence'!
print(sentence)   # Still has original capitalization
```

### ‚úÖ Solution:
```python
sentence = sentence.lower()  # Assign the result back
```

---

### ‚ùå Mistake 2: Losing information unnecessarily
If you're doing Named Entity Recognition (NER) or detecting proper nouns, **never** lowercase first.
```python
"Steve Jobs" ‚Üí "steve jobs"  # Now you can't tell it's a person's name
```

### ‚úÖ Solution:
For NER tasks, preserve case and normalize differently (remove punctuation, etc.).

---

### ‚ùå Mistake 3: Applying lowercasing to the wrong input
If you're using a pre-trained model that expects lowercased input, **must** lowercase before feeding it.

### ‚úÖ Solution:
Check your model's documentation‚Äîif it was trained on lowercased text, preprocess similarly.

## Key Takeaways

1. **Lowercasing reduces vocabulary**: Same word, different cases = 1 word after lowercasing
2. **Use list comprehensions** for batch processing: Concise and efficient
3. **Know when NOT to lowercase**: NER, acronym detection, sentiment analysis
4. **Lowercasing is lossy**: You lose information about proper nouns and emphasis
5. **Check your model**: Different models expect different preprocessing

## Practice:
- Try lowercasing text with emails or URLs ‚Äî what happens?
- Experiment with uppercase text in sentiment analysis ‚Äî does it change results?
- Move on to **stopword removal** and **stemming/lemmatization** for deeper normalization

## Preview: What's Next?

Lowercasing is just the first step. Other normalization techniques include:
- **Stopword removal** ‚Äî Remove common words like "the", "is", "and"
- **Punctuation removal** ‚Äî Strip `.,!?;:` etc.
- **Stemming/Lemmatization** ‚Äî Reduce words to their root form ("running" ‚Üí "run")
- **Unicode normalization** ‚Äî Handle special characters

Each technique has trade-offs and is suited to different tasks. We'll explore them in upcoming notebooks.