# Exercise 0: Regex and Text Processing Basics

Welcome to your first NLP exercise! In this notebook, you'll learn the fundamental skills for working with text data.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **Basic Text Operations**: Clean and normalize text data
2. **Regular Expressions**: Find and extract patterns from text
3. **German Text Processing**: Handle German language specifics (umlauts, compound words)
4. **Data Extraction**: Extract emails, phone numbers, dates, and URLs from text
5. **Text Validation**: Check if text matches specific patterns
6. **Advanced Cleaning**: Build comprehensive text preprocessing pipelines

## Prerequisites
- Basic Python knowledge (strings, functions, loops)
- No prior regex experience needed - we'll learn together!

## What You'll Build
- Text cleaning functions for German text
- Pattern extraction tools (emails, phones, dates)
- Text validation systems
- A complete text preprocessing pipeline

**Ready to become a text processing expert?** Let's dive in! üöÄ

## Exercise 1: Basic Text Cleaning

**Goal**: Learn fundamental text cleaning operations that form the foundation of all text processing tasks.

**Your Task**: Implement basic text cleaning functions using both simple string methods and regular expressions.

### Setup and Sample Data

In [None]:
import re
import string

# Sample German text with various issues (use this for testing your functions)
sample_german_text = """
    Das ist ein   Beispieltext!!!  Er enth√§lt GROSSBUCHSTABEN, 
    Zahlen wie 123, E-Mails wie test@uni-berlin.de,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!   
    
    Telefonnummer: 030-12345678
    Datum: 15.03.2025
"""

print("üìÑ Sample Text to Work With:")
print(repr(sample_german_text))  # repr() shows whitespace and special characters
print("\n" + "="*50)

In [None]:
# Exercise 1a: Basic String Cleaning
def clean_basic_text(text):
    """
    Clean text using basic string methods.
    
    Your task: Implement the following cleaning steps:
    1. Remove leading/trailing whitespace
    2. Replace multiple spaces with single spaces
    3. Convert to lowercase
    4. Remove common punctuation (but keep German umlauts!)
    
    Hints:
    - Use .strip() to remove leading/trailing whitespace
    - Use re.sub(r'\s+', ' ', text) to replace multiple spaces
    - Use .lower() for lowercase conversion
    - For punctuation, use string.punctuation but be careful with German characters
    - You can use .translate() with str.maketrans() for punctuation removal
    
    Args:
        text (str): Input text to clean
    
    Returns:
        str: Cleaned text
    """
    
    # TODO: Implement your cleaning steps here
    # Step 1: Remove leading/trailing whitespace
    
    # Step 2: Replace multiple whitespaces with single space
    
    # Step 3: Convert to lowercase
    
    # Step 4: Remove punctuation (keep German umlauts)
    
    pass  # Remove this when you implement the function

# Test your function (uncomment after implementing)
# cleaned_basic = clean_basic_text(sample_german_text)
# print("Basic cleaning result:")
# print(repr(cleaned_basic))

## Exercise 2: Introduction to Regular Expressions

**Goal**: Learn regex basics for powerful pattern matching and text manipulation.

### Regex Basics - Essential Patterns

Before diving into exercises, here are the key regex patterns you'll need:

**Basic Characters:**
- `.` : Matches any single character
- `*` : Matches 0 or more repetitions  
- `+` : Matches 1 or more repetitions
- `?` : Matches 0 or 1 repetition
- `{n}` : Matches exactly n repetitions
- `{n,m}` : Matches n to m repetitions

**Character Classes:**
- `[abc]` : Matches a, b, or c
- `[a-z]` : Matches any lowercase letter
- `[A-Z]` : Matches any uppercase letter  
- `[0-9]` : Matches any digit
- `\d` : Matches any digit (equivalent to [0-9])
- `\w` : Matches word characters (letters, digits, underscore)
- `\s` : Matches whitespace characters

**Anchors:**
- `^` : Start of string
- `$` : End of string
- `\b` : Word boundary

**Special for German:**
- `[a-zA-Z√§√∂√º√Ñ√ñ√ú√ü]` : German letters including umlauts

In [None]:
# Exercise 2a: Pattern Recognition Practice

def find_phone_numbers(text):
    """
    Find German phone numbers in text.
    
    Your task: Write a regex pattern to find phone numbers
    
    German phone formats to match:
    - 030-12345678 (area code with dash)
    - 030 12345678 (area code with space)  
    - +49 30 12345678 (international format)
    - (030) 12345678 (area code in parentheses)
    
    Hints:
    - \d matches digits
    - {n} matches exactly n repetitions
    - {n,m} matches n to m repetitions
    - [-\s] matches dash or space
    - \+ matches literal plus sign
    - [\(\)] matches parentheses (need to escape them)
    - Use | for alternatives: (pattern1|pattern2)
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found phone numbers
    """
    
    # TODO: Write your regex pattern here
    # Hint: Start simple with one format, then expand
    phone_pattern = r''  # Your pattern goes here
    
    # TODO: Use re.findall to find all matches
    # phones = re.findall(phone_pattern, text)
    # return phones
    
    pass  # Remove this when you implement

def find_email_addresses(text):
    """
    Find email addresses in text.
    
    Your task: Write a regex to match email addresses
    
    Email format: username@domain.extension
    - Username: letters, numbers, dots, underscores, hyphens
    - Domain: letters, numbers, dots, hyphens
    - Extension: 2-4 letters
    
    Hints:
    - [A-Za-z0-9._-] matches valid username characters
    - + means one or more
    - @ matches literal @ symbol
    - \. matches literal dot (. is special in regex)
    - {2,4} matches 2 to 4 repetitions
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found email addresses
    """
    
    # TODO: Write your email pattern
    email_pattern = r''  # Your pattern goes here
    
    # TODO: Find and return emails
    pass

# Test data for your functions
test_text = """
Kontakt: Max Mustermann
Telefon: 030-12345678 oder (030) 87654321
Mobil: +49 175 1234567
E-Mail: max.mustermann@uni-berlin.de
Backup: support@example.com
"""

# Test your functions (uncomment after implementing)
# phones = find_phone_numbers(test_text)
# emails = find_email_addresses(test_text)
# print("Found phones:", phones)
# print("Found emails:", emails)

## Exercise 3: Text Validation and Advanced Cleaning  

**Goal**: Use regex for validation and text replacement operations.

**Your Tasks**: Build validators for German data formats and text cleaners.

### Key Regex Functions You'll Use:
- `re.search()`: Find first match
- `re.match()`: Match at beginning of string  
- `re.findall()`: Find all matches
- `re.sub()`: Replace matches
- `re.fullmatch()`: Check if entire string matches pattern

In [None]:
# Demonstrate different regex functions
sample_text = "Das Meeting ist am 15.03.2025 um 14:30 Uhr. N√§chstes Meeting: 22.03.2025 um 10:00 Uhr."

# re.search() - Find first occurrence
date_pattern = r'\d{2}\.\d{2}\.\d{4}'
first_date = re.search(date_pattern, sample_text)
if first_date:
    print(f"Erstes Datum gefunden: {first_date.group()}")
    print(f"Position: {first_date.span()}")

# re.findall() - Find all occurrences
all_dates = re.findall(date_pattern, sample_text)
print(f"\nAlle Daten: {all_dates}")

# re.finditer() - Iterator over all matches
print("\nDetaillierte Informationen zu allen Daten:")
for match in re.finditer(date_pattern, sample_text):
    print(f"  Datum: {match.group()}, Position: {match.span()}")

# re.sub() - Replace text
censored = re.sub(date_pattern, '[DATUM ENTFERNT]', sample_text)
print(f"\nZensierter Text: {censored}")

# re.split() - Split by pattern
sentences = re.split(r'\. ', sample_text)
print(f"\nS√§tze: {sentences}")

## Part 2: Text Preprocessing and Cleaning

In [None]:
# Sample German text with various issues
raw_text = """
    Das ist ein   Beispieltext!!!  
    Er enth√§lt GROSSBUCHSTABEN, Zahlen wie 123, E-Mails wie test@example.com,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    
    Es gibt auch    mehrfache    Leerzeichen und
    Zeilenumbr√ºche.
    
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!
"""

print("Original Text:")
print(raw_text)

In [None]:
def clean_text(text, 
               lowercase=True, 
               remove_urls=True, 
               remove_emails=True,
               remove_numbers=False,
               remove_punctuation=False,
               remove_extra_whitespace=True):
    """
    Comprehensive text cleaning function.
    
    Args:
        text (str): Input text to clean
        lowercase (bool): Convert to lowercase
        remove_urls (bool): Remove URLs
        remove_emails (bool): Remove email addresses
        remove_numbers (bool): Remove numbers
        remove_punctuation (bool): Remove punctuation
        remove_extra_whitespace (bool): Remove extra whitespace
    
    Returns:
        str: Cleaned text
    """
    # Remove URLs
    if remove_urls:
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove email addresses
    if remove_emails:
        text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
    
    # Remove numbers
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    
    # Remove punctuation (but keep German umlauts)
    if remove_punctuation:
        text = re.sub(r'[^\w\s\u00C0-\u017F]', '', text)
    
    # Convert to lowercase
    if lowercase:
        text = text.lower()
    
    # Remove extra whitespace
    if remove_extra_whitespace:
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
    
    return text

# Test cleaning function
cleaned = clean_text(raw_text)
print("Gereinigter Text:")
print(cleaned)

print("\n" + "="*50)
print("Mit verschiedenen Optionen:")
print("\nOhne Zahlen:")
print(clean_text(raw_text, remove_numbers=True))

print("\nOhne Satzzeichen:")
print(clean_text(raw_text, remove_punctuation=True))

## Part 3: Pattern Extraction and Information Retrieval

In [None]:
# Sample text with various entities
sample_document = """
Prof. Dr. M√ºller lehrt an der Akademie f√ºr Wissenschaften. 
Sie k√∂nnen ihn unter mueller@akademie-wissen.de oder +49-30-1234-56789 erreichen.
Die Vorlesung findet am 15.03.2025 um 14:00 Uhr in Raum B456 statt.
Die Teilnahmegeb√ºhr betr√§gt 150,00 EUR. 
Weitere Informationen finden Sie unter https://www.akademie-wissen.de/vorlesungen.
Anmeldeschluss ist der 01.03.2025.
"""

def extract_entities(text):
    """
    Extract various entities from text using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Extracted entities
    """
    entities = {}
    
    # Extract email addresses
    entities['emails'] = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    
    # Extract phone numbers (German format)
    entities['phones'] = re.findall(r'\+\d{2}-\d{2,3}-\d{3,4}-\d{4,5}', text)
    
    # Extract dates (German format: DD.MM.YYYY)
    entities['dates'] = re.findall(r'\d{2}\.\d{2}\.\d{4}', text)
    
    # Extract times (HH:MM format)
    entities['times'] = re.findall(r'\d{1,2}:\d{2}', text)
    
    # Extract URLs
    entities['urls'] = re.findall(r'https?://[^\s]+', text)
    
    # Extract room numbers (pattern: A123, B456, etc.)
    entities['rooms'] = re.findall(r'\b[A-Z]\d{3}\b', text)
    
    # Extract prices (EUR format)
    entities['prices'] = re.findall(r'\d+,\d{2}\s*EUR', text)
    
    # Extract titles (Prof., Dr., etc.)
    entities['titles'] = re.findall(r'\b(Prof\.|Dr\.|Dipl\.-Ing\.)\s+', text)
    
    return entities

# Extract entities
extracted = extract_entities(sample_document)

print("Extrahierte Entit√§ten:")
print("="*50)
for entity_type, values in extracted.items():
    if values:
        print(f"\n{entity_type.upper()}:")
        for value in values:
            print(f"  - {value}")

## Part 4: Text Tokenization

Tokenization is the process of breaking text into smaller units (tokens) such as words, sentences, or subwords.

In [None]:
def tokenize_words(text):
    """
    Simple word tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of word tokens
    """
    # Match word characters including German umlauts
    tokens = re.findall(r'\b[\w\u00C0-\u017F]+\b', text.lower())
    return tokens

def tokenize_sentences(text):
    """
    Simple sentence tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of sentence tokens
    """
    # Split on sentence-ending punctuation
    sentences = re.split(r'[.!?]+', text)
    # Clean up and remove empty strings
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

# Test tokenization
sample = "Das ist ein Satz! Und hier ist noch einer. Was f√ºr ein sch√∂ner Tag?"

print("Original Text:")
print(sample)

print("\nWort-Tokenisierung:")
word_tokens = tokenize_words(sample)
print(word_tokens)
print(f"Anzahl Tokens: {len(word_tokens)}")

print("\nSatz-Tokenisierung:")
sentence_tokens = tokenize_sentences(sample)
for i, sent in enumerate(sentence_tokens, 1):
    print(f"  {i}. {sent}")

## Part 5: Text Statistics and Analysis

In [None]:
def analyze_text(text):
    """
    Perform basic statistical analysis on text.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Text statistics
    """
    stats = {}
    
    # Basic counts
    stats['total_chars'] = len(text)
    stats['total_chars_no_spaces'] = len(re.sub(r'\s', '', text))
    
    # Word statistics
    words = tokenize_words(text)
    stats['total_words'] = len(words)
    stats['unique_words'] = len(set(words))
    stats['avg_word_length'] = sum(len(word) for word in words) / len(words) if words else 0
    
    # Sentence statistics
    sentences = tokenize_sentences(text)
    stats['total_sentences'] = len(sentences)
    stats['avg_words_per_sentence'] = len(words) / len(sentences) if sentences else 0
    
    # Most common words
    word_freq = Counter(words)
    stats['most_common_words'] = word_freq.most_common(10)
    
    # Count digits
    stats['digit_count'] = len(re.findall(r'\d', text))
    
    # Count uppercase letters
    stats['uppercase_count'] = len(re.findall(r'[A-Z]', text))
    
    return stats

# Analyze sample text
analysis_text = """
Natural Language Processing (NLP) ist ein spannendes Forschungsgebiet der Informatik.
Es kombiniert Linguistik, maschinelles Lernen und k√ºnstliche Intelligenz.
Mit NLP k√∂nnen Computer menschliche Sprache verstehen und verarbeiten.
Anwendungen umfassen Chatbots, maschinelle √úbersetzung und Sentimentanalyse.
Die Entwicklung von NLP hat in den letzten Jahren enorme Fortschritte gemacht.
"""

stats = analyze_text(analysis_text)

print("Text-Analyse:")
print("="*50)
print(f"Gesamtzeichen: {stats['total_chars']}")
print(f"Zeichen ohne Leerzeichen: {stats['total_chars_no_spaces']}")
print(f"\nGesamtw√∂rter: {stats['total_words']}")
print(f"Eindeutige W√∂rter: {stats['unique_words']}")
print(f"Durchschnittliche Wortl√§nge: {stats['avg_word_length']:.2f}")
print(f"\nGesamts√§tze: {stats['total_sentences']}")
print(f"Durchschnittliche W√∂rter pro Satz: {stats['avg_words_per_sentence']:.2f}")
print(f"\nZiffern: {stats['digit_count']}")
print(f"Gro√übuchstaben: {stats['uppercase_count']}")
print(f"\nH√§ufigste W√∂rter:")
for word, count in stats['most_common_words']:
    print(f"  {word}: {count}")

## Part 6: Advanced Pattern Matching Examples

In [None]:
# German-specific patterns
def validate_german_patterns(text):
    """
    Validate various German text patterns.
    """
    patterns = {
        'postal_code': r'\b\d{5}\b',  # German postal codes
        'iban': r'\bDE\d{20}\b',  # German IBAN (simplified)
        'license_plate': r'\b[A-Z√Ñ√ñ√ú]{1,3}-[A-Z√Ñ√ñ√ú]{1,2}\s?\d{1,4}\b',  # German license plates
        'academic_titles': r'\b(Prof\.|Dr\.|Dipl\.-Ing\.)\s+',
    }
    
    results = {}
    for pattern_name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            results[pattern_name] = matches
    
    return results

# Test German patterns
german_text = """
Prof. Dr. M√ºller wohnt in 10115 Berlin.
Seine IBAN lautet DE89370400440532013000.
Das Auto mit dem Kennzeichen B-MW 1234 geh√∂rt ihm.
Dr. Schmidt hat einen M.Sc. in Informatik.
"""

print("Deutsche Muster-Erkennung:")
print("="*50)
found_patterns = validate_german_patterns(german_text)
for pattern_type, matches in found_patterns.items():
    print(f"\n{pattern_type.upper()}:")
    for match in matches:
        print(f"  - {match}")

## Part 7: Building a Text Processing Pipeline

In [None]:
class TextProcessor:
    """
    A comprehensive text processing pipeline.
    """
    
    def __init__(self):
        self.processing_steps = []
    
    def add_step(self, step_name, step_function):
        """Add a processing step to the pipeline."""
        self.processing_steps.append((step_name, step_function))
        return self
    
    def process(self, text, verbose=True):
        """Process text through all steps in the pipeline."""
        if verbose:
            print("Text Processing Pipeline")
            print("="*50)
            print(f"Original text length: {len(text)} characters\n")
        
        current_text = text
        
        for step_name, step_function in self.processing_steps:
            current_text = step_function(current_text)
            if verbose:
                print(f"After {step_name}:")
                print(f"  Length: {len(current_text)} characters")
                print(f"  Preview: {current_text[:100]}...\n")
        
        return current_text

# Define processing functions
def remove_urls(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

def remove_emails(text):
    return re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def to_lowercase(text):
    return text.lower()

def remove_special_chars(text):
    return re.sub(r'[^\w\s\u00C0-\u017F]', '', text)

# Create and configure pipeline
pipeline = TextProcessor()
pipeline.add_step("URL Removal", remove_urls)
pipeline.add_step("Email Removal", remove_emails)
pipeline.add_step("Lowercase Conversion", to_lowercase)
pipeline.add_step("Special Character Removal", remove_special_chars)
pipeline.add_step("Whitespace Normalization", normalize_whitespace)

# Test pipeline
test_text = """
    Besuchen Sie unsere Website unter https://www.example.com!
    Kontaktieren Sie uns: info@example.com
    WICHTIGE INFORMATIONEN √ºber NLP und maschinelles Lernen!!!
    Es gibt    viele    Leerzeichen    hier.
"""

processed_text = pipeline.process(test_text)

print("\nFinal Result:")
print("="*50)
print(processed_text)

## Part 8: Practical Exercise - Document Analyzer

In [None]:
def analyze_document(text):
    """
    Comprehensive document analysis combining all techniques.
    
    Args:
        text (str): Input document
    
    Returns:
        dict: Analysis results
    """
    results = {}
    
    # Extract entities
    results['entities'] = extract_entities(text)
    
    # Text statistics
    results['statistics'] = analyze_text(text)
    
    # Pattern validation
    results['german_patterns'] = validate_german_patterns(text)
    
    return results

# Sample document for analysis
document = """
Sehr geehrte Damen und Herren,

hiermit lade ich Sie zur Konferenz "NLP in der Praxis" ein.
Die Veranstaltung findet am 15.05.2025 um 09:00 Uhr statt.
Ort: Freie Universit√§t Berlin, Raum A123, 14195 Berlin.

Keynote-Speaker: Prof. Dr. Anna Schmidt (anna.schmidt@fu-berlin.de)
Thema: "Moderne Ans√§tze im Natural Language Processing"

Anmeldung unter: https://www.nlp-konferenz.de
R√ºckfragen an: info@nlp-konferenz.de oder +49-30-838-12345

Teilnahmegeb√ºhr: 299,00 EUR
IBAN f√ºr √úberweisung: DE89370400440532013000

Mit freundlichen Gr√º√üen,
Dr. Max M√ºller
"""

print("Dokumenten-Analyse:")
print("="*70)
print("\nOriginaldokument:")
print(document)
print("\n" + "="*70)

analysis = analyze_document(document)

# Display results
print("\nüìß EXTRAHIERTE ENTIT√ÑTEN:")
print("-"*70)
for entity_type, values in analysis['entities'].items():
    if values:
        print(f"\n{entity_type.upper()}:")
        for value in values:
            print(f"  ‚úì {value}")

print("\n\nüìä STATISTISCHE ANALYSE:")
print("-"*70)
stats = analysis['statistics']
print(f"W√∂rter: {stats['total_words']} (davon {stats['unique_words']} eindeutig)")
print(f"S√§tze: {stats['total_sentences']}")
print(f"Zeichen: {stats['total_chars']}")
print(f"Durchschnittliche Wortl√§nge: {stats['avg_word_length']:.2f}")
print(f"Durchschnittliche W√∂rter pro Satz: {stats['avg_words_per_sentence']:.2f}")

print("\n\nüá©üá™ DEUTSCHE MUSTER:")
print("-"*70)
for pattern_type, matches in analysis['german_patterns'].items():
    print(f"\n{pattern_type.upper()}:")
    for match in matches:
        print(f"  ‚úì {match}")

## Exercise Tasks

Complete the following tasks to practice your regex and text processing skills:

1. **Pattern Creation**:
   - Create a regex pattern to extract German street addresses
   - Write a pattern to find all capitalized words (potential proper nouns)
   - Develop a pattern for German phone numbers in various formats

2. **Text Cleaning**:
   - Build a function to remove HTML tags from text
   - Create a function to normalize different quotation marks
   - Implement a function to expand common German abbreviations

3. **Information Extraction**:
   - Extract all monetary amounts from a text (EUR, $, etc.)
   - Find and categorize all numbers (integers, floats, percentages)
   - Extract compound words (typical in German)

4. **Text Validation**:
   - Validate German postal codes
   - Check if text contains proper sentence structure
   - Identify potential spelling errors using pattern matching

5. **Advanced Pipeline**:
   - Create a pipeline that anonymizes personal information
   - Build a text normalizer for social media content
   - Develop a preprocessing pipeline for sentiment analysis

## Reflection Questions

1. When should you use regex vs. specialized NLP libraries?
2. What are the limitations of regex for text processing?
3. How can regex patterns be optimized for performance?
4. Why is text preprocessing important for NLP tasks?
5. What challenges are specific to German text processing?

## Next Steps

- Proceed to **Notebook 01**: Introduction to NLP and Text Processing
- Explore advanced tokenization with NLTK and spaCy
- Learn about stemming and lemmatization
- Study language-specific text processing challenges

## Exercise 1: Basic Text Cleaning

**Goal**: Learn fundamental text cleaning operations that form the foundation of all text processing tasks.

**Your Task**: Implement basic text cleaning functions using both simple string methods and regular expressions.

### Setup and Sample Data

In [None]:
import re
import string

# Sample German text with various issues (use this for testing your functions)
sample_german_text = """
    Das ist ein   Beispieltext!!!  Er enth√§lt GROSSBUCHSTABEN, 
    Zahlen wie 123, E-Mails wie test@uni-berlin.de,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!   
    
    Telefonnummer: 030-12345678
    Datum: 15.03.2025
"""

print("üìÑ Sample Text to Work With:")
print(repr(sample_german_text))  # repr() shows whitespace and special characters
print("\n" + "="*50)

In [None]:
# Exercise 1a: Basic String Cleaning
def clean_basic_text(text):
    """
    Clean text using basic string methods.
    
    Your task: Implement the following cleaning steps:
    1. Remove leading/trailing whitespace
    2. Replace multiple spaces with single spaces
    3. Convert to lowercase
    4. Remove common punctuation (but keep German umlauts!)
    
    Hints:
    - Use .strip() to remove leading/trailing whitespace
    - Use re.sub(r'\s+', ' ', text) to replace multiple spaces
    - Use .lower() for lowercase conversion
    - For punctuation, use string.punctuation but be careful with German characters
    - You can use .translate() with str.maketrans() for punctuation removal
    
    Args:
        text (str): Input text to clean
    
    Returns:
        str: Cleaned text
    """
    
    # TODO: Implement your cleaning steps here
    # Step 1: Remove leading/trailing whitespace
    text = text.strip()

    # Step 2: Replace multiple whitespaces with single space
    text = re.sub(r'\s+', ' ', text)

    # Step 3: Convert to lowercase
    text = text.lower()

    # Step 4: Remove punctuation (keep German umlauts)
    # Create a translation table that maps each punctuation character to None
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    return text  # Return the cleaned text

# Test your function (uncomment after implementing)
cleaned_basic = clean_basic_text(sample_german_text)
print("Basic cleaning result:")
print(repr(cleaned_basic))

## Exercise 2: Introduction to Regular Expressions

**Goal**: Learn regex basics for powerful pattern matching and text manipulation.

### Regex Basics - Essential Patterns

Before diving into exercises, here are the key regex patterns you'll need:

**Basic Characters:**
- `.` : Matches any single character
- `*` : Matches 0 or more repetitions  
- `+` : Matches 1 or more repetitions
- `?` : Matches 0 or 1 repetition
- `{n}` : Matches exactly n repetitions
- `{n,m}` : Matches n to m repetitions

**Character Classes:**
- `[abc]` : Matches a, b, or c
- `[a-z]` : Matches any lowercase letter
- `[A-Z]` : Matches any uppercase letter  
- `[0-9]` : Matches any digit
- `\d` : Matches any digit (equivalent to [0-9])
- `\w` : Matches word characters (letters, digits, underscore)
- `\s` : Matches whitespace characters

**Anchors:**
- `^` : Start of string
- `$` : End of string
- `\b` : Word boundary

**Special for German:**
- `[a-zA-Z√§√∂√º√Ñ√ñ√ú√ü]` : German letters including umlauts

In [None]:
# Exercise 2a: Pattern Recognition Practice

def find_phone_numbers(text):
    """
    Find German phone numbers in text.
    
    Your task: Write a regex pattern to find phone numbers
    
    German phone formats to match:
    - 030-12345678 (area code with dash)
    - 030 12345678 (area code with space)  
    - +49 30 12345678 (international format)
    - (030) 12345678 (area code in parentheses)
    
    Hints:
    - \d matches digits
    - {n} matches exactly n repetitions
    - {n,m} matches n to m repetitions
    - [-\s] matches dash or space
    - \+ matches literal plus sign
    - [\(\)] matches parentheses (need to escape them)
    - Use | for alternatives: (pattern1|pattern2)
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found phone numbers
    """
    
    # TODO: Write your regex pattern here
    # Hint: Start simple with one format, then expand
    phone_pattern = r'(\+49\s?|0)(30|40|70|80|90)?[-\s]?\d{3,4}[-\s]?\d{4}'  # Your pattern goes here
    
    # TODO: Use re.findall to find all matches
    phones = re.findall(phone_pattern, text)
    return phones
    
    pass  # Remove this when you implement

def find_email_addresses(text):
    """
    Find email addresses in text.
    
    Your task: Write a regex to match email addresses
    
    Email format: username@domain.extension
    - Username: letters, numbers, dots, underscores, hyphens
    - Domain: letters, numbers, dots, hyphens
    - Extension: 2-4 letters
    
    Hints:
    - [A-Za-z0-9._-] matches valid username characters
    - + means one or more
    - @ matches literal @ symbol
    - \. matches literal dot (. is special in regex)
    - {2,4} matches 2 to 4 repetitions
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found email addresses
    """
    
    # TODO: Write your email pattern
    email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}'  # Your pattern goes here
    
    # TODO: Find and return emails
    emails = re.findall(email_pattern, text)
    return emails

# Test data for your functions
test_text = """
Kontakt: Max Mustermann
Telefon: 030-12345678 oder (030) 87654321
Mobil: +49 175 1234567
E-Mail: max.mustermann@uni-berlin.de
Backup: support@example.com
"""

# Test your functions (uncomment after implementing)
phones = find_phone_numbers(test_text)
emails = find_email_addresses(test_text)
print("Found phones:", phones)
print("Found emails:", emails)

### Regex Functions in Python

Python's `re` module provides several key functions:
- `re.search()`: Find first match
- `re.match()`: Match at beginning of string
- `re.findall()`: Find all matches
- `re.finditer()`: Find all matches (returns iterator)
- `re.sub()`: Replace matches
- `re.split()`: Split string by pattern

In [None]:
# Demonstrate different regex functions
sample_text = "Das Meeting ist am 15.03.2025 um 14:30 Uhr. N√§chstes Meeting: 22.03.2025 um 10:00 Uhr."

# re.search() - Find first occurrence
date_pattern = r'\d{2}\.\d{2}\.\d{4}'
first_date = re.search(date_pattern, sample_text)
if first_date:
    print(f"Erstes Datum gefunden: {first_date.group()}")
    print(f"Position: {first_date.span()}")

# re.findall() - Find all occurrences
all_dates = re.findall(date_pattern, sample_text)
print(f"\nAlle Daten: {all_dates}")

# re.finditer() - Iterator over all matches
print("\nDetaillierte Informationen zu allen Daten:")
for match in re.finditer(date_pattern, sample_text):
    print(f"  Datum: {match.group()}, Position: {match.span()}")

# re.sub() - Replace text
censored = re.sub(date_pattern, '[DATUM ENTFERNT]', sample_text)
print(f"\nZensierter Text: {censored}")

# re.split() - Split by pattern
sentences = re.split(r'\. ', sample_text)
print(f"\nS√§tze: {sentences}")

## Part 2: Text Preprocessing and Cleaning

In [None]:
# Sample German text with various issues
raw_text = """
    Das ist ein   Beispieltext!!!  
    Er enth√§lt GROSSBUCHSTABEN, Zahlen wie 123, E-Mails wie test@example.com,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    
    Es gibt auch    mehrfache    Leerzeichen und
    Zeilenumbr√ºche.
    
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!
"""

print("Original Text:")
print(raw_text)

In [None]:
def clean_text(text, 
               lowercase=True, 
               remove_urls=True, 
               remove_emails=True,
               remove_numbers=False,
               remove_punctuation=False,
               remove_extra_whitespace=True):
    """
    Comprehensive text cleaning function.
    
    Args:
        text (str): Input text to clean
        lowercase (bool): Convert to lowercase
        remove_urls (bool): Remove URLs
        remove_emails (bool): Remove email addresses
        remove_numbers (bool): Remove numbers
        remove_punctuation (bool): Remove punctuation
        remove_extra_whitespace (bool): Remove extra whitespace
    
    Returns:
        str: Cleaned text
    """
    # Remove URLs
    if remove_urls:
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove email addresses
    if remove_emails:
        text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
    
    # Remove numbers
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    
    # Remove punctuation (but keep German umlauts)
    if remove_punctuation:
        text = re.sub(r'[^\w\s\u00C0-\u017F]', '', text)
    
    # Convert to lowercase
    if lowercase:
        text = text.lower()
    
    # Remove extra whitespace
    if remove_extra_whitespace:
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
    
    return text

# Test cleaning function
cleaned = clean_text(raw_text)
print("Gereinigter Text:")
print(cleaned)

print("\n" + "="*50)
print("Mit verschiedenen Optionen:")
print("\nOhne Zahlen:")
print(clean_text(raw_text, remove_numbers=True))

print("\nOhne Satzzeichen:")
print(clean_text(raw_text, remove_punctuation=True))

## Part 3: Pattern Extraction and Information Retrieval

In [None]:
# Sample text with various entities
sample_document = """
Prof. Dr. M√ºller lehrt an der Akademie f√ºr Wissenschaften. 
Sie k√∂nnen ihn unter mueller@akademie-wissen.de oder +49-30-1234-56789 erreichen.
Die Vorlesung findet am 15.03.2025 um 14:00 Uhr in Raum B456 statt.
Die Teilnahmegeb√ºhr betr√§gt 150,00 EUR. 
Weitere Informationen finden Sie unter https://www.akademie-wissen.de/vorlesungen.
Anmeldeschluss ist der 01.03.2025.
"""

def extract_entities(text):
    """
    Extract various entities from text using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Extracted entities
    """
    entities = {}
    
    # Extract email addresses
    entities['emails'] = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    
    # Extract phone numbers (German format)
    entities['phones'] = re.findall(r'\+\d{2}-\d{2,3}-\d{3,4}-\d{4,5}', text)
    
    # Extract dates (German format: DD.MM.YYYY)
    entities['dates'] = re.findall(r'\d{2}\.\d{2}\.\d{4}', text)
    
    # Extract times (HH:MM format)
    entities['times'] = re.findall(r'\d{1,2}:\d{2}', text)
    
    # Extract URLs
    entities['urls'] = re.findall(r'https?://[^\s]+', text)
    
    # Extract room numbers (pattern: A123, B456, etc.)
    entities['rooms'] = re.findall(r'\b[A-Z]\d{3}\b', text)
    
    # Extract prices (EUR format)
    entities['prices'] = re.findall(r'\d+,\d{2}\s*EUR', text)
    
    # Extract titles (Prof., Dr., etc.)
    entities['titles'] = re.findall(r'\b(Prof\.|Dr\.|Dipl\.-Ing\.)\s+', text)
    
    return entities

# Extract entities
extracted = extract_entities(sample_document)

print("Extrahierte Entit√§ten:")
print("="*50)
for entity_type, values in extracted.items():
    if values:
        print(f"\n{entity_type.upper()}:")
        for value in values:
            print(f"  - {value}")

## Part 4: Text Tokenization

Tokenization is the process of breaking text into smaller units (tokens) such as words, sentences, or subwords.

In [None]:
def tokenize_words(text):
    """
    Simple word tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of word tokens
    """
    # Match word characters including German umlauts
    tokens = re.findall(r'\b[\w\u00C0-\u017F]+\b', text.lower())
    return tokens

def tokenize_sentences(text):
    """
    Simple sentence tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of sentence tokens
    """
    # Split on sentence-ending punctuation
    sentences = re.split(r'[.!?]+', text)
    # Clean up and remove empty strings
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

# Test tokenization
sample = "Das ist ein Satz! Und hier ist noch einer. Was f√ºr ein sch√∂ner Tag?"

print("Original Text:")
print(sample)

print("\nWort-Tokenisierung:")
word_tokens = tokenize_words(sample)
print(word_tokens)
print(f"Anzahl Tokens: {len(word_tokens)}")

print("\nSatz-Tokenisierung:")
sentence_tokens = tokenize_sentences(sample)
for i, sent in enumerate(sentence_tokens, 1):
    print(f"  {i}. {sent}")

## Part 5: Text Statistics and Analysis

In [None]:
def analyze_text(text):
    """
    Perform basic statistical analysis on text.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Text statistics
    """
    stats = {}
    
    # Basic counts
    stats['total_chars'] = len(text)
    stats['total_chars_no_spaces'] = len(re.sub(r'\s', '', text))
    
    # Word statistics
    words = tokenize_words(text)
    stats['total_words'] = len(words)
    stats['unique_words'] = len(set(words))
    stats['avg_word_length'] = sum(len(word) for word in words) / len(words) if words else 0
    
    # Sentence statistics
    sentences = tokenize_sentences(text)
    stats['total_sentences'] = len(sentences)
    stats['avg_words_per_sentence'] = len(words) / len(sentences) if sentences else 0
    
    # Most common words
    word_freq = Counter(words)
    stats['most_common_words'] = word_freq.most_common(10)
    
    # Count digits
    stats['digit_count'] = len(re.findall(r'\d', text))
    
    # Count uppercase letters
    stats['uppercase_count'] = len(re.findall(r'[A-Z]', text))
    
    return stats

# Analyze sample text
analysis_text = """
Natural Language Processing (NLP) ist ein spannendes Forschungsgebiet der Informatik.
Es kombiniert Linguistik, maschinelles Lernen und k√ºnstliche Intelligenz.
Mit NLP k√∂nnen Computer menschliche Sprache verstehen und verarbeiten.
Anwendungen umfassen Chatbots, maschinelle √úbersetzung und Sentimentanalyse.
Die Entwicklung von NLP hat in den letzten Jahren enorme Fortschritte gemacht.
"""

stats = analyze_text(analysis_text)

print("Text-Analyse:")
print("="*50)
print(f"Gesamtzeichen: {stats['total_chars']}")
print(f"Zeichen ohne Leerzeichen: {stats['total_chars_no_spaces']}")
print(f"\nGesamtw√∂rter: {stats['total_words']}")
print(f"Eindeutige W√∂rter: {stats['unique_words']}")
print(f"Durchschnittliche Wortl√§nge: {stats['avg_word_length']:.2f}")
print(f"\nGesamts√§tze: {stats['total_sentences']}")
print(f"Durchschnittliche W√∂rter pro Satz: {stats['avg_words_per_sentence']:.2f}")
print(f"\nZiffern: {stats['digit_count']}")
print(f"Gro√übuchstaben: {stats['uppercase_count']}")
print(f"\nH√§ufigste W√∂rter:")
for word, count in stats['most_common_words']:
    print(f"  {word}: {count}")

## Part 6: Advanced Pattern Matching Examples

In [None]:
# German-specific patterns
def validate_german_patterns(text):
    """
    Validate various German text patterns.
    """
    patterns = {
        'postal_code': r'\b\d{5}\b',  # German postal codes
        'iban': r'\bDE\d{20}\b',  # German IBAN (simplified)
        'license_plate': r'\b[A-Z√Ñ√ñ√ú]{1,3}-[A-Z√Ñ√ñ√ú]{1,2}\s?\d{1,4}\b',  # German license plates
        'academic_titles': r'\b(Prof\.|Dr\.|Dipl\.-Ing\.)\s+',
    }
    
    results = {}
    for pattern_name, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            results[pattern_name] = matches
    
    return results

# Test German patterns
german_text = """
Prof. Dr. M√ºller wohnt in 10115 Berlin.
Seine IBAN lautet DE89370400440532013000.
Das Auto mit dem Kennzeichen B-MW 1234 geh√∂rt ihm.
Dr. Schmidt hat einen M.Sc. in Informatik.
"""

print("Deutsche Muster-Erkennung:")
print("="*50)
found_patterns = validate_german_patterns(german_text)
for pattern_type, matches in found_patterns.items():
    print(f"\n{pattern_type.upper()}:")
    for match in matches:
        print(f"  - {match}")

## Part 7: Building a Text Processing Pipeline

In [None]:
class TextProcessor:
    """
    A comprehensive text processing pipeline.
    """
    
    def __init__(self):
        self.processing_steps = []
    
    def add_step(self, step_name, step_function):
        """Add a processing step to the pipeline."""
        self.processing_steps.append((step_name, step_function))
        return self
    
    def process(self, text, verbose=True):
        """Process text through all steps in the pipeline."""
        if verbose:
            print("Text Processing Pipeline")
            print("="*50)
            print(f"Original text length: {len(text)} characters\n")
        
        current_text = text
        
        for step_name, step_function in self.processing_steps:
            current_text = step_function(current_text)
            if verbose:
                print(f"After {step_name}:")
                print(f"  Length: {len(current_text)} characters")
                print(f"  Preview: {current_text[:100]}...\n")
        
        return current_text

# Define processing functions
def remove_urls(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

def remove_emails(text):
    return re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def to_lowercase(text):
    return text.lower()

def remove_special_chars(text):
    return re.sub(r'[^\w\s\u00C0-\u017F]', '', text)

# Create and configure pipeline
pipeline = TextProcessor()
pipeline.add_step("URL Removal", remove_urls)
pipeline.add_step("Email Removal", remove_emails)
pipeline.add_step("Lowercase Conversion", to_lowercase)
pipeline.add_step("Special Character Removal", remove_special_chars)
pipeline.add_step("Whitespace Normalization", normalize_whitespace)

# Test pipeline
test_text = """
    Besuchen Sie unsere Website unter https://www.example.com!
    Kontaktieren Sie uns: info@example.com
    WICHTIGE INFORMATIONEN √ºber NLP und maschinelles Lernen!!!
    Es gibt    viele    Leerzeichen    hier.
"""

processed_text = pipeline.process(test_text)

print("\nFinal Result:")
print("="*50)
print(processed_text)

## Part 8: Practical Exercise - Document Analyzer

In [None]:
def analyze_document(text):
    """
    Comprehensive document analysis combining all techniques.
    
    Args:
        text (str): Input document
    
    Returns:
        dict: Analysis results
    """
    results = {}
    
    # Extract entities
    results['entities'] = extract_entities(text)
    
    # Text statistics
    results['statistics'] = analyze_text(text)
    
    # Pattern validation
    results['german_patterns'] = validate_german_patterns(text)
    
    return results

# Sample document for analysis
document = """
Sehr geehrte Damen und Herren,

hiermit lade ich Sie zur Konferenz "NLP in der Praxis" ein.
Die Veranstaltung findet am 15.05.2025 um 09:00 Uhr statt.
Ort: Freie Universit√§t Berlin, Raum A123, 14195 Berlin.

Keynote-Speaker: Prof. Dr. Anna Schmidt (anna.schmidt@fu-berlin.de)
Thema: "Moderne Ans√§tze im Natural Language Processing"

Anmeldung unter: https://www.nlp-konferenz.de
R√ºckfragen an: info@nlp-konferenz.de oder +49-30-838-12345

Teilnahmegeb√ºhr: 299,00 EUR
IBAN f√ºr √úberweisung: DE89370400440532013000

Mit freundlichen Gr√º√üen,
Dr. Max M√ºller
"""

print("Dokumenten-Analyse:")
print("="*70)
print("\nOriginaldokument:")
print(document)
print("\n" + "="*70)

analysis = analyze_document(document)

# Display results
print("\nüìß EXTRAHIERTE ENTIT√ÑTEN:")
print("-"*70)
for entity_type, values in analysis['entities'].items():
    if values:
        print(f"\n{entity_type.upper()}:")
        for value in values:
            print(f"  ‚úì {value}")

print("\n\nüìä STATISTISCHE ANALYSE:")
print("-"*70)
stats = analysis['statistics']
print(f"W√∂rter: {stats['total_words']} (davon {stats['unique_words']} eindeutig)")
print(f"S√§tze: {stats['total_sentences']}")
print(f"Zeichen: {stats['total_chars']}")
print(f"Durchschnittliche Wortl√§nge: {stats['avg_word_length']:.2f}")
print(f"Durchschnittliche W√∂rter pro Satz: {stats['avg_words_per_sentence']:.2f}")

print("\n\nüá©üá™ DEUTSCHE MUSTER:")
print("-"*70)
for pattern_type, matches in analysis['german_patterns'].items():
    print(f"\n{pattern_type.upper()}:")
    for match in matches:
        print(f"  ‚úì {match}")

## Exercise Tasks

Complete the following tasks to practice your regex and text processing skills:

1. **Pattern Creation**:
   - Create a regex pattern to extract German street addresses
   - Write a pattern to find all capitalized words (potential proper nouns)
   - Develop a pattern for German phone numbers in various formats

2. **Text Cleaning**:
   - Build a function to remove HTML tags from text
   - Create a function to normalize different quotation marks
   - Implement a function to expand common German abbreviations

3. **Information Extraction**:
   - Extract all monetary amounts from a text (EUR, $, etc.)
   - Find and categorize all numbers (integers, floats, percentages)
   - Extract compound words (typical in German)

4. **Text Validation**:
   - Validate German postal codes
   - Check if text contains proper sentence structure
   - Identify potential spelling errors using pattern matching

5. **Advanced Pipeline**:
   - Create a pipeline that anonymizes personal information
   - Build a text normalizer for social media content
   - Develop a preprocessing pipeline for sentiment analysis

## Reflection Questions

1. When should you use regex vs. specialized NLP libraries?
2. What are the limitations of regex for text processing?
3. How can regex patterns be optimized for performance?
4. Why is text preprocessing important for NLP tasks?
5. What challenges are specific to German text processing?

## Next Steps

- Proceed to **Notebook 01**: Introduction to NLP and Text Processing
- Explore advanced tokenization with NLTK and spaCy
- Learn about stemming and lemmatization
- Study language-specific text processing challenges

## Exercise 1: Basic Text Cleaning

**Goal**: Learn fundamental text cleaning operations that form the foundation of all text processing tasks.

**Your Task**: Implement basic text cleaning functions using both simple string methods and regular expressions.

### Setup and Sample Data

In [None]:
import re
import string

# Sample German text with various issues (use this for testing your functions)
sample_german_text = """
    Das ist ein   Beispieltext!!!  Er enth√§lt GROSSBUCHSTABEN, 
    Zahlen wie 123, E-Mails wie test@uni-berlin.de,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!   
    
    Telefonnummer: 030-12345678
    Datum: 15.03.2025
"""

print("üìÑ Sample Text to Work With:")
print(repr(sample_german_text))  # repr() shows whitespace and special characters
print("\n" + "="*50)

In [None]:
# Exercise 1a: Basic String Cleaning
def clean_basic_text(text):
    """
    Clean text using basic string methods.
    
    Your task: Implement the following cleaning steps:
    1. Remove leading/trailing whitespace
    2. Replace multiple spaces with single spaces
    3. Convert to lowercase
    4. Remove common punctuation (but keep German umlauts!)
    
    Hints:
    - Use .strip() to remove leading/trailing whitespace
    - Use re.sub(r'\s+', ' ', text) to replace multiple spaces
    - Use .lower() for lowercase conversion
    - For punctuation, use string.punctuation but be careful with German characters
    - You can use .translate() with str.maketrans() for punctuation removal
    
    Args:
        text (str): Input text to clean
    
    Returns:
        str: Cleaned text
    """
    
    # TODO: Implement your cleaning steps here
    # Step 1: Remove leading/trailing whitespace
    text = text.strip()

    # Step 2: Replace multiple whitespaces with single space
    text = re.sub(r'\s+', ' ', text)

    # Step 3: Convert to lowercase
    text = text.lower()

    # Step 4: Remove punctuation (keep German umlauts)
    # Create a translation table that maps each punctuation character to None
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    return text  # Return the cleaned text

# Test your function (uncomment after implementing)
cleaned_basic = clean_basic_text(sample_german_text)
print("Basic cleaning result:")
print(repr(cleaned_basic))

## Exercise 2: Introduction to Regular Expressions

**Goal**: Learn regex basics for powerful pattern matching and text manipulation.

### Regex Basics - Essential Patterns

Before diving into exercises, here are the key regex patterns you'll need:

**Basic Characters:**
- `.` : Matches any single character
- `*` : Matches 0 or more repetitions  
- `+` : Matches 1 or more repetitions
- `?` : Matches 0 or 1 repetition
- `{n}` : Matches exactly n repetitions
- `{n,m}` : Matches n to m repetitions

**Character Classes:**
- `[abc]` : Matches a, b, or c
- `[a-z]` : Matches any lowercase letter
- `[A-Z]` : Matches any uppercase letter  
- `[0-9]` : Matches any digit
- `\d` : Matches any digit (equivalent to [0-9])
- `\w` : Matches word characters (letters, digits, underscore)
- `\s` : Matches whitespace characters

**Anchors:**
- `^` : Start of string
- `$` : End of string
- `\b` : Word boundary

**Special for German:**
- `[a-zA-Z√§√∂√º√Ñ√ñ√ú√ü]` : German letters including umlauts

In [None]:
# Exercise 2a: Pattern Recognition Practice

def find_phone_numbers(text):
    """
    Find German phone numbers in text.
    
    Your task: Write a regex pattern to find phone numbers
    
    German phone formats to match:
    - 030-12345678 (area code with dash)
    - 030 12345678 (area code with space)  
    - +49 30 12345678 (international format)
    - (030) 12345678 (area code in parentheses)
    
    Hints:
    - \d matches digits
    - {n} matches exactly n repetitions
    - {n,m} matches n to m repetitions
    - [-\s] matches dash or space
    - \+ matches literal plus sign
    - [\(\)] matches parentheses (need to escape them)
    - Use | for alternatives: (pattern1|pattern2)
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found phone numbers
    """
    
    # TODO: Write your regex pattern here
    # Hint: Start simple with one format, then expand
    phone_pattern = r'(\+49\s?|0)(30|40|70|80|90)?[-\s]?\d{3,4}[-\s]?\d{4}'  # Your pattern goes here
    
    # TODO: Use re.findall to find all matches
    phones = re.findall(phone_pattern, text)
    return phones
    
    pass  # Remove this when you implement

def find_email_addresses(text):
    """
    Find email addresses in text.
    
    Your task: Write a regex to match email addresses
    
    Email format: username@domain.extension
    - Username: letters, numbers, dots, underscores, hyphens
    - Domain: letters, numbers, dots, hyphens
    - Extension: 2-4 letters
    
    Hints:
    - [A-Za-z0-9._-] matches valid username characters
    - + means one or more
    - @ matches literal @ symbol
    - \. matches literal dot (. is special in regex)
    - {2,4} matches 2 to 4 repetitions
    
    Args:
        text (str): Text to search in
        
    Returns:
        list: List of found email addresses
    """
    
    # TODO: Write your email pattern
    email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}'  # Your pattern goes here
    
    # TODO: Find and return emails
    emails = re.findall(email_pattern, text)
    return emails

# Test data for your functions
test_text = """
Kontakt: Max Mustermann
Telefon: 030-12345678 oder (030) 87654321
Mobil: +49 175 1234567
E-Mail: max.mustermann@uni-berlin.de
Backup: support@example.com
"""

# Test your functions (uncomment after implementing)
phones = find_phone_numbers(test_text)
emails = find_email_addresses(test_text)
print("Found phones:", phones)
print("Found emails:", emails)

### Regex Functions in Python

Python's `re` module provides several key functions:
- `re.search()`: Find first match
- `re.match()`: Match at beginning of string
- `re.findall()`: Find all matches
- `re.finditer()`: Find all matches (returns iterator)
- `re.sub()`: Replace matches
- `re.split()`: Split string by pattern

In [None]:
# Demonstrate different regex functions
sample_text = "Das Meeting ist am 15.03.2025 um 14:30 Uhr. N√§chstes Meeting: 22.03.2025 um 10:00 Uhr."

# re.search() - Find first occurrence
date_pattern = r'\d{2}\.\d{2}\.\d{4}'
first_date = re.search(date_pattern, sample_text)
if first_date:
    print(f"Erstes Datum gefunden: {first_date.group()}")
    print(f"Position: {first_date.span()}")

# re.findall() - Find all occurrences
all_dates = re.findall(date_pattern, sample_text)
print(f"\nAlle Daten: {all_dates}")

# re.finditer() - Iterator over all matches
print("\nDetaillierte Informationen zu allen Daten:")
for match in re.finditer(date_pattern, sample_text):
    print(f"  Datum: {match.group()}, Position: {match.span()}")

# re.sub() - Replace text
censored = re.sub(date_pattern, '[DATUM ENTFERNT]', sample_text)
print(f"\nZensierter Text: {censored}")

# re.split() - Split by pattern
sentences = re.split(r'\. ', sample_text)
print(f"\nS√§tze: {sentences}")

## Part 2: Text Preprocessing and Cleaning

In [None]:
# Sample German text with various issues
raw_text = """
    Das ist ein   Beispieltext!!!  
    Er enth√§lt GROSSBUCHSTABEN, Zahlen wie 123, E-Mails wie test@example.com,
    URLs wie https://www.example.com und Sonderzeichen: @#$%!
    
    Es gibt auch    mehrfache    Leerzeichen und
    Zeilenumbr√ºche.
    
    Deutsche Umlaute: √§√∂√º√Ñ√ñ√ú√ü sind wichtig!
"""

print("Original Text:")
print(raw_text)

In [None]:
def clean_text(text, 
               lowercase=True, 
               remove_urls=True, 
               remove_emails=True,
               remove_numbers=False,
               remove_punctuation=False,
               remove_extra_whitespace=True):
    """
    Comprehensive text cleaning function.
    
    Args:
        text (str): Input text to clean
        lowercase (bool): Convert to lowercase
        remove_urls (bool): Remove URLs
        remove_emails (bool): Remove email addresses
        remove_numbers (bool): Remove numbers
        remove_punctuation (bool): Remove punctuation
        remove_extra_whitespace (bool): Remove extra whitespace
    
    Returns:
        str: Cleaned text
    """
    # Remove URLs
    if remove_urls:
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove email addresses
    if remove_emails:
        text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
    
    # Remove numbers
    if remove_numbers:
        text = re.sub(r'\d+', '', text)
    
    # Remove punctuation (but keep German umlauts)
    if remove_punctuation:
        text = re.sub(r'[^\w\s\u00C0-\u017F]', '', text)
    
    # Convert to lowercase
    if lowercase:
        text = text.lower()
    
    # Remove extra whitespace
    if remove_extra_whitespace:
        text = re.sub(r'\s+', ' ', text)
        text = text.strip()
    
    return text

# Test cleaning function
cleaned = clean_text(raw_text)
print("Gereinigter Text:")
print(cleaned)

print("\n" + "="*50)
print("Mit verschiedenen Optionen:")
print("\nOhne Zahlen:")
print(clean_text(raw_text, remove_numbers=True))

print("\nOhne Satzzeichen:")
print(clean_text(raw_text, remove_punctuation=True))

## Part 3: Pattern Extraction and Information Retrieval

In [None]:
# Sample text with various entities
sample_document = """
Prof. Dr. M√ºller lehrt an der Akademie f√ºr Wissenschaften. 
Sie k√∂nnen ihn unter mueller@akademie-wissen.de oder +49-30-1234-56789 erreichen.
Die Vorlesung findet am 15.03.2025 um 14:00 Uhr in Raum B456 statt.
Die Teilnahmegeb√ºhr betr√§gt 150,00 EUR. 
Weitere Informationen finden Sie unter https://www.akademie-wissen.de/vorlesungen.
Anmeldeschluss ist der 01.03.2025.
"""

def extract_entities(text):
    """
    Extract various entities from text using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Extracted entities
    """
    entities = {}
    
    # Extract email addresses
    entities['emails'] = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    
    # Extract phone numbers (German format)
    entities['phones'] = re.findall(r'\+\d{2}-\d{2,3}-\d{3,4}-\d{4,5}', text)
    
    # Extract dates (German format: DD.MM.YYYY)
    entities['dates'] = re.findall(r'\d{2}\.\d{2}\.\d{4}', text)
    
    # Extract times (HH:MM format)
    entities['times'] = re.findall(r'\d{1,2}:\d{2}', text)
    
    # Extract URLs
    entities['urls'] = re.findall(r'https?://[^\s]+', text)
    
    # Extract room numbers (pattern: A123, B456, etc.)
    entities['rooms'] = re.findall(r'\b[A-Z]\d{3}\b', text)
    
    # Extract prices (EUR format)
    entities['prices'] = re.findall(r'\d+,\d{2}\s*EUR', text)
    
    # Extract titles (Prof., Dr., etc.)
    entities['titles'] = re.findall(r'\b(Prof\.|Dr\.|Dipl\.-Ing\.)\s+', text)
    
    return entities

# Extract entities
extracted = extract_entities(sample_document)

print("Extrahierte Entit√§ten:")
print("="*50)
for entity_type, values in extracted.items():
    if values:
        print(f"\n{entity_type.upper()}:")
        for value in values:
            print(f"  - {value}")

## Part 4: Text Tokenization

Tokenization is the process of breaking text into smaller units (tokens) such as words, sentences, or subwords.

In [None]:
def tokenize_words(text):
    """
    Simple word tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of word tokens
    """
    # Match word characters including German umlauts
    tokens = re.findall(r'\b[\w\u00C0-\u017F]+\b', text.lower())
    return tokens

def tokenize_sentences(text):
    """
    Simple sentence tokenization using regex.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of sentence tokens
    """
    # Split on sentence-ending punctuation
    sentences = re.split(r'[.!?]+', text)
    # Clean up and remove empty strings
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

# Test tokenization
sample = "Das ist ein Satz! Und hier ist noch einer. Was f√ºr ein sch√∂ner Tag?"

print("Original Text:")
print(sample)

print("\nWort-Tokenisierung:")
word_tokens = tokenize_words(sample)
print(word_tokens)
print(f"Anzahl Tokens: {len(word_tokens)}")

print("\nSatz-Tokenisierung:")
sentence_tokens = tokenize_sentences(sample)
for i, sent in enumerate(sentence_tokens, 1):
    print(f"  {i}. {sent}")

## Part 5: Text Statistics and Analysis

In [None]:
def analyze_text(text):
    """
    Perform basic statistical analysis on text.
    
    Args:
        text (str): Input text
    
    Returns:
        dict: Text statistics
    """
    stats = {}
    
    # Basic counts
    stats['total_chars'] = len(text)
    stats['total_chars_no_spaces'] = len(re.sub(r'\s', '', text))
    
    # Word statistics
    words = tokenize_words(text)
    stats['total_words'] = len(words)
    stats['unique_words'] = len(set(words))
    stats['avg_word_length'] = sum(len(word) for word in words) / len(words) if words else 0
    
    # Sentence statistics
    sentences = tokenize_sentences(text)
    stats['total_sentences'] = len(sentences)
    stats['avg_words_per_sentence'] = len(words) / len(sentences) if sentences else 0
    
    # Most common words
    word_freq = Counter(words)
    stats['most_common_words'] = word_freq.most_common(10)
    
    # Count digits
    stats['digit_count'] = len(re.findall(r'\d', text))
    
    # Count uppercase letters
    stats['uppercase_count'] = len(re.findall(r'[A-Z]', text))
    
    return stats

# Analyze sample text
analysis_text = """
Natural Language Processing (NLP) ist ein spannendes Forschungsgebiet der Informatik.
Es kombiniert Linguistik, maschinelles Lernen und k√ºnstliche Intelligenz.
Mit NLP k√∂nnen Computer menschliche Sprache verstehen und verarbeiten.
Anwendungen umfassen Chatbots, maschinelle √úbersetzung und Sentimentanalyse.
Die Entwicklung von NLP hat in den letzten Jahren enorme Fortschritte gemacht.
"""

stats = analyze_text(analysis_text)

print("Text-Analyse:")
print("="*50)
print(f"Gesamtzeichen: {stats['total_chars']}")
print(f"Zeichen ohne Leerzeichen: {stats['total_chars_no_spaces']}")
print(f"\nGesamtw√∂rter: {stats['total_words']}")
print(f"Eindeutige W√∂rter: {stats