# Exercise 1: Introduction to NLP and Text Processing

Welcome to Natural Language Processing! In this notebook, you'll learn the fundamental concepts and techniques for working with text data.

## Learning Objectives
By the end of this exercise, you will be able to:
1. **Tokenization**: Split text into meaningful units (words, sentences)
2. **Text Statistics**: Calculate basic metrics (word count, character count, etc.)
3. **Stop Word Removal**: Filter out common, non-informative words
4. **Normalization**: Convert text to consistent format (lowercase, etc.)
5. **Frequency Analysis**: Find most common words and patterns
6. **Visualization**: Create charts to understand text data

## What You'll Build
- A complete text preprocessing pipeline
- Text analysis tools for German language
- Visualization functions for text statistics
- A reusable NLP toolkit for future projects

## Prerequisites
- Basic Python knowledge
- Understanding of lists, dictionaries, and functions
- No prior NLP knowledge required!

**Ready to start your NLP journey?** Let's go! üöÄ

## Exercise 1: Setting Up Your NLP Toolkit

**Goal**: Import necessary libraries and understand the NLP ecosystem.

**Your Task**: Set up the libraries you'll need for text processing, with fallbacks for missing dependencies.

In [None]:
# Essential imports for text processing
import re
import string
from collections import Counter, defaultdict
import matplotlib.pyplot as plt

# Try to import advanced NLP libraries with helpful error handling
try:
    import nltk
    # Download required NLTK data
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    from nltk.tokenize import word_tokenize, sent_tokenize
    from nltk.corpus import stopwords
    NLTK_AVAILABLE = True
    print("‚úÖ NLTK loaded successfully!")
except ImportError:
    print("‚ùå NLTK not available. Install with: pip install nltk")
    NLTK_AVAILABLE = False

try:
    import spacy
    # Try to load German model
    nlp = spacy.load('de_core_news_sm')
    SPACY_AVAILABLE = True
    print("‚úÖ spaCy German model loaded!")
except (ImportError, OSError):
    print("‚ùå spaCy German model not available.")
    print("   Install with: python -m spacy download de_core_news_sm")
    SPACY_AVAILABLE = False

print(f"\nüìö NLP Toolkit Status:")
print(f"   NLTK: {'Available' if NLTK_AVAILABLE else 'Not available (using fallbacks)'}")
print(f"   spaCy: {'Available' if SPACY_AVAILABLE else 'Not available (using fallbacks)'}")
print("\nüöÄ Ready to start text processing!")

NLTK loaded successfully!
Ready to start!


In [2]:
# Simple German text for practice
german_text = """
Ich liebe Pizza und Pasta. Computer sind sehr n√ºtzlich heute. 
Die Sonne scheint hell und warm. Hunde sind treue Freunde.
"""

print("Our text to work with:")
print(german_text)

Our text to work with:

Ich liebe Pizza und Pasta. Computer sind sehr n√ºtzlich heute. 
Die Sonne scheint hell und warm. Hunde sind treue Freunde.



### Step 1: Basic Text Statistics

In [None]:
def analyze_text_statistics(text):
    """
    Calculate comprehensive text statistics.
    
    Args:
        text (str): Input text to analyze
    
    Returns:
        dict: Dictionary containing various text statistics
    """
    # TODO: Implement comprehensive text analysis:
    # 1. Count characters, words, sentences
    # 2. Calculate average word/sentence length
    # 3. Find unique words and vocabulary size
    # 4. Analyze punctuation usage
    
    # Clean text for analysis
    clean_text = text.strip()
    
    # Basic counts
    char_count = len(clean_text)
    char_count_no_spaces = len(clean_text.replace(' ', ''))
    
    # Word analysis
    words = clean_text.split()
    word_count = len(words)
    unique_words = set(word.lower().strip(string.punctuation) for word in words)
    vocabulary_size = len(unique_words)
    
    # Sentence analysis (simple approach)
    sentences = [s.strip() for s in clean_text.split('.') if s.strip()]
    sentence_count = len(sentences)
    
    # Average lengths
    avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0
    avg_sentence_length = word_count / sentence_count if sentence_count > 0 else 0
    
    # Punctuation analysis
    punctuation_count = sum(1 for char in clean_text if char in string.punctuation)
    
    stats = {
        'characters_total': char_count,
        'characters_no_spaces': char_count_no_spaces,
        'words_total': word_count,
        'words_unique': vocabulary_size,
        'sentences': sentence_count,
        'avg_word_length': round(avg_word_length, 2),
        'avg_sentence_length': round(avg_sentence_length, 2),
        'punctuation_marks': punctuation_count,
        'vocabulary_richness': round(vocabulary_size / word_count * 100, 2) if word_count > 0 else 0
    }
    
    return stats

def display_text_statistics(text, title="Text Analysis"):
    """Display formatted text statistics."""
    
    print(f"üìä {title}")
    print("=" * 50)
    
    stats = analyze_text_statistics(text)
    
    print(f"üìù Basic Counts:")
    print(f"   Characters (total): {stats['characters_total']}")
    print(f"   Characters (no spaces): {stats['characters_no_spaces']}")
    print(f"   Words (total): {stats['words_total']}")
    print(f"   Words (unique): {stats['words_unique']}")
    print(f"   Sentences: {stats['sentences']}")
    
    print(f"\nüìè Averages:")
    print(f"   Average word length: {stats['avg_word_length']} characters")
    print(f"   Average sentence length: {stats['avg_sentence_length']} words")
    
    print(f"\nüéØ Text Quality:")
    print(f"   Punctuation marks: {stats['punctuation_marks']}")
    print(f"   Vocabulary richness: {stats['vocabulary_richness']}%")
    
    return stats

# Analyze our sample text
sample_stats = display_text_statistics(german_text, "German Sample Text Analysis")

# Let's try it!
chars, words, sentences = count_basic_things(german_text)
print(f"Characters: {chars}")
print(f"Words: {words}")
print(f"Sentences: {sentences}")
print(f"Average words per sentence: {words/sentences:.1f}")

Characters: 123
Words: 20
Sentences: 4
Average words per sentence: 5.0


### Step 2: Advanced Tokenization

In [None]:
def tokenize_text_multiple_methods(text):
    """
    Compare different tokenization approaches.
    
    Args:
        text (str): Input text to tokenize
    
    Returns:
        dict: Results from different tokenization methods
    """
    # TODO: Implement multiple tokenization approaches:
    # 1. Simple whitespace splitting
    # 2. Regular expression-based tokenization
    # 3. NLTK tokenization (if available)
    # 4. spaCy tokenization (if available)
    
    results = {}
    
    # Method 1: Simple whitespace splitting
    simple_tokens = text.split()
    results['simple_split'] = {
        'tokens': simple_tokens,
        'count': len(simple_tokens),
        'method': 'Whitespace splitting'
    }
    
    # Method 2: Regular expression tokenization
    # This handles punctuation better
    import re
    regex_pattern = r'\b\w+\b'  # Word boundaries
    regex_tokens = re.findall(regex_pattern, text.lower())
    results['regex'] = {
        'tokens': regex_tokens,
        'count': len(regex_tokens),
        'method': 'Regex word boundaries'
    }
    
    # Method 3: NLTK tokenization (if available)
    if NLTK_AVAILABLE:
        try:
            nltk_tokens = word_tokenize(text.lower(), language='german')
            results['nltk'] = {
                'tokens': nltk_tokens,
                'count': len(nltk_tokens),
                'method': 'NLTK German tokenizer'
            }
        except Exception as e:
            results['nltk'] = {'error': f"NLTK tokenization failed: {e}"}
    else:
        results['nltk'] = {'error': 'NLTK not available'}
    
    # Method 4: spaCy tokenization (if available)
    if SPACY_AVAILABLE:
        try:
            doc = nlp(text)
            spacy_tokens = [token.text.lower() for token in doc if not token.is_space]
            results['spacy'] = {
                'tokens': spacy_tokens,
                'count': len(spacy_tokens),
                'method': 'spaCy German model'
            }
        except Exception as e:
            results['spacy'] = {'error': f"spaCy tokenization failed: {e}"}
    else:
        results['spacy'] = {'error': 'spaCy not available'}
    
    return results

def compare_tokenization_methods(text):
    """Compare and display different tokenization results."""
    
    print("üî§ Tokenization Method Comparison")
    print("=" * 60)
    print(f"Input text: '{text.strip()}'")
    print()
    
    results = tokenize_text_multiple_methods(text)
    
    for method_name, data in results.items():
        print(f"üìù {method_name.upper()}:")
        
        if 'error' in data:
            print(f"   ‚ùå {data['error']}")
        else:
            print(f"   Method: {data['method']}")
            print(f"   Token count: {data['count']}")
            print(f"   First 10 tokens: {data['tokens'][:10]}")
        print()
    
    return results

def german_sentence_tokenization(text):
    """
    Tokenize German text into sentences.
    
    Args:
        text (str): Input text
    
    Returns:
        list: List of sentences
    """
    # TODO: Implement sentence tokenization:
    # 1. Use simple period splitting as fallback
    # 2. Use NLTK sentence tokenizer if available
    # 3. Handle German-specific sentence patterns
    
    print("üìë Sentence Tokenization")
    print("=" * 30)
    
    # Method 1: Simple approach
    simple_sentences = [s.strip() for s in text.split('.') if s.strip()]
    print(f"Simple method: {len(simple_sentences)} sentences")
    
    # Method 2: NLTK (if available)
    if NLTK_AVAILABLE:
        try:
            nltk_sentences = sent_tokenize(text, language='german')
            print(f"NLTK method: {len(nltk_sentences)} sentences")
            
            print("\nNLTK sentences:")
            for i, sentence in enumerate(nltk_sentences, 1):
                print(f"  {i}. {sentence.strip()}")
            
            return nltk_sentences
        except Exception as e:
            print(f"NLTK sentence tokenization failed: {e}")
    
    print("\nSimple sentences:")
    for i, sentence in enumerate(simple_sentences, 1):
        print(f"  {i}. {sentence}")
    
    return simple_sentences

# Test tokenization methods
print("Testing different tokenization approaches:")
tokenization_results = compare_tokenization_methods(german_text)

# Test sentence tokenization
sentences = german_sentence_tokenization(german_text)

### Step 2: Tokenization with NLTK

In [4]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

def nltk_preprocessing(text, language='german'):
    """
    Preprocess text using NLTK tools.
    
    Args:
        text (str): Input text
        language (str): Language for stopwords
    
    Returns:
        dict: Preprocessed text components
    """
    # TODO: Implement the following preprocessing steps:
    # 1. Sentence tokenization
    # 2. Word tokenization
    # 3. Convert to lowercase
    # 4. Remove punctuation and numbers
    # 5. Remove stopwords
    
    result = {}
    
    # Sentence tokenization
    result['sentences'] = sent_tokenize(text, language='german')
    
    # Word tokenization
    tokens = word_tokenize(text, language='german')
    result['tokens'] = tokens
    
    # Lowercase conversion
    tokens_lower = [token.lower() for token in tokens]
    result['tokens_lower'] = tokens_lower
    
    # Remove punctuation and numbers
    tokens_clean = [token for token in tokens_lower if token.isalpha()]
    result['tokens_clean'] = tokens_clean
    
    # Remove stopwords
    german_stopwords = set(stopwords.words('german'))
    tokens_no_stop = [token for token in tokens_clean if token not in german_stopwords]
    result['tokens_no_stopwords'] = tokens_no_stop
    
    return result

# Apply NLTK preprocessing
nltk_result = nltk_preprocessing(german_text)

print("NLTK Preprocessing Results:")
print(f"Number of sentences: {len(nltk_result['sentences'])}")
print(f"Number of tokens: {len(nltk_result['tokens'])}")
print(f"Number of clean tokens: {len(nltk_result['tokens_clean'])}")
print(f"Number of tokens without stopwords: {len(nltk_result['tokens_no_stopwords'])}")
print("\nTokens without stopwords:")
print(nltk_result['tokens_no_stopwords'])

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/german/[0m

  Searched in:
    - 'C:\\Users\\Felix Neub√ºrger/nltk_data'
    - 'c:\\Users\\Felix Neub√ºrger\\Documents\\Lehre\\NLP_BA2526\\.venv\\nltk_data'
    - 'c:\\Users\\Felix Neub√ºrger\\Documents\\Lehre\\NLP_BA2526\\.venv\\share\\nltk_data'
    - 'c:\\Users\\Felix Neub√ºrger\\Documents\\Lehre\\NLP_BA2526\\.venv\\lib\\nltk_data'
    - 'C:\\Users\\Felix Neub√ºrger\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


### Step 3: Advanced Processing with spaCy

In [None]:
def spacy_processing(text, nlp_model):
    """
    Process text using spaCy for advanced NLP features.
    
    Args:
        text (str): Input text
        nlp_model: Loaded spaCy model
    
    Returns:
        dict: Advanced NLP analysis results
    """
    # Process text with spaCy
    doc = nlp_model(text)
    
    result = {
        'tokens': [],
        'lemmas': [],
        'pos_tags': [],
        'entities': [],
        'noun_phrases': []
    }
    
    # TODO: Extract the following information:
    # 1. Tokens and their lemmas
    # 2. Part-of-speech tags
    # 3. Named entities
    # 4. Noun phrases
    
    # Extract token information
    for token in doc:
        if not token.is_space and not token.is_punct:
            result['tokens'].append(token.text)
            result['lemmas'].append(token.lemma_)
            result['pos_tags'].append((token.text, token.pos_, token.tag_))
    
    # Extract named entities
    for ent in doc.ents:
        result['entities'].append((ent.text, ent.label_, ent.start_char, ent.end_char))
    
    # Extract noun phrases
    for chunk in doc.noun_chunks:
        result['noun_phrases'].append(chunk.text)
    
    return result

# Apply spaCy processing (if model is available)
if 'nlp' in locals():
    spacy_result = spacy_processing(german_text, nlp)
    
    print("spaCy Processing Results:")
    print(f"\nLemmas: {spacy_result['lemmas'][:10]}...")  # Show first 10
    print(f"\nPOS Tags (first 10):")
    for token, pos, tag in spacy_result['pos_tags'][:10]:
        print(f"  {token}: {pos} ({tag})")
    
    print(f"\nNamed Entities:")
    for ent_text, ent_label, start, end in spacy_result['entities']:
        print(f"  {ent_text}: {ent_label}")
    
    print(f"\nNoun Phrases: {spacy_result['noun_phrases']}")
else:
    print("spaCy model not available. Please install: python -m spacy download de_core_news_sm")

### Step 4: Frequency Analysis and Visualization

In [None]:
def analyze_word_frequency(tokens, top_n=10):
    """
    Analyze word frequency and create visualizations.
    
    Args:
        tokens (list): List of tokens to analyze
        top_n (int): Number of top words to display
    
    Returns:
        Counter: Word frequency counter
    """
    # TODO: Implement frequency analysis:
    # 1. Count word frequencies
    # 2. Create bar plot of most frequent words
    # 3. Generate word cloud
    
    # Count frequencies
    word_freq = Counter(tokens)
    most_common = word_freq.most_common(top_n)
    
    # Create visualizations
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Bar plot
    words, counts = zip(*most_common) if most_common else ([], [])
    ax1.bar(words, counts)
    ax1.set_title(f'Top {top_n} Most Frequent Words')
    ax1.set_xlabel('Words')
    ax1.set_ylabel('Frequency')
    ax1.tick_params(axis='x', rotation=45)
    
    # Word cloud
    if tokens:
        text_for_cloud = ' '.join(tokens)
        wordcloud = WordCloud(width=400, height=300, background_color='white').generate(text_for_cloud)
        ax2.imshow(wordcloud, interpolation='bilinear')
        ax2.set_title('Word Cloud')
        ax2.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    return word_freq

# Analyze frequency of clean tokens
if 'nltk_result' in locals():
    word_frequencies = analyze_word_frequency(nltk_result['tokens_no_stopwords'])
    print("\nWord Frequency Analysis:")
    print(f"Total unique words: {len(word_frequencies)}")
    print(f"Most common words: {word_frequencies.most_common(5)}")

### Step 5: Text Comparison Function

In [None]:
def compare_preprocessing_methods(text):
    """
    Compare different preprocessing approaches.
    
    Args:
        text (str): Input text to compare
    
    Returns:
        pandas.DataFrame: Comparison results
    """
    # TODO: Create a comparison between different preprocessing methods:
    # 1. Raw text split
    # 2. NLTK tokenization
    # 3. spaCy tokenization (if available)
    
    methods = {}
    
    # Simple split
    simple_tokens = text.split()
    methods['Simple Split'] = {
        'token_count': len(simple_tokens),
        'unique_tokens': len(set(simple_tokens)),
        'sample_tokens': simple_tokens[:5]
    }
    
    # NLTK tokenization
    nltk_tokens = word_tokenize(text, language='german')
    methods['NLTK'] = {
        'token_count': len(nltk_tokens),
        'unique_tokens': len(set(nltk_tokens)),
        'sample_tokens': nltk_tokens[:5]
    }
    
    # spaCy tokenization (if available)
    if 'nlp' in locals():
        spacy_doc = nlp(text)
        spacy_tokens = [token.text for token in spacy_doc if not token.is_space]
        methods['spaCy'] = {
            'token_count': len(spacy_tokens),
            'unique_tokens': len(set(spacy_tokens)),
            'sample_tokens': spacy_tokens[:5]
        }
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(methods).T
    return comparison_df

# Compare preprocessing methods
comparison = compare_preprocessing_methods(german_text)
print("Preprocessing Method Comparison:")
print(comparison)

## Exercise Tasks

Complete the following tasks to practice your understanding:

1. **Extend Text Statistics**: Add more sophisticated statistics like:
   - Lexical diversity (unique words / total words)
   - Average characters per word
   - Reading difficulty scores

2. **Custom Stopword List**: Create a custom stopword list for your domain and compare results.

3. **Language Detection**: Use a language detection library to automatically identify text language.

4. **Multi-text Analysis**: Process multiple texts and compare their characteristics.

5. **Interactive Preprocessing**: Create a function that allows users to choose different preprocessing options.

## Reflection Questions

1. What are the main differences between NLTK and spaCy tokenization?
2. When would you choose lemmatization over stemming?
3. How does German text processing differ from English?
4. What preprocessing steps are most important for your specific use case?

## Next Steps

- Explore more advanced spaCy features (dependency parsing, similarity)
- Learn about regular expressions for custom text cleaning
- Practice with different types of texts (social media, formal documents, etc.)
- Study language-specific challenges in German NLP