# Module 6: Text Processing and Data Manipulation

## Welcome to Module 6!

In Module 5, you mastered file system operations. Now it's time to work with **file contents** - reading, writing, and processing text data.

### What You'll Learn

- Reading text files (Get-Content)
- Writing to files (Set-Content, Add-Content, Out-File)
- Processing CSV files (Import-Csv, Export-Csv)
- Working with JSON data (ConvertFrom-Json, ConvertTo-Json)
- String searching and filtering
- Select-String for pattern matching
- Regular expressions basics
- Text manipulation and formatting
- Building a log file analyzer

### Why Text Processing Matters

Text processing is essential for automation:
- **Analyze** log files for errors
- **Process** CSV data from spreadsheets
- **Parse** configuration files
- **Extract** information with patterns
- **Transform** data between formats

Let's become text processing experts!

## Setup: Prepare Practice Environment

In [None]:
import subprocess
from pathlib import Path
import time

# Practice folder
practice_folder = Path.home() / "Documents" / "AutomationPractice"
practice_folder.mkdir(exist_ok=True)

# Create TextProcessing practice subfolder
text_practice = practice_folder / "TextProcessing_Practice"
text_practice.mkdir(exist_ok=True)

print(f"Practice folder ready: {text_practice}")
print("Let's master text processing!\n")

# Helper function
def run_ps(command, cwd=None):
    """Run a PowerShell command and return output."""
    if cwd is None:
        cwd = str(text_practice)
    
    result = subprocess.run(
        ['powershell', '-Command', command],
        cwd=cwd,
        capture_output=True,
        text=True,
        timeout=30
    )
    
    return result.stdout + result.stderr

print("✓ Helper function ready!")

## 1. Reading Text Files

**Get-Content** reads file contents into PowerShell.

### Basic File Reading

In [None]:
# Create a sample text file
sample_text = """Line 1: Introduction to PowerShell
Line 2: Text processing is powerful
Line 3: Automation saves time
Line 4: Practice makes perfect
Line 5: Keep learning!"""

(text_practice / "sample.txt").write_text(sample_text, encoding='utf-8')

# Read entire file
output = run_ps('''
Write-Host "=== Read Entire File ==="
$content = Get-Content "sample.txt"
$content

Write-Host ""
Write-Host "Total lines: $($content.Count)"
''')

print(output)

### Reading Specific Lines

In [None]:
output = run_ps('''
Write-Host "=== First 3 Lines ==="
Get-Content "sample.txt" -TotalCount 3

Write-Host ""
Write-Host "=== Last 2 Lines ==="
Get-Content "sample.txt" -Tail 2

Write-Host ""
Write-Host "=== Line 3 Only ==="
$lines = Get-Content "sample.txt"
$lines[2]  # Arrays are 0-indexed
''')

print(output)

### Reading as Single String

In [None]:
output = run_ps('''
Write-Host "=== Read as Array (default) ==="
$asArray = Get-Content "sample.txt"
Write-Host "Type: $($asArray.GetType().Name)"
Write-Host "Count: $($asArray.Count)"

Write-Host ""
Write-Host "=== Read as Single String (-Raw) ==="
$asString = Get-Content "sample.txt" -Raw
Write-Host "Type: $($asString.GetType().Name)"
Write-Host "Length: $($asString.Length) characters"
Write-Host "First 50 chars: $($asString.Substring(0, 50))..."
''')

print(output)

## 2. Writing Text Files

Multiple cmdlets for writing: **Set-Content**, **Add-Content**, **Out-File**.

### Set-Content (Overwrites File)

In [None]:
output = run_ps('''
Write-Host "=== Set-Content (Create/Overwrite) ==="

# Create new file
Set-Content -Path "output.txt" -Value "First line"
Write-Host "Created output.txt"

# Overwrite with new content
Set-Content -Path "output.txt" -Value "This replaces everything"
Write-Host "Overwrote output.txt"

Write-Host ""
Write-Host "Current content:"
Get-Content "output.txt"
''')

print(output)

### Add-Content (Appends to File)

In [None]:
output = run_ps('''
Write-Host "=== Add-Content (Append) ==="

# Append lines
Add-Content -Path "output.txt" -Value "Second line"
Add-Content -Path "output.txt" -Value "Third line"
Add-Content -Path "output.txt" -Value "Fourth line"

Write-Host "Appended 3 lines"
Write-Host ""
Write-Host "Current content:"
Get-Content "output.txt"
''')

print(output)

### Out-File with Encoding

In [None]:
output = run_ps('''
Write-Host "=== Out-File with Encoding ==="

# Write with specific encoding
"UTF-8 encoded file" | Out-File -FilePath "utf8_file.txt" -Encoding UTF8
Write-Host "Created UTF-8 file"

# Append to file
"Appended line" | Out-File -FilePath "utf8_file.txt" -Encoding UTF8 -Append
Write-Host "Appended to file"

Write-Host ""
Write-Host "Content:"
Get-Content "utf8_file.txt"
''')

print(output)

## 3. Processing Text Line by Line

Common pattern: read file, process each line, output results.

In [None]:
# Create a data file
data = """apple,red,sweet
banana,yellow,sweet
lemon,yellow,sour
lime,green,sour
grape,purple,sweet"""

(text_practice / "fruits.txt").write_text(data, encoding='utf-8')

output = run_ps('''
Write-Host "=== Process Lines ==="

$lines = Get-Content "fruits.txt"

foreach ($line in $lines) {
    $parts = $line.Split(",")
    $fruit = $parts[0]
    $color = $parts[1]
    $taste = $parts[2]
    
    Write-Host "$fruit is $color and tastes $taste"
}
''')

print(output)

## 4. CSV File Processing

PowerShell has excellent CSV support with **Import-Csv** and **Export-Csv**.

### Creating and Importing CSV

In [None]:
# Create a CSV file
csv_data = """Name,Age,Department,Salary
Alice,30,Engineering,80000
Bob,25,Marketing,60000
Charlie,35,Engineering,90000
Diana,28,Sales,65000
Eve,32,Engineering,85000"""

(text_practice / "employees.csv").write_text(csv_data, encoding='utf-8')

output = run_ps('''
Write-Host "=== Import CSV ==="

$employees = Import-Csv "employees.csv"

Write-Host "Total employees: $($employees.Count)"
Write-Host ""

$employees | Format-Table -AutoSize
''')

print(output)

### Filtering and Sorting CSV Data

In [None]:
output = run_ps('''
Write-Host "=== Filter and Sort CSV ==="

$employees = Import-Csv "employees.csv"

# Filter: Engineering department only
Write-Host "Engineering Department:"
$engineers = $employees | Where-Object {$_.Department -eq "Engineering"}
$engineers | Format-Table Name, Age, Salary

Write-Host ""
# Filter: Salary > 70000
Write-Host "High Earners (>70k):"
$highEarners = $employees | Where-Object {[int]$_.Salary -gt 70000}
$highEarners | Sort-Object Salary -Descending | Format-Table Name, Salary

Write-Host ""
# Calculate average salary
$avgSalary = ($employees | Measure-Object -Property Salary -Average).Average
Write-Host "Average Salary: $$avgSalary"
''')

print(output)

### Modifying and Exporting CSV

In [None]:
output = run_ps('''
Write-Host "=== Modify and Export CSV ==="

$employees = Import-Csv "employees.csv"

# Add calculated property: bonus (10% of salary)
$withBonus = $employees | Select-Object Name, Age, Department, Salary, 
    @{Name="Bonus"; Expression={[int]$_.Salary * 0.1}}

Write-Host "Employees with Bonus:"
$withBonus | Format-Table -AutoSize

# Export to new CSV
$withBonus | Export-Csv "employees_with_bonus.csv" -NoTypeInformation
Write-Host ""
Write-Host "Exported to employees_with_bonus.csv"
''')

print(output)

## 5. JSON Data Processing

Work with JSON using **ConvertFrom-Json** and **ConvertTo-Json**.

### Reading JSON Files

In [None]:
# Create a JSON file
json_data = '''{
    "project": "Automation Suite",
    "version": "1.0.0",
    "authors": ["Alice", "Bob", "Charlie"],
    "dependencies": {
        "powershell": "5.1",
        "python": "3.8"
    },
    "features": [
        {"name": "File Processing", "status": "complete"},
        {"name": "Text Analysis", "status": "in-progress"},
        {"name": "Reporting", "status": "planned"}
    ]
}'''

(text_practice / "config.json").write_text(json_data, encoding='utf-8')

output = run_ps('''
Write-Host "=== Read JSON ==="

$json = Get-Content "config.json" -Raw | ConvertFrom-Json

Write-Host "Project: $($json.project)"
Write-Host "Version: $($json.version)"
Write-Host ""

Write-Host "Authors:"
$json.authors | ForEach-Object { Write-Host "  - $_" }

Write-Host ""
Write-Host "Dependencies:"
Write-Host "  PowerShell: $($json.dependencies.powershell)"
Write-Host "  Python: $($json.dependencies.python)"

Write-Host ""
Write-Host "Features:"
$json.features | ForEach-Object {
    Write-Host "  [$($_.status)] $($_.name)"
}
''')

print(output)

### Creating and Writing JSON

In [None]:
output = run_ps('''
Write-Host "=== Create JSON ==="

# Create PowerShell object
$data = [PSCustomObject]@{
    Name = "System Report"
    Date = (Get-Date -Format "yyyy-MM-dd")
    Status = "Success"
    Files = @(
        @{Name="file1.txt"; Size=1024},
        @{Name="file2.txt"; Size=2048},
        @{Name="file3.txt"; Size=512}
    )
}

# Convert to JSON
$jsonOutput = $data | ConvertTo-Json -Depth 3

Write-Host "Generated JSON:"
$jsonOutput

# Save to file
$jsonOutput | Out-File "report.json" -Encoding UTF8
Write-Host ""
Write-Host "Saved to report.json"
''')

print(output)

## 6. String Searching with Select-String

**Select-String** searches for patterns in text files (like `grep` in Unix).

In [None]:
# Create a log file
log_content = """2025-11-14 10:00:00 INFO Application started
2025-11-14 10:00:05 INFO Loading configuration
2025-11-14 10:00:10 WARNING Configuration file missing
2025-11-14 10:00:15 INFO Using default settings
2025-11-14 10:00:20 ERROR Failed to connect to database
2025-11-14 10:00:25 INFO Retrying connection
2025-11-14 10:00:30 INFO Connected successfully
2025-11-14 10:00:35 ERROR File not found: data.csv
2025-11-14 10:00:40 WARNING Low disk space
2025-11-14 10:00:45 INFO Processing complete"""

(text_practice / "app.log").write_text(log_content, encoding='utf-8')

output = run_ps('''
Write-Host "=== Select-String: Find Patterns ==="

# Find all ERROR lines
Write-Host "ERROR lines:"
Select-String -Path "app.log" -Pattern "ERROR"

Write-Host ""
# Find WARNING or ERROR
Write-Host "WARNING or ERROR lines:"
Select-String -Path "app.log" -Pattern "WARNING|ERROR"

Write-Host ""
# Case-insensitive search
Write-Host "Lines containing 'connect' (case-insensitive):"
Select-String -Path "app.log" -Pattern "connect" -CaseSensitive:$false
''')

print(output)

### Select-String with Context

In [None]:
output = run_ps('''
Write-Host "=== Select-String with Context ==="

# Show 1 line before and after ERROR
Write-Host "ERROR with context (1 line before/after):"
Select-String -Path "app.log" -Pattern "ERROR" -Context 1,1
''')

print(output)

## 7. Regular Expressions Basics

Regular expressions (regex) are powerful patterns for matching text.

### Basic Regex Patterns

In [None]:
output = run_ps('''
Write-Host "=== Basic Regex Matching ==="

$text = "Email me at alice@example.com or bob@test.org"

# Match email pattern
if ($text -match "\\w+@\\w+\\.\\w+") {
    Write-Host "Found email: $($Matches[0])"
}

Write-Host ""

# Find all matches
$text = "Prices: $19.99, $29.99, $49.99"
Write-Host "Text: $text"
Write-Host "All prices:"
[regex]::Matches($text, "\\$\\d+\\.\\d+") | ForEach-Object {
    Write-Host "  Found: $($_.Value)"
}
''')

print(output)

### Common Regex Patterns

In [None]:
output = run_ps('''
Write-Host "=== Common Regex Patterns ==="

$tests = @(
    @{Text="abc123"; Pattern="^[a-z]+$"; Description="All lowercase letters"},
    @{Text="abc123"; Pattern="^[a-z0-9]+$"; Description="Letters and numbers"},
    @{Text="123-456-7890"; Pattern="^\\d{3}-\\d{3}-\\d{4}$"; Description="Phone format"},
    @{Text="2025-11-14"; Pattern="^\\d{4}-\\d{2}-\\d{2}$"; Description="Date YYYY-MM-DD"}
)

foreach ($test in $tests) {
    $match = $test.Text -match $test.Pattern
    $result = if ($match) { "✓ Match" } else { "✗ No match" }
    Write-Host "$result : '$($test.Text)' - $($test.Description)"
}
''')

print(output)

### Extract Information with Regex

In [None]:
output = run_ps('''
Write-Host "=== Extract with Regex ==="

$logLine = "2025-11-14 10:00:20 ERROR Failed to connect to database"

# Extract date, time, level, message
$pattern = "^(\\d{4}-\\d{2}-\\d{2}) (\\d{2}:\\d{2}:\\d{2}) (\\w+) (.+)$"

if ($logLine -match $pattern) {
    Write-Host "Date: $($Matches[1])"
    Write-Host "Time: $($Matches[2])"
    Write-Host "Level: $($Matches[3])"
    Write-Host "Message: $($Matches[4])"
}
''')

print(output)

## 8. Practical Example: Log File Analyzer

Build a comprehensive log analyzer that extracts insights from log files.

In [None]:
# Create a more detailed log file
detailed_log = """2025-11-14 08:00:00 INFO Application startup initiated
2025-11-14 08:00:01 INFO Loading configuration from config.json
2025-11-14 08:00:02 WARNING Config file not found, using defaults
2025-11-14 08:00:03 INFO Database connection pool: 10 connections
2025-11-14 08:00:05 ERROR Failed to connect to primary database
2025-11-14 08:00:06 INFO Attempting failover to backup database
2025-11-14 08:00:07 INFO Connected to backup database successfully
2025-11-14 08:00:10 INFO Processing batch job: batch_001
2025-11-14 08:00:15 INFO Processed 1000 records
2025-11-14 08:00:20 WARNING Memory usage: 85%
2025-11-14 08:00:25 ERROR Failed to write to file: disk full
2025-11-14 08:00:26 ERROR Transaction rolled back
2025-11-14 08:00:30 INFO Disk space freed: 2GB
2025-11-14 08:00:35 INFO Retrying write operation
2025-11-14 08:00:36 INFO Write successful
2025-11-14 08:00:40 INFO Processed 2000 records
2025-11-14 08:00:45 WARNING Memory usage: 90%
2025-11-14 08:00:50 INFO Garbage collection triggered
2025-11-14 08:00:51 INFO Memory usage: 65%
2025-11-14 08:01:00 INFO Batch job completed successfully"""

(text_practice / "detailed.log").write_text(detailed_log, encoding='utf-8')

output = run_ps('''
Write-Host "=== Log File Analyzer ==="
Write-Host ""

$logFile = "detailed.log"
$lines = Get-Content $logFile

# Count by log level
$infos = ($lines | Select-String "INFO").Count
$warnings = ($lines | Select-String "WARNING").Count
$errors = ($lines | Select-String "ERROR").Count

Write-Host "Log Summary:"
Write-Host "============"
Write-Host "Total entries: $($lines.Count)"
Write-Host "INFO: $infos"
Write-Host "WARNING: $warnings"
Write-Host "ERROR: $errors"

Write-Host ""
Write-Host "Error Details:"
Write-Host "=============="
$errorLines = $lines | Select-String "ERROR"
foreach ($errorLine in $errorLines) {
    Write-Host "  - $errorLine"
}

Write-Host ""
Write-Host "Time Range:"
Write-Host "==========="
$firstLine = $lines[0]
$lastLine = $lines[-1]
if ($firstLine -match "^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})") {
    Write-Host "Start: $($Matches[1])"
}
if ($lastLine -match "^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2})") {
    Write-Host "End: $($Matches[1])"
}

# Find keywords
Write-Host ""
Write-Host "Key Events:"
Write-Host "==========="
$keywords = @("database", "memory", "batch")
foreach ($keyword in $keywords) {
    $count = ($lines | Select-String $keyword -CaseSensitive:$false).Count
    Write-Host "  $keyword : $count mentions"
}
''')

print(output)

## 9. Practical Example: Generate Report from CSV

Read CSV data and generate a formatted text report.

In [None]:
output = run_ps('''
Write-Host "=== Generate Report from CSV ==="
Write-Host ""

$employees = Import-Csv "employees.csv"

# Generate report
$report = @()
$report += "="*50
$report += "EMPLOYEE SALARY REPORT"
$report += "Generated: $(Get-Date -Format 'yyyy-MM-dd HH:mm')"
$report += "="*50
$report += ""

# Group by department
$byDept = $employees | Group-Object Department

foreach ($dept in $byDept) {
    $report += "Department: $($dept.Name)"
    $report += "-"*50
    
    $deptEmployees = $dept.Group
    $totalSalary = ($deptEmployees | Measure-Object -Property Salary -Sum).Sum
    $avgSalary = ($deptEmployees | Measure-Object -Property Salary -Average).Average
    
    $report += "Employees: $($dept.Count)"
    $report += "Total Salary: $$totalSalary"
    $report += "Average Salary: $$([math]::Round($avgSalary, 2))"
    $report += ""
    
    foreach ($emp in $deptEmployees) {
        $report += "  - $($emp.Name), Age $($emp.Age), Salary $$($emp.Salary)"
    }
    
    $report += ""
}

$report += "="*50
$report += "COMPANY TOTALS"
$report += "="*50
$totalComp = ($employees | Measure-Object -Property Salary -Sum).Sum
$avgComp = ($employees | Measure-Object -Property Salary -Average).Average
$report += "Total Employees: $($employees.Count)"
$report += "Total Compensation: $$totalComp"
$report += "Average Salary: $$([math]::Round($avgComp, 2))"

# Display report
$report | ForEach-Object { Write-Host $_ }

# Save to file
$report | Out-File "salary_report.txt" -Encoding UTF8
Write-Host ""
Write-Host "Report saved to salary_report.txt"
''')

print(output)

## 10. Practical Example: Configuration File Parser

Parse and modify configuration files.

In [None]:
# Create a config file
config_content = """# Application Configuration
server_address=192.168.1.100
server_port=8080
database_host=localhost
database_port=5432
database_name=myapp
max_connections=100
timeout=30
debug_mode=false
log_level=INFO"""

(text_practice / "app.config").write_text(config_content, encoding='utf-8')

output = run_ps('''
Write-Host "=== Parse Configuration File ==="
Write-Host ""

$configFile = "app.config"
$lines = Get-Content $configFile

# Parse configuration
$config = @{}

foreach ($line in $lines) {
    # Skip comments and empty lines
    if ($line -match "^\\s*#" -or $line -match "^\\s*$") {
        continue
    }
    
    # Parse key=value
    if ($line -match "^(.+?)=(.+)$") {
        $key = $Matches[1].Trim()
        $value = $Matches[2].Trim()
        $config[$key] = $value
    }
}

Write-Host "Configuration Settings:"
Write-Host "======================"
foreach ($key in $config.Keys | Sort-Object) {
    Write-Host "${key}: $($config[$key])"
}

Write-Host ""
Write-Host "Server URL: http://$($config['server_address']):$($config['server_port'])"
Write-Host "Database: $($config['database_name']) @ $($config['database_host']):$($config['database_port'])"
Write-Host "Debug Mode: $($config['debug_mode'])"
''')

print(output)

## 11. Practice Exercise: Word Frequency Counter

**Challenge**: Count word frequency in a text file.

In [None]:
# Create sample text
sample = """PowerShell is powerful. PowerShell makes automation easy.
Automation saves time. Time is valuable.
PowerShell scripting is fun. Scripting with PowerShell is efficient."""

(text_practice / "sample_text.txt").write_text(sample, encoding='utf-8')

print("=== Test file created ===")
print("Your task: Count how many times each word appears (case-insensitive)\n")

In [None]:
# Your solution here
print("=== Your Solution ===")
# Write your code using Get-Content and Group-Object

In [None]:
# Solution
output = run_ps('''
Write-Host "=== Word Frequency Counter ==="
Write-Host ""

$text = Get-Content "sample_text.txt" -Raw

# Remove punctuation and split into words
$text = $text.ToLower()
$text = $text -replace "[^a-z0-9\\s]", ""
$words = $text -split "\\s+" | Where-Object {$_ -ne ""}

# Count frequencies
$frequency = $words | Group-Object | Sort-Object Count -Descending

Write-Host "Word Frequency (Top 10):"
Write-Host "======================="
$frequency | Select-Object -First 10 | ForEach-Object {
    Write-Host "$($_.Name): $($_.Count) times"
}

Write-Host ""
Write-Host "Total unique words: $($frequency.Count)"
Write-Host "Total words: $($words.Count)"
''')

print("=== Exercise Solution ===")
print(output)

## Summary and Key Takeaways

Congratulations! You've mastered text processing in PowerShell.

### What You Learned:

✓ **Get-Content** - Reading text files  
✓ **Set-Content / Add-Content** - Writing to files  
✓ **Out-File** - Writing with encoding  
✓ **Import-Csv / Export-Csv** - CSV processing  
✓ **ConvertFrom-Json / ConvertTo-Json** - JSON handling  
✓ **Select-String** - Text searching (like grep)  
✓ **Regular Expressions** - Pattern matching  
✓ **Text Manipulation** - Processing and transforming  
✓ **Real-World Projects** - Log analyzer, report generator, config parser  

### Quick Reference: Text Processing

```powershell
# Read files
Get-Content "file.txt"                    # Array of lines
Get-Content "file.txt" -Raw               # Single string
Get-Content "file.txt" -TotalCount 10     # First 10 lines
Get-Content "file.txt" -Tail 5            # Last 5 lines

# Write files
Set-Content "file.txt" -Value "text"      # Create/overwrite
Add-Content "file.txt" -Value "text"      # Append
"text" | Out-File "file.txt" -Encoding UTF8

# CSV
$data = Import-Csv "data.csv"
$data | Export-Csv "output.csv" -NoTypeInformation

# JSON
$json = Get-Content "file.json" -Raw | ConvertFrom-Json
$object | ConvertTo-Json -Depth 3 | Out-File "file.json"

# Search
Select-String -Path "*.log" -Pattern "ERROR"
Select-String -Path "file.txt" -Pattern "\d+" -AllMatches

# Regex
if ($text -match "pattern") { $Matches[0] }
[regex]::Matches($text, "pattern")
```

### Common Patterns

**Process log file**:
```powershell
$errors = Get-Content "app.log" | Select-String "ERROR"
$errors | ForEach-Object { Write-Host $_ }
```

**Filter and export CSV**:
```powershell
$data = Import-Csv "input.csv"
$filtered = $data | Where-Object {$_.Status -eq "Active"}
$filtered | Export-Csv "output.csv" -NoTypeInformation
```

**Parse structured log**:
```powershell
$logs = Get-Content "app.log"
$parsed = $logs | ForEach-Object {
    if ($_ -match "^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)$") {
        [PSCustomObject]@{
            Date = $Matches[1]
            Time = $Matches[2]
            Level = $Matches[3]
            Message = $Matches[4]
        }
    }
}
```

### Regex Quick Reference

- `\d` = digit (0-9)
- `\w` = word character (a-z, A-Z, 0-9, _)
- `\s` = whitespace
- `.` = any character
- `^` = start of line
- `$` = end of line
- `+` = one or more
- `*` = zero or more
- `?` = zero or one
- `{n}` = exactly n times
- `[abc]` = any of a, b, or c
- `(...)` = capture group

### Next Steps

In **Module 7: Task Automation Fundamentals**, you'll learn:
- Environment variables
- Running external programs
- Error handling (Try/Catch)
- Creating functions
- Building automation scripts

You now have powerful text processing skills!

## Cleanup

Run this cell to remove all practice files:

In [None]:
import shutil

print("Cleaning up TextProcessing_Practice folder...\n")

if text_practice.exists():
    shutil.rmtree(text_practice)
    print(f"✓ Removed {text_practice}")
    print("\nAll practice files deleted.")
    print("The main AutomationPractice folder remains for future modules.")
else:
    print("Practice folder already cleaned up!")