### Reading Text Files — Practice Problems (Advanced)

This notebook includes:

- A set of advanced (but not too advanced) practice problems for reading text files.
- Complete solutions following best practices (context managers, error handling, etc.).
- A setup cell that **creates all the data files** used in the problems.

⚠️ **Note:** The setup cell will (re)create several small sample files in the current directory and may overwrite files with the same names.

In [2]:
from pathlib import Path

# Create (or overwrite) all sample files used in this notebook.

base_path = Path('.')

# 1) temperatures.csv
temperatures_content = "\n".join([
    "date,temperature_c",
    "2023-01-01,1.5",
    "2023-01-02,.",
    "2023-01-03,-3.0",
    "2023-01-04,0.0",
    "2023-01-05,invalid",
])
with open(base_path / 'temperatures.csv', 'w', encoding='utf-8') as f:
    f.write(temperatures_content)

# 2) notes.txt
notes_content = "\n".join([
    "  This is the first note.",
    "",
    "Second note with some text.",
    "   ",
    "Third note.",
    "Fourth note.",
    "Fifth note.",
    "Sixth note.",
])
with open(base_path / 'notes.txt', 'w', encoding='utf-8') as f:
    f.write(notes_content)

# 3) app.log
app_log_content = "\n".join([
    "2025-01-01 10:00:01,INFO,App started",
    "2025-01-01 10:00:03,WARNING,Low disk space",
    "2025-01-01 10:00:05,ERROR,Could not open file",
    "2025-01-01 10:01:01,INFO,User logged in",
    "2025-01-01 10:02:00,ERROR,Unexpected input",
])
with open(base_path / 'app.log', 'w', encoding='utf-8') as f:
    f.write(app_log_content)

# 4) novel.txt
novel_content = "\n".join([
    "The quick brown fox jumps over the lazy dog and the dog does not mind.",
    "And then the fox and the dog become friends and wander through the forest.",
    "In the forest, the trees are tall and the wind is soft and the night is quiet.",
])
with open(base_path / 'novel.txt', 'w', encoding='utf-8') as f:
    f.write(novel_content)

# 5) rates_2025-01-01.csv, rates_2025-01-02.csv, rates_2025-01-03.csv
rates_files = {
    'rates_2025-01-01.csv': [
        'date,currency,rate',
        '2025-01-01,USD,1.10',
        '2025-01-01,EUR,1.00',
        '2025-01-01,GBP,0.85',
    ],
    'rates_2025-01-02.csv': [
        'date,currency,rate',
        '2025-01-02,USD,1.11',
        '2025-01-02,EUR,1.00',
        '2025-01-02,GBP,0.86',
    ],
    'rates_2025-01-03.csv': [
        'date,currency,rate',
        '2025-01-03,USD,1.15',
        '2025-01-03,EUR,1.01',
        '2025-01-03,GBP,0.87',
    ],
}
for filename, lines in rates_files.items():
    with open(base_path / filename, 'w', encoding='utf-8') as f:
        f.write("\n".join(lines))

# 6) measurements.csv
measurements_content = "\n".join([
    'id,value',
    '1,3.14',
    '2,2.71',
    '3,abc',
    '4,.',
    '5,10.0',
    'six,6.0',
])
with open(base_path / 'measurements.csv', 'w', encoding='utf-8') as f:
    f.write(measurements_content)

# 7) students.csv
students_content = "\n".join([
    'name,age',
    'Alice,20',
    'Bob,22',
    'Charlie,invalid',
])
with open(base_path / 'students.csv', 'w', encoding='utf-8') as f:
    f.write(students_content)

# 8) products.csv
products_content = "\n".join([
    'id,price',
    'p1,9.99',
    'p2,14.50',
    'p3,not_a_price',
])
with open(base_path / 'products.csv', 'w', encoding='utf-8') as f:
    f.write(products_content)

print('Sample data files created (or overwritten).')


Sample data files created (or overwritten).


**General Guidelines (Best Practices):**

- Prefer `with open(...) as f:` instead of manually calling `open` / `close`.
- Use iteration over file objects (`for line in f:`) when you can.
- Handle bad or missing data using `try`/`except` where appropriate.
- Avoid loading huge files entirely into memory unless you really need to.
- Add basic safety for missing files using `try`/`except FileNotFoundError`.

#### Problem 1 — Filter and Aggregate CSV Data

You are given a CSV file `temperatures.csv` with the following structure:

```text
date,temperature_c
2023-01-01,1.5
2023-01-02,.
2023-01-03,-3.0
...
```

* The first line is a header.
* A `.` means the temperature value is missing.

**Task**

1. Write a function `load_temperatures(file_name)` that:
   - Opens the file using a context manager.
   - Skips the header line.
   - Reads each remaining line, strips the newline, and splits on `','`.
   - Converts valid temperature values to `float`.
   - Ignores rows where the value is `.` or cannot be converted.
   - Returns a list of `(date_str, temperature_float)` tuples.

2. Using that function:
   - Compute the minimum, maximum, and average temperature.
   - Print them in a nicely formatted way (e.g. with 2 decimal places).

In [3]:
def load_temperatures(file_name):
    """Return a list of (date_str, temperature_float) from a CSV file.

    - Skips the header row.
    - Skips rows with missing or invalid numeric data.
    - Handles missing file gracefully.
    """
    data = []
    try:
        with open(file_name, encoding='utf-8') as f:
            # Skip header line (if present)
            _ = next(f, None)
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split(',', 1)
                if len(parts) != 2:
                    continue
                date_str, temp_str = parts
                temp_str = temp_str.strip()
                # Skip missing values
                if temp_str == '.':
                    continue
                try:
                    temperature = float(temp_str)
                except ValueError:
                    # Bad numeric data, skip row
                    continue
                data.append((date_str, temperature))
    except FileNotFoundError:
        print(f"File not found: {file_name!r}")
    return data


file_name = 'temperatures.csv'

temperatures = load_temperatures(file_name)
if temperatures:
    values = [t for _, t in temperatures]
    min_temp = min(values)
    max_temp = max(values)
    avg_temp = sum(values) / len(values)
    print(f'Min temperature: {min_temp:.2f} C')
    print(f'Max temperature: {max_temp:.2f} C')
    print(f'Average temperature: {avg_temp:.2f} C')
else:
    print('No valid temperature data found (or file missing).')


Min temperature: -3.00 C
Max temperature: 1.50 C
Average temperature: -0.50 C


#### Problem 2 — Lazy Line Reader

Sometimes you don't want to load all lines into memory at once.

**Task**

1. Implement a generator function `iter_clean_lines(file_name)` that:
   - Uses a context manager to open the file.
   - Iterates over the file line by line.
   - Strips trailing whitespace (including `\n`) from each line.
   - Yields only non-empty lines.
   - Handles missing file gracefully.

2. Demonstrate the generator by printing the first 5 non-empty lines of a file `notes.txt` without reading the entire file into memory.

In [4]:
from itertools import islice


def iter_clean_lines(file_name):
    """Yield stripped, non-empty lines from the given text file.

    If the file does not exist, print a message and yield nothing.
    """
    try:
        with open(file_name, encoding='utf-8') as f:
            for line in f:
                line = line.rstrip()
                if line:
                    yield line
    except FileNotFoundError:
        print(f"File not found: {file_name!r}")
        return


file_name = 'notes.txt'

print('First 5 non-empty lines:')
for line in islice(iter_clean_lines(file_name), 5):
    print(line)


First 5 non-empty lines:
  This is the first note.
Second note with some text.
Third note.
Fourth note.
Fifth note.


#### Problem 3 — Parsing a Simple Log File

You have an application log file `app.log` where each line looks like this:

```text
2025-01-01 10:00:01,INFO,App started
2025-01-01 10:00:03,WARNING,Low disk space
2025-01-01 10:00:05,ERROR,Could not open file
...
```

The format is:

```text
timestamp,level,message
```

**Task**

1. Write a function `count_log_levels(file_name)` that:
   - Opens the log file with a context manager.
   - Reads it line by line.
   - Splits each line into `timestamp`, `level`, `message` using `split(',', 2)`.
   - Counts how many times each log `level` appears.
   - Returns a dictionary mapping each level to its count (e.g. `{'INFO': 10, 'ERROR': 2}`).
   - Handles missing file gracefully.

2. Call the function on `app.log` and print the result in a readable way, e.g.:

```text
INFO: 10
WARNING: 3
ERROR: 2
```

In [5]:
def count_log_levels(file_name):
    """Return a dict mapping log level -> count from a log file.

    Handles missing file gracefully.
    """
    counts = {}
    try:
        with open(file_name, encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                parts = line.split(',', 2)
                if len(parts) < 2:
                    continue
                try:
                    _, level, _ = parts
                except ValueError:
                    # If there is no message part, skip the line
                    continue
                level = level.strip()
                if not level:
                    continue
                counts[level] = counts.get(level, 0) + 1
    except FileNotFoundError:
        print(f"File not found: {file_name!r}")
    return counts


file_name = 'app.log'

level_counts = count_log_levels(file_name)
if level_counts:
    for level in sorted(level_counts):
        print(f'{level}: {level_counts[level]}')
else:
    print('No log data found (or file missing).')


ERROR: 2
INFO: 2


#### Problem 4 — Count Substring in a Large File (Chunked Reading)

You need to count how many times a substring appears in a potentially very large text file `novel.txt`.  
You don't want to load the whole file into memory at once.

**Task**

1. Write a function `count_substring_in_file(file_name, substring, chunk_size=4096)` that:
   - Opens the file using a context manager.
   - Repeatedly reads chunks of size `chunk_size` using `read(chunk_size)` until the end of the file.
   - Counts how many times `substring` occurs in the **entire** file.
   - Handles missing file gracefully.

   > Hint: Be careful with occurrences of `substring` that might span the boundary between two chunks.  
   > One simple strategy is to remember the last `len(substring) - 1` characters from the previous chunk and prepend them to the next chunk before counting.

2. Test your function with a small `novel.txt` and a few substrings (e.g. `'the'`, `'and'`).

In [6]:
def count_substring_in_file(file_name, substring, chunk_size=4096):
    """Return the number of times substring appears in the file.

    The implementation is memory-friendly and handles matches
    that span across chunk boundaries. Handles missing file gracefully.
    """
    if not substring:
        raise ValueError('substring must not be empty')

    total = 0
    tail = ''
    sub_len = len(substring)

    try:
        with open(file_name, encoding='utf-8') as f:
            while True:
                chunk = f.read(chunk_size)
                if not chunk:
                    break
                # Combine tail from previous chunk with current chunk
                text = tail + chunk
                total += text.count(substring)
                # Keep the last sub_len - 1 characters for next round
                if sub_len > 1:
                    tail = text[-(sub_len - 1):]
                else:
                    tail = ''
    except FileNotFoundError:
        print(f"File not found: {file_name!r}")
        return 0

    return total


file_name = 'novel.txt'

print('Occurrences of \'the\':', count_substring_in_file(file_name, 'the'))
print('Occurrences of \'and\':', count_substring_in_file(file_name, 'and'))


Occurrences of 'the': 10
Occurrences of 'and': 6


#### Problem 5 — Merging Data from Multiple Files

Suppose you receive daily CSV files with exchange rates, named like this:

- `rates_2025-01-01.csv`
- `rates_2025-01-02.csv`
- `rates_2025-01-03.csv`
- ...

Each file has the structure:

```text
date,currency,rate
2025-01-01,USD,1.10
2025-01-01,EUR,1.00
...
```

**Task**

1. Write a function `load_rates(file_names)` that:
   - Accepts a list of file names.
   - For each file:
     - Uses a context manager to open the file.
     - Skips the header.
     - Reads each line, strips it, and splits on `','`.
     - Converts `rate` to `float`.
     - Collects tuples of `(date_str, currency_str, rate_float)`.
   - Returns a single list containing data from **all** the files.
   - Handles missing files gracefully.

2. Using `load_rates`, given a list of several file names, build a dictionary mapping each currency to its **average** rate across all dates (e.g. `{'USD': 1.12, 'EUR': 1.0, ...}`).

In [7]:
from collections import defaultdict


def load_rates(file_names):
    """Load (date, currency, rate) tuples from multiple CSV files.

    Skips lines with invalid data and prints a message for missing files.
    """
    all_data = []
    for file_name in file_names:
        try:
            with open(file_name, encoding='utf-8') as f:
                # Skip header
                _ = next(f, None)
                for line in f:
                    line = line.strip()
                    if not line:
                        continue
                    parts = line.split(',')
                    if len(parts) != 3:
                        continue
                    date_str, currency, rate_str = parts
                    try:
                        rate = float(rate_str)
                    except ValueError:
                        continue
                    all_data.append((date_str, currency, rate))
        except FileNotFoundError:
            print(f"File not found (skipping): {file_name!r}")
            continue
    return all_data


file_names = [
    'rates_2025-01-01.csv',
    'rates_2025-01-02.csv',
    'rates_2025-01-03.csv',
]

all_rates = load_rates(file_names)

if all_rates:
    sums = defaultdict(float)
    counts = defaultdict(int)

    for _, currency, rate in all_rates:
        sums[currency] += rate
        counts[currency] += 1

    avg_rates = {currency: sums[currency] / counts[currency] for currency in counts}

    print('Average rate per currency:')
    for currency in sorted(avg_rates):
        print(f'{currency}: {avg_rates[currency]:.4f}')
else:
    print('No rate data found (files missing or empty).')


Average rate per currency:
EUR: 1.0033
GBP: 0.8600
USD: 1.1200


#### Problem 6 — Robust CSV Parsing with Error Logging

You are processing a file `measurements.csv` with the following structure:

```text
id,value
1,3.14
2,2.71
3,abc
4,.
5,10.0
...
```

Some rows contain invalid numeric data (`abc`, `.`, empty values, etc.).

**Task**

1. Write a function `load_measurements(data_file, error_file)` that:
   - Opens `data_file` for reading *and* `error_file` for writing using two context managers in a single `with` statement.
   - Skips the header of `data_file`.
   - For each remaining line:
     - Strips the line and skips it if it is empty.
     - Attempts to parse `id` as `int` and `value` as `float`.
     - If both conversions succeed, stores the result in a list of `(id_int, value_float)` tuples.
     - If a conversion fails, writes the **original line** to `error_file` and continues.
   - Returns the list of successfully parsed tuples.
   - Handles missing `data_file` gracefully (and does not create an empty `error_file` in that case).

2. Call `load_measurements('measurements.csv', 'bad_measurements.csv')` and then:
   - Print how many rows were valid.
   - Print how many rows were written to the error file.

In [8]:
def load_measurements(data_file, error_file):
    """Load (id, value) tuples from data_file and log bad rows to error_file.

    Handles missing data_file gracefully.
    """
    good_rows = []

    try:
        with open(data_file, encoding='utf-8') as data_f, open(error_file, 'w', encoding='utf-8') as err_f:
            # Skip header
            _ = next(data_f, None)
            for line in data_f:
                # Preserve the original line (without trailing newline) for logging
                raw_line = line.rstrip('\n')
                stripped = raw_line.strip()
                if not stripped:
                    continue
                parts = stripped.split(',', 1)
                if len(parts) != 2:
                    err_f.write(raw_line + '\n')
                    continue
                id_str, value_str = parts
                id_str = id_str.strip()
                value_str = value_str.strip()
                try:
                    id_int = int(id_str)
                    value_float = float(value_str)
                except ValueError:
                    err_f.write(raw_line + '\n')
                    continue
                good_rows.append((id_int, value_float))
    except FileNotFoundError:
        print(f"Data file not found: {data_file!r}")
        return good_rows

    return good_rows


data_file = 'measurements.csv'
error_file = 'bad_measurements.csv'

good_rows = load_measurements(data_file, error_file)
valid_count = len(good_rows)

# Count invalid rows if the error file exists
invalid_count = 0
try:
    with open(error_file, encoding='utf-8') as f:
        for _ in f:
            invalid_count += 1
except FileNotFoundError:
    # If the data file was missing, we might not have an error file either
    invalid_count = 0

print(f'Valid rows: {valid_count}')
print(f'Invalid rows (logged to {error_file}): {invalid_count}')


Valid rows: 3
Invalid rows (logged to bad_measurements.csv): 3


#### Problem 7 — Reusing File-Reading Logic

You need to work with several different CSV files that all share the same basic structure:

```text
# Example: any_data.csv
col1,col2,col3
a,1,2.0
b,3,4.5
...
```

**Task**

1. Write a general-purpose function `read_csv_rows(file_name, has_header=True)` that:
   - Opens the file using a context manager.
   - Optionally skips the first line if `has_header` is `True`.
   - Iterates over remaining lines:
     - Strips the line, skips it if empty.
     - Splits on `','`.
     - Yields the resulting list of string fields (i.e. `['a', '1', '2.0']`).
   - Handles missing file gracefully.

2. Show how you can reuse this function for two different files:
   - `students.csv`:
     ```text
     name,age
     Alice,20
     Bob,22
     ```
   - `products.csv`:
     ```text
     id,price
     p1,9.99
     p2,14.50
     ```

   For each file, use `read_csv_rows` to:
   - Print all rows.
   - Convert numeric fields (`age`, `price`) to the appropriate numeric type.

In [9]:
def read_csv_rows(file_name, has_header=True):
    """Yield lists of string fields for each non-empty row in a CSV file.

    Handles missing file gracefully.
    """
    try:
        with open(file_name, encoding='utf-8') as f:
            if has_header:
                _ = next(f, None)
            for line in f:
                line = line.strip()
                if not line:
                    continue
                fields = [field.strip() for field in line.split(',')]
                yield fields
    except FileNotFoundError:
        print(f"File not found: {file_name!r}")
        return


# Example usage with students.csv and products.csv:

students_file = 'students.csv'
products_file = 'products.csv'

students = []
for row in read_csv_rows(students_file):
    if len(row) != 2:
        continue
    name, age_str = row
    try:
        age = int(age_str)
    except ValueError:
        continue
    students.append((name, age))

products = []
for row in read_csv_rows(products_file):
    if len(row) != 2:
        continue
    product_id, price_str = row
    try:
        price = float(price_str)
    except ValueError:
        continue
    products.append((product_id, price))

print('Students:')
for name, age in students:
    print(f'- {name}: {age} years old')

print('\nProducts:')
for product_id, price in products:
    print(f'- {product_id}: {price:.2f}')


Students:
- Alice: 20 years old
- Bob: 22 years old

Products:
- p1: 9.99
- p2: 14.50
