### More Examples Reading CSV Files — Advanced Exercises


In this notebook we'll practice more advanced (but still very practical) patterns for working with CSV files using Python's built-in [`csv`](https://docs.python.org/3/library/csv.html) module.

We will assume that the following files are available in the current working directory:

- `nasdaq.csv`
- `st-2001est-01.csv` (US Census data)

Each exercise is followed immediately by a reference solution written with common best practices (context managers, `newline=''` when opening CSVs, small reusable functions, and robust type conversion).


In [1]:
import csv
import re

from collections import namedtuple
from io import StringIO
from statistics import mean


#### Exercise 1 — Parsing `nasdaq.csv` into dictionaries

The earlier example parsed `nasdaq.csv` into a list of lists and manually converted the last column to a `float`.

**Task:**

1. Use `csv.DictReader` to parse `nasdaq.csv` into a list of dictionaries, where each dictionary represents one row.
2. Convert the `last_sale` column (or the last numeric column in the file) from a string to `float`.
3. Return the parsed data from a function called `parse_nasdaq_dict`.

Your function should:

- Accept a filename.
- Not modify the header names (use them as-is from the CSV file).
- Leave non-numeric fields as strings.


##### Solution


In [2]:
def parse_nasdaq_dict(file_name):
    """Parse nasdaq-style CSV into a list of dictionaries.

    The function assumes the CSV has a header row and that the last
    column contains numeric values that should be converted to float.
    """
    records = []

    # newline='' is recommended by the csv module docs
    with open(file_name, newline='') as f:
        reader = csv.DictReader(f)
        fieldnames = reader.fieldnames
        if not fieldnames:
            return records  # empty file

        last_field = fieldnames[-1]

        for row in reader:
            # Make a shallow copy before mutating (good practice when in doubt)
            row = dict(row)
            # Convert last column to float if the value is not empty
            value = row.get(last_field, '').strip()
            if value:
                row[last_field] = float(value)
            else:
                row[last_field] = None
            records.append(row)

    return records


# Example usage (will print first 5 parsed rows if the file exists)
nasdaq_file = 'nasdaq.csv'
nasdaq_records = parse_nasdaq_dict(nasdaq_file)
nasdaq_records[:5]


[{'Symbol': 'AAIT',
  'Company Name': 'iShares MSCI All Country Asia Information Technology Index Fund',
  'Security Name': 'iShares MSCI All Country Asia Information Technology Index Fund',
  'Market Category': 'G',
  'Test Issue': 'N',
  'Financial Status': 'N',
  'Round Lot Size': 100.0},
 {'Symbol': 'AAL',
  'Company Name': 'American Airlines Group, Inc.',
  'Security Name': 'American Airlines Group, Inc. - Common Stock',
  'Market Category': 'Q',
  'Test Issue': 'N',
  'Financial Status': 'N',
  'Round Lot Size': 100.0},
 {'Symbol': 'AAME',
  'Company Name': 'Atlantic American Corporation',
  'Security Name': 'Atlantic American Corporation - Common Stock',
  'Market Category': 'G',
  'Test Issue': 'N',
  'Financial Status': 'N',
  'Round Lot Size': 100.0},
 {'Symbol': 'AAOI',
  'Company Name': 'Applied Optoelectronics, Inc.',
  'Security Name': 'Applied Optoelectronics, Inc. - Common Stock',
  'Market Category': 'G',
  'Test Issue': 'N',
  'Financial Status': 'N',
  'Round Lot Siz

#### Exercise 2 — Streaming data and computing statistics

So far we have loaded all the CSV data into memory at once. For large files, it is more efficient to *stream* rows.

**Task:**

1. Write a generator function `iter_nasdaq_last_prices(file_name)` that:
   - Uses `csv.reader` to iterate over `nasdaq.csv`.
   - Skips the header.
   - Yields the last column converted to `float` for each row.
2. Use that generator to compute the average of the last prices without ever storing the full list of prices in memory at once.

*Hint:* A generator is just a function that uses `yield` instead of `return`.


##### Solution


In [3]:
def iter_nasdaq_last_prices(file_name):
    """Yield last prices from a nasdaq-style CSV, one at a time.

    Assumes the last column is numeric and contains the last sale price.
    """
    with open(file_name, newline='') as f:
        reader = csv.reader(f)
        # Skip header row
        headers = next(reader, None)
        if headers is None:
            return  # empty file

        for row in reader:
            if not row:
                continue  # skip empty lines
            try:
                yield float(row[-1])
            except ValueError:
                # If conversion fails, skip that row (or log, depending on your needs)
                continue


def average_last_price(file_name):
    prices = list(iter_nasdaq_last_prices(file_name))
    if not prices:
        return None
    return sum(prices) / len(prices)


# Example usage (if the file exists)
avg_price = average_last_price(nasdaq_file)
avg_price


99.96966632962588

Note: In a *truly* memory-conscious implementation we would avoid converting the generator to a list in `average_last_price` and instead compute the running sum and count. Here we keep the code simple and focus on the generator pattern itself. See the next small variation for a streaming-only implementation:


In [4]:
def average_last_price_streaming(file_name):
    total = 0.0
    count = 0
    for price in iter_nasdaq_last_prices(file_name):
        total += price
        count += 1
    return (total / count) if count else None


# Example usage
average_last_price_streaming(nasdaq_file)


99.96966632962588

#### Exercise 3 — Parsing Census data into `namedtuple`s

Previously, we parsed `st-2001est-01.csv` into lists, cleaning up the thousands separators and converting strings to `int`.

**Task:**

1. Define a `namedtuple` type called `CensusRow` with fields that match the header row of the census CSV file.
2. Write a function `load_census_records(file_name)` that:
   - Uses `csv.reader` to read the file.
   - Uses the first row as headers.
   - Cleans any numeric columns by removing commas and converting them to `int`.
   - Returns a list of `CensusRow` objects.
3. Print the first three records to verify that parsing works correctly.

You may assume that **all columns except the first** are numeric.


##### Solution


In [5]:
def load_census_records(file_name):
    """Load census CSV data into a list of CensusRow namedtuples.

    Assumes column 0 is a text label and all remaining columns are integers
    possibly using commas as thousands separators.
    """
    with open(file_name, newline='') as f:
        reader = csv.reader(f)
        headers = next(reader, None)
        if headers is None:
            return []

        # Normalize headers to valid Python identifiers
        def normalize_header(h):
            name = h.strip()
            # Replace spaces with underscores
            name = name.replace(' ', '_')
            # Remove commas
            name = name.replace(',', '')
            # Remove any remaining non-alphanumeric/underscore chars
            name = re.sub(r'\W', '', name)
            # If it starts with a digit, prefix with 'f_'
            if name and name[0].isdigit():
                name = 'f_' + name
            # Fallback if the header was empty or all invalid chars
            if not name:
                name = 'field'
            return name

        normalized = [normalize_header(h) for h in headers]

        # Ensure field names are unique
        seen = {}
        final_headers = []
        for name in normalized:
            base = name
            if base not in seen:
                seen[base] = 0
                final_headers.append(base)
            else:
                seen[base] += 1
                final_headers.append(f"{base}_{seen[base]}")

        CensusRow = namedtuple('CensusRow', final_headers)

        records = []
        for row in reader:
            if not row:
                continue

            area = row[0].strip()
            numeric_fields = []
            for field in row[1:]:
                cleaned = field.replace(',', '').strip()
                numeric_fields.append(int(cleaned))

            record = CensusRow(area, *numeric_fields)
            records.append(record)

    return records



census_file = 'st-2001est-01.csv'
census_records = load_census_records(census_file)
census_records[:3]


[CensusRow(Geographic_Area='United States', July_1_2001_Estimate=284796887, July_1_2000_Estimate=282124631, April_1_2000_Population_Estimates_Base=281421906),
 CensusRow(Geographic_Area='Alabama', July_1_2001_Estimate=4464356, July_1_2000_Estimate=4451493, April_1_2000_Population_Estimates_Base=4447100),
 CensusRow(Geographic_Area='Alaska', July_1_2001_Estimate=634892, July_1_2000_Estimate=627601, April_1_2000_Population_Estimates_Base=626932)]

Once the data is in `namedtuple` form, working with it becomes more expressive and self-documenting:


In [6]:
# Example: find the area with the largest value in the last numeric column
if census_records:
    last_field_name = census_records[0]._fields[-1]
    max_row = max(census_records, key=lambda r: getattr(r, last_field_name))
    last_value = getattr(max_row, last_field_name)
    (max_row, last_field_name, last_value)
else:
    None


#### Exercise 4 — Detecting dialects with `csv.Sniffer`

Sometimes you receive a CSV file with an unknown delimiter and quoting rules. Python's `csv.Sniffer` can often detect these automatically.

We'll simulate such a file using an in-memory text buffer.

**Task:**

1. Create a multi-line string representing CSV data where:
   - Fields are separated by semicolons (`;`).
   - Fields containing spaces are quoted with double quotes.
2. Use `csv.Sniffer().sniff` to detect the dialect from a sample of this string.
3. Parse the data using the detected dialect.
4. Convert the last column to `int`.

Implement this in a function `parse_with_sniffer(text)` that returns a list of rows.


##### Solution


In [7]:
sample_text = """city;state;population
"New York";NY;8419600
"Los Angeles";CA;3980400
"Chicago";IL;2716000
"""


def parse_with_sniffer(text):
    """Parse CSV text using csv.Sniffer to detect the dialect.

    Returns a list of rows where the last column has been converted to int.
    """
    # Use a StringIO buffer so the csv module can read from it like a file
    buffer = StringIO(text)
    sample = buffer.read(1024)
    buffer.seek(0)

    sniffer = csv.Sniffer()
    dialect = sniffer.sniff(sample)
    buffer.seek(0)

    reader = csv.reader(buffer, dialect=dialect)
    rows = []
    headers = next(reader, None)
    if headers is not None:
        rows.append(headers)

    for row in reader:
        if not row:
            continue
        # Convert last field to int
        row[-1] = int(row[-1])
        rows.append(row)

    return rows


parsed_rows = parse_with_sniffer(sample_text)
parsed_rows


[['city', 'state', 'population'],
 ['New York', 'NY', 8419600],
 ['Los Angeles', 'CA', 3980400],
 ['Chicago', 'IL', 2716000]]

You can inspect the detected dialect to see what `Sniffer` discovered:


In [8]:
buffer = StringIO(sample_text)
sample = buffer.read(1024)
buffer.seek(0)
dialect = csv.Sniffer().sniff(sample)
dialect.delimiter, dialect.quotechar, dialect.doublequote, dialect.skipinitialspace


(';', '"', False, False)

#### Exercise 5 — Handling malformed rows gracefully

Real-world CSV files often contain malformed rows (missing columns, bad numeric values, etc.). Your parsing code should handle these cases without crashing.

**Task:**

1. Write a function `safe_parse_census(file_name)` that:
   - Uses `csv.reader` to read the census CSV.
   - Uses the first row as headers (like before).
   - Tries to parse numeric columns by removing commas and converting to `int`.
   - If a row fails numeric conversion, prints a *warning* including the row number and skips that row.
2. Return the successfully parsed rows as a list of lists: `[header_row, row1, row2, ...]`.

This exercise focuses on robust error handling rather than data structures like `namedtuple`.


##### Solution


In [9]:
def safe_parse_census(file_name):
    """Parse census CSV file and skip malformed rows.

    Returns a list of rows, with the first element being the header row.
    Prints a warning message to stdout for any row that cannot be parsed.
    """
    results = []

    with open(file_name, newline='') as f:
        reader = csv.reader(f)
        headers = next(reader, None)
        if headers is None:
            return results

        results.append(headers)

        for row_number, row in enumerate(reader, start=2):  # header is row 1
            if not row:
                continue
            try:
                area = row[0]
                numeric = [int(field.replace(',', '').strip()) for field in row[1:]]
                parsed_row = [area] + numeric
                results.append(parsed_row)
            except (ValueError, IndexError) as ex:
                print(f"Warning: skipping malformed row {row_number}: {row!r} ({ex})")

    return results


safe_census_data = safe_parse_census(census_file)
# Show the first few parsed data rows (excluding header)
safe_census_data[:5]


[['Geographic Area',
  'July 1, 2001 Estimate',
  'July 1, 2000 Estimate',
  'April 1, 2000 Population Estimates Base'],
 ['United States', 284796887, 282124631, 281421906],
 ['Alabama', 4464356, 4451493, 4447100],
 ['Alaska', 634892, 627601, 626932],
 ['Arizona', 5307331, 5165274, 5130632]]