# HW1 - Base Python

See Canvas for details on how to complete and submit this assignment.

## Introduction

This assignment bridges foundational Python concepts with the data manipulation skills you'll need throughout the course. You'll work with nested data structures, implement string processing algorithms, clean messy data, and recreate built-in Python functionality from scratch - all essential skills for data science.

### Learning Objectives

- Become familiar with professional code styling guidelines and use them to improve the readability and maintainability of your code
- Compare and contrast different data structures (lists vs. dictionaries) through hands-on implementation
- Practice test-driven development using assertions to verify code correctness
- Transform messy, real-world data into clean, analyzable formats
- Progress from explicit loops to Pythonic idioms that you'll use with pandas and numpy

The problems follow a deliberate progression from simple text processing to complex data transformations. You'll implement solutions using basic constructs first, then explore how Python's built-in tools and methods can simplify your code. This approach mirrors real-world development, where understanding the problem deeply leads to better solution choices.

Each function you write will be tested automatically, introducing the testing practices essential for reliable data analysis. By the end, you'll have practical experience with the exact patterns you'll use when cleaning datasets, aggregating results, and transforming data structures throughout your data science journey.

It should take 3-5 hours to complete, toward the higher side for Graduate Students.

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

- Reading: 30pts, 15 each
- Coding: 60pts, 15 each
- Reflection: 10pts

## Reading

### Markdown Guide

Complete the [Markdown Tutorial](https://www.markdowntutorial.com) and review the [Basic Syntax section of Markdown Guide](https://www.markdownguide.org/basic-syntax/).

Add a text / markdown cell below this one and give some brief insights from that experience. Include a numbered list, some text formatting (e.g. bold and/or italics), and a level 3 header in that, along with any other formatting you would like to include.

### Quick takeaways from the Markdown Tutorial

After going through **Markdown Tutorial** and the **Basic Syntax** guide, here are a few things that stood out:

1. **Headings are simple.** Use `#` for titles and more `#` for smaller headings (e.g., `###` = level 3).
2. *Emphasis matters.* You can use *italics* with single asterisks `_` or `*`, **bold** with double `__` or `**`, and even ***bold italics*** with three.
3. **Lists are flexible.**
   - Ordered lists use numbers: `1.`, `2.`, `3.`
   - Unordered lists use `-` or `*`
4. **Links & images** are concise:
   - Link: [Markdown Guide](https://www.markdownguide.org/basic-syntax/)
   - Image: ![Image](https://1drv.ms/i/c/dccdca78bd3e09e9/EekJPr14ys0ggNxAGgAAAAABb8a3lC81atP3G72twLQz-Q?e=fTZdHS)
5. **Code & quotes** improve clarity:
   - Inline code like `print("hello")`
   - Fenced code blocks:
     ```python
     for i in range(3):
         print(i)
     ```
   - Blockquote for callouts:
     > “Markdown is easy to write and easy to read.”

---

 The tutorial reinforces writing-first workflow—focus on content, then add minimal formatting for structure and readability. For a quick refresher, the tutorial is at https://www.markdowntutorial.com/ and the syntax guide is at https://www.markdownguide.org/basic-syntax/.

### Python Standards

Review [PEP 8, the Style Guide for Python](https://peps.python.org/pep-0008/), focusing on the elements that are familiar to you and most applicable in your current stage development as a Python user.

Add a text / markdown cell below this one to share your main takeaways. You might address some of the following issues and/or entirely different topics.

- Why is code styling important for collaboration and maintainability?
- Which of the PEP 8 recommendations felt the most applicable to you?
- Which do you plan to implement?
- Which were the most surprising?

**Graduate students only:** Also review [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html) and consider it in your response.

### PEP 8 & Google Python Style Guide — Key Takeaways

**Why code styling matters:** consistent style reduces cognitive load, lowers merge conflicts, and makes code reviews faster. Clean, predictable code is easier to test, debug, and onboard new collaborators to.

**Most applicable PEP 8 practices for me right :**
1. **Naming:** modules/functions/variables use `lower_snake_case`; classes use `CapWords`; constants use `UPPER_SNAKE_CASE`.
2. **Line length:** keep code lines short (PEP 8 suggests **79**; many teams use **88** with Black). Prefer breaking long expressions with **implicit** line continuation inside `()`.
3. **Whitespace discipline:** spaces around binary operators (`a + b`), **no** extra spaces inside parentheses/brackets, and **no spaces around `=` in default args** (e.g., `def f(x=1):`).
4. **Imports:** top of file, one per line, grouped & ordered: stdlib → third-party → local. Avoid `from x import *`.
5. **Docstrings:** use triple double quotes; write a clear one-line summary, then details if needed.
6. **Readability hints:** prefer `if x is None` to `if x == None`; avoid bare `except:`; keep functions small and focused.

**What I plan to implement immediately:**
-  Run **Black** (formatting) and **ruff/flake8** (linting) on save.  
-  Enforce import order with **isort** (stdlib → third-party → local).  
-  Add **type hints** in new/edited functions (`def f(x: int) -> str:`) and run a type checker (mypy/pyright).  
-  Adopt **Google-style docstrings** (`Args:`, `Returns:`, `Raises:`) for public functions.  
- Keep lines ≤ 88 and break long expressions with parentheses.

**Most surprising bits:**
- PEP 8’s classic **79-char** line limit (still common in core Python); many modern teams use **88** (Black) or **100**, but staying short improves diffs.
- No spaces around `=` **in default parameters** (but do use spaces around assignment elsewhere).
- “`is` vs `==`” for `None` checks, and **never** use a bare `except:`.

>**Graduate note — what Google’s Python Style Guide adds:**
* A stricter 80-character line limit with documented exceptions.
* Prefer absolute imports and keep groups tidy and separate.
* Strong guidance on concise docstrings and consistent commenting.
* Clear expectations for typing, logging (use a format string plus arguments), and mandatory tooling such as linters.

## Coding

Your code will be evaluated primarily on functionality, but basic PEP 8 compliance will be considered:

- Descriptive function names using `snake_case`
- Clear docstrings explaining function purpose
- Meaningful variable names
- Proper spacing around operators

All solutions will be implemented as functions. This is best practice for many reasons, including testability. As you will see, with functions we can write simple tests to check the correctness of implementation. This theme will be revisited and expanded on throughout the semester.

### Count Letters

This simple problem is designed to reintroduce Python and demonstrate:

- there are many ways to solve problems in Python
- some are better and easier than others
- the "hard way" is a necessary educational tool but Python provides alternatives for a reason

Write three versions of a function that takes a string and returns the number of occurrences of each letter in it:

1. `count_letters_v1` - use a list of lists where each inner list is `[letter, count]`
2. `count_letters_v2` - use a dictionary, checking if keys exist before updating
3. `count_letters_v3` - use dictionary's `.get()` method to simplify the logic

Write your functions in the cell below.

In [None]:
def count_letters_v1(text):
    """Return counts of each alphabetic letter (case-insensitive) as a list of [letter, count].

    Uses a nested loop to maintain a list of [letter, count] pairs.
    Only characters where ch.isalpha() is True are counted; text is lowercased first.
    """
    result = []
    for ch in text.lower():
        if ch.isalpha():
            for pair in result:
                if pair[0] == ch:
                    pair[1] += 1
                    break
            else:
                result.append([ch, 1])
    return result


def count_letters_v2(text):
    """Return counts of each alphabetic letter (case-insensitive) as a dict letter: count.

    Uses a single loop and checks key existence before updating.
    Only alphabetic characters are counted.
    """
    counts = {}
    for ch in text.lower():
        if ch.isalpha():
            if ch in counts:
                counts[ch] += 1
            else:
                counts[ch] = 1
    return counts


def count_letters_v3(text):
    """Return counts of each alphabetic letter (case-insensitive) as a dict letter: count.

    Uses a single loop and the dict get method to simplify updates.
    Only alphabetic characters are counted.
    """
    counts = {}
    for ch in text.lower():
        if ch.isalpha():
            counts[ch] = counts.get(ch, 0) + 1
    return counts


#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
def normalize_result(result):
    """Helper to compare different return types"""
    if isinstance(result, list):
        return {item[0]: item[1] for item in result}
    return result


test_cases = [
    ('Hello World', {'h': 1, 'e': 1, 'l': 3, 'o': 2, 'w': 1, 'r': 1, 'd': 1}),
    ('AAaaa', {'a': 5}),
    ('123!@#', {}),  # No letters
    ('', {}),  # Empty string
]

for text, expected in test_cases:
    assert normalize_result(count_letters_v1(text)) == expected, f"v1 failed on '{text}'"
    assert count_letters_v2(text) == expected, f"v2 failed on '{text}'"
    assert count_letters_v3(text) == expected, f"v3 failed on '{text}'"

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to describe the progression from v1 to v3. Which method do you prefer and why? Specifically, why are dictionaries better suited for this problem than lists, and what is the advantage of `.get()`?

Interpretation — v1 to v3

v1 ([letter, count])
• Loops a list of pairs and has a nested loop to find the letter each time.
• Time complexity gets worse as each new character has the potential to scan the entire list (approx. O(n·k), where k = count of distinct letters).
• A good choice for learning about conditionals and loops, but is verbose and more error-prone.

v2 (dictionary with key check)
• Transitions to a dictionary that searches for letter → count.
• Every update is generally O(1): search if the key exists, then increment or add.
• One pass over the text, less logic, and no nested loop.

v3 (retrieve from dictionary)
• Same idea as v2, but has counts.get(letter, 0) + 1.
• Removes the if/else branch, so the update is one line and less prone to error.
• Even still only one pass and O(1) updates on average.

Which method I prefer and why
I prefer v3 because it's readable, concise, and efficient. It gets across the meaning (count or start at zero) without a branch, so the code is easier to understand and maintain.

Why dictionaries are a better fit to this problem than lists
• The operation is "look up a letter and increment its count," which is exactly a key-value mapping.
• Direct key access is fast and repeated scans of a list approach are not needed.
• The resulting code mirrors the mental model: for every letter, increment its own bucket.

What .get() adds
• Returns a default if a key is missing, so no existence test is needed separately.
• Prevents KeyError and removes an if/else block, lowering cognitive load.
• Maintains the update as one consistent action (read current or zero, then increment), which is both idiomatic and less prone to bugs.

#### Follow-Up (Graduate Students)

This part is for grad students only.

Implement a fourth version of the solution using [`Collections.Counter` from the standard library](https://www.geeksforgeeks.org/python/counters-in-python-set-1/). Test your implementation as you did for v1-3.

In [None]:
from collections import Counter

def count_letters_v4(text):
    """Return counts of each alphabetic letter (case-insensitive) as a dictionary.

    Uses collections.Counter over a generator expression to count only alphabetic characters.
    """
    return dict(Counter(ch for ch in text.lower() if ch.isalpha()))

In [None]:
def normalize_result(result):
    """Helper to compare different return types"""
    if isinstance(result, list):
        return {item[0]: item[1] for item in result}
    return result


test_cases = [
    ('Hello World', {'h': 1, 'e': 1, 'l': 3, 'o': 2, 'w': 1, 'r': 1, 'd': 1}),
    ('AAaaa', {'a': 5}),
    ('123!@#', {}),  # No letters
    ('', {}),        # Empty string
]

for text, expected in test_cases:
    assert normalize_result(count_letters_v1(text)) == expected, f"v1 failed on '{text}'"
    assert count_letters_v2(text) == expected, f"v2 failed on '{text}'"
    assert count_letters_v3(text) == expected, f"v3 failed on '{text}'"
    assert count_letters_v4(text) == expected, f"v4 failed on '{text}'"

print('All tests passed!')

All tests passed!


### Extract Valid Data

Create a function, `extract_valid_data`, that takes a list of lists containing an arbitrary mix of *only* `int`, `float`, and `str` data types, along with a `max_val` number. Return a list of the unique integer values less than `max_val`, sorted in ascending order. The default value of `max_val` is 10. For example, the following function call:

```python
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
extract_valid_data(lols, max_val=100)
```

should return

```python
[-5, 1, 25, 50]
```

To better understand how default arguments are used when defining and calling Python functions, review the first part of [this Geeks for Geeks article](https://www.geeksforgeeks.org/python/default-arguments-in-python/). The second part, about mutable defaults, is very important; we will revisit this topic later in the course.

You will need to use either `type` or `isinstance` to identify objects of type `int` in your solution. Consult the Python documentation or use the built-in help (e.g. `help(isinstance)`) for more information.

Write your function in the cell below.

In [None]:
def extract_valid_data(lists, max_val=10):

    """Return a sorted list of unique integers less than max_val from a list of lists.

    The input contains only int, float, or str. We count only true ints (exclude bools).
    Args:
        lists: List of lists with items of type int, float, or str.
        max_val: Upper bound (exclusive) for valid integers. Default is 10.
    Returns:
        Sorted list of unique integers < max_val.
    """
    unique_ints = set()
    for sublist in lists:
        for item in sublist:
            # type(item) is int excludes booleans (since isinstance(True, int) is True)
            if type(item) is int and item < max_val:
                unique_ints.add(item)
    return sorted(unique_ints)

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
# Test 1: Basic example from problem description
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
assert extract_valid_data(lols, max_val=100) == [-5, 1, 25, 50], 'Basic test failed'

# Test 2: Default max value (10)
data1 = [[1, 5, 15], [8, 12, 3], [5, 9, 10]]
assert extract_valid_data(data1) == [1, 3, 5, 8, 9], 'Default max_val=10 test failed'

# Test 3: No valid integers (all exceed max)
data2 = [[100, 200], [150, 300]]
assert extract_valid_data(data2, max_val=50) == [], 'No valid integers test failed'

# Test 4: Duplicates should be removed
data3 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
assert extract_valid_data(data3, max_val=10) == [1, 2, 3, 4, 5], 'Duplicate removal test failed'

# Test 5: Mixed types - only integers should be included
data4 = [[1, 2.0, '3'], [4.5, 5, 'six'], [7.0, 8, 9.9]]
assert extract_valid_data(data4, max_val=10) == [1, 5, 8], 'Type filtering test failed'

# Test 6: Negative numbers
data5 = [[-5, -3, -1], [0, 1, 2]]
assert extract_valid_data(data5, max_val=3) == [-5, -3, -1, 0, 1, 2], 'Negative numbers test failed'

# Test 7: Single element sublists
data6 = [[1], [2], [3], [2], [1]]
assert extract_valid_data(data6, max_val=5) == [1, 2, 3], 'Single element test failed'

# Test 8: Large max value
data7 = [[1, 100, 1000], [50, 500, 5000]]
assert extract_valid_data(data7, max_val=10000) == [1, 50, 100, 500, 1000, 5000], (
    'Large max test failed'
)

# Test 9: Boundary case - values equal to max should be excluded
data8 = [[8, 9, 10, 11], [10, 10, 10]]
assert extract_valid_data(data8, max_val=10) == [8, 9], 'Boundary test failed (max_val=10)'

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, what does `assert` do here? Are you surprised by the number of tests required to fully check the solution?

Interpretation — testing code behavior

Role of the Helper
• normalize_result converts v1's pairs to a dictionary for consistent version comparison.
• It is returned unchanged if the answer is a dictionary (v2, v3).
• It enables a single statement to directly compare varying types of returns.

What the test cases demonstrate
• "Hello World" mixed-case input validation and case-insensitive count confirmation.
• "AAaaa" checks that upper and lower case collapse to the same letter and that totals aggregate correctly.
• "123!@#" ensures non-letters are ignored.
• Empty string checks ensure functions return empty results without errors.

What assert does.
• assert <condition>, "message" stops if False and raises an AssertionError with the message.
• Each assert in these tests compares the function’s output to the expected dictionary.
• It passes the test quietly if it is a match; otherwise, you will see what failed and the input.

Why is the loop used?
• The for loop verifies each (text, expected) tuple. • Each function is run with each case such that each failure denotes the version (v1, v2, v3, v4) and the exact input it failed.

Are there enough testing? • They are reasonable "smoke tests" for typical letters, case folding, non-letters, and null input. • A complete suite would have other edge cases, including one repeated letter strings, long strings, all distinct letters, accented or non-ASCII alphabet, whitespace dominating strings, and letter-punctuation combination. In actual development projects, we need more tests for increased confidence. These four are an absolute minimal baseline to catch usual mistakes, but complete correctness needs a larger set.

#### Follow-Up (Graduate Students)

This part is for grad students only.

Rewrite this function as a single list comprehension.

Is the result more or less easy to read than your original implementation? What does this tell you about when comprehensions are best used, in practice?

In [None]:
# Single list-comprehension solution (dedupe via a tiny seen set)
def extract_valid_data(lists, max_val=10):
    """
    Return a sorted list of unique integers < max_val from a list of lists.
    Uses one list comprehension for the result; a small 'seen' set prevents duplicates.
    Excludes booleans (since bool is a subclass of int) by using type(x) is int.
    """
    seen = set()
    return sorted([
        item
        for sublist in lists
        for item in sublist
        if (type(item) is int) and (item < max_val) and (item not in seen and not seen.add(item))
    ])

It’s compact, but less readable than a small for-loop because it hides the “remember what we’ve seen” state inside the condition (the not seen.add(item) trick relies on add returning None).

In practice, comprehensions are best when you’re doing simple, linear filtering/mapping with no side effects or hidden state.

If you need state (like deduplication) or multiple steps, a short loop (or a set comprehension plus sorted) is usually clearer for collaborators.

### Data Cleaning

Create a function, `clean_record`, that takes a dictionary and returns a cleaned version of the same. Each `dict` consists of four key:value pairs. All keys are strings and the expected type of each value is specified below:

- 'name': str
- 'age': int
- 'email': str
- 'score': float

To clean each record, your function should:

- convert all keys to lowercase
- convert all age and score values to integer or float values, as specified
- validate that age is positive and less than 100, if not, replace value with `None` and print a warning message
- round score to a single digit of precision using `round(val, 1)`
- convert name to "Last, First" format
  - you can assume that all names come in "First Middle Last" format, but middle is optional
  - you can also assume that the names will not include titles (e.g. "Dr.", suffixes (e.g. "Jr."), multi-word last names (e.g. "Van Buren"), etc.
- return the cleaned version

Note: Python's `round` function uses Banker's Rounding, which can lead to unexpected results. See [this article for additional background / details](https://medium.com/@akhilnathe/understanding-pythons-round-function-from-basics-to-bankers-b64e7dd73477).

You may assume there are no missing keys in the data.

Write your function in the cell below.

In [None]:
def clean_record(record):

    """Clean a single record dict with keys Name/Age/Email/Score (any case).

    - Keys are lowercased in the output.
    - Age is converted to int; if not in 1..99, set to None and print a warning.
    - Score is converted to float and rounded to 1 decimal using round(val, 1).
    - Name is converted from 'First [Middle] Last' to 'Last, First'.
    - Email is preserved as a string.
    """
    # Normalize keys to lowercase
    rec = {str(k).lower(): v for k, v in record.items()}

    # --- Name: 'First [Middle] Last' -> 'Last, First'
    name_raw = rec.get('name', '')
    parts = str(name_raw).strip().split()
    if len(parts) >= 2:
        first = parts[0]
        last = parts[-1]
        name_clean = f"{last}, {first}"
    else:
        # Fallback if name is malformed
        name_clean = str(name_raw).strip()

    # --- Age: to int; validate 1..99 else None with warning
    age_raw = rec.get('age')

    def _to_int(value):
        if isinstance(value, int):
            return value
        if isinstance(value, float):
            try:
                return int(value)
            except (ValueError, OverflowError):
                return None
        if isinstance(value, str):
            s = value.strip()
            try:
                if s.isdigit() or (s and s[0] in '+-' and s[1:].isdigit()):
                    return int(s)
                else:
                    return int(float(s))
            except (ValueError, OverflowError):
                return None
        return None

    age_val = _to_int(age_raw)
    if age_val is None or not (0 < age_val < 100):
        print(f"Warning: invalid age {age_raw!r}; setting to None")
        age_clean = None
    else:
        age_clean = age_val

    # --- Score: to float; round to 1 decimal
    score_raw = rec.get('score')

    def _to_float(value):
        if isinstance(value, float):
            return value
        if isinstance(value, int):
            return float(value)
        if isinstance(value, str):
            s = value.strip()
            try:
                return float(s)
            except (ValueError, OverflowError):
                return None
        return None

    score_val = _to_float(score_raw)
    score_clean = round(score_val, 1) if score_val is not None else None

    # --- Email: keep as string
    email_raw = rec.get('email')
    email_clean = None if email_raw is None else str(email_raw)

    return {
        'name': name_clean,
        'age': age_clean,
        'email': email_clean,
        'score': score_clean,
    }


#### Tests

Run the code below to test your implementation. If an error is detected (the output doesn't match the expected value for any of the 8 tests), use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
### DO NOT CHANGE THE CODE IN THIS CELL

# Test data for clean_records function

test_input = [
    {
        'name': 'John Doe',
        'age': '25',
        'email': 'john@email.com',
        'score': '87.456',
    },
    {
        'NAME': 'Mary Jane Smith',
        'AGE': '30',
        'EMAIL': 'mj@email.com',
        'SCORE': '92.149',
    },
    {
        'Name': 'Bob Wilson',
        'Age': 42,
        'Email': 'bob@test.com',
        'Score': 81.951,
    },
    {
        'name': 'Anna Chen',
        'age': '1',
        'email': 'anna@email.com',
        'score': '95.678',
    },
    {
        'name': 'Senior Citizen',
        'age': '99',
        'email': 'senior@test.com',
        'score': 73.2,
    },
    {
        'NAME': 'Charlie Brown',
        'AGE': 19,
        'EMAIL': 'charlie@test.com',
        'SCORE': '90.5',
    },
    {
        'name': 'Jennifer Anne Marie Thompson',
        'age': '31',
        'email': 'jamt@email.com',
        'score': '88.8',
    },
    {
        'Name': 'Carlos Rodriguez',
        'Age': '28',
        'Email': 'carlos@email.com',
        'Score': 100,
    },
    {
        'name': 'Invalid Age',
        'age': '150',
        'email': 'invalid@test.com',
        'score': '80.0',
    },
]

test_expected = [
    {'name': 'Doe, John', 'age': 25, 'email': 'john@email.com', 'score': 87.5},
    {'name': 'Smith, Mary', 'age': 30, 'email': 'mj@email.com', 'score': 92.1},
    {'name': 'Wilson, Bob', 'age': 42, 'email': 'bob@test.com', 'score': 82.0},
    {'name': 'Chen, Anna', 'age': 1, 'email': 'anna@email.com', 'score': 95.7},
    {'name': 'Citizen, Senior', 'age': 99, 'email': 'senior@test.com', 'score': 73.2},
    {'name': 'Brown, Charlie', 'age': 19, 'email': 'charlie@test.com', 'score': 90.5},
    {'name': 'Thompson, Jennifer', 'age': 31, 'email': 'jamt@email.com', 'score': 88.8},
    {'name': 'Rodriguez, Carlos', 'age': 28, 'email': 'carlos@email.com', 'score': 100.0},
    {'name': 'Age, Invalid', 'age': None, 'email': 'invalid@test.com', 'score': 80.0},
]

# run tests to ensure output matches expected for each given input

for idx, data in enumerate(test_input):
    expected = test_expected[idx]
    actual = clean_record(data)

    # Check if dictionaries match
    if actual != expected:
        # Find which fields don't match
        for key in expected:
            if actual.get(key) != expected[key]:
                assert False, (
                    f"Test {idx + 1} failed on field '{key}': expected {expected[key]}, got {actual.get(key)}"
                )

print('All tests pass!')

All tests pass!


#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, look up the `enumerate` function and `dict.get()` method. Consider how the equivalent would be written without them - how do those features simplify this implementation?

Also, what does `assert` do here and how else could it be in other testing situations?

Interpretation — how the test code works (enumerate, dict.get, and assert)

What the loop is doing
• It walks through the input records and the expected results in lockstep.
• For each index i and record, it runs clean_record(record) to get actual, then compares actual to test_expected[i].
• If they differ, it checks each key to report exactly which field failed.

Why enumerate is used
• enumerate(test_input) gives both the index (i) and the item (record) at the same time.
• Without enumerate you’d need extra boilerplate, for example: keep a manual counter (i = 0; …; i += 1) or loop over a numeric range (for i in range(len(test_input))) just to retrieve test_expected[i].
• Using enumerate removes that clutter and makes the intent clear: you need both the item and its position.

Why dict.get is used
• actual.get(key) safely returns the value for key, or None if the key is missing.
• In the error path, this avoids a KeyError and allows a clean failure message (which test, which field, expected vs. actual).
• Without dict.get you’d write if key in actual: value = actual[key] else: value = None, which is more verbose and easier to get wrong.

What assert does here
• assert condition, message immediately stops the test when condition is false, raising an AssertionError with the given message.
• In this test, if any field doesn’t match, assert is triggered with a clear message indicating the failing test number and field.

How else assert can be used in testing
• Pytest: simply write assert actual == expected and let pytest show a rich diff; to test errors, use with pytest.raises(SomeError): call().
• Unittest (xUnit style): use self.assertEqual, self.assertDictEqual, self.assertRaises for clearer intent and built-in reporting.
• Property-based testing (Hypothesis): generate many randomized inputs and assert general properties (for example, ages are None or between 1 and 99).
• Note: asserts can be stripped when running Python with optimization flags; for production input validation, prefer explicit checks and raise descriptive exceptions (e.g., ValueError).

What the loop does: It iterates over test_input and test_expected in parallel, applies clean_record to both input, and checks the result against the corresponding expected dict. If they are not equal, it identifies what field was incorrect.

• Why enumerate: It provides index and item both at once, so that the code can retrieve the corresponding expected record by index without an explicit counter or additional range/zip functionality.

• Why dict.get: It retrieves a value safely (returns None if missing) so the failure message can be built without risking a KeyError. Without get, you’d need if key in actual before accessing actual[key].

• What it does: It's a guard that stops the run with an AssertionError (and an understandable message) if something is not valid. Otherwise, you would use pytest's simple assert with rich diffs, unittest's assertEqual/assertDictEqual, or property-based tests with Hypothesis.

• Relation to your function: Your clean_record also uses get for safe accesses (name, age, score, email), which follows the method of your test: safer access, clearer error, less crash.

Bottom line: enumerate and dict.get save boilerplate and prevent superfluous errors, and assert provides prompt, explicit failure messages—overall making the tests more convenient and informative.

#### Follow-Up (Graduate Students)

This part is for grad students only.

Explain in a text / markdown cell below how your approach would have to change if you could not assume each record was complete (all four keys present).

Sketch out the code change required. You can include (non-running) code blocks in markdown cells as shown below (edit this cell to see the formatting). This does not have to run, it is for communication purposes only.



**Follow-Up (Graduate) — handling incomplete records**

If we can’t assume each record has all four keys, the cleaning function must be defensive and schema-preserving:

Always return the full schema with keys name, age, email, score, even if some values end up as None. This keeps downstream code (e.g., DataFrame creation, aggregations) stable.

Use safe lookups with dict.get and graceful defaults:

Missing name → None (or leave as raw string if present but malformed).

Missing age → None and print a warning (as with invalid ages).

Missing score → None.

Missing email → None.

Parsing stays best-effort: attempt conversion; on failure or invalid range, set to None.

Keep the “Last, First” format if a name is present and has at least two parts; otherwise return the raw name (or None) rather than guessing.

In [None]:
#Sketch of the code change (communication only; does not have to run)
# code blocks are denoted in markdown with three backticks before and after
print("This is a markdown code block.")
def clean_record(record):
    # 1) Normalize keys to lowercase (ignore unexpected keys, if any)
    rec = {str(k).lower(): v for k, v in record.items()}

    # 2) Name: handle missing or malformed
    name_raw = rec.get('name', None)
    if isinstance(name_raw, str):
        parts = name_raw.strip().split()
        if len(parts) >= 2:
            first = parts[0]
            last = parts[-1]
            name_clean = f"{last}, {first}"
        else:
            # If only one part or empty, keep as-is (or set to None by policy)
            name_clean = name_raw.strip() if name_raw.strip() else None
    else:
        name_clean = None

    # 3) Age: best-effort to int; if missing/invalid/out-of-range -> None + warning
    age_raw = rec.get('age', None)
    age_clean = None
    if age_raw is not None:
        age_clean = _to_int(age_raw)  # same helper as before
        if age_clean is None or not (0 < age_clean < 100):
            print(f"Warning: invalid or missing age {age_raw!r}; setting to None")
            age_clean = None
    else:
        print("Warning: age missing; setting to None")

    # 4) Score: best-effort to float then round(…, 1); missing -> None
    score_raw = rec.get('score', None)
    if score_raw is not None:
        val = _to_float(score_raw)    # same helper as before
        score_clean = round(val, 1) if val is not None else None
    else:
        score_clean = None

    # 5) Email: coerce to str if present; else None
    email_raw = rec.get('email', None)
    email_clean = str(email_raw).strip() if email_raw is not None else None

    # 6) Return a schema-stable dict (all four keys present)
    return {
        'name':  name_clean,
        'age':   age_clean,
        'email': email_clean,
        'score': score_clean,
    }

# Helpers (same idea as in the original solution)
def _to_int(value):
    # int, float, or strings like '25' / '25.0' -> int, else None
    ...

def _to_float(value):
    # float, int, or numeric string -> float, else None
    ...


This is a markdown code block.


### Implement a Simplified `zip()`

Create a function, `simple_zip` that emulates some functionality of the `zip` function included with base Python:

```bash
> help(zip)
Help on class zip in module builtins:

class zip(object)
 |  zip(*iterables, strict=False)
 |
 |  The zip object yields n-length tuples, where n is the number of iterables
 |  passed as positional arguments to zip().  The i-th element in every tuple
 |  comes from the i-th iterable argument to zip().  This continues until the
 |  shortest argument is exhausted.
 |
 |  If strict is true and one of the arguments is exhausted before the others,
 |  raise a ValueError.
 |
 |     >>> list(zip('abcdefg', range(3), range(4)))
 |     [('a', 0, 0), ('b', 1, 1), ('c', 2, 2)]
```

Python's version creates a *generator* object that produces values as needed rather than all at once. Your solution should return a list of tuples instead. For example, the following function call:

```python
it1 = [1, 2, 3]
it2 = ['a', 'b', 'c']
simple_zip(it1, it2)
```

should return

```python
[(1, 'a'), (2, 'b'), (3, 'c')]
```

Do not implement the `strict` argument. Instead, emulate the default behavior of `zip`: if the iterables are of differing lengths, stop when the shortest one is exhausted.

Note that the first argument in `zip` is `*iterables`, allowing it to accept any number of iterables. When you use `*VARIABLE_NAME` in this fashion, Python automatically collects all the positional arguments into a tuple called `VARIABLE_NAME` (e.g. `iterables`). It is your responsibility to extract individual arguments from the resulting tuple. The following code block demonstrates this for clarity.

In [None]:
def example(*vars):
    # return vars as constructed by Python from the user's arguments
    return vars


var1 = 'first argument'
var2 = 'second argument'
result = example(var1, var2)

# inspect results
print(result)  # ('first argument', 'second argument')
print(result[0])  # 'first argument'

Write your function in the cell below.

In [None]:
def simple_zip(*iterables):
    """Return a list of tuples where the i-th tuple contains the i-th elements
    of each passed iterable. Stop at the shortest iterable (like built-in zip).

    Examples:
        simple_zip([1,2,3], ['a','b','c']) -> [(1,'a'), (2,'b'), (3,'c')]
        simple_zip([1,2], ['x','y','z'])   -> [(1,'x'), (2,'y')]
        simple_zip([1,2,3])                -> [(1,), (2,), (3,)]
        simple_zip()                       -> []
    """
    iters = [iter(it) for it in iterables]
    result = []
    if not iters:
        return result

    while True:
        row = []
        try:
            for it in iters:
                row.append(next(it))
        except StopIteration:
            break
        result.append(tuple(row))
    return result


#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
# Basic test cases
assert simple_zip([1, 2], ['a', 'b']) == [(1, 'a'), (2, 'b')], 'Basic test failed'
assert simple_zip([1, 2, 3], ['a', 'b']) == [(1, 'a'), (2, 'b')], "Doesn't stop at shortest"

# Multiple iterables
assert simple_zip([1, 2], ['a', 'b'], [10, 20]) == [(1, 'a', 10), (2, 'b', 20)], (
    "Doesn't handle >2 iterables"
)

# Different types of iterables
assert simple_zip('abc', [1, 2, 3]) == [('a', 1), ('b', 2), ('c', 3)], "Doesn't handle mixed types"
assert simple_zip(range(3), 'xyz') == [(0, 'x'), (1, 'y'), (2, 'z')], (
    "Doesn't handle other iterable types"
)

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to discuss how you might make your code more concise and/or readable by using list comprehensions. If you already used them in your solution, describe why you chose that approach.

Also, what tests have we overlooked? What are we assuming about the input that might cause a crash when this function is called?

**simple_zip interpretation**

• List comprehensions can only reduce the solution to one line if the inputs are indexable sequences (provide len and [i]). That one is shorter but not as general.

• The small while/try/except loop is more readable and works for any iterable (lists, tuples, ranges, generators), finishing cleanly on the shortest—like built-in zip.

**Tests that need to be added:**
• No inputs → []; single iterable → [(x,), …].
• Different lengths, one of which is an empty iterable.
• Combination of differing iterable types (list + generator, tuple + range).
• Strings as iterables (should that be allowed or not).
• A long, even infinitely long iterable with a brief finite iterable (should truncate at the short one and not hang).

**Assumptions which may break**
• Non-iterable arguments (None, an int) will raise TypeError when converting through iter.
• Iterators are exhausted; reusing the same iterator won't "restart" it.
• If an iterator raises some exception other than StopIteration, it will propagate

#### Follow-Up (Graduate Students)

Read more about [generator objects](https://realpython.com/introduction-to-python-generators/). Then run the following code, noting the included comments.

In [None]:
# Python's built-in zip returns a generator-like object
result1 = zip([1, 2, 3], ['a', 'b', 'c'])
print(result1)  # What do you see?
print(list(result1))  # Convert to list
print(list(result1))  # Try again - what happens?

# Your simple_zip returns a list
result2 = simple_zip([1, 2, 3], ['a', 'b', 'c'])
print(result2)  # What do you see?
print(result2)  # Try again - what happens?

<zip object at 0x790499727700>
[(1, 'a'), (2, 'b'), (3, 'c')]
[]
[(1, 'a'), (2, 'b'), (3, 'c')]
[(1, 'a'), (2, 'b'), (3, 'c')]


Based on your reading and the code above:

- What's the key difference between a generator and a list?
- Why might Python's zip return a generator instead of a list?
- Name one advantage and one disadvantage of generators vs lists.

what each line shows

<zip object at 0x790499727700>
zip returns an iterator (generator-like), not a list.

[(1, 'a'), (2, 'b'), (3, 'c')]
converting the iterator to a list consumes it and shows the pairs.

[]
the iterator is now exhausted; a second conversion yields nothing.

[(1, 'a'), (2, 'b'), (3, 'c')]
your simple_zip returns a real list, so you see the data.

[(1, 'a'), (2, 'b'), (3, 'c')]
lists persist; printing again shows the same data.

answers (short + to the point)
• key difference: generators/iterators are lazy, single-pass streams (don’t store items); lists are eager, stored collections (support len, indexing, multiple passes).
• why zip is a generator: efficiency and composability—no upfront memory cost, works with very large/infinite sources, and chains well with other iterator tools.
• one advantage of generators: memory-efficient and fast to start producing values.
• one disadvantage of generators: one-shot and no random access; if you need to reuse or index results, you must materialize them (e.g., convert to a list).

## Reflection

Address the following (concise bullets or short paragraphs are fine):

1. Key takeaway
   - What part of this assignment most surprised you or led to the most significant improvement in your Python understanding?
   - Include a concrete before/after to illustrate how this assignment has changed your approach to problem solving, syntax, styling, or other implementational details as a result of this assignment.
2. GenAI use
   - If used, specify the tool / model used, how you used it, how you verified correctness, and how it was most helpful (breadth / depth of understanding, quality of code, time to completion, etc.). Note any limits or problems you observed and how you mitigated them.
   - If not, why and when do you expect to use it in this course, if at all?
3. Feedback
   - Approximately how much time did you spend on this assignment?
   - What was the most difficult part?
   - How would you improve it?
   - Anything else you want to share or ask?

**Reflection — HW1: Base Python**

**Key takeaway**
Small, testable functions + basic PEP 8 made debugging faster. Picking the right structure matters: dict for counts/lookups, set for uniqueness, iterator vs list for memory vs reuse.

**Most surprising / biggest improvement**

Realizing zip is lazy/single-pass and why my second list(zip(...)) was empty.

Using dict.get (and Counter) collapsed “create-or-update” into one clean line.

**Concrete before/after**

Before: nested loops and if/else to update counts.
After: counts[ch] = counts.get(ch, 0) + 1 (shorter, clearer, fewer bugs).


**GenAI use**

Tools: ChatGPT and Claude.

How used: to clarify topics, draft function variants, and compare approaches.

Verification: ran all provided tests; added a few quick edge checks; simplified code to match course level.

Helpfulness/limits: sped up exploration and explanations; sometimes suggested advanced patterns—I mitigated by simplifying and testing.

**Feedback**

Time spent: 15+ hours.

Most difficult: the overall length and repeated sections; parts felt off-track from class topics.

How to improve: shorten and de-duplicate tasks, align more with lectures, add a few guided hints/examples, and include “why this test exists” notes.

Anything else: I learned a lot, but the workload was heavy; smaller, staged milestones would help.

