# Chapter 2: Fundamentals of Python Programming for Data Analysis

This chapter teaches the core Python skills you’ll use constantly as a data analyst: writing clean code, working with common data structures, making decisions with logic, repeating work with loops, organizing code with functions, using packages, handling errors safely, and reading/writing files.

## Introduction

**Why this matters for data analytics:** Most data problems start as plain text, numbers, and lists of records. Before using specialized libraries (NumPy/Pandas), you must be comfortable with Python fundamentals—because these skills help you understand what libraries are doing *under the hood*, debug issues, and write reliable analysis code.

**You’ll learn to:**
- Write readable Python code (style + conventions).
- Use variables and built-in data types correctly.
- Work with lists/tuples/sets/dicts for real-world data.
- Use indexing/slicing to select subsets of data.
- Use conditionals and loops to express logic and repeat tasks.
- Create functions (including small lambdas) to reduce repetition.
- Import and use modules/packages.
- Handle errors with exceptions so your code fails safely.
- Read and write common file formats: TXT, CSV, JSON.

**Tip:** Run the code cells in order. Each section builds on the previous one.

In [None]:
# Setup cell (safe to run multiple times)
from __future__ import annotations

import csv
import json
from pathlib import Path

print('Chapter 2 setup complete.')

---

## 2.1 Python Syntax and Coding Conventions

### Python syntax: the big ideas
- **Indentation matters**: Python uses indentation (spaces) to define code blocks.
- **Statements**: usually one per line.
- **Comments**: start with `#`.
- **Docstrings**: triple quotes inside functions/classes to describe what they do.

### Coding conventions (PEP 8 basics)
These conventions make your code easier to read and maintain:
- Use `snake_case` for variables and functions.
- Use `CapWords` for class names.
- Keep lines reasonably short (often ~79–99 chars).
- Use meaningful names: `total_sales` is better than `ts`.

> **Common mistake:** Mixing tabs and spaces can break indentation. Prefer 4 spaces per indentation level.

**Resources (optional):**
- PEP 8 Style Guide: https://peps.python.org/pep-0008/
- Python tutorial (official): https://docs.python.org/3/tutorial/

In [None]:
# Indentation creates a block
x = 10
if x > 5:
    message = 'x is greater than 5'
else:
    message = 'x is 5 or less'

message

### Writing readable code (a mini-pattern)
When you write analysis code, a common pattern is:
1. **Load** data
2. **Clean/transform** data
3. **Analyze** data
4. **Report** results

Even in this fundamentals chapter, we’ll practice writing code in small, understandable steps with clear variable names.

---

## 2.2 Variables and Data Types

### What is a variable?
A **variable** is a name that points to a value in memory. You assign values using `=`.

### Common built-in data types
- `int`: whole numbers (e.g., `42`)
- `float`: decimals (e.g., `3.14`)
- `str`: text (e.g., `'hello'`)
- `bool`: `True`/`False`
- `NoneType`: `None` (represents “no value”)

In data analytics, correct types matter because math and comparisons behave differently depending on the type.

> **Tip:** Use `type(value)` to inspect the type of a value.
> **Common mistake:** Reading numbers from files often gives you strings (e.g., `'100'`), not integers. You usually must convert them.

In [None]:
name = 'Aisha'
age = 22
gpa = 3.75
is_active = True
unknown_value = None

print(type(name), name)
print(type(age), age)
print(type(gpa), gpa)
print(type(is_active), is_active)
print(type(unknown_value), unknown_value)

### Type conversion (casting)
You often convert types to make data usable.
- `int('123')` converts a numeric string to an integer
- `float('3.5')` converts to a float
- `str(123)` converts to a string

> **Warning:** `int('3.5')` fails because `'3.5'` is not a whole number string. Convert to `float` first if needed.

In [None]:
raw_count = '120'
count = int(raw_count)

raw_price = '19.99'
price = float(raw_price)

print(count, type(count))
print(price, type(price))

### Exercise 2.2
1. Create variables `product`, `units_sold`, and `unit_price`.
2. Compute `revenue = units_sold * unit_price`.
3. Print a sentence using an f-string, like: `"Laptop generated $9999.90"`.

Try changing `units_sold` to a string first (e.g., `'10'`) and see what happens—then fix it by converting it to `int`.

In [None]:
# Your turn (exercise)
product = 'Laptop'
units_sold = 10
unit_price = 999.99

revenue = units_sold * unit_price
print(f'{product} generated ${revenue:.2f}')

---

## 2.3 Numeric, String, and Boolean Operations

### Numeric operations
Python supports the usual math operators:
- `+`, `-`, `*`, `/` (division gives a float)
- `//` (floor division)
- `%` (modulo / remainder)
- `**` (power)

### String operations
- Concatenate with `+`
- Repeat with `*`
- Use f-strings for readable formatting

### Boolean operations
- Comparisons: `==`, `!=`, `<`, `<=`, `>`, `>=`
- Logical operators: `and`, `or`, `not`

> **Tip:** Use parentheses to make logic easy to read.
> **Common mistake:** `=` is assignment, `==` is comparison.

In [None]:
# Numeric
a = 17
b = 5

print('a + b =', a + b)
print('a / b =', a / b)
print('a // b =', a // b)
print('a % b =', a % b)
print('a ** b =', a ** b)

# Strings
first = 'Data'
second = 'Analytics'
full = first + ' ' + second
print(full)
print('ha' * 3)

# Booleans
is_big = a > 10
is_even = (a % 2) == 0
print('is_big:', is_big)
print('is_even:', is_even)
print('big AND even:', is_big and is_even)

### Exercise 2.3
You work at a store. A customer gets free shipping if:
- Their cart total is at least $50, **or**
- They are a premium member

Write code that calculates `free_shipping` based on these rules, then print the result.

In [None]:
cart_total = 42.50
is_premium = True

free_shipping = (cart_total >= 50) or is_premium
print('Free shipping?', free_shipping)

---

## 2.4 Lists, Tuples, Sets, and Dictionaries

Python has four core container types that show up constantly in analytics work.

### Lists (`list`)
- Ordered collection (keeps insertion order)
- Mutable (you can change it)
- Good for sequences of values: daily sales, names, scores

### Tuples (`tuple`)
- Ordered
- Immutable (cannot be changed)
- Good for fixed records like `(x, y)` coordinates

### Sets (`set`)
- Unordered collection of **unique** values
- Great for removing duplicates or testing membership quickly

### Dictionaries (`dict`)
- Key-value pairs: `{key: value}`
- Keys are unique
- Great for mapping IDs to records, or feature names to values

> **Tip:** In analytics, dictionaries are often used to represent one “row” (record). A list of dictionaries can represent a small dataset.
> **Common mistake:** Using a list when you actually need a dictionary (you need keys/names, not positions).

In [None]:
# A small dataset as a list of dicts (each dict is like a row)
transactions = [
    {'customer': 'Aisha', 'amount': 120.0, 'category': 'Books'},
    {'customer': 'Omar', 'amount': 75.5, 'category': 'Groceries'},
    {'customer': 'Aisha', 'amount': 35.0, 'category': 'Groceries'},
]

transactions

In [None]:
# List operations
amounts = [t['amount'] for t in transactions]
print('Amounts:', amounts)

# Tuple example
point = (3, 7)
print('Point:', point)

# Set example: unique customers
unique_customers = {t['customer'] for t in transactions}
print('Unique customers:', unique_customers)

# Dict example: summarize by customer
totals_by_customer = {}
for t in transactions:
    customer = t['customer']
    totals_by_customer[customer] = totals_by_customer.get(customer, 0) + t['amount']

totals_by_customer

### Exercise 2.4
1. Create a list of 5 numbers (your choice).
2. Add a new number to the list.
3. Create a set from the list and observe how duplicates behave.
4. Create a dictionary with your name as a key and your favorite number as a value.

In [None]:
numbers = [2, 4, 4, 8, 16]
numbers.append(32)
unique_numbers = set(numbers)

profile = {'Haseeb': 7}

print('numbers:', numbers)
print('unique_numbers:', unique_numbers)
print('profile:', profile)

---

## 2.5 Indexing and Slicing

Indexing and slicing let you select parts of sequences (like lists and strings).

### Indexing (one item)
- `my_list[0]` is the first item
- `my_list[-1]` is the last item

### Slicing (a range)
- `my_list[start:stop]` returns items from `start` up to (not including) `stop`
- `my_list[:3]` means “first 3 items”
- `my_list[3:]` means “from index 3 to the end”
- `my_list[::2]` means “every 2nd item”

**Visual idea (indexes):**
```
values = ['a', 'b', 'c', 'd', 'e']
index:    0    1    2    3    4
index:   -5   -4   -3   -2   -1
```

> **Common mistake:** Off-by-one errors in slicing. Remember: the `stop` index is *not included*.

In [None]:
values = ['a', 'b', 'c', 'd', 'e']

print('First item:', values[0])
print('Last item:', values[-1])
print('First 3:', values[:3])
print('From index 2 onward:', values[2:])
print('Middle (1 to 3):', values[1:4])
print('Every 2nd item:', values[::2])

text = 'DataAnalytics'
print('text[:4] =', text[:4])
print('text[-5:] =', text[-5:])

### Exercise 2.5
Given a list of daily temperatures, compute:
- The first 7 days
- The last 7 days
- Every other day

In [None]:
temps = [22, 23, 21, 20, 24, 25, 26, 23, 22, 21, 20, 19, 23, 24]

first_week = temps[:7]
last_week = temps[-7:]
every_other = temps[::2]

print('first_week:', first_week)
print('last_week:', last_week)
print('every_other:', every_other)

---

## 2.6 Conditional Statements

Conditionals let your code choose different actions depending on data.

### `if`, `elif`, `else`
- Use `if` for the first condition
- Use `elif` (“else if”) for additional conditions
- Use `else` for the default case

**Why analysts use this:** to categorize data (e.g., “high/medium/low”), apply business rules, or filter records.

> **Tip:** Order your conditions from most specific to most general.
> **Common mistake:** Forgetting that only the first matching branch runs.

In [None]:
score = 82

if score >= 90:
    grade = 'A'
elif score >= 80:
    grade = 'B'
elif score >= 70:
    grade = 'C'
else:
    grade = 'Needs improvement'

grade

### Exercise 2.6
Write code that categorizes a transaction amount as:
- `small` if `< 50`
- `medium` if `50–199`
- `large` if `>= 200`

In [None]:
amount = 210

if amount < 50:
    size = 'small'
elif amount < 200:
    size = 'medium'
else:
    size = 'large'

size

---

## 2.7 Loops (`for`, `while`)

Loops repeat work—perfect for iterating through records and computing summaries.

### `for` loops
Use `for` when you have a collection (list, dict, file lines) and you want to process each item.

### `while` loops
Use `while` when you repeat until a condition becomes false.

> **Tip:** Prefer `for` loops for data processing; they’re clearer and less error-prone.
> **Common mistake:** Creating an infinite `while` loop by forgetting to update the loop condition.

In [None]:
# Summing values with a for-loop
daily_sales = [120, 80, 95, 110, 60]

total = 0
for s in daily_sales:
    total += s

avg = total / len(daily_sales)
print('Total:', total)
print('Average:', avg)

In [None]:
# A small while-loop example
threshold = 300
running_total = 0
day = 0

while running_total < threshold and day < len(daily_sales):
    running_total += daily_sales[day]
    day += 1

print('Days needed to reach threshold:', day)
print('Running total:', running_total)

### Loop helpers you’ll use a lot
- `range(n)` gives `0..n-1`
- `enumerate(list)` gives `(index, value)` pairs
- `break` exits the loop early
- `continue` skips to the next iteration

> **Tip:** `enumerate(...)` is usually better than manually tracking an index variable.

In [None]:
for i, s in enumerate(daily_sales):
    print(f'Day {i}: sales={s}')

### Exercise 2.7
Compute the **maximum** sale value in `daily_sales` **without using** `max()`.
(This is good practice for understanding how summaries work.)

In [None]:
max_sale = daily_sales[0]
for s in daily_sales[1:]:
    if s > max_sale:
        max_sale = s

max_sale

---

## 2.8 Functions and Lambda Expressions

### Why functions?
A function lets you name a reusable block of logic. In analytics code, functions help you:
- avoid repeating yourself
- test logic on small inputs
- make notebooks easier to read

### Function basics
- Define with `def`
- Use parameters to accept inputs
- Use `return` to output a result
- Add a docstring to explain purpose and inputs/outputs

### Lambda expressions
A `lambda` is a tiny, anonymous function, often used for simple transformations (e.g., sorting).

> **Tip:** Prefer `def` for anything non-trivial. Lambdas are best for one-line operations.
> **Common mistake:** Writing complex lambdas that are hard to read.

In [None]:
def average(values: list[float]) -> float:
    """Return the arithmetic mean of a non-empty list of numbers."""
    if not values:
        raise ValueError('values must not be empty')
    return sum(values) / len(values)

print('Average sales:', average([120, 80, 95, 110, 60]))

In [None]:
# Lambda example: sort transactions by amount
sorted_transactions = sorted(transactions, key=lambda t: t['amount'], reverse=True)
sorted_transactions

### Exercise 2.8
1. Write a function `is_weekend(day_name)` that returns `True` for Saturday/Sunday.
2. Write a function `clean_text(s)` that:
   - strips leading/trailing spaces
   - lowercases the text

Test your functions with a few inputs.

In [None]:
def is_weekend(day_name: str) -> bool:
    day = day_name.strip().lower()
    return day in {'saturday', 'sunday'}

def clean_text(s: str) -> str:
    return s.strip().lower()

print(is_weekend(' Saturday '))
print(is_weekend('Monday'))
print(clean_text('  Data Analytics  '))

---

## 2.9 Modules and Packages

### What is a module?
A **module** is a Python file (or built-in library) that contains code you can reuse.

### What is a package?
A **package** is a collection of modules. Many analytics tools (like `pandas`) are packages.

### Import patterns you’ll see
- `import math`
- `from pathlib import Path`
- `import pandas as pd` (common alias)

**Why analysts use this:** You don’t want to rewrite CSV reading, JSON parsing, date/time logic, etc. The standard library has a lot already.

> **Tip:** If you get `ModuleNotFoundError`, the package may not be installed in your current environment.

**Resources (optional):**
- Standard library overview: https://docs.python.org/3/library/

In [None]:
import math
from statistics import mean

values = [1, 4, 9, 16]
roots = [math.sqrt(v) for v in values]

print('roots:', roots)
print('mean roots:', mean(roots))

---

## 2.10 Error Handling and Exceptions

### What is an exception?
An **exception** is how Python tells you something went wrong during execution (e.g., dividing by zero, invalid conversion, missing file).

### Why analysts care
In data work, unexpected values are common: missing data, malformed numbers, weird dates. Error handling helps your code:
- fail with a clear message
- skip bad rows safely
- continue processing while tracking issues

### `try` / `except`
Use this pattern when you *expect* something might fail and you have a safe fallback plan.

> **Warning:** Don’t use a broad `except:` unless you truly need it. Catch specific exceptions like `ValueError` or `FileNotFoundError`.
> **Common mistake:** Catching errors and ignoring them silently—this can hide data quality problems.

In [None]:
def safe_float(text: str) -> float | None:
    """Convert text to float. Return None if conversion fails."""
    try:
        return float(text)
    except ValueError:
        return None

samples = ['3.14', '10', 'not-a-number', '']
converted = [safe_float(s) for s in samples]
converted

### Raising exceptions intentionally
Sometimes you should *stop* execution if the input is invalid. This makes errors easier to diagnose.

In [None]:
def percent_change(old: float, new: float) -> float:
    """Return percent change from old to new.

    Example: old=50, new=75 -> 50.0 (meaning +50%)
    """
    if old == 0:
        raise ValueError('old must not be 0 (cannot divide by zero)')
    return ((new - old) / old) * 100

print(percent_change(50, 75))

### Exercise 2.10
Write a function `safe_int(text)` that returns an `int` if possible, otherwise returns `None`.
Test it with: `'123'`, `' 7 '`, `'7.2'`, `'abc'`.

In [None]:
def safe_int(text: str) -> int | None:
    try:
        return int(text.strip())
    except (ValueError, AttributeError):
        return None

for s in ['123', ' 7 ', '7.2', 'abc']:
    print(s, '->', safe_int(s))

---

## 2.11 Reading and Writing Files (CSV, TXT, JSON)

File I/O is a daily task for analysts. In this section you’ll learn:
- how to create a folder for outputs
- how to write and read a plain text file
- how to write and read CSV (a spreadsheet-friendly format)
- how to write and read JSON (a common web/API format)

We’ll use the **standard library** (`pathlib`, `csv`, `json`) so this notebook works in most Python installations.

> **Tip:** Always build file paths with `pathlib.Path` instead of manually typing slashes.
> **Common mistake:** Writing files into unknown working directories. We’ll create a dedicated `chapter_02_outputs/` folder.

In [None]:
outputs_dir = Path('chapter_02_outputs')
outputs_dir.mkdir(exist_ok=True)
outputs_dir.resolve()

### TXT: Writing and reading plain text
Plain text is useful for logs, notes, and simple data exports.

In [None]:
txt_path = outputs_dir / 'notes.txt'

lines = [
    'Chapter 2 notes',
    'Key idea: data types matter',
    'Key idea: handle errors safely',
]

txt_path.write_text('\n'.join(lines), encoding='utf-8')
print('Wrote:', txt_path)

loaded_text = txt_path.read_text(encoding='utf-8')
print('--- file contents ---')
print(loaded_text)

### CSV: Writing and reading tabular data
CSV (comma-separated values) is one of the most common formats in analytics.

We’ll create a tiny dataset as a list of dictionaries and write it to CSV.

> **Tip:** CSV stores everything as text. When you read it back, you usually need to convert columns to numbers yourself.

In [None]:
rows = [
    {'date': '2026-01-01', 'customer': 'Aisha', 'amount': 120.0},
    {'date': '2026-01-02', 'customer': 'Omar', 'amount': 75.5},
    {'date': '2026-01-03', 'customer': 'Aisha', 'amount': 35.0},
]

csv_path = outputs_dir / 'transactions.csv'

with csv_path.open('w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['date', 'customer', 'amount'])
    writer.writeheader()
    writer.writerows(rows)

print('Wrote:', csv_path)

In [None]:
# Read the CSV back
loaded_rows = []
with csv_path.open('r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Convert amount from string to float
        row['amount'] = float(row['amount'])
        loaded_rows.append(row)

loaded_rows

#### A simple “table” view (no extra libraries)
This is not a full table library, but it’s a nice quick visual check for small datasets.

In [None]:
def print_table(rows: list[dict], columns: list[str]) -> None:
    widths = {col: max(len(col), *(len(str(r.get(col, ''))) for r in rows)) for col in columns}
    header = ' | '.join(col.ljust(widths[col]) for col in columns)
    sep = '-+-'.join('-' * widths[col] for col in columns)
    print(header)
    print(sep)
    for r in rows:
        print(' | '.join(str(r.get(col, '')).ljust(widths[col]) for col in columns))

print_table(loaded_rows, columns=['date', 'customer', 'amount'])

### JSON: Writing and reading structured data
JSON is common in APIs and web data. It represents nested structures (lists + dictionaries).

> **Tip:** Use `indent=2` when writing JSON for readability (great for debugging).

In [None]:
json_path = outputs_dir / 'transactions.json'

payload = {
    'dataset': 'transactions',
    'rows': rows,
}

json_path.write_text(json.dumps(payload, indent=2), encoding='utf-8')
print('Wrote:', json_path)

loaded_payload = json.loads(json_path.read_text(encoding='utf-8'))
loaded_payload

### Exercise 2.11
1. Add a new row to `rows` and write the CSV again.
2. Read the CSV and compute total amount per customer (dictionary).
3. Save the totals to a JSON file called `totals_by_customer.json`.

In [None]:
# Solution example (try your own first!)
rows2 = rows + [{'date': '2026-01-04', 'customer': 'Noor', 'amount': 200.0}]

csv_path2 = outputs_dir / 'transactions_v2.csv'
with csv_path2.open('w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['date', 'customer', 'amount'])
    writer.writeheader()
    writer.writerows(rows2)

totals = {}
with csv_path2.open('r', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        customer = row['customer']
        amount = float(row['amount'])
        totals[customer] = totals.get(customer, 0.0) + amount

totals_path = outputs_dir / 'totals_by_customer.json'
totals_path.write_text(json.dumps(totals, indent=2), encoding='utf-8')

print('Totals:', totals)
print('Saved:', totals_path)

---

## Mini-Project (Chapter 2): Build a Tiny Transaction Analyzer

**Goal:** Use Python fundamentals to analyze a small dataset.

### Requirements
Using the `transactions_v2.csv` you created (or create it again):
1. Read the CSV file.
2. Convert `amount` to float.
3. Compute:
   - total number of rows
   - total amount
   - average amount
   - totals by customer (dictionary)
4. Identify the customer with the highest total.
5. Write a short text report (`report.txt`) to `chapter_02_outputs/`.

**Why this project helps:** It combines variables, types, loops, dicts, functions, and file I/O—exactly what analysts do every day.

> **Tip:** Build it in small steps and print intermediate results to verify correctness.

In [None]:
def load_transactions_csv(path: Path) -> list[dict]:
    items: list[dict] = []
    with path.open('r', newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            row['amount'] = float(row['amount'])
            items.append(row)
    return items

def summarize_transactions(items: list[dict]) -> dict:
    amounts = [item['amount'] for item in items]
    totals_by_customer: dict[str, float] = {}
    for item in items:
        customer = item['customer']
        totals_by_customer[customer] = totals_by_customer.get(customer, 0.0) + item['amount']

    top_customer = max(totals_by_customer, key=totals_by_customer.get) if totals_by_customer else None

    return {
        'row_count': len(items),
        'total_amount': sum(amounts) if amounts else 0.0,
        'average_amount': (sum(amounts) / len(amounts)) if amounts else 0.0,
        'totals_by_customer': totals_by_customer,
        'top_customer': top_customer,
        'top_customer_total': totals_by_customer.get(top_customer, 0.0) if top_customer else 0.0,
    }

data_path = outputs_dir / 'transactions_v2.csv'
items = load_transactions_csv(data_path)
summary = summarize_transactions(items)

summary

In [None]:
# Write a simple text report
report_lines = [
    'Transaction Analyzer Report',
    '==========================',
    f"Total rows: {summary['row_count']}",
    f"Total amount: ${summary['total_amount']:.2f}",
    f"Average amount: ${summary['average_amount']:.2f}",
    '',
    'Totals by customer:',
]

for customer, total in summary['totals_by_customer'].items():
    report_lines.append(f"  - {customer}: ${total:.2f}")

report_lines.append('')
report_lines.append(f"Top customer: {summary['top_customer']} (${summary['top_customer_total']:.2f})")

report_path = outputs_dir / 'report.txt'
report_path.write_text('\n'.join(report_lines), encoding='utf-8')

print('--- Report saved to:', report_path, '---')
print('\n'.join(report_lines))

---

## Summary / Key Takeaways

- Python’s **syntax and indentation** define structure—readability matters.
- **Variables and types** are foundational: many analytics bugs are type bugs.
- **Lists, tuples, sets, and dicts** are the core ways to store and shape data before (and even alongside) libraries.
- **Indexing/slicing** helps you extract subsets of sequences correctly.
- **Conditionals and loops** express rules and repetition for processing records.
- **Functions** keep notebooks clean and logic reusable; prefer `def` over complex lambdas.
- **Modules/packages** let you reuse code from the standard library and the wider ecosystem.
- **Exceptions** help your code fail safely and handle messy real-world data.
- **File I/O (TXT/CSV/JSON)** is essential for importing/exporting data and building reproducible workflows.

**Next chapter:** We’ll move from Python fundamentals to faster numeric work with NumPy arrays.