# Module 2 - Strings & Core Collections

**Scope note (important):** This lesson intentionally avoids *flow control* (no `if`, `for`, `while`) and avoids `try/except`.  
We focus on **data representation** and **core operations** that you will later combine with flow control to build real pipelines.

**What you should be able to do by the end**
- Explain what a **sequence** is and what strings, lists, and tuples have in common.
- Use indexing and slicing (including negative indexing) confidently.
- Apply the most useful **string methods** for cleaning and parsing text.
- Create and manipulate **lists**, **tuples**, **dictionaries**, and **sets**.
- Choose the right structure for the job (and explain why).

## What are Sequences?

A **sequence** is an **ordered** collection of items.

In this course, the three main sequence types we use are:
- `str` (strings) - sequence of **characters**
- `list` - sequence of **objects** (mutable)
- `tuple` - sequence of **objects** (immutable)

### What these three share (common sequence behavior)

All three support:
- `len(x)` → number of items
- **indexing**: `x[0]`, `x[1]`, `x[-1]`  
- **slicing**: `x[start:stop:step]`
- **membership**: `item in x`
- concatenation with `+` and repetition with `*`

> We *will* learn to iterate over sequences with `for` loops in the next module.  
> For now, we use indexing/slicing and built-ins.

In [None]:
# A quick sequence playground: str, list, tuple

text = "data"
lst  = ["d", "a", "t", "a"]
tup  = ("d", "a", "t", "a")

print(len(text), len(lst), len(tup))

print(text[0], lst[0], tup[0])     # indexing
print(text[-1], lst[-1], tup[-1])  # negative indexing

print(text[1:3], lst[1:3], tup[1:3])  # slicing

print("a" in text, "a" in lst, "a" in tup)  # membership

print(text + " engineering")
print(lst + ["!"])
print(tup + ("!",))

print(text * 3)
print(lst * 2)
print(tup * 2)

4 4 4
d d d
a a a
at ['a', 't'] ('a', 't')
True True True
data engineering
['d', 'a', 't', 'a', '!']
('d', 'a', 't', 'a', '!')
datadatadata
['d', 'a', 't', 'a', 'd', 'a', 't', 'a']
('d', 'a', 't', 'a', 'd', 'a', 't', 'a')


### Indexing and negative indexing

Python uses **0-based indexing**:
- `x[0]` is the first item
- `x[1]` is the second item

**Negative indexing** counts from the end:
- `x[-1]` is the last item
- `x[-2]` is the item before last

Negative indexing is useful because "last item" is a common need and you don't want to compute `len(x) - 1` all the time.

In [None]:
seq = "cloud"

print(seq[0])    # first char
print(seq[-1])   # last char
print(seq[len(seq)-1])  # same as [-1], but more work

### Slicing rules (the ones you actually need)

Given `seq[start:stop:step]`:
- `start` is **included** (default: 0)
- `stop` is **excluded** (default: `len(seq)`)
- `step` is the jump size (default: 1)
- Negative `step` walks backwards.

Common patterns:
- `seq[:n]` → first `n` items
- `seq[-n:]` → last `n` items
- `seq[::2]` → every second item
- `seq[::-1]` → reversed sequence (creates a new one)

**Example:** Even-index characters (`s[::2]`), odd-index characters (`s[1::2]`), and reversing the string (`s[::-1]`).

In [None]:
s = "Roses are red, violets are blue"

s[::2]   # every second char (even indices)
s[1::2]  # odd indices
s[::-1]   # reversed

'eulb era steloiv ,der era sesoR'

In [None]:
s = "Roses are red, violets are blue"

print(s[:5])       # first 5 chars
print(s[-4:])      # last 4 chars
print(s[::2])      # every second char
print(s[::-1])     # reversed

## Strings in depth

Strings (`str`) are one of the most important data types in data engineering:
- parsing log lines, CSV rows, JSON fields
- cleaning IDs, emails, free text, column names
- building file paths, SQL snippets, messages, metrics labels

### Key property: strings are **immutable**

Strings, once set cannot be modified. You cannot reach into the string using indexing and replace values - you'll have to create a new string based on the existing one with your modification implemented. Same goes when you "manipulate" a string by invoking any of the string methods) - you actually get a **new** string back.

In [None]:
text = "Hello, Data"
clean = text.strip().lower().replace("data", "Engineers")

print(text, id(text))    # original stays the same
print(clean, id(clean))  # new string

Hello, Data 138226205881520
hello, Engineers 138226205881712


### Quotes and escaping

In [None]:
print('Hello\\ninja!')

Hello\ninja!


Escaping solves two common problems:

1) **Including special characters inside the quotes**
- If your string is delimited by `"..."`, then an unescaped `"` would end the string early.
- Same for `'...'` and `'`.

2) **Writing "non-printing" characters in a readable way**
- Newline and tab are real characters, but you usually want to *type* them as `\n` and `\t`.

Common escapes:
- `\n` newline, `\t` tab
- `\"` include a double quote inside `"..."`
- `\'` include a single quote inside `'...'`
- `\\` a backslash character itself

> **Tip**: *raw strings* (prefix with `r"..."`) are useful when you want backslashes to be taken literally (e.g., Windows paths or regular expressions).

In [None]:
print("line1\nline2")
print("col1\tcol2\tcol3")
print("They said: \"Ship it!\"")
print('It\'s valid to escape single quotes too.')

# Raw string: backslashes are treated literally (handy for paths/regex)
path = r"C:\Users\alon\data\logs\app.log"
print(path)

line1
line2
col1	col2	col3
They said: "Ship it!"
It's valid to escape single quotes too.
C:\Users\alon\data\logs\app.log


### String concatenation vs formatting

**Real-world example:** A daily job builds an output folder path with the current date (e.g. `c:/data/output/2026/01/22`).

Concatenation (`+`) is fine for small things, but it gets messy quickly (you start sprinkling `str(...)` everywhere).

For readable, maintainable text generation, prefer:
- **f-strings** (recommended, Python 3.6+)
- `str.format(...)` (still common in older code and libraries)

Both support the same **format specification mini-language** (width, alignment, precision, etc.).

`datetime.datetime.now()` returns an object with `year`, `month`, `day`, `hour`, `minute`, `second`, etc.

In [None]:
import datetime

FOLDER_PATH = 'c:/data/output/'
now = datetime.datetime.now()
current_date = f'{now.year}/{now.month:02d}/{now.day:02d}'  # e.g. 2026/01/22
target_folder = f'{FOLDER_PATH}{current_date}'
target_folder

'pizza2026'

In [None]:
# Returns full datetime: year, month, day, hour, minute, second, microsecond
datetime.datetime.now()

datetime.datetime(2026, 2, 2, 16, 55, 1, 677227)

In [None]:
name = "Oz"
age = 12
avg = 83.4

# Concatenation
msg1 = "The boy's name is " + name + ", his age is " + str(age) + " and his average is " + str(avg)
print(msg1)

# f-string formatting (recommended)
msg2 = f"The boy's name is {name}, his age is {age} and his average is {avg:.1f}"
print(msg2)

### Formatting essentials you should know

#### Option A - f-strings (recommended)

They embed expressions directly: `f"Hello {name}"`.
> **Important**: any variable, function or object you use in your f-string expressions, have to be defined and correctly set by the time the f-string is evaluated.

#### Option B - `.format(...)`
Useful when you want a reusable template: `"Hello {}".format(name)`.

#### Why you should care (data engineering angle)

Formatting is how you produce:
- clean logs (`INFO 2026-01-22 10:31:05 - loaded 12,345 rows`)
- human-readable summaries ("mean=..., p95=...")
- aligned text "tables" for quick debugging in notebooks

Below are the most useful patterns.

In [None]:
price = 1234567.891

print(f"{price:.2f}")       # precision
print(f"{price:,.2f}")      # thousands separators

# Alignment inside a "field width"
print(f"|{'h':<10}|{'Value':^10}|{'Note':<12}|")
print(f"|{'Houston':<10}|{397.82:^10.2f}|{'4 bedrooms':<12}|")


1234567.89
1,234,567.89
|h         |  Value   |Note        |
|Houston   |  397.82  |4 bedrooms  |


#### `.format(...)`: positional and named placeholders

`{}` placeholders are filled left-to-right (positional), or you can name them for clarity.

- Positional: `"{} * {} = {}".format(a, b, a*b)`
- Named: `"{x} * {y} = {prod}".format(x=a, y=b, prod=a*b)`

In [None]:
a = 1.23456
b = 2.34567
prod = a * b

# Positional placeholders
print("{} * {} = {}".format(a, b, prod))

# Named placeholders (more readable when you have many values)
print("{x:.2f} * {y:.2f} = {p:.2f}".format(x=a, y=b, p=a*b))

f"{a:.2f} * {b:.2f} = {a*b:.2f}"

1.23456 * 2.34567 = 2.8958703552000005
1.23 * 2.35 = 2.90


'1.23 * 2.35 = 2.90'

#### Format specifiers (mini-language)

Inside `{...}`, after `:`, you can control how values look:

- `:.2f` → float with 2 digits after the decimal  
- `:,` → thousands separators  
- `:>10` / `:<10` / `:^10` → align right / left / center inside width 10  
- `:+` → always show sign  
- `:.1%` → percent formatting (0.123 → 12.3%)

These work in both f-strings and `.format(...)`.

In [None]:
n_rows = 1234567
ratio = 0.23891
delta = -42

print(f"rows={n_rows:,}")
print(f"ratio={ratio:.1%}")
print(f"delta={delta:+d}")

name = "Houston"
value = 397.82
note = "4 bedrooms"

print(f"|{name:<10}|{value:^10.2f}|{note:<12}|")

#### A tiny "log line" example (no loops yet)

In pipelines you often want a one-line status message that is consistent and easy to scan.

In [None]:
dataset = "events_2026_01"
n_rows = 987654
seconds = 1.728

msg = f"[LOAD] dataset={dataset:<15} rows={n_rows:>9,} time={seconds:>6.3f}s"
print(msg)

### Useful string methods for cleaning and parsing

**Useful categories:**

Manipulation (`lower`, `upper`, `title`, `strip`, `replace`),

Alignment (`center`, `ljust`, `rjust`), checks (`startswith`, `isalpha`, …),

Search and count (`count`, `find`, `index`).

In [None]:
# Quick reference: lower, upper, title, strip, replace | center, ljust | startswith, isalpha | count, find, index
"  Hello  ".strip().upper()

True

A few "workhorse" methods:
- `lower()`, `upper()`, `title()`, `capitalize()`
- `strip()`, `lstrip()`, `rstrip()`
- `replace(old, new)`
- `count(sub)`
- `find(sub)` - returns **index** of first match, or **-1** if not found
- `index(sub, [start_index])` - like `find()`, but raises `ValueError` if not found
- `startswith(prefix)`, `endswith(suffix)`
- `split(sep)` and `join(iterable)`

In [None]:
path = 'c:/data/2026/01/01/sales/new_input_01012026.csv'

#path.endswith('.csv')

path.split('/')

['c:', 'data', '2026', '01', '01', 'sales', 'new_input_01012026.csv']

In [None]:
path.split('sales')[0]

'c:/data/2026/01/01/'

**Example:** `split(sep)` splits the string into a list of substrings at each occurrence of `sep`. Then we clean punctuation and rejoin with `join`.

In [None]:
lyrics = "Who let who the, dogs out? Who? Who? Who? Who!!!"

lyrics.split('Who')   # returns list of substrings between 'Who'

words_list = lyrics.replace("?", "").replace(',','').replace('!','').split(' ')

'|'.join(words_list)


'Who|let|who|the|dogs|out|Who|Who|Who|Who'

In [None]:
'HELLO'[0]

'H'

In [None]:
['H','L','L','O'][0]

'H'

In [None]:
words_list

['Who', 'let', 'who', 'the', 'dogs', 'out', 'Who', 'Who', 'Who', 'Who']

In [None]:
lyrics = "Who let who the dogs out? Who? Who? Who? Who?"

print(lyrics.lower())
print(lyrics.count("Who"))
print(lyrics.replace("Who", "Moo"))

print(lyrics.startswith("Who"))
print(lyrics.endswith("Who?"))

parts = lyrics.split("?")  # split into pieces
print(parts)

words = lyrics.replace("?", "").replace(",", "").split()
print(words)

joined = " | ".join(words[:6])
print(joined)

who let who the dogs out? who? who? who? who?
5
Moo let who the dogs out? Moo? Moo? Moo? Moo?
True
True
['Who let who the dogs out', ' Who', ' Who', ' Who', ' Who', '']
['Who', 'let', 'who', 'the', 'dogs', 'out', 'Who', 'Who', 'Who', 'Who']
Who | let | who | the | dogs | out


### `capitalize()` and `find()`

- `capitalize()` uppercases only the **first** character of the string (the rest become lowercase). Compare with `title()` which capitalizes every word.
- `find(sub)` returns the **index** of the first occurrence, or **-1** if not found. Safer than `index()` when you're not sure the substring exists.

In [None]:
sentence = "python is a POWERFUL language"

print(sentence.capitalize())  # "Python is a powerful language"
print(sentence.title())       # "Python Is A Powerful Language"

filename = "my_report.pdf"
print(filename.find("report"))   # 3 (index where "report" starts)
print(filename.find("excel"))    # -1 (not found - no error raised)

# Practical use: only process if the substring exists
log_line = "2026-01-22,ERROR,user=alice,action=login"
pos = log_line.find("ERROR")
if pos != -1:
    print(f"Found 'ERROR' at index {pos}")

### `in` on strings

Use `sub in text` to check substring existence. Returns a boolean.

In [None]:
text = "ERROR user=alice action=login"
text = "WARNING user=alice action=login"


print("ERROR" in text)
print("user=" in text)
print("WARNING" in text)

False
True
True


### A practical parsing example (no loops yet)

Pretend this is a single log line you got from a file or a stream.
We'll parse it using `split()` and simple indexing.

In [None]:
log = "2026-01-22,INFO,user=alice,action=login,lat=32.08,lon=34.78"

parts = log.split(",")
print(parts)

date = parts[0]
level = parts[1]

# key=value pairs start at index 2
kv1 = parts[2].split("=")  # ["user", "alice"]
kv2 = parts[3].split("=")  # ["action", "login"]

user = kv1[1]
action = kv2[1]

print("date:", date)
print("level:", level)
print("user:", user)
print("action:", action)

['2026-01-22', 'INFO', 'user=alice', 'action=login', 'lat=32.08', 'lon=34.78']
date: 2026-01-22
level: INFO
user: alice
action: login


In [None]:
parts = log.split(",")

date , level, user, action, lat, lon = parts # tuple unpacking


print(date, level, user, action, lat, lon , sep='\n')

2026-01-22
INFO
user=alice
action=login
lat=32.08
lon=34.78


> Later, with `for` loops, you'll be able to turn this into a reusable parser that works for any number of `key=value` pairs.

## Q in class

you have the following path

file_path = 'c:\data\sales.csv'
return only the word sales in 3 variations

In [None]:
file_path = 'c:/data/sales.csv'

# 1
file_path[8:13]

# 2
target_word = 'sales'
file_path.replace('.csv','')[-len(target_word):]

#3 - split
target_ext = '.csv'
tagret_split = '/'

file_path.replace(target_ext,'').split(tagret_split)[-1]

final_file_path = file_path.replace(target_ext,'')
# print(final_file_path)
final_file_path = final_file_path.split(tagret_split)
# print(final_file_path)
final_file_path = final_file_path[-1]
# print(final_file_path)



target_word= 'sales'
len_target_word = len(target_word)

idx = file_path.find(target_word)
file_path[idx: idx+len_target_word]


file_path.index('sales')

file_path.count('/')

2

In [None]:
#@title Solution

file_path = 'c:/data/sales.csv'

# option 1:
file_path[8:13]

# option 2:
file_path[8:-4]

# option 3:
file_path[-9:-4:1]

# option 4:
file_path[-5:-10:-1][::-1]

#option 5:
print(file_path.split('/')[-1].split('.')[0])

sales


## Q in class - Slice
given the following word `hipopotam`

print the following:

* hip
* pop
* hpptm
* matopopih
* opop
* mat

In [None]:
s = 'hipopotam'

s[2:6][::-1]

s[-4:-8:-1]

'opop'

In [None]:
#@title Solution

print(
    s[:3], # hip
    s[2:5], # pop
    s[::2], # hpptm
    s[::-1], # matopopih
    s[-4:-8:-1], s[2:6][::-1], s[5:1:-1] # opop
    s[:-4:-1] , # mat
    sep='\n'
)

## Lists (mutable sequences)

A `list` is a mutable sequence. Use it when:
- you need to **add/remove/change** items
- order matters
- duplicates are allowed

Lists are everywhere in data work: columns, rows, batches, results, intermediate transformations.

In [None]:
numbers = [10, 20, 30, 40, 50]

print(numbers)
print(numbers[0])
print(numbers[-1])
print(numbers[1:4])

[10, 20, 30, 40, 50]
10
50
[20, 30, 40]


pros & cons:

* can hold many different types -> beware! `[1,'hello',True,[5,5,5]]`
* size dynamicly change
* Every item in a list is a pointer to an object, causing massive overhead. At scale, 10 million elements can consume `3.6 GB` in a list versus just 0.08 GB in a `NumPy array`.
* Performance Bottlenecks and expensive operations: Lists are processed one item at a time by the Python interpreter. Unlike vectorized operations in libraries like Pandas or Polars, lists cannot process data in parallel or utilize optimized C-code paths.


### `in` in `List`


when working with matrix `[[1,2,3], [4,5,6], [7,8,9]]`
you need to search for the entire object,

if you are looking for a nested value, you will need to iterate over it

```
mat = [[1,2,3], [4,5,6], [7,8,9]]

print(1 in mat) >> False

print([1] in mat) >> False

print([1,2,3] in mat) >> True
```



In [None]:
mat = [[1,2,3],
       [4,5,6],
       [7,8,9]
]
#print(mat)

1 in mat[0]

True

### Lists are mutable (strings/tuples are not)

Mutability means you can change an existing list *in place*.

In [None]:
nums = [1, 2, 3]
nums[0] = 99
print(nums)

# Same idea, but with strings: you can't assign into a string
word = "data"
# word[0] = "D"  # uncomment to see the error
print(word)

[99, 2, 3]
data


In [None]:
word = "data" # ->Data

# how to say it data or data?

word.title()

'Data'

### Adding items: `append`, `extend`, `insert`

**Difference:** `append(x)` adds the object **as a single element** at the end. `extend(iterable)` **unpacks** the iterable and adds each of its items one by one.

- `append(x)` adds one element at the end
- `extend(iterable)` adds many elements
- `insert(i, x)` inserts at a specific index

**Note:** `extend(dict)` adds the dictionary’s **keys** to the list (not the key-value pairs).

In [None]:
colors = ["cyan", "green", "yellow"]
colors.append("red")           # one element at the end
print(colors)

colors.extend(["black", "white"])  # add many at the end
print(colors)

colors.insert(1, "magenta")
print(colors)

['cyan', 'green', 'yellow', 'red']
['cyan', 'green', 'yellow', 'red', 'black', 'white']
['cyan', 'magenta', 'green', 'yellow', 'red', 'black', 'white']


In [None]:
a = [10,20,30]
b = {'name': 'onn', 'age': 999}

a.extend(b)
a

[10, 20, 30, 'name', 'age']

## small Q in class:

build this list `['a','b','c','d']` in  2 different ways.

In [None]:
# 1
L = ['a','b','c','d']
print(L)

# 2
L = [None] * 4
L[0] = 'a'
L[1] = 'b'
L[2] = 'c'
L[3] = 'd'

print(L)

#3

L = []
L.append('a')
L.append('b')
L.append('c')
L.append('d')

print(L)

# 4
L='a,b,c,d'.split(',')
print(L)

# 5
L = []
L.extend('abcd')
print(L)

# 6
word = 'abcd'
L = list(word)
L[0] = 'A'
print(''.join(L))


word ='hello how are you'
word.split(' ')

['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']
['a', 'b', 'c', 'd']
Abcd


['hello', 'how', 'are', 'you']

lists also support operations:

* concatenate with `+` -> `['a','b'] + ['c','d'] >> ['a','b', 'c','d']`
* create new larger list using `*` -> `[None] * 5 >> [None,None,None,None,None]`

### Removing items: `pop`, `remove`

- `pop()` removes and returns by **index** (default: last item)
- `remove(x)` removes by **value** (first match)
- `clear()` removes **all** items (empties the list)

In [None]:
items = ["a", "b", "c", "b"]
last = items.pop()
print(last)
print(items)

items.remove("b")
print(items)

In [None]:
sql_queries = ['drop table if exists','truncate' , 'create', ' insert']


print(sql_queries.pop())
print(sql_queries.pop())
print(sql_queries.pop())
print(sql_queries.pop())
print(sql_queries.pop())


 insert
create
truncate
drop table if exists


IndexError: pop from empty list

In [None]:
cleanup = ["tmp_file_1", "tmp_file_2", "tmp_file_3"]
print("Before clear:", cleanup)

cleanup.clear()
print("After clear:", cleanup)

Before clear: ['tmp_file_1', 'tmp_file_2', 'tmp_file_3']
After clear: []


### Sorting (two options)

- `sorted(lst)` returns a **new** list
- `lst.sort()` sorts the list **in place**

In [None]:
numbers = [37, 81, 21, 49, 98, 57]
sorted(numbers)

[21, 37, 49, 57, 81, 98]

In [None]:
numbers = [37, 81, 21, 49, 98, 57]
numbers.sort()
numbers

[21, 37, 49, 57, 81, 98]

In [None]:
numbers = [37, 81, 21, 49, 98, 57]
print(sorted(numbers))
print(numbers)  # original unchanged

numbers.sort()
print(numbers)

numbers.sort(reverse=True)
print(numbers)

In [None]:
numbers.sort?

### `reversed()` built-in

Besides `[::-1]` and `.reverse()`, Python provides the `reversed()` built-in. It returns an **iterator** (not a list), so wrap it in `list()` if you need a list back.

Key difference:
- `lst.reverse()` modifies the list **in place**
- `reversed(lst)` returns a **new** reversed iterator (original unchanged)

In [None]:
numbers = [10, 20, 30, 40, 50]

rev_iter = reversed(numbers)
print(list(rev_iter))   # [50, 40, 30, 20, 10]
print(numbers)          # original is unchanged

### Useful built-in functions for sequences

Python has several built-in functions that work on any iterable (lists, tuples, sets, etc.):

- `len(x)` - number of items
- `sum(x)` - sum of all numeric items
- `min(x)` / `max(x)` - smallest / largest item
- `x.index(value)` - index of the **first** occurrence of `value` (raises `ValueError` if not found)

In [None]:
scores = [88, 72, 95, 61, 84]

print("Count:", len(scores))
print("Sum:  ", sum(scores))
print("Min:  ", min(scores))
print("Max:  ", max(scores))
print("Avg:  ", sum(scores) / len(scores))

print("Index of 95:", scores.index(95))

### Copying lists (important pitfall)

`b = a` does **not** make a copy - it makes a second reference to the same list.

In [None]:
a = [1, 2, 3]
b = a
b.append(4)

print("a:", a)
print("b:", b)

# Make a copy:
c = a.copy()      # or: c = a[:]
c.append(999)

print("a after copy:", a)
print("c copy:", c)

a: [1, 2, 3, 4]
b: [1, 2, 3, 4]
a after copy: [1, 2, 3, 4]
c copy: [1, 2, 3, 4, 999]


In [None]:
L1 = [1,2,3]
print(L1)

L2 = L1
L2[0] = 999

print(L1)

[1, 2, 3]
[999, 2, 3]


## Tuples (immutable sequences)

A `tuple` is an immutable sequence. Use it when:
- the structure is a **fixed record** (e.g., a coordinate, (name, age))
- you want "this should not change" semantics
- you need a hashable sequence (advanced; comes up with sets/dict keys later)

Tuples support indexing/slicing just like strings and lists.

In [None]:
t = (1,2,3) # tuple
L = [1,2,3] # list

In [None]:
point = (32.08, 34.78)
print(point[0])
print(point[1])

# One-item tuple (comma matters!)
single = ("only",) # singelton
print(type(single))

32.08
34.78
<class 'tuple'>


### Tuple packing (implicit tuple creation)

In Python, **commas create tuples**.

When you write multiple comma-separated values on the right-hand side of an assignment, Python **packs** them into a tuple and assigns that tuple to the variable.

Parentheses are optional (they often improve readability, but the commas are what matter).

In [None]:
details = "Jack", 30, 1.78   # packed into a tuple
print(details)
print(type(details))

# Parentheses are optional, but common for readability:
details2 = ("Jill", 28, 1.65)
print(details2)

# You can still index a tuple (it's a sequence):
print(details[0])   # name
print(details[1])   # age
print(details[2])   # height

('Jack', 30, 1.78)
<class 'tuple'>
('Jill', 28, 1.65)
Jack
30
1.78


### Tuple unpacking

Unpacking lets you assign multiple values at once.

In [None]:
record = ("alice", "login", "2026-01-22")
user, action, date = record

print(user)
print(action)
print(date)

alice
login
2026-01-22


In [None]:
t = (1,2,3)

a,b,c = t


print(t)

(1, 2, 3)


In [None]:
# declare var
user = 'onn'
password = 'KING'


user, password = row # ['onn', ' king']

# swap assignments - IMPORTANT!!

start_date = '01012026'
end_date = '05012025'

start_date, end_date = end_date, start_date

print(start_date, end_date)

# not in python
tmp = start_date
start_date = end_date
end_date = tmp

05012025 01012026


In [None]:
# ordered sequence -> index 0 | iter
'hello'
[1,2,3,4]
['a','c',[1,2]]
(1,2,'a')

(1, 2, 'a')

## Dictionaries (mappings: key → value)

A `dict` maps **keys** to **values**.

Use a dictionary when:
- you need fast lookup by a meaningful key (e.g., user_id → user_record)
- your "items" are not naturally positional (unlike sequences)
- you want labeled fields

> **Important**: dictionary keys are unique. No key appears twice in a dictionaries (this does not apply to values, of course).

In [None]:
record = ("alice", "login", "2026-01-22")
record[0]


record_dict = {'user_name':'alice', 'action':'logic', 'date':'2026-01-22'}
record_dict['user_name']

'alice'

In [None]:
user = {"name": "alice", "role": "admin", "active": True}

user['age'] = 50 # create | update
print(user)
user['age'] = 'ERROR' # create | update
print(user)

{'name': 'alice', 'role': 'admin', 'active': True, 'age': 50}
{'name': 'alice', 'role': 'admin', 'active': True, 'age': 'ERROR'}


In [None]:
user['gender'] = user['role']
user.pop('role')
user

{'name': 'alice', 'active': True, 'age': 'ERROR', 'gender': 'admin'}

In [None]:
user = {"name": "alice", "role": "admin", "active": True}
print(user["name"])
print(user["role"])

# Add / update

user["active"] = False
user["last_login"] = "2026-01-22"
print(user)

### Safe lookup with `get`

`d[key]` raises `KeyError` error if the key doesn't exist.  
`d.get(key, [default=None])` returns `default` instead.

In [None]:
config = {"region": "eu-west-1", "retries": 3}

print(config.get("region"))
print(config.get("timeout_seconds", 30))  # not present → default

eu-west-1
30


In [None]:
config = {"region": "eu-west-1", "retries": 3}

DEFAULT_TIMEOUT = 30

region_value = config["region"] # want to fail if not exists! critial data
timeout = config.get("timeout_seconds", DEFAULT_TIMEOUT)  # not present → default

print(timeout)

30


In [None]:
x = config.get("timeout_seconds" , 30)
print(x)

30


### `keys`, `values`, `items`

These are *views* over the dictionary. We'll iterate them once we learn loops.

In [None]:
d = {"a": 1, "b": 2, "c": 3}

print(d.keys())
print(d.values())
print(d.items())

dict_keys(['a', 'b', 'c'])
dict_values([1, 2, 3])
dict_items([('a', 1), ('b', 2), ('c', 3)])


### `update()` and `pprint`

- `d.update(other_dict)` merges another dictionary into `d`. Existing keys get overwritten with the new values.
- The `pprint` module (pretty print) formats complex dictionaries for easier reading - especially useful when debugging nested structures.

In [1]:
import pprint

config = {"region": "eu-west-1", "retries": 3}
overrides = {"retries": 5, "timeout": 30}

config.update(overrides)
print(config)

big_dict = {
    "pipeline": "daily_load",
    "steps": ["extract", "transform", "load"],
    "config": {"region": "eu-west-1", "retries": 5, "timeout": 30}
}
pprint.pprint(big_dict)

{'region': 'eu-west-1', 'retries': 5, 'timeout': 30}
{'config': {'region': 'eu-west-1', 'retries': 5, 'timeout': 30},
 'pipeline': 'daily_load',
 'steps': ['extract', 'transform', 'load']}


## Sets (unique items)

A `set` contains unique values (no duplicates). Sets are great for:
- fast membership checks: `item in my_set`
- removing duplicates
- set algebra: union / intersection / difference

> **Important**: sets are **not ordered**, so they do not support indexing/slicing.

In [None]:
items = ["red", "red", "blue", "green", "green", "green"]
unique = set(items)
print(unique)
print("red" in unique)

a = {"a", "b", "c"}
b = {"b", "c", "d"}

print("union:", a | b)
print("intersection:", a & b)
print("difference:", a - b)

{'red', 'green', 'blue'}
True
union: {'d', 'a', 'b', 'c'}
intersection: {'b', 'c'}
difference: {'a'}


* `()` - function call
* `(1,)` - tuple (singleton)
* `[]` - indices, slicing, open an object, str[0] / dict['key']
* `[1,2,3]` - list
* `"hello {}` - placeholder str formatting
* `{1,2,3}` - set
* `{'key':5}` - dict

In [None]:
a = {"a", "b", "c"}
b = {"b", "c", "d"}

a.union(b)

a - b

a.difference(b)

{'a'}

### Symmetric difference (`^`)

Items that are in **either** set, but **not in both**. Useful for finding "what changed" between two collections.

```python
a ^ b   # operator syntax
a.symmetric_difference(b)  # method syntax
```

In [2]:
yesterday_users = {"alice", "bob", "carol"}
today_users     = {"bob", "carol", "dave", "eve"}

print("symmetric difference:", yesterday_users ^ today_users)
print("same via method:     ", yesterday_users.symmetric_difference(today_users))

symmetric difference: {'eve', 'alice', 'dave'}
same via method:      {'eve', 'alice', 'dave'}


In [None]:
phrase_1 = "hello how are you ?"

phrase_2 = "hello how are you today ?"

len(set(phrase_2.split(' ')).intersection(set(phrase_1.split(' ')))) / len(phrase_2.split(' '))

0.8333333333333334

In [None]:
import pandas as pd

df = pd.DataFrame({'id':[1,2,3] , 'age':[10,20,30] , 'name':['Onn','ONN','OnN']})

#df.columns = ['ID','SENORITY']

df['name'] = df['name'].str.lower()

# numpy _ numeircal python

df

Unnamed: 0,id,age,name
0,1,10,onn
1,2,20,onn
2,3,30,onn


## how its related to Pandas:

* manipulating columns
  * change column headers : `df.columns = ['id','date','amount']`
  * check if column exists : `'created_by' in df.columns >> False`


* manipulating values
  * apply lower on all columns : `df['first_name'].str.lower()`
  * get a distinct list of customer cities : `set(df['cutomer_city']) -> df['cutomer_city'].unique()`

* validating data quickley
  * `set(source_df['item_id']) - set(target_df['item_id'])`
  * can be achived better with `pd.merge` will learn in Pandas module

## Exercises

> Run the "Setup" cell under each exercise section first.

### Exercise 1 - Slicing warmup

In [None]:
# Setup
words = "100 Bottles Of Milk On The Wall."
print(words)

Create slicing expressions that produce:
1) `"100"`  
2) `"Wall"`  
3) `"Bottles Of Milk"`  
4) `"001"` (by slicing + reversing)  
5) `"lWhnk"` (a tricky slice)

Write each expression as a separate `print(...)` line.


In [None]:
# Your code here


#### Solution

In [None]:
print(words[:3])
print(words[-5:-1])
print(words[4:19])
print(words[2::-1])
print(words[-2:-15:-3])

### Exercise 2 - Formatting a simple table

In [None]:
# Setup
row_1 = ["Houston", 397.82, "4 bedrooms", "Furnished + air conditioning"]
row_2 = ["Dallas",  43.90,  "3 beds",     "Garage, recently renovated, fully furnished"]

Use f-strings to print two aligned rows like a table.

Requirements:
- City in width 10 (left aligned)
- Price in width 10 with 2 decimals (centered is nice)
- Bedrooms in width 12 (left aligned)
- Description in width 45 (left aligned)


In [None]:
# Your code here


#### Solution

In [None]:
line_1 = f"|{row_1[0]:<10s}|{row_1[1]:^10.2f}|{row_1[2]:<12s}|{row_1[3]:<45s}|"
line_2 = f"|{row_2[0]:<10s}|{row_2[1]:^10.2f}|{row_2[2]:<12s}|{row_2[3]:<45s}|"
print(line_1)
print(line_2)

### Exercise 3 - String operations

In [None]:
# Setup
lyrics = "Let it be, let it be, let it be, let it be, Whisper words of wisdom, let it be"

1) How many times does `"let it be"` occur (case-insensitive)?  
2) Replace every `"be,"` with `"be!"`  
3) Center the string with `@` padding, adding 12 extra characters total.

Print each result.


In [None]:
# Your code here


#### Solution

In [None]:
print("1.", lyrics.lower().count("let it be"))
print("2.", lyrics.replace("be,", "be!"))
print("3.", lyrics.center(len(lyrics) + 12, "@"))

Now:
- Remove punctuation (commas) and split into words.
- Print every second word (1st, 3rd, 5th, ...).


In [None]:
# Your code here


#### Solution

In [None]:
words_list = lyrics.replace(",", "").split()
print(words_list[::2])

### Exercise 4 - Sentence summary

In [None]:
# Setup
sentence = "  Cows are great. Cows are smart!  "

Print:
- number of characters (after stripping leading/trailing spaces)
- number of words
- does it contain the substring `"dog"` (case-insensitive)?


In [None]:
# Your code here


#### Solution

In [None]:
s = sentence.strip()

print("Characters:", len(s))
print("Words:", len(s.split()))
print('Contains "cow"?', "cow" in s.lower())

### Exercise 5 - Collections (list, dict, set)

#### 5.1 - Pair animals and sounds (no loops)

Given two lists of equal length, build a new list like:
`["cat-meow", "dog-wooof!", ...]`

Constraint: no loops yet - use indexing explicitly.

In [None]:
# Setup
animals = ["cat", "dog", "sheep", "bee"]
sounds  = ["meow", "wooof!", "baaah", "bzzzz"]

In [None]:
# Your code here


##### Solution

In [None]:
paired = [
    animals[0] + "-" + sounds[0],
    animals[1] + "-" + sounds[1],
    animals[2] + "-" + sounds[2],
    animals[3] + "-" + sounds[3],
]
print(paired)

#### 5.2 - Insert a word after a given color

Replace the two "inputs" with variable values (so the notebook runs).  
Task: insert `word` immediately after `user_color`, using `.index` and `.insert`.

Foe example:
```
colors = ["cyan", "green", "yellow", "red", "silver", "purple", "magenta", "gold", "brown", "black", "white"]

user_color = "red"   
word = "ALERT"

--->  ["cyan", "green", "yellow", "red", "ALERT", "silver", "purple", "magenta", "gold", "brown", "black", "white"]
```

In [None]:
# Setup
colors = ["cyan", "green", "yellow", "red", "silver", "purple", "magenta", "gold", "brown", "black", "white"]

user_color = "red"   # in class: input("Color? ")
word = "ALERT"       # in class: input("Word? ")

In [None]:
# Your code here


##### Solution

In [None]:
color_index = colors.index(user_color)
colors.insert(color_index + 1, word)
print(colors)

#### 5.3 Dictionary lookup + list aggregation

Given a dictionary of subject → list of grades, compute the average for a chosen subject.
Use `sum()` and `len()` (no loops).

In [None]:
# Setup
student_tests = {
    "English": [84, 78, 93],
    "Biology": [81, 90],
    "Math": [96, 86, 95, 91],
    "History": [87],
}

subject = "Math"     # in class: input("Subject? ")

In [None]:
# Your code here


##### Solution

In [None]:
subject_tests = student_tests[subject]
average = sum(subject_tests) / len(subject_tests)
print(subject, "average test result:", round(average, 2))

#### 5.4 Deduplicate with a set

Given a list of user IDs that may repeat, produce a set of unique IDs and show membership tests (e.g., is `u3` in the set?).

In [None]:
# Setup
user_ids = ["u1", "u2", "u1", "u3", "u2", "u4"]

In [None]:
# Your code here


##### Solution


In [None]:
unique_ids = set(user_ids)
print("Unique:", unique_ids)
print("u2 present?", "u2" in unique_ids)
print("u9 present?", "u9" in unique_ids)

## What's next

Once you learn flow control (conditions + loops), you'll be able to:
- iterate over sequences (`for item in ...`)
- build parsers that handle any number of fields
- transform batches of records and generate outputs

This lesson built the "data structures toolbox" you'll use everywhere.

<hr/>