### Advanced Strings

This notebook goes beyond the basics of Python strings. We'll cover:
- Creation: quotes, escaping, raw strings, and multi-line strings
- Indexing, slicing, and immutability
- Common operations: membership, `len`, concatenation, repetition
- String methods: search, replace, split/join, strip, change case
- Unicode & normalization, casefolding
- Encoding/decoding between `str` and `bytes`
- Efficient building with `str.join`
- Formatting: f-strings and format specifiers
- Translation tables and character filtering
- Practical patterns and gotchas

## 1) Creation: quotes, escaping, raw strings, multi-line

In [1]:
single = 'He said: \'hi\''
double = "He said: 'hi'"
triple = """Multi-line\nstring with "quotes" and 'quotes'"""
raw_path = r"C:\\Users\\me\\Documents\\file.txt"  # raw string: backslashes not treated as escapes
single, double, triple.splitlines(), raw_path

("He said: 'hi'",
 "He said: 'hi'",
 ['Multi-line', 'string with "quotes" and \'quotes\''],
 'C:\\\\Users\\\\me\\\\Documents\\\\file.txt')

**Caveat (raw strings):** A raw string cannot end with a single backslash (e.g. `r"\"` is invalid). Use `r"\\"` or concatenate.

## 2) Indexing, slicing, and immutability

In [2]:
s = "Python rocks!"
first = s[0]
last = s[-1]
mid = s[7:11]   # 'rock'
step = s[::2]
rev = s[::-1]
len(s), first, last, mid, step, rev

(13, 'P', '!', 'rock', 'Pto ok!', '!skcor nohtyP')

In [3]:
try:
    s[0] = 'J'
except TypeError as e:
    immutability_error = str(e)
immutability_error  # strings are immutable

"'str' object does not support item assignment"

To "modify" a string, build a **new** one:

In [4]:
s2 = 'J' + s[1:]
s, s2

('Python rocks!', 'Jython rocks!')

## 3) Membership, concatenation, repetition, and comparisons

In [5]:
'rock' in s, 'java' in s, 'Py' in s, s + '!' , '-' * 10, 'Python' < 'python'  # lexicographic / ordinal comparison

(True, False, True, 'Python rocks!!', '----------', True)

**Note:** `'A' < 'a'` because uppercase letters have smaller Unicode code points than lowercase (ASCII subset). Use `.casefold()` or `.lower()` for case-insensitive comparisons.

## 4) Core search & replace methods (safe, no exceptions unless noted)

In [6]:
t = "abracadabra"
first_a = t.find('a')          # -1 if not found
last_b = t.rfind('b')          # -1 if not found
idx_ra = t.index('ra')         # raises ValueError if not found
count_a = t.count('a')
replaced = t.replace('abra', 'ABRA', 1)  # limit with count
first_a, last_b, idx_ra, count_a, replaced, t  # original unchanged

(0, 8, 2, 5, 'ABRAcadabra', 'abracadabra')

`startswith`/`endswith` accept tuples for multiple options; you can also set slice-like `start`, `end` bounds.

In [7]:
"hello.py".endswith((".py", ".pyw")), "data.csv".startswith("dat"), "abc".startswith("b", 1)  # True

(True, True, True)

## 5) Strip/trim, split/join, partition/splitlines, and case changing

In [8]:
u = "  \t  spam,eggs,ham  \n"
trim = u.strip()            # lstrip / rstrip for one side
split_csv = trim.split(',') # split on comma
split_max = trim.split(',', maxsplit=1)
joined = ";".join(split_csv)
parts = "key=value=extra".partition('=')    # ('key', '=', 'value=extra')
rparts = "key=value=extra".rpartition('=')  # ('key=value', '=', 'extra')
lines = "a\nb\r\nc".splitlines()          # handles different line endings
caps = "tiTle cAsE".title(), "MiXeD".lower(), "MiXeD".upper(), "straße".casefold()
trim, split_csv, split_max, joined, parts, rparts, lines, caps

('spam,eggs,ham',
 ['spam', 'eggs', 'ham'],
 ['spam', 'eggs,ham'],
 'spam;eggs;ham',
 ('key', '=', 'value=extra'),
 ('key=value', '=', 'extra'),
 ['a', 'b', 'c'],
 ('Title Case', 'mixed', 'MIXED', 'strasse'))

**When to use `partition`?** For splitting **once** around a delimiter while keeping the delimiter. Safer than `split` when the delimiter may not exist (always returns a 3-tuple).

## 6) Unicode details: normalization & case-insensitive matching
Different Unicode representations can look identical but compare unequal unless normalized. Use `unicodedata.normalize` and `casefold` for robust comparisons.

In [9]:
import unicodedata as ud
# 'é' as a single code point vs 'e' + combining accent
s1 = "é"
s2 = "e\u0301"
raw_equal = (s1 == s2)
nfc_equal = (ud.normalize('NFC', s1) == ud.normalize('NFC', s2))
casefold_equal = ("Straße".casefold() == "strasse".casefold())
raw_equal, nfc_equal, casefold_equal

(False, True, True)

**Tip:** For user-facing string equality or deduplication, normalize (e.g., 'NFC') and `.casefold()` before comparing.

## 7) Encoding/decoding between `str` and `bytes`
`str` is Unicode text. `bytes` is a sequence of raw bytes. Use `.encode()` and `.decode()` with an explicit encoding (UTF-8 is standard). Handle errors gracefully when needed.

In [10]:
text = "π ≈ 3.14159"
b = text.encode('utf-8')           # to bytes
back = b.decode('utf-8')           # to str
len(text), len(b), b, back

(11, 14, b'\xcf\x80 \xe2\x89\x88 3.14159', 'π ≈ 3.14159')

In [11]:
# Error strategies: 'ignore', 'replace', 'backslashreplace'
bad_bytes = b"abc\xffdef"
ignored = bad_bytes.decode('utf-8', errors='ignore')
replaced = bad_bytes.decode('utf-8', errors='replace')
backslashed = bad_bytes.decode('utf-8', errors='backslashreplace')
ignored, replaced, backslashed

('abcdef', 'abc�def', 'abc\\xffdef')

**Bytes literals:** `b'...'` creates bytes. Use for I/O, hashing, protocols. Convert to/from `str` explicitly.

In [12]:
b_lit = b"hello"  # ASCII-only in bytes literals
b_lit, type(b_lit), list(b_lit)  # ints 0..255

(b'hello', bytes, [104, 101, 108, 108, 111])

## 8) Efficient concatenation with `str.join`
Repeated `+` on strings creates many temporaries. Prefer `''.join(parts)` for large loops or collections of fragments.

In [13]:
parts = [str(i) for i in range(10)]
bad = ''
for p in parts:
    bad += p  # okay small, but scales poorly
good = ''.join(parts)
bad, good, bad == good

('0123456789', '0123456789', True)

## 9) Formatting: f-strings and format specs
F-strings are concise and fast. Format specifiers control width, alignment, precision, number style, etc.

In [14]:
name = "Ada"
pi = 3.1415926535
n = 12345
f1 = f"Hello, {name}! pi≈{pi:.3f}"
f2 = f"{n:,}"                      # thousands separator
f3 = f"{n:>10}"                    # right align width 10
f4 = f"{n:#x}"                      # hex with prefix
f5 = f"{0.875:.0%}"                 # percentage 88%
f1, f2, f3, f4, f5

('Hello, Ada! pi≈3.142', '12,345', '     12345', '0x3039', '88%')

**Debugging helper:** `f"{var=}"` shows name and value (Python 3.8+).

In [15]:
total = 42
f"{total=}"  # 'total=42'

'total=42'

## 10) Translation tables and character filtering
`str.maketrans` + `str.translate` can replace or delete many characters efficiently in one pass (no regex needed).

In [16]:
s = "a+b=c; 1+2=3!"
table = str.maketrans({'+': ' plus ', '=': ' equals ', ';': ', '})
translated = s.translate(table)
delete_digits = str.maketrans('', '', '0123456789')  # third arg: delete chars
no_digits = s.translate(delete_digits)
translated, no_digits

('a plus b equals c,  1 plus 2 equals 3!', 'a+b=c; +=!')

**Filtering example:** Keep only letters and spaces using `translate` with a deletion set of unwanted characters.

In [17]:
import string
text = "Hello, World! 123"
to_delete = string.punctuation + string.digits
clean = text.translate(str.maketrans('', '', to_delete))
clean.strip()

'Hello World'

## 11) Character classes and `.is*` predicates
Useful for quick validation without regex:
- `isalpha`, `isalnum`, `isdecimal`, `isdigit`, `isnumeric`
- `isspace`, `islower`, `isupper`, `istitle`
- `isprintable` (no control chars)

In [18]:
"ABC".isalpha(), "A1".isalnum(), "²".isdigit(), "½".isnumeric(), "  \t\n".isspace(), "Hello World".istitle()

(True, True, True, True, True, True)

**Note:** `isdigit` vs `isnumeric`: some Unicode numeric chars (like "½") are numeric but not digits; prefer `isnumeric` for broader coverage.

## 12) Practical patterns & gotchas
- **Building large text**: accumulate in a list, then `''.join` once.
- **Case-insensitive search**: normalize and `.casefold()` both sides.
- **Paths/regex patterns**: use raw strings to avoid escape headaches.
- **Binary vs text I/O**: open files with correct mode and encoding.
- **Normalization**: before deduplication or keys, normalize (e.g., NFC).