<a href="https://colab.research.google.com/github/BELBINBENORM/my_kaggle_ml_practice/blob/main/00_00_00_The_Data_Science_Brush_Up.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python → The Data Science Brush Up
---

## PHASE  1 — Core Python Syntax  
*You don’t “know Python” until this is automatic.*

### Variables & data types
Python is **dynamically typed**, but types matter.

In [1]:
x = 10           # int
y = 3.5          # float
name = "Beno"    # str
is_active = True # bool

- Variables are references, not boxes.
- Types belong to objects, not variable names.
- You can reassign a variable to a different type:

In [2]:
x = 10
x = "ten"

type(x)


str

### Type Casting

Python doesn’t guess — you must convert types explicitly.

In [3]:
int("10")      # 10
float("3.5")   # 3.5
str(100)       # "100"
bool(0)        # False

# Beware

# int("10.5")    # ValueError

False

### Operators

- Arithmetic : +  -  *  /  //  %  **
- Comparison : ==  !=  >  <  >=  <=
- Logical: and  or  not

In [4]:
5 / 2    # 2.5
5 // 2   # 2


2

Identity vs Equality

In [5]:
a = [1, 2]
b = [1, 2]

a == b    # True (values equal)
a is b    # False (different objects)


False

### Input / Output

Input is always a string.

In [6]:
name = input("Enter your name: ")
age = int(input("Enter your age: "))


Enter your name: beno
Enter your age: 23


Print/output:

In [7]:
print(name, age)
print(f"My name is {name} and I am {age}")

beno 23
My name is beno and I am 23


### Control Flow — `if`, `elif`, `else`

In [8]:
x = 10

if x > 0:
    print("positive")
elif x == 0:
    print("zero")
else:
    print("negative")


positive


### Loop
#### For loop

In [9]:
for i in range(5):
    print(i)


0
1
2
3
4


#### While loop

In [10]:
count = 0
while count < 3:
    print(count)
    count += 1


0
1
2


#### Loop controls

- break
- continue
- pass


In [11]:
for i in range(5):
    if i == 3:
        #break
        continue
        #pass
    print(i)

0
1
2
4


### Truthy & Falsy Values

Python treats certain values as False automatically.

Falsy values: `False`, `0`, `0.0`, `""`, `[]`, `{}`, `set()`, `None`

In [12]:
if []:
    print("won't run")

if [1]:
    print("will run")


will run


---

## PHASE  2 — Data Types & Memory Model  
*This is where depth starts. Most roadmaps fail here.*

### Built-in Types (What Exists, Not Just How to Use)

#### Numeric Types

- `int` Arbitrary precision (not 32/64-bit like C)
- `float` IEEE 754 double precision
- `bool` Subclass of int (True == 1, False == 0)

Why this matters:

- Overflow behaves differently than low-level languages
- Float precision bugs are not Python bugs

#### Text Type

`str`

- Immutable
- Unicode
- Sequence of characters

Key implications:

- Every string operation creates a new object
- Heavy string concatenation can be expensive

#### Sequence Types

`list`

- Mutable
- Ordered
- Heterogeneous

`tuple`

- Immutable
- Ordered
- Can contain mutable objects

Critical insight:

= Immutability applies to the container, not its contents

#### Set Types

`set`

- Mutable
- Unordered
- Unique elements

`frozenset`

- Immutable
- Hashable

Why sets matter:

- Fast membership checks
- Enforces hashability discipline

#### Mapping Type

`dict`

- Mutable
- Keys must be hashable
- Insertion-ordered (Python 3.7+)

Important:

- Dict is a hash table, not magic
- Key design affects performance and correctness

### Mutability & Immutability (This Is Where Bugs Come From)

What Mutability Actually Means

- Mutable → object can change in place
- Immutable → object cannot change after creation
This is about memory, not syntax.

#### Object Identity

- Every object lives somewhere in memory
- id(obj) gives its identity (memory reference)

Two variables can:

- Refer to the same object
- Refer to different objects with equal values

Equality (==) ≠ Identity (is)

#### Reference vs Value (Python Is Not Pass-by-Value)

Python is:
- Pass-by-object-reference

Meaning:

- Function parameters receive references to objects
- Mutation affects the original object
- Rebinding does not

This single concept explains:

- “Why did my list change?”
- “Why didn’t my integer change?”

#### Function Argument Behavior

- Mutable arguments can be modified
- Immutable arguments cannot be changed, only replaced

This is not a special rule.
It’s a direct consequence of mutability.

#### Immutable Containers with Mutable Elements

Example:

- Tuple containing a list

Tuple is immutable
List inside it is not

This breaks many people’s brains. It shouldn’t.

#### Hashability (Quiet but Critical)

What Hashability Means

- Object has a stable hash value
- Hash does not change during lifetime
- Can be used as dict key or set element

Rules:

- Immutable ≠ always hashable
- Mutable = never hashable

Why this matters:

- Dict keys
- Set behavior
- Performance
- Bugs that appear “random”

### Copy Mechanics (Where Silent Bugs Live)

#### Assignment vs Copy

- Assignment copies the reference
- No new object is created

This is not optional knowledge. This is core Python.

#### Shallow Copy

- Copies the container
- References inside remain shared

Used when:

- You want a new outer structure
- Inner objects are safe to share

Danger:

- Mutating nested objects leaks changes

#### Deep Copy

- Recursively copies everything
- New objects all the way down

Used when:

- You need full isolation

Cost:

- Slow
- Memory-heavy
- Sometimes wrong

#### When Copying Is a Mistake

- Large data structures
- Read-only workflows
- Performance-sensitive code

Often better:

- Design immutability
- Create new objects intentionally
- Avoid mutation altogether

Mental Model You Must Adopt

Python variables:

- Do not hold values
- They hold references to objects

Mutation:

- Changes the object
- Affects all references

Assignment:

- Rebinds a name
- Does not touch the object

If this clicks, Python becomes predictable.
If it doesn’t, everything feels random.

---


## PHASE  3 — Functions & Call Mechanics  
*Functions are Python’s backbone.*

- Function definition & return
- Default arguments (and why they’re dangerous)
- Positional vs keyword arguments
- Packing & unpacking (`*args`, `**kwargs`)
- Multiple return values
- Scope & closures
- Pure vs impure functions

### Function Definition & Return

Basic Structure

In [13]:
def add(a, b):
    return a + b

Key facts:

- `def` creates a function object
- `return` sends a value back and exits the function
- If no `return`, Python returns `None`

In [14]:
def f():
    pass

result = f()  # None

A function is just an object:

In [15]:
print(type(add))

<class 'function'>


### Default Arguments (Why They’re Dangerous)

#### Safe Defaults (Immutable)

In [16]:
def power(x, n=2):
    return x ** n

In [17]:
power(125)

15625

#### Dangerous Defaults (Mutable)

In [18]:
def add_item(item, items=[]):
    items.append(item)
    return items

In [19]:
add_item(13)

[13]

In [20]:
add_item(38,[10])

[10, 38]

This list is created once, not per call.

Correct pattern:

In [21]:
def add_item(item, items=None):
    if items is None:
        items = []
    items.append(item)
    return items

Rule:

Never use mutable objects as default arguments.

### Positional vs Keyword Arguments

#### Positional Arguments

In [22]:
def func(a, b, c):
    return a + b + c

#### Keyword Arguments

In [23]:
func(1, 2, 3)

6

In [24]:
func(a=1, b=2, c=3)

6

In [25]:
func(1, b=2, c=3)

6

### Packing & Unpacking (`*args`, `**kwargs`)

#### Packing Arguments

In [26]:
def total(*args):
    return sum(args)

total(1, 2, 3, 4)

10

- `*args` → tuple
- Captures extra positional arguments

In [27]:
def config(**kwargs):
    return kwargs

config(a=1, b=2)

{'a': 1, 'b': 2}

- `**kwargs` → dict
- Captures keyword arguments

#### Unpacking Arguments

In [28]:
nums = [1, 2, 3]
total(*nums)

6

In [29]:
params = {"a": 1, "b": 2}
config(**params)

{'a': 1, 'b': 2}

Key insight:

- Packing happens in function definition
- Unpacking happens in function call

### Multiple Return Values

Python returns one object — usually a tuple.

In [30]:
def stats(x, y):
    return x + y, x - y

result = stats(5, 3)

print(result)

(8, 2)


Unpacking:

In [31]:
s, d = stats(5, 3)
print(s,d)

8 2


This is tuple unpacking, not “multiple returns”.

### Scope & Closures

*Scope Rules (LEGB)*

- Local
- Enclosing
- Global
- Built-in

In [32]:
x = 10

def outer():
    x = 20
    def inner():
        return x
    return inner()

Closures

A closure remembers variables from its enclosing scope.

In [33]:
def multiplier(n):
    def inner(x):
        return x * n
    return inner

double = multiplier(2)
double(5)

10

`n` is remembered even after `multiplier` exits.

Closures are used in:

- Decorators
- Callbacks
- Functional patterns

### Pure vs Impure Functions

#### Pure Function

In [34]:
def square(x):
    return x * x

Characteristics:

- Same input → same output
- No side effects
- Easy to test
- Safe for parallelism

#### Impure Function

In [35]:
count = 0

def increment():
    global count
    count += 1

Characteristics:

- Depends on external state
- Causes side effects
- Harder to reason about

**why This Matters**

ML pipelines, data processing, and APIs break when functions:

- Mutate shared state
- Depend on hidden globals
- Modify arguments silently

Prefer:

- Pure functions
- Explicit inputs and outputs

---

## PHASE 4 — Iteration, Iterators & Generators  
*Performance & scale. This is where Python stops being naive.*

- `for` loop internals
- Iteration protocol
- `iter()` / `next()`
- Generator functions (`yield`)
- Generator expressions
- Lazy evaluation
- One-time iterables
- Memory vs speed tradeoffs

### `for` Loop Internals (What Really Happens)

When you write:

Python does not loop over indexes.

What actually happens:

- `iter(data)` is called
- An iterator object is returned
- `next()` is called repeatedly
- Loop stops when `StopIteration` is raised

Understanding this explains:

- Why some objects can be looped only once
- Why generators behave differently from lists

### The Iteration Protocol

An object is iterable if:

- It has `__iter__()` that returns an iterator

An object is an iterator if:

- It has `__iter__()` (returns itself)
- It has `__next__()`

Example:

In [36]:
it = iter([1, 2, 3])

Once exhausted:

In [37]:
next(it)  # StopIteration

1

### `iter()` and `next()` (Explicit Control)

In [38]:
data = [10, 20, 30]
it = iter(data)

next(it)  # 10
next(it)  # 20
next(it)  # 30

30

After this, the iterator is dead.

Important:

- Iterators are stateful
- You cannot rewind them unless you recreate them

### Generator Functions (`yield`)

A function becomes a generator when it uses `yield`.

In [39]:
def count_up(n):
    for i in range(n):
        yield i

Calling it:

In [40]:
g = count_up(3)

Nothing runs yet.

Execution happens only when:

In [41]:
next(g)

0

Key properties:

- Execution pauses at yield
- State is preserved
- Memory-efficient

### Generator Expressions

List comprehension:

In [42]:
l=[x * 2 for x in range(5)]
type(l)

list

Generator expression:

In [43]:
g=(x * 2 for x in range(5))
type(g)

generator

Difference:

- List → builds everything in memory
- Generator → produces values on demand

This is not a style choice.

This is a performance decision.

### Lazy Evaluation (Why Generators Matter)

Generators are lazy:

- Values computed only when requested
- No unnecessary work
- Scales to large or infinite data

Example:

In [44]:
def infinite_numbers():
    n = 0
    while True:
        yield n
        n += 1

In [45]:
whole_num = infinite_numbers()
type(whole_num)

generator

In [46]:
next(whole_num)

0

Used carefully, laziness enables:

- Streaming
- Large data pipelines
- Efficient ML preprocessing

Used carelessly, it creates:

- Silent infinite loops

### One-Time Iterables (Common Trap)

Generators and iterators:

- Can be consumed only once

In [47]:
g = (x for x in range(3))

list(g)
list(g)  # Empty

[]

This is not a bug.
This is how iterators work.

Rule:

If you need to iterate multiple times, store the data.

### Memory vs Speed Tradeoffs

**Lists**

- Fast indexing
- Reusable
- High memory usage

**Generators**

- Minimal memory
- One-pass only
- Slightly slower per item

In [48]:
import sys

l=[x * 2 for x in range(5000)]
g=(x * 2 for x in range(5000))

for i in (l,g):

    print(f"Type : {type(i)}")
    print(f"Id   : {id(i)}")
    print(f"Size : {sys.getsizeof(i)} \n")


Type : <class 'list'>
Id   : 133731312519488
Size : 41880 

Type : <class 'generator'>
Id   : 133731322789584
Size : 200 



Guideline:

- Small data → lists
- Large or streaming data → generators

In ML and data engineering:

- Generators reduce memory pressure
- But can hide performance costs if abused

**When generators are the wrong tool**

Generators are bad when:

- You need random access
- You need multiple passes
- You need caching
- You need vectorization (NumPy, pandas)

They shine when:

- Streaming data
- One-pass processing
- IO-bound tasks
- Large datasets that don’t fit in memory

**The real meaning (lock this in)**

*Generators make memory usage look cheap,
but they can silently make time complexity, debugging, and reuse expensive.*

That’s the tradeoff.

---

## PHASE 5 — Strings, Text & Regex  
*Underrated, heavily used.*

- String internals
- Formatting (`f-strings`)
- Unicode basics
- Regex fundamentals
- Common text-processing patterns

### String Internals

Strings Are Immutable

In [49]:
s = "hello"
s += " world"

What actually happened:

- A new string was created
- Old string was discarded
- References were updated

Implication:

- Repeated concatenation in loops is expensive

Bad:

In [50]:
s = ""
for word in words:
    s += word

NameError: name 'words' is not defined

Good:

In [None]:
s = "".join(words)

**Strings Are Sequences**

In [None]:
s = "python"

print(s[0])      # 'p'
print(s[-1])     # 'n'
print(s[1:4])    # 'yth'

Indexing creates new strings, not views.

### String Formatting (f-Strings Deep Dive)

#### Basic f-Strings

In [None]:
name = "Beno"
age = 23

f"My name is {name} and I am {age}"

f-strings:

- Are evaluated at runtime
- Are faster than .format()
- Are readable

#### Alignment & Width

In [None]:
x = 7

print(f"|{x:>5}|")   # right aligned
print(f"|{x:<5}|")   # left aligned
print(f"|{x:^5}|")   # center

With fill character:

In [None]:
x = "@"

print(f"|{x:*>5}|")   # right aligned
print(f"|{x:*<5}|")   # left aligned
print(f"|{x:*^5}|")   # center

#### Numeric Formatting

In [None]:
pi = 3.14159265

print(f"{pi:.2f}")     # 3.14
print(f"{pi:>8}")
print(f"{pi:8.2f}")    # width 8
print(f"{pi:.2%}")     # percentage

Thousands separator:

In [None]:
n = 1000000
f"{n:,}"

#### Base Conversion

In [None]:
n = 255

print(f"{n:b}")   # binary
print(f"{n:o}")   # octal
print(f"{n:x}")   # hex

#### Debug Format (Python 3.8+)

In [None]:
x = 10
f"{x=}"

### Unicode Basics (Text Is Not ASCII)

Python str is:
- Unicode
- Not bytes

In [None]:
s = "café"
len(s)   # 4
type(s)  # str

But:

In [None]:
b = s.encode("utf-8")
len(b)   # 5
type(b)  # bytes
print(b)

Key takeaway:

- Characters ≠ bytes
- Encoding matters for files, APIs, databases

#### Encoding / Decoding

In [None]:
import sys
text = "Hello this is 1234@#$%^&*"

encoded = text.encode("utf-8")
decoded = encoded.decode("utf-8")

for i in (encoded,decoded):
    print(f"{i}")
    print(f"Type :{type(i)}")
    print(f"Len  :{len(i)}")
    print(f"Id   :{id(i)}")
    print(f"Size :{sys.getsizeof(i)} \n")

Mismatch encoding → corrupted text.

### Regex (`re` Module) — Core Functions (Powerful, Dangerous)


**Core Metacharacters**

| Metacharacter | Meaning |
|--------------|--------|
| `.` | Any character except newline |
| `^` | Start of string |
| `$` | End of string |
| `*` | 0 or more of previous |
| `+` | 1 or more of previous |
| `?` | 0 or 1 (optional) |
| `{m,n}` | Repeat between m and n |
| `[]` | Character class |
| `()` | Grouping / capture |
| `\|` | OR |
| `\` | Escape character |

**Predefined Character Classes**

| Pattern | Meaning |
|--------|--------|
| `\d` | digit |
| `\D` | non-digit |
| `\w` | word character (letter, digit, `_`) |
| `\W` | non-word character |
| `\s` | whitespace |
| `\S` | non-whitespace |


##### `re.compile()`

Precompiles a regex pattern into a pattern object.

In [None]:
import re

pattern = re.compile(r"\d{2,5}")
pattern.findall("a1 b22 c333")

Why it exists:

- Avoids recompiling the same pattern repeatedly
- Faster in loops
- Cleaner code

Rule:
- Use `re.compile()` when a pattern is reused.

#### `re.escape()`

Escapes all regex metacharacters in a string.

In [None]:
re.escape("a.b*c?")

Use case:

- User input
- Filenames
- Dynamic patterns

Rule:

- If input is not trusted, escape it.

#### `re.findall()`

Returns all matches as a list.

In [None]:
re.findall(r"\d+", "a1 b22 c333")


Notes:

- If groups exist → returns tuples
- Loads everything into memory

Good for:

- Small to medium text
- When you need all matches at once


#### re.finditer()

Returns an iterator of match objects.

In [None]:
for m in re.finditer(r"\d+", "a1 b22 c333"):
    print(m.group(), m.start(), m.end())

Why it matters:

- Lazy
- Memory efficient
- Access to positions

Use when:

- Large text
- Need match spans
- Performance matters

#### `re.fullmatch()`

Matches entire string.

In [None]:
re.fullmatch(r"\d+", "123")   # match
re.fullmatch(r"\d+", "123a")  # no match

Best for:

- Validation
- Input checking

Rule:
- Use `fullmatch()` for validation, not `search()`.

#### `re.match()`

Matches from the start of the string only.

In [None]:
re.match(r"\d+", "123abc")   # match
re.match(r"\d+", "abc123")   # no match

Important:

- Does NOT scan the whole string
- Often misunderstood

Use only when:

- You explicitly want position 0

#### `re.search()`

Searches anywhere in the string.

In [None]:
re.search(r"\d+", "abc123xyz")

Most common:

- Finds first occurrence anywhere

Rule:

- search() ≠ match()

#### `re.split()`

Splits string by a regex pattern.

In [None]:
re.split(r",\s*", "a, b, c")

Groups are included if parentheses are used:

In [None]:
re.split(r"(\d+)", "a1b2c")

#### `re.sub()`

Replaces matches with a string.

In [None]:
re.sub(r"\d+", "#", "a1b22c333")

Can use a function:

In [None]:
re.sub(r"\d+", lambda m: str(int(m.group()) * 2), "a1b2")

#### `re.subn()`

Same as sub(), but also returns count.

In [None]:
re.subn(r"\d+", "#", "a1b22c333")

Use when:

You need to know how many replacements occurred

#### `re.purge()`

Clears the internal regex cache.

In [None]:
re.purge()

Reality:

- Almost never needed
- Mostly for long-running systems
- Or memory diagnostics
- You can ignore this 99% of the time.

---

## PHASE 6 — Error Handling & Defensive Code

- Exceptions hierarchy
- `try / except / else / finally`
- Raising custom exceptions
- Writing failure-aware code

### Exceptions Hierarchy (Know What You’re Catching)

All exceptions inherit from BaseException.

Key branches:

- `Exception` → catch this, not `BaseException`
- `ValueError`, `TypeError`, `KeyError`, `IndexError`
- `IOError` / `OSError`
- `RuntimeError`

Never do this:

Why:

- Catches everything
- Hides real bugs
- Makes debugging hell

### `try / except / else / finally`

#### Basic Structure

In [None]:
try:
    x = int("10")
except ValueError:
    print("Invalid number")
else:
    print("Conversion successful")
finally:
    print("Always runs")

Meaning:

- `try` → risky code
- `except` → handle failure
- `else` → runs only if no exception
- `finally` → cleanup, always runs

#### Multiple Exceptions

In [None]:
try:
    data = int(value)
except (TypeError, ValueError):
    print("Invalid input")

Order matters:
- Catch specific exceptions first
- General ones last

### Raising Exceptions (Fail Fast, Fail Loud)

#### Raise Built-in Exceptions

In [None]:
if age < 0:
    raise ValueError("Age cannot be negative")

Rule:
- If something is logically impossible, raise immediately.

#### Custom Exceptions

In [None]:
class InvalidConfigError(Exception):
    pass

Usage:

In [None]:
if not config:
    raise InvalidConfigError("Missing configuration")

Why custom exceptions matter:
- Clear intent
- Easier debugging
- Better error handling upstream

### Writing Failure-Aware Code

#### Validate Early

In [None]:
if not isinstance(x, int):
    raise TypeError("x must be int")

#### Don’t Swallow Errors

Bad:

In [None]:
try:
    risky()
except Exception:
    pass

Good:

In [None]:
try:
    risky()
except Exception as e:
    log_error(e)
    raise

#### Use Exceptions, Not Return Codes

Bad:

In [None]:
def divide(a, b):
    if b == 0:
        return None
    return a / b

Good:

In [None]:
def divide(a, b):
    if b == 0:
        raise ZeroDivisionError("b cannot be zero")
    return a / b

#### Context Managers (with)

In [None]:
with open("file.txt") as f:
    data = f.read()

**Why this matters:**
- Ensures cleanup
- Prevents resource leaks
- Cleaner than try/finally

#### Defensive Coding Principles

- Assume inputs are wrong
- Validate boundaries
- Fail early
- Log before crashing
- Never hide errors
- Let errors propagate when appropriate

---

## PHASE 7 — Modules, Packages & Imports  
*This is where scripts turn into systems.*

- Import mechanics
- `__name__ == "__main__"`
- Package structure
- Relative vs absolute imports
- Virtual environments revisited

### Import Mechanics

When Python sees:

It does not magically know where that is.

Python searches in order:

- Current script directory
- Directories in PYTHONPATH
- Standard library
- Site-packages

You can inspect this:

In [None]:
import sys
sys.path

Key insight:

- Imports are just filesystem lookups + execution.

A module is executed once, then cached in sys.modules.

#### Import Styles

In [None]:
import math
math.sqrt(16)

In [None]:
from math import sqrt
sqrt(16)

In [None]:
import math as m
m.sqrt(16)

Rules:

- Prefer explicit imports
- Avoid from module import *
- Aliases are fine when standard (np, pd)

### `__name__ == "__main__"`

Every Python file has a __name__.

- `"__main__"` → file is run directly
- Module name → file is imported

Example:

In [None]:
def main():
    print("Running main logic")

if __name__ == "__main__":
    main()

Why this matters:
- Prevents code from running on import
- Enables reuse + testing
- Required for clean package design

### Package Structure (Think in Directories)

**Basic structure:**

Key points:
- A package is a directory
- `__init__.py` marks it as importable
- Structure > cleverness

### Relative vs Absolute Imports

#### Absolute Imports (Preferred)

In [None]:
from app.utils import helper

Clear, explicit, stable.

#### Relative Imports (Use Carefully)

In [None]:
from .utils import helper
from ..config import settings

Rules:

- Only inside packages
- Break easily when refactoring
- Never use in top-level scripts

Rule of thumb:
- Libraries → absolute
- Internal package modules → relative (sparingly)

### Virtual Environments (Revisited, Properly)

Why they exist:
- Dependency isolation
- Reproducibility
- Zero global pollution

Create:

Activate:

Install:

Freeze:

Reality check:

*If you don’t use venv, you will eventually break something.*

---

## PHASE 8 — OOP (Pythonic, not Java cosplay)  
*Needed for frameworks and ML codebases.*

- Classes & instances
- `__init__`
- Instance vs class attributes
- Inheritance
- Composition
- Method resolution order
- Dunder methods
- Dataclasses

### What a class really is

A class is a factory for objects.
An object is just a bundle of data + behavior.

Nothing mystical.

In [None]:
class User:
    pass

This creates a new type.

In [None]:
u = User()

Now:
- User → the class
- u → an instance of that class

Check reality:

In [None]:
type(u)        # <class '__main__.User'>
isinstance(u, User)  # True

### `__init__` is NOT a constructor

This is where people get it wrong.

In [None]:
class User:
    def __init__(self, name):
        self.name = name

Key facts:

- `__init__` does not create the object
- The object already exists
- `__init__` just initializes it

This happens under the hood:

- Python creates an empty object
- Calls `__init__`(object, args...)

`self` is not magic.

It’s just the object being passed in.

### Instance attributes vs Class attributes (CRITICAL)

**Instance attribute**

Belongs to one object

In [None]:
class User:
    def __init__(self, name):
        self.name = name

Each instance has its own name.

**Class attribute**

Belongs to the class itself

In [None]:
class User:
    species = "human"

u1 = User("A")
u2 = User("B")

u1.species  # human
u2.species  # human

Now the trap:

In [None]:
u1.species = "alien"

This creates a new instance attribute, it does NOT modify the class.

Rule:

- Read → checks instance first, then class
- Write → always writes to instance unless explicitly done on class

### Methods are just functions with an object attached

In [None]:
class User:
    def greet(self):
        return f"Hi {self.name}"

Calling:

In [None]:
u.greet()

Reality:

In [None]:
User.greet(u)

That’s it.

No magic. Just function binding.

### `__str__` vs `__repr__` (don’t mix them)

In [None]:
class User:
    def __init__(self, name):
        self.name = name

    def __str__(self):
        return self.name

    def __repr__(self):
        return f"User(name={self.name!r})"

- `__str__` → for humans
- `__repr__` → for developers, debugging, logs

Rule:

If __repr__ lies, debugging becomes hell.

### Inheritance (use carefully)

In [None]:
class Animal:
    def speak(self):
        return "sound"

class Dog(Animal):
    def speak(self):
        return "bark"

Dog inherits behavior, not copies it.

In [None]:
d = Dog()
d.speak()  # bark

Problem:

- Deep inheritance chains become untraceable
- MRO gets complex fast

### Method Resolution Order (MRO)

Python decides which method to call using MRO.

In [None]:
Dog.mro()

Python searches left to right, top to bottom.

Multiple inheritance?

You will shoot yourself if you don’t understand MRO.

### Composition > Inheritance (this matters in ML systems)

Bad:

In [None]:
class ModelTrainer(Trainer, Logger, Validator):
    pass

Good:

In [None]:
class ModelTrainer:
    def __init__(self, logger, validator):
        self.logger = logger
        self.validator = validator

Why?

- Easier testing
- Less coupling
- No MRO madness
- Replace behavior without rewriting classes

This is how real codebases survive.

### Class methods vs Static methods

Class method

Works on the class, not instance.

In [None]:
class User:
    count = 0

    def __init__(self):
        User.count += 1

    @classmethod
    def total_users(cls):
        return cls.count

In [None]:
u=User()

In [None]:
u.total_users()

Used for:

- Factories
- Global counters
- Alternative constructors

Static method

Just namespaced function.

In [None]:
class MathUtils:
    @staticmethod
    def add(a, b):
        return a + b

In [None]:
m=MathUtils()

In [None]:
m.add(5,10)

No self, no cls.

If it doesn’t touch object or class state, make it static.

### Dataclasses (modern Python OOP)

Stop writing boilerplate.

In [None]:
from dataclasses import dataclass

@dataclass
class User:
    name: str
    age: int

This gives you:
- `__init__`
- `__repr__`
- `__eq__`

For free.

Immutable object?

In [None]:
@dataclass(frozen=True)
class Config:
    lr: float
    batch_size: int

Now it’s hashable and safe.

Used everywhere in ML configs and pipelines.

### The hard truth about OOP in Python

- OOP is not mandatory
- Overusing classes makes code worse
- Use classes when:
    - You model state + behavior
    - You need clear boundaries
    - You need extensibility

If you’re using classes just to “look professional”, you’re doing it wrong.

---

## PHASE 9 — Standard Library Power Tools  
*Most people jump to third-party libs too early.*

- `collections`
- `itertools`
- `functools`
- `pathlib`
- `datetime`
- `math`, `random`
- `copy`

### `collections` — Better data structures

This module exists because list and dict aren’t always enough.

#### `Counter`

Count things. Fast. Clean.

In [None]:
from collections import Counter

data = ["a", "b", "a", "c", "b", "a"]
Counter(data)

Use cases:
- Frequency analysis
- NLP token counts
- Class imbalance checks

Anti-pattern:

Using loops + dict manually for counting.

#### `defaultdict`

Avoid key-existence checks.

In [None]:
from collections import defaultdict

groups = defaultdict(list)

groups["a"].append(1)
groups["a"].append(2)

groups

No `if key in dict`.

Reality check:

If you ever write:

In [None]:
if k not in d:
    d[k] = []

You should be using defaultdict.

#### `deque`

Fast append/pop from both ends.

In [None]:
from collections import deque

q = deque([1, 2, 3])
q.appendleft(0)
q.pop()

Use cases:

- Queues
- Sliding windows
- BFS algorithms

List is slow here. Period.

#### `namedtuple` (older, still useful)

Lightweight object without full class.

In [None]:
from collections import namedtuple

Point = namedtuple("Point", ["x", "y"])
p = Point(2, 3)

p

### `itertools` — Performance without memory waste

This is iterator-level power.

#### `count`, `cycle`, `repeat`

Infinite streams.

In [None]:
from itertools import count

for i in count(10, step=2):
    print(i)
    if i > 20:
        break

Use when:
- Generating sequences lazily
- Avoiding giant lists

#### `chain`

Flatten iterables without copying.

In [None]:
from itertools import chain

list(chain([1,2], [3,4]))

Better than:

a + b


Because it doesn’t allocate intermediate lists.

#### `product`, `permutations`, `combinations`

In [None]:
from itertools import product

list(product([1,2], ["a","b"]))

Used in:
- Grid search
- Parameter sweeps
- Combinatorics

Warning:

Explodes fast. Know the math before running.

### `functools` — Function-level control

#### `lru_cache`

Memoization. Huge performance win.

In [None]:
from functools import lru_cache

@lru_cache(maxsize=None)
def fib(n):
    if n < 2:
        return n
    return fib(n-1) + fib(n-2)

In [None]:
fib(5)

Turns exponential into linear.

Use cases:
- Dynamic programming
- Repeated expensive calls
- Recursive algorithms

Danger:
Caching mutable inputs = bugs.

#### `partial`

Fix arguments in advance.

In [None]:
from functools import partial

def power(base, exp):
    return base ** exp

square = partial(power, exp=2)

In [None]:
square(5)

Cleaner than lambdas in many cases.

#### `reduce`

Fold values into one.

In [None]:
from functools import reduce
reduce(lambda a, b: a + b, [1,2,3])

Truth:

- Readable only for simple ops.
- Don’t be clever. Clarity beats one-liners.

### `pathlib` — Filesystem done right

Stop doing this:

Do this:

In [None]:
from pathlib import Path

p = Path(".") / "file.txt"

Examples:

In [None]:
p.exists()

In [None]:
p.is_file()

In [None]:
p.write_text("hello")

In [None]:
p.read_text()

Why this matters:
- OS-independent
- Safer
- Cleaner
- Used everywhere in real projects

### `datetime` — Time without lying to yourself

In [None]:
from datetime import datetime, timedelta

now = datetime.now()
tomorrow = now + timedelta(days=1)

Parsing:

In [None]:
datetime.strptime("2025-01-01", "%Y-%m-%d")

Formatting:

In [None]:
now.strftime("%Y-%m-%d %H:%M")

Hard truth:

Timezones are pain.

If your system touches production data, learn pytz or zoneinfo later.

### `math` and `random`

#### `math`

Fast, precise math.

In [None]:
import math

math.sqrt(16)
math.log(10)
math.ceil(2.3)

Used heavily in:
- ML preprocessing
- Probability
- Optimization

#### `random`

Non-secure randomness.

In [None]:
import random

random.seed(42)
random.choice([1,2,3])
random.shuffle(data)

Never use this for:

- Security
- Tokens
- Passwords

Use `secrets` instead.

### `copy` — Understand this or suffer later

In [None]:
import copy

a = [[1, 2], [3, 4]]
b = copy.copy(a)
c = copy.deepcopy(a)

- `copy` → new container, same inner objects
- `deepcopy` → everything duplicated

Reality check:

- Deep copy is expensive
- Often signals bad design
- Immutable data avoids this entirely

---

## PHASE 10 — APIs & Backend Foundations  
*This is how your code talks to the outside world.*

### API Fundamentals

**What an API actually is**

An API is a contract:

- Input format
- Output format
- Rules
- Failure modes

If you break the contract, clients break. No debate.

**REST principles (what actually matters)**

Forget buzzwords. Remember this:

- Resources, not actions
    `/users/42` not `/getUser?id=42`
- Stateless
    Every request is independent.
- Predictable
    Same input → same output.

REST is about consistency, not purity.

**HTTP Methods (behavior, not syntax)**

|Method|	Meaning|
|-|-|
GET	| Read
POST | Create
PUT | Replace
PATCH | Partial update
DELETE | Remove

Misusing these turns your API into trash.

**Status Codes (non-negotiable)**

|Code | Meaning|
|-|-|
200	| OK
201	| Created
400	| Client error
401	| Unauthorized
403	| Forbidden
404	| Not found
422	| Validation failed
500	| Server error

Returning `200` for failures = bad engineering.

**Headers & payloads**

- Headers = metadata
- Payload = data

Common headers:

- Authorization
- Content-Type
- Accept

Payload is usually JSON.

**JSON serialization**

Python objects are not JSON.

Mapping:
- dict → object
- list / tuple → array
- None → null

Everything else must be converted.

### Working with APIs in Python (requests)

**Basic request**

Never ignore:
- `status_code`
- network failures
- timeouts

**POST with payload**

Always set timeouts.

No timeout = production outage waiting to happen.

**Authentication**

Most common:
- API keys
- Bearer tokens

headers = {
    "Authorization": "Bearer TOKEN"
}

Never hardcode secrets. Ever.

**Pagination**

APIs rarely give all data at once.

Patterns:
- page / limit
- cursor-based

You must loop until exhausted.

**Rate limiting**

APIs will throttle you.

You must:

- read headers
- back off
- retry intelligently

Blind retries = ban.

**Error handling (real-world)**

Failure is expected. Handle it.

### Building APIs — `Flask` vs `FastAPI`

#### Flask (simple, flexible)

Minimal structure.

In [None]:
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/ping")
def ping():
    return jsonify({"msg": "pong"})

Good for:

- small services
- quick prototypes

Weak at:

- validation
- typing
- large systems

#### FastAPI (modern, typed, serious)

In [None]:
from fastapi import FastAPI

app = FastAPI()

@app.get("/ping")
def ping():
    return {"msg": "pong"}

Why FastAPI wins:

- Automatic validation
- Type hints enforced
- Built-in docs

This is what production teams use now.

**Request validation**

In [None]:
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

Invalid input never reaches your logic.
That’s huge.

**Response models**

Your output must be predictable.

FastAPI enforces this. Flask doesn’t.

**Middleware**

Used for:

- logging
- auth
- timing
- tracing

Every serious API has middleware.

**API documentation**

FastAPI gives:

- Swagger UI
- OpenAPI spec

If your API has no docs, it doesn’t exist.

**Reality Check**

Most ML engineers:

- Can train models
- Cannot expose them cleanly
- Cannot consume external APIs safely

That makes them blockers, not builders.

---

## PHASE 11 — Data Engineering Foundations  
*ML without data engineering is fragile.*

### Data Storage (How data actually lives)

#### CSV — simple, fragile

In [None]:
import pandas as pd

df = pd.read_csv("sales.csv")
df.head()

Problems you should notice:

- No enforced types
- Dates come as strings
- Corruption goes unnoticed

CSV is fine for:

- quick inspection
- ingestion

Not fine for:

- repeated training
- production pipelines

#### JSON — good for APIs, not analytics

In [None]:
import json

with open("event.json") as f:
    data = json.load(f)

data["user_id"]

Issues:

- Nested structure
- Hard to query
- Inefficient at scale

Rule:

JSON is for communication, not storage.

#### Parquet — correct format for ML pipelines

In [None]:
df.to_parquet("sales.parquet")
df2 = pd.read_parquet("sales.parquet")

Why this matters:

- Columnar
- Typed
- Compressed
- Fast reads

If your training data is reused → Parquet.

#### Databases & SQL (Features don’t come from CSVs)

**SQLite (local, dev-friendly)**

In [None]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("data.db")

df.to_sql("sales", conn, if_exists="replace", index=False)

Query it:

In [None]:
query = """
SELECT product, SUM(revenue) AS total_revenue
FROM sales
GROUP BY product
"""

pd.read_sql(query, conn)

This is feature engineering, not “SQL work”.

#### PostgreSQL (conceptually same, production-ready)

The SQL you write does not change.
Only the connection does.

That’s why SQL matters.

### Data Pipelines (Where most systems break)

#### ETL vs ELT (shown in code)

**ETL (transform before load)**

In [None]:
raw = pd.read_csv("raw.csv")

clean = raw.dropna()
clean["price"] = clean["price"].astype(float)

clean.to_parquet("clean.parquet")

**ELT (modern approach)**

In [None]:
raw.to_parquet("raw.parquet")

# later
clean = raw.dropna()

Why ELT wins:

- You don’t destroy raw data
- You can reprocess later
- Debugging is possible

#### Batch Processing (boring and powerful)

In [None]:
def daily_aggregation(df):
    return (
        df.groupby("date")
          .agg(total_sales=("revenue", "sum"))
          .reset_index()
    )

Batch jobs:

- predictable
- repeatable
- schedulable

Streaming is optional.
Batch is mandatory.

#### Incremental Data Loads (scaling 101)

The naive (bad) way

In [None]:
df = pd.read_csv("events.csv")  # reloads everything

The correct way

In [None]:
last_processed = "2025-01-01"

new_data = df[df["event_time"] > last_processed]

Store `last_processed` somewhere persistent.

If you don’t, you’ll reprocess forever.

#### Data Validation (models die quietly)

In [None]:
def validate(df):
    assert df["price"].notnull().all()
    assert (df["price"] >= 0).all()
    assert df["user_id"].dtype == "int64"

Run validation before training.

Crashing early is success.

#### Idempotency (critical, non-negotiable)

**Broken pipeline**

In [None]:
df.to_sql("features", conn, if_exists="append")

Run twice → duplicated data.

**Idempotent pipeline**

In [None]:
df.to_sql("features", conn, if_exists="replace")

or use keys:

Definition you must remember:

Same job + same input → same result

Anything else is a bug.

#### Data Lakes (conceptual, but practical)

Directory structure matters:

Rules:

- Raw is immutable
- Never overwrite raw data
- Everything downstream is reproducible

#### Orchestration (Airflow concept, simplified)

This is what orchestrators manage:

With:

- retries
- scheduling
- logging
- monitoring

Airflow doesn’t move data.

It controls when and how things run.

---

## PHASE 12 — NumPy (`np`)  
*Now Python meets performance.*

### ndarray internals (this is the foundation)

A NumPy array is not a Python list.

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4])
type(a)

Key facts:

- Fixed type
- Contiguous memory
- Homogeneous (all elements same type)

Compare with list:

In [None]:
l = [1, 2, 3, 4]

List = pointers to Python objects

ndarray = raw memory buffer

That’s why NumPy is fast.

**Shape, dtype, size**

In [None]:
print(a.shape)      # (4,)
print(a.dtype)      # int64
print(a.size)       # 4
print(a.ndim)       # 1

You must know these by instinct.

**Memory layout**

In [None]:
print(a.itemsize)   # bytes per element
print(a.nbytes)     # total bytes

This matters for:

- performance
- cache behavior
- large datasets

### Vectorization (the real speedup)

The slow way (Python loop)

In [None]:
new_list =[]
for x in l:
    new_list.append(x * 2)
print(new_list)

This is slow because:

- Python loop
- Python integers
- Interpreter overhead

**The NumPy way**

In [None]:
a = np.array([1, 2, 3, 4])

new_array = a*2

print(new_array)

- No loop.
- Runs in compiled C.
- Orders of magnitude faster.

Rule:

If you see a Python loop over numbers, something is wrong.

### Broadcasting (power + danger)

Broadcasting lets NumPy operate on arrays of different shapes.

In [None]:
a = np.array([1, 2, 3])
b = 10

a + b

Works because b is broadcast.

Broadcasting with arrays

In [None]:
x = np.array([[1, 2, 3],
              [4, 5, 6]])

y = np.array([10, 20, 30])

x + y

Rules (memorize):

- Align from right
- Dimensions must match or be 1

Broadcasting is powerful

Broadcasting can hide bugs

Always check shapes.

### Indexing (where bugs are born)

#### Basic indexing

In [None]:
a = np.array([10, 20, 30, 40])

print(a[0])
print(a[-1])
print(a[1:3])

#### 2D indexing

In [None]:
m = np.array([[1, 2, 3],
              [4, 5, 6]])

print(m[0, 1])     # row 0, col 1
print(m[:, 1] )    # all rows, column 1

This is not like nested lists.
Learn it properly.

#### Boolean indexing (very important)

In [None]:
a = np.array([5, 10, 15, 20])

a[a > 10]

Used everywhere:

- filtering
- cleaning
- feature selection

#### Fancy indexing (copies data)

In [None]:
a[[0, 2, 3]]

Important:

- slices = views
- fancy indexing = copies

This affects memory and performance.

#### Views vs Copies (critical concept)

In [None]:
a = np.array([1, 2, 3, 4])
b = a[1:3]

b[0] = 100
a

`a` changes because `b` is a view.

Force copy:

In [None]:
c = a[1:3].copy()

If you don’t understand this, you will introduce silent bugs.

### Universal Functions (ufuncs)

These are vectorized functions.

In [None]:
print(np.sqrt(a))
print(np.log(a))
print(np.exp(a))

Also work elementwise:

In [None]:
np.maximum(a, 10)

Faster than Python math + loops.

### Linear Algebra Basics (enough for ML)

#### Dot product

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print(np.dot(a, b))

or

In [None]:
print(a @ b)

#### Matrix multiplication

In [None]:
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

A @ B

This is not elementwise multiplication.

Elementwise:

In [None]:
A * B

Know the difference or you will break models.

#### Transpose

In [None]:
A.T

Shapes matter. Always.

### Random Sampling (data science backbone)

#### Random numbers

In [None]:
np.random.rand(3)

In [None]:
np.random.randn(3)

#### Reproducibility (mandatory)

In [None]:
np.random.seed(42)

If you don’t seed:

- experiments aren’t reproducible
- debugging becomes impossible

#### Sampling distributions

In [None]:
np.random.normal(loc=0, scale=1, size=25)

In [None]:
np.random.choice([0, 1], size=10)

Used in:

- train-test splits
- simulations
- initialization

### Performance Reality Check

**Bad NumPy:**

In [None]:
for i in range(len(a)):
    a[i] = a[i] * 2

**Good NumPy:**

In [None]:
a *= 2

**Best NumPy:**

- vectorized
- no Python loops
- minimal copies
- correct shapes

---

## PHASE 13 — Pandas (`pd`)  
*Real data is ugly.*

- Series vs DataFrame
- IO operations
- Indexing deep dive
- Missing data
- GroupBy mechanics
- Merge & join
- Apply vs vectorization
- Performance traps

---

## PHASE 14 — Visualization  
*Understanding > decoration.*

- Matplotlib (`pyplot`)
- Seaborn (`sns`)
- Choosing the right plot
- Interpreting plots honestly

---

## PHASE 15 — Statistics for ML  
*Enough to reason, not bluff.*

- Descriptive statistics
- Probability
- Distributions
- Correlation vs causation
- Hypothesis testing intuition

---

## PHASE 16 — Machine Learning (`sklearn`)  
*Structured thinking over algorithms.*

- ML workflow
- Preprocessing
- Core algorithms
- Pipelines
- Model evaluation
- Data leakage

---

## PHASE 17 — Deep Learning (TensorFlow)  
*Only after you understand ML.*

- Tensors
- Keras APIs
- Training loop intuition
- Overfitting control
- Model persistence

---

## PHASE 18 — NLP & LLM Systems (LangChain)  
*Modern AI stack, responsibly.*

- NLP basics
- Embeddings
- Vector search
- LangChain primitives
- RAG systems
- Failure modes

---

## PHASE 19 — Engineering & Reality  
*This is what actually pays.*

- Debugging
- Logging
- Testing
- Git
- Reading others’ code
- Writing code others can read
- Deployment basics
---