# Libraries

#### Polars
Polars is written in Rust, uses lazy evaluation, parallel execution, and treats your CPU like the expensive resource it is.

In [18]:
import polars as pl

df = pl.read_csv("logs.csv")
result = (
    df.filter(pl.col("status") == 500)
    .group_by("service")
    .count()
    .sort("count", descending=True)
)
print(result)


FileNotFoundError: No such file or directory (os error 2): logs.csv

#### DuckDB
It’s a zero-setup analytical database that runs inside your process and queries CSVs and Parquet files directly. No server. No setup.

You get:
- Vectorized execution
- Columnar storage
- SQL that actually flies

In [17]:
import duckdb

duckdb.sql("""
    SELECT country, COUNT(*) 
    FROM 'users.parquet'
    GROUP BY country
    ORDER BY COUNT(*) DESC
""").show()


IOException: IO Error: No files found that match the pattern "users.parquet"

#### Rich debugging
- Colored tracebacks.
- Progress bars.
- Tables.
- Syntax highlighting.

In [15]:
from rich.console import Console
from rich.table import Table

console = Console()
table = Table(title="Latency by Service")
table.add_column("Service")
table.add_column("ms")
table.add_row("auth", "120")
table.add_row("billing", "340")
console.print(table)


#### RapidFuzz
- Deduplication
- User input
- Messy data (which is all data)

In [None]:
from rapidfuzz import process

choices = ["database", "data science", "deep learning", "debugging"]
match = process.extractOne("databse", choices)
print(match)


#### Perfect
- Retries
- Logging
- Visualization
- Scheduling

In [None]:
from prefect import flow, task

@task
def fetch():
    return [1, 2, 3]

@task
def transform(data):
    return [x * 10 for x in data]

@flow
def pipeline():
    data = fetch()
    transform(data)

pipeline()


#### Logging for Debugging
- Log levels
- Structured output
- Files instead of terminal chaos

In [8]:
import logging

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s"
)
logging.info("Server started")
logging.warning("Cache miss")
logging.error("Database connection failed")


2026-01-17 12:17:26,441 | INFO | Server started
2026-01-17 12:17:26,444 | ERROR | Database connection failed


#### Argparse
argparse lets your script accept command-line arguments (like --file data.csv --verbose).
Without it, you'd need sys.argv parsing (messy). argparse handles it cleanly.

```c
python app.py --file data.csv --verbose
```
- Automation becomes trivial
- Your scripts become reusable
- Other developers can use your code without reading it

In [None]:
import argparse

# Create argument parser (handles command-line args)
parser = argparse.ArgumentParser()

# Add --file argument (REQUIRED)
# Usage: python script.py --file data.csv
parser.add_argument("--file", required=True)

# Usage: python script.py --file data.csv --verbose
parser.add_argument("--verbose", action="store_true")

# Returns object with attributes: args.file, args.verbose
args = parser.parse_args()

print("Processing:", args.file)

#### Collections
- `Counter` – frequency counting
- `defaultdict` – dictionaries that don’t throw tantrums
- `deque` – fast queues
- `namedtuple` / `dataclass` alternatives for lightweight objects

In [None]:
from collections import Counter

counts = Counter(data)

# better than
"""
counts = {}
for item in data:
    if item in counts:
        counts[item] += 1
    else:
        counts[item] = 1
"""

from collections import defaultdict

grouped = defaultdict(list)
for user in users:
    grouped[user.country].append(user)


NameError: name 'data' is not defined

#### Pathlib
 use methods like `.exists()`, `.glob()`, `.read_text()`

In [None]:
from pathlib import Path

base = Path(__file__).resolve().parent
file_path = base / "data" / "files" / "input.txt"
print(file_path.exists())


### Type Annotations
Type hints are annotations that specify what types variables, parameters, and returns should be. They're optional documentation that tools (mypy) can check.

In [None]:
def greet(name: str) -> str:
    return f"Hello {name}"

# Basic
name: str = "Alice"
age: int = 25
price: float = 9.99
active: bool = True

# Collections
names: list[str] = ["Alice", "Bob"]
scores: dict[str, int] = {"Alice": 100}
coords: tuple[int, int] = (10, 20)
unique: set[str] = {"a", "b"}

# in a function you also need to define the return type
def add(a: int, b: int) -> int:
    return a + b

#### Typing

In [None]:
from typing import Optional

# Optional (can be None)
middle_name: Optional[str] = None  # same as str | None

In [None]:
# Union type allows you to indicate that a variable or function argument can be one of several types.
# It is highly likely that this function is fragile and difficult to use, but it becomes far more clear when the types are made explicit.
from typing import Union

def ugly_function(value: int, operation: Union[str, int, float, bool]) -> int:
    return value + operation

In [None]:
# Any (escape hatch - any type allowed)
from typing import Any
data: Any = "whatever"  # defeats type checking

In [None]:
#  Callable[[int], str] signifies a function that takes a single parameter of type int and returns a str.
from collections.abc import Callable

def feeder(get_next_item: Callable[[int], str]) -> None:
    return None

#### Pydantic
- Type safety
- Validation
- Serialization
- Clear errors

In [None]:
from pydantic import BaseModel, Field

class User(BaseModel):
    id: int
    email: str
    age: int = Field(gt=0)

user = User(id="42", email="a@b.com", age=21)

#### Sorting with Key

In [None]:
people = [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 20}]
sorted_name = sorted(people, key=lambda x: x["name"])
sorted_age = sorted(people, key=lambda x: x["age"])
print(sorted_name)
print(sorted_age)

[{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 20}]
[{'name': 'Bob', 'age': 20}, {'name': 'Alice', 'age': 25}]


#### Enumerate

In [None]:
for index, value in enumerate(["a", "b", "c"]):
    print(index, value)

0 a
1 b
2 c


#### Exception Handling

In [22]:
def get_ratio(x: int, y: int) -> int:
    try:
        ratio = x / y
    except ZeroDivisionError:
        print("cannot divide by zero")
        y = y + 1
        ratio = x / y
    return ratio


print(get_ratio(x=400, y=0))

cannot divide by zero
400.0


#### Ternary Operator

In [None]:
# conditional checks and assign values or execute expressions in a single line
for i in range(5):
    even_or_odd = "Even" if i % 2 == 0 else "Odd"
    print(even_or_odd)
  
  
n = 5  
# nested if-else statements
res = "Positive" if n > 0 else "Negative" if n < 0 else "Zero"
print(res)

# (condition_is_false, condition_is_true)[condition]
res = ("Odd", "Even")[n % 2 == 0]
print(res)


# condition_dict = {True: value_if_true, False: value_if_false}
# Key is True or False based on the condition a > b. 
# The corresponding value (a or b) is then selected.
a = 30
b = 40
m1 = {True: a, False: b}[a > b]
print(m1)

Even
Odd
Even
Odd
Even
Positive
Odd
40


#### Print parameters

In [29]:
a = ["english", "french", "spanish", "german", "twi"]
    
# String appended after the last value
print(*a, end=" ...")
print("\n")

# String appended after the last value
print(*a, sep=", ")

print("\n")

for language in a:
    print(language, end=" ")
    
print("\n")

for language in a:
    print(language, end=" | ")

english french spanish german twi ...

english, french, spanish, german, twi


english french spanish german twi 

english | french | spanish | german | twi | 

#### Slicing 

In [None]:
# reverse a String
a = "Hello World!"
print(a[::-1])


# reverse a List
Lst = [60, 70, 30, 20, 90, 10, 50]
print(Lst[::-1])

!dlroW olleH
[50, 10, 90, 20, 30, 70, 60]


#### Lambda function

In [None]:
# anonymous function
(lambda x, y: x**y)(5, 3)

# can be assigned to a variable
square = lambda x: x**2
square(5)


# best used inside other fct anonymously
tuples = [(1, "d"), (2, "b"), (4, "a"), (3, "c")]
sorted(tuples, key=lambda x: x[1])  # sort by second value in the tuple


25

### Timing

#### time perfomance counter

In [10]:
import time

# more accurate than time.time, for timing your code
start = time.perf_counter()
time.sleep(1)
end = time.perf_counter()
print(end - start)


1.0002453459983371


#### PyInstrument
PyInstrument is a Python profiler that helps you identifing where most of the execution time is spent, allowing you to focus on improving those areas.

In [None]:
from pyinstrument import Profiler

profiler = Profiler()
profiler.start()
# code you want to measure
profiler.stop()
print(profiler.output_text(unicode=True, color=True))


#### The Walrus Operator (:=) – Assign and Use in One Step

Introduced in **Python 3.8**, the **walrus operator (**:=**)** allows assignment inside expressions, making the code more concise.

The `~` operator performs a bitwise inversion of integers, flipping their bits to the opposite value. The result can be explained by the formula:

In [3]:
# Without the walrus operator
data = input("Enter a number: ")
if int(data) > 10:
    print(f"Number {data} is greater than 10")

# With the walrus operator
if (num := int(input("Enter a number: "))) > 10:
    print(f"Number {num} is greater than 10")


### Dictionaries

#### Merging dictionaries


In [None]:
# initialize two dictionaries
a = {"a": 1, "b": 2}
b = {"c": 3, "d": 4}

# merge dictionaries
a.update(b)
print(a)


merged = dict1 | dict2
print(merged)


{'a': 1, 'b': 2, 'c': 3, 'd': 4}


#### Dictionary.get() with Default Values

Instead of checking if a key exists in a dictionary, ***.get()*** allows retrieving values with a fallback default. -> Prevents KeyError.

In [None]:
person = {"name": "Alice", "age": 25}
print(person.get("city", "Unknown"))  # Key 'city' doesn't exist


#### Unpacking with \* in Function Calls and Loops
Python allows **iterable unpacking** with ***\**** , making function calls and loops cleaner.


In [7]:
def greet(name, age):
    print(f"Hello, {name}. You are {age} years old.")


data = ("John", 30)
greet(*data)  # Unpacking tuple into function arguments


Hello, John. You are 30 years old.


#### Merging Dictionaries with | Operator (Python 3.9+)

Python 3.9 introduced the ***|*** operator for merging dictionaries.

In [9]:
dict1 = {"a": 1, "b": 2}
dict2 = {"b": 3, "c": 4}

merged = dict1 | dict2  # Merges two dictionaries
print(merged)


{'a': 1, 'b': 3, 'c': 4}


#### Using zip() to Iterate Over Multiple Iterables

zip() simplifies iterating over multiple lists at once.

In [18]:
names = ["Alice", "Bob", "Charlie"]
scores = [85, 90, 78]


for name, score in zip(names, scores):
    print(f"{name} scored {score}")

Alice scored 85
Bob scored 90
Charlie scored 78


#### Quick File Reading with Path().read\_text()

Instead of ***open()***, use ***pathlib.Path*** for a cleaner file-reading approach.

In [21]:
from pathlib import Path

content = Path("README.md").read_text()
print(content)


# notebooks

Notebooks with essential data science knowledge:
If your a practical learner here is my gathered knowledge about
statistics, coding in python, machine learning etc. in executable examples.

Data sets used in the notebooks are provided here:
https://drive.google.com/drive/folders/1UzgxrOvtdJwKui7gbKhzohp5e_WQihSP?usp=sharing



#### Swapping Variables Without a Temp Variable

Python allows swapping variables in a single line.

In [23]:
a, b = 5, 10
a, b = b, a  # Swaps values without a temporary variable
print(a, b)


10 5


#### Using Counter for Quick Frequency Counting

collections.Counter is an efficient way to count occurrences in a list.

In [25]:
from collections import Counter

words = ["apple", "banana", "apple", "orange", "banana", "apple"]
freq = Counter(words)
print(freq)


Counter({'apple': 3, 'banana': 2, 'orange': 1})


#### List Comprehensions with if-else Conditions

List comprehensions allow ***if-else*** conditions inside them.

In [27]:
numbers = [1, 2, 3, 4, 5]
result = ["Even" if num % 2 == 0 else "Odd" for num in numbers]
print(result)


['Odd', 'Even', 'Odd', 'Even', 'Odd']


#### any() and all() for Quick Checks

Instead of writing loops for conditions, ***any()*** and ***all()*** make checking multiple conditions easier.

In [29]:
numbers = [3, 5, 7, 9]
print(any(num % 2 == 0 for num in numbers))  # Checks if at least one even number exists
print(all(num > 0 for num in numbers))  # Checks if all numbers are positive


False
True


#### enumerate() for Index-Based Looping

Instead of manually maintaining an index, ***enumerate()*** provides an automatic counter.

In [31]:
colors = ["red", "blue", "green"]
for index, color in enumerate(colors, start=1):
    print(f"{index}: {color}")


1: red
2: blue
3: green
