# Python for DE - Basics

### Index of "Python for DE - Basics"

####.  1Miscellaneous Topics That I think Confuse me All the tio- **Module vs Library vs Pkagesce
. Functional Progmmings2

#### 3. Data Structures
- **Lists**: Properties, usage, examples.
- **Tuples**: Immutable sequences.
- **Sets**: Unique unordered collections.
- **Dictionaries**: Key-value pairs.
- **Strings**: Behaviors and ope3ations.

#### 4. Classes and Objects
- Concepts: classes, objects, constructors, and methods.
- Examples with Python co4e snippets.

#### 5. Basic Python Questions
- Questions on variables, data structures, loops, and functions.
- Example tasks wi5h explanations.

#### 6. Writing Memory-Efficient Data Pipelines
- **Using Generators**: Lazy evaluation, creating generators.
- **ETL Pipeline Example**: Implementing generators for an ETL task.
- **Batch Processing**: Fetching data in batches.
- Pros and cons of generators.
- **Distributed Frameworks**: Overview of Apache 6park, Flink, and Dask.

#### 7. Dataclasses
- Overview and usage of the `@dataclass` decorator.
- Examples of mutable7and immutable dataclasses.

#### 8. Type Hints
- Introduction to type annotations in Python.
- Use with static type checkers (`mypy`).
- Exa8ples of higher-order functions.

#### 9. Itertools
- Functions for iterators: `count`, `cycle`, `islice`, etc.
- Examples: combining, filter9g, grouping, and aggregating data.

#### 10. Regex
- Common patterns and usage scenarios.
- Examples: extracting emai0s, matching dates, and splitting text.

#### 11. Shallow Copy vs Deep Copy
- Differences between shallow and deep copies.
- Examples illustrald you like a deeper summary of any specific section?

## Miscellaneous Topics That I think Confuse me All the time

### 1. Module Vs Libary Vs Package
- **Module**
    - It is a single python file (.py extension) which can contain functions, classes, variables etc
    - These are used for organising code logically
- **Library**
    - It is a collection of modules (or even a single module) that provides specific functionality
    - Designed to be resuable
- **Package**
    - Basically a folder with multiple .py files (modules).
    - The presence of the __init__.py file "tells Python" that this folder should be treated as a package (something you can import).
    - When your project becomes too large or complex, a single module (one .py file) may no longer be sufficient to organize your code. In such cases, you group related modules into a package for better organization and structure.
 

### 2. Functional Programming
- It's a programming paradigm (basically a way of thinking about how to structure your code)
- It considers computation as mathematical functions
- It focuses on writing pure functions (functions which only depend on the input arguement and nothing else) and emphasizes immutability, declarative code, and higher-order functions.
- Best example of FP in Python is list comprehension and lambda functions

## 1. Data Structures
### 1. Lists
- Ordered collection of items
- It is mutable (can be edited after creating)
- Items can be of different data types
- Lists need additional memory to handle dynamic resizing and associated methods. They also use pointers to store the memory addresses of their elements, leading to more memory usage.

In [1]:
fruits = ["apple", "banana", "cherry"]

# Adding Items
fruits.append("orange")
print(fruits)  # Output: ['apple', 'banana', 'cherry', 'orange']

# Removing Items
fruits.remove("banana")
print(fruits)  # Output: ['apple', 'cherry', 'orange']

# Slicing
print(fruits[1:])  # Output: ['cherry', 'orange']

['apple', 'banana', 'cherry', 'orange']
['apple', 'cherry', 'orange']
['cherry', 'orange']


### 2. Tuple
- Ordered collection of items
- Immutable (cannot be edited)
- Faster than lists because it only allows read-only
- stored as a single contiguous block of memory without any extra overhead for resizing or methods.

### 3. Sets
- Unordered collection of unique items
- mutable
- Supports mathematical set operations

### 4. Dictionaries
- Unordered collection of key-value pairs
- mutable
- keys must be unique and of immutable type


### 5. Strings
- Although strings are not traditionally considered data structures, they behave like immutable sequences of characters.
- Immutable
- Strings can be sliced and iterated over like lists.

## 2. Classes and Objects
### Classes
- It's basically a blueprint for creating objects
- It defines the attributes (properties) and methods (behaviours/functions)

```python
class ClassName:
    # Constructor Method - this is used to initialse an object
    def __init__(self, attribute1, attribute2):
        self.attribute1 = attribute1
        self.attribute2 = attribute2
    def foo(self):
        return pass
```
#### Constructor method
- `__init__`: This is the **constructor** method, which is called automatically when an object is instantiated (created).
- It's used to initialize the object's attributes.
- `self`: It refers to the current instance of the class (the object). Every method inside the class must have `self` as its first parameter.
- `self.attribute1 = attribute1 # why??` => taking the value passed to the constructor and assigning to the instance variable `self.attribute1`. This ensures all the methods within the class have access to the attributes

### Objects
- Simply the instances of a class created from it's blueprint

```python
# Creating an instance of the class
my_object = ClassName(attribute1_value, attribute2_value)
```

In [4]:
# Creating a Dictionary
student = {"name": "Joyan", "age": 25, "course": "CS"}

# Accessing Values with keys
print(student['name'])

# Adding or Updating a Key-Value Pair
student["grade"] = "A"
print(student)  # Output: {'name': 'Joyan', 'age': 25, 'course': 'CS', 'grade': 'A'}

# Removing a Key-Value Pair
del student["age"]
print(student)  # Output: {'name': 'Joyan', 'course': 'CS', 'grade': 'A'}

# Iterating Through a Dictionary
for key, value in student.items():
    print(f"{key}: {value}")

Joyan
{'name': 'Joyan', 'age': 25, 'course': 'CS', 'grade': 'A'}
{'name': 'Joyan', 'course': 'CS', 'grade': 'A'}
name: Joyan
course: CS
grade: A


## Basic Questions

In [8]:
# Variable: A storage location identified by its name, containing some value.
# Question: Assign a value of 10 to variable a and 20 to variable b
# Question: Store the result of a + b in a variable c and print it. What is the result of a + b?
a, b = 10, 20
c = a + b
print(f"Question 1 ans: {c}")

s = '  Some string '
# Question: How do you remove the empty spaces in front of and behind the string s?
print(s.strip())

# Data Structures are ways of representing data, each has its own pros and cons and places that they are the right fit.
## List: A collection of elements that can be accessed by knowing the location (aka index) of the element
l = [1, 2, 3, 4]

# Question: How do you access the elements in index 0 and 3? Print the results.
## NOTE: lists retain the order of elements in it but dictionary doesn't
print(f"Element at index 0: {l[0]}\nElement at index 3: {l[3]}")

## Dictionary: A collection of key-value pairs, where each key is mapped to a value using a hash function. Provides fast data retrieval based on keys.
d = {'a': 1, 'b': 2}

# Question: How do you access the values associated with keys 'a' and 'b'?
## NOTE: The dictionary cannot have duplicate keys
print(f"{d['a']}, {d['b']}")

## Set: A collection of unique elements that do not allow duplicates
my_set = set()
my_set.add(10)
my_set.add(10)
my_set.add(10)

# Question: What will be the output of my_set? => (10)

## Tuple: A collection of immutable (non-changeable) elements, tuples retain their order once created.
my_tuple = (1, 'hello', 3.14)

# Accessing elements by index
# Question: How do you access the elements in index 0 and 1 of my_tuple?
print(f"Element at index 0: {my_tuple[0]}\nElement at index 1: {my_tuple[1]}")

# Counting occurrences of an element
count_tuple = (1, 2, 3, 1, 1, 2)
# Question: How many times does the number 1 appear in count_tuple?
count_tuple.count(1) # the count method can be used to count the no. of occurences

# Finding the index of an element
# Question: What is the index of the first occurrence of the number 2 in count_tuple?
count_tuple.index(2) # returns back the index of the first occurence of the element

# Loop allows a specific chunk of code to be repeated a certain number of times
# We can loop through our data structures as shown below
# Question: How do you loop through a list and print its elements?
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
    print(fruit)

# Dictionary loop
# Question: How do you loop through a dictionary and print its keys and values?
students = {"name": "Joyan", "age": 25, "course": "CS"}

for key, value in students.items():
    print(f"Key: {key} - Value: {value}")

# Comprehension is a shorthand way of writing a loop
# Question: Multiply every element in list l with 2 and print the result
l = [elm*2 for elm in l]
print(f"Using list comprehension: {l}")

# Functions: A block of code that can be re-used as needed. This allows for us to have logic defined in one place, making it easy to maintain and use.
## For example, let's create a simple function that takes a list as an input and returns another list whose values are greater than 3

def gt_three(input_list):
    return [elt for elt in input_list if elt > 3]
## NOTE: we use list comprehension with filtering in the above function

list_1 = [1, 2, 3, 4, 5, 6]
# Question: How do you use the gt_three function to filter elements greater than 3 from list_1?
print(f"gt_three function to filter elements greater than 3 from list_1: {gt_three(list_1)}")

list_2 = [1, 2, 3, 1, 1, 1]
# Question: What will be the output of gt_three(list_2)? => [] an empty list
print(f"output of gt_three(list_2): {gt_three(list_2)}")

# Classes and Objects
# Think of a class as a blueprint and objects as things created based on that blueprint
# You can define classes in Python as shown below
class DataExtractor:

    def __init__(self, some_value):
        self.some_value = some_value

    def get_connection(self):
        # Some logic
        # some_value is accessible using self.some_value
        return self.some_value

    def close_connection(self):
        # Some logic
        # some_value is accessible using self.some_value
        pass

# Question: How do you create a DataExtractor object and print its some_value attribute?
data_extractor_object = DataExtractor("some_value_as_attribute")
print(f"DataExtractor object and print its some_value attribute: {data_extractor_object.get_connection()}")


# Libraries are code that can be reused.
# Python comes with some standard libraries to do common operations, 
# such as the datetime library to work with time (although there are better libraries)
from datetime import datetime  # You can import library or your code from another file with the import statement

# Question: How do you print the current date in the format 'YYYY MM DD'? Hint: Google strftime
datetime.strftime(datetime.now(), "yyyy-MM-dd")

# Exception handling: When an error occurs, we need our code to gracefully handle it without just stopping. 
# Here is how we can handle errors when the program is running
try:
    # Code that might raise an exception
    pass
except Exception as e: 
    # Code that runs if the exception occurs
    pass
else:
    # Code that runs if no exception occurs
    pass
finally:
    # Code that always runs, regardless of exceptions
    pass

# For example, let's consider exception handling on accessing an element that is not present in a list l
l = [1, 2, 3, 4, 5]

# Question: How do you handle an IndexError when accessing an invalid index in a list?
# NOTE: in the except block its preferred to specify the exact erro/exception that you want to handle
try:
    out_of_range_index = len(l)+1
    l[out_of_range_index] # accessing element out of range
except IndexError:
    print("IndexError")

Question 1 ans: 30
Some string
Element at index 0: 1
Element at index 3: 4
1, 2
Element at index 0: 1
Element at index 1: hello
apple
banana
cherry
Key: name - Value: Joyan
Key: age - Value: 25
Key: course - Value: CS
Using list comprehension: [2, 4, 6, 8]
gt_three function to filter elements greater than 3 from list_1: [4, 5, 6]
output of gt_three(list_2): []
DataExtractor object and print its some_value attribute: some_value_as_attribute
IndexError


## 3. Writing memory efficient data pipelines in Python
https://www.startdataengineering.com/post/writing-memory-efficient-dps-in-python/
### 1. Using Generators
- It can be thought of as **resumable functions**
- Normal working of a regular function
    1. Call the function
    2. Creates a new private namespace and local variables are created
    3. Value is returned and the local variables are destroyed
- The main idea behind generators is to not throw away those local variables but rather resume processing from that point
- When you call a generator function it returns a generator object but does not execute the function immediately -> The function runs only when you iterate over the generator. **Sounds familiar to lazy evaluation, doesn't it?**

#### How to Create Generators
1. Using **yield** keyword

```python
def my_generator():
    for i in range(5):
        yield i

# Example usage
for value in my_generator():
    printue


**()**

2. Using the generator expression

```python
squares = (x**2 for x in range(1,5))
print(next(squares)) # output -> 1
print(next(squares)) # output -> 4
print(next(squares)) 
```

#### Where can Generators come in handy for data Pipeline?
- A data pipeline processes data in stages, where the output of one stage becomes the input for the next.
- Generators are ideal for building these because they allow you to pass data between stages without loading all the data into memory.
- Lazy Evaluation: Generators compute values on demand, making them suitable for handling large datasets without consuming much memory.

In [11]:
# Writing an ETL pipeline with Generators
# TASK
'''
Let’s say our objective is to process the parking violation 2018 dataset with the following steps:

1. Keep only the violations issued by police (denoted by P in the data), to vehicles with the make FORD in NJ.
2. Replace P with police.
3. Concat house number, street name, and registration state fields into a single address field.
4. Write the result’s in to a csv file with the header vehicle_make,issuing_agency,address.

'''

import csv
input_file_name = "parking-violations-issued-fiscal-year-2018.csv"
output_file_name = "nj_ford_trasnportation_issued_pv_2018.csv"

# EXTRACT
# 1. Stream the data from input file
read_file_object = open(input_file_name, "r")
extractor = csv.reader(read_file_object) # csv reader creates a generator - it does not load the complete file in memory, it yields one row at a time as you iterate.
#next(extractor) # prints the headers
#next(extractor) # prints the first row


# 2. Keep only required fields
# 2 => registration state, 7 => vehicle make, 8 => issuing agency, 23 => house number, 24 => street name
col_filtered_stream = ([row[2], row[7], row[8], row[23], row[24]] for row in extractor)
# everytime you call the extractor one row will be passed
# Note we are using the () expression to create a generator

# TRANSFORM
# 3. keep only violations issued by police, to vehicles with the make FORD in NJ
value_filtered_stream = filter(lambda x: all([x[0]=='NJ', x[1]=='FORD',x[2]=='p']), 
                               col_filtered_stream
                              )

# 4. replace p with police
transformed_stream = (
                        [stream[0], stream[1], "police", stream[3], stream[4]]
                        for stream in value_filtered_stream
                        )

# 5. concat house number, street name, registration state into a single address field
final_stream = (
                    [stream[1], stream[2], ", ".join([stream[3], stream[4], stream[0]])]
                    for stream in transformed_stream
)

final_stream  # this is a generator object and has not yet started generating data

# LOAD
# 6. write a header row for output data
write_file_object = open(output_file_name, "w")
loader = csv.writer(
    write_file_object, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL
)
header = ["vehicle_make", "issuing_agency", "address"]
loader.writerow(header)

# 7. stream data into an output file
loader.writerows(final_stream)  # loader asks for data from final_stream
# Data is pulled from the generator chain (final_stream → transformed_stream → value_filtered_stream → col_filtered_stream → extractor).

#### Reading in Batches with Generators
- When pulling data from a large database you need to make a tradeoff between
    - memory: pulling complete dataset might lead to out of memory issues
    - cost: loading one row at a time can incur expensive network calls
- A good tradeoff would be to fetch data in batches. The size of a batch will depend on the memory available and speed requirements of your data pipeline.

In [None]:
import psycopg2

def generate_from_db(username, password, host, port, dbname, batch_size=10000):
    conn_url = f"postgresql://{username}:{password}@{host}:{port}/{dbname}"

    conn = psycopg2.connect(conn_url)
    cur = conn.cursor(name="get_large_data")
    cur.execute(
        "SELECT c1,c2,c3 FROM big_table"
    )  # this will get the data ready on the db side

    while True:
        rows = cur.fetchmany(
            batch_size
        )  # this will fetch data in batches from the ready data in db
        if not rows:
            break
        yield from rows

    cur.close()
    conn.close()

next(generate_from_db("username", "password", "host", 5432, "database"))

#### Pros and Cons of Generators
| **Aspect**               | **Pros**                                                                                 | **Cons**                                                                                   |
|--------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| **Memory Efficiency**     | Processes one item at a time, avoiding large memory usage.                              | Cannot hold the entire dataset in memory, limiting random access.                         |
| **Lazy Evaluation**       | Computes values only when needed, saving computation time.                              | Errors may surface late, making debugging harder.                                         |
| **Chaining & Composition**| Modular pipeline stages improve readability and maintainability.                        | Complex pipelines can be harder to debug and test.                                        |
| **Scalability**           | Suitable for streaming or very large datasets.                                          | Single-threaded processing limits scalability in parallel or distributed systems.         |
| **I/O Efficiency**        | Handles data incrementally, reducing latency for large file operations.                 | Requires careful resource management (e.g., closing files manually).                      |
| **Clean Syntax**          | Python generator expressions are concise and intuitive for simple tasks.                | Writing complex pipelines with generators may have a steeper learning curve.              |
| **Random Access**         | Not applicable as generators yield values sequentially.                                 | Revisiting data requires restarting the generator or storing intermediate results.         |
| **Debugging**             | Efficient and straightforward for small, linear tasks.                                  | Debugging lazy execution errors can be challenging.                                       |
| **Built-in Features**     | Works seamlessly with `itertools` and Python’s standard library for basic operations.   | Lacks advanced features like grouping or sorting without converting to lists.             |
| **Parallelism**           | Lightweight and works well for sequential processing.                                   | Cannot leverage multi-core CPUs or distributed systems without additional libraries.       |
| **State Management**      | Generators maintain state internally, simplifying incremental tasks.                    | Once exhausted, they cannot be reused without restarting the pipeline.                    |

### 2. Using distributed frameworks
- Examples include Apache Spark, Flink or Dask
- It's not always required to go for distributed frameworks unless you expect the workload to grow significantly
- Distributed frameworks allows some tasks to be processed parallely -> faster
- If your data itself is distributed like HDFS or cloud storage then with distributed frameworks the calculations can be formed where the data resides, avoiding costly data transfer to a single location.

## 4. Dataclasses
- It's a module in python that provides the ability to define data objects using the `@dataclass` decorator
- Designed to represent data in Python
- It's meant to reduce bolierplate code for classes that are essentially data containers (classes that show how the data is stored)
### Why use dataclasses
- Supports creating immutable classes using `frozen=True` parameter
- Automatically generates methods like __init__, __repr__, __eq__, etc., based on class attributes -> lesser boilerplate code
- Encourages the use of type hints

### When to avoid Dataclasses
- For simple data structures like a dictionary then you don't need a data class as creating one has a slight overhead

In [2]:
from dataclasses import dataclass

@dataclass
class Car:
    make: str
    model: str
    year: int

# Creating an instance
car1 = Car(make="Ford", model="Mustang", year="2022")

print(car1)

Car(make='Ford', model='Mustang', year='2022')


In [1]:
from dataclasses import dataclass, field

@dataclass
class Product:
    name: str
    price: float
    quantity: int = 0 #int with a default value of 0
    tags: list[str] = field(default_factory=list) # unique mutable default value for each instance

    def total_cost(self) -> float:
        return self.price * self.quantity

# Example
product = Product(name="Laptop", price=999.99, quantity=3)
print(product.total_cost())  # 2999.97
print(product)  # Product(name='Laptop', price=999.99, quantity=3, tags=[])

2999.9700000000003
Product(name='Laptop', price=999.99, quantity=3, tags=[])


In [11]:
@dataclass(frozen=True)
class Point:
    x: int
    y: int

point = Point(1, 2)
print(hash(point))  # Instances are hashable

# Attempting to modify an attribute raises an error
point.x = 5  # Raises: FrozenInstanceError

-3550055125485641917


FrozenInstanceError: cannot assign to field 'x'

## 5. Type Hints
- These are the annotations that specify the expected data types of variables, function parameters and return values
- It enables static type checking using libraries like `mypy`
- `mypy` is a static type checker for Python.
    - It analyzes your code before runtime to ensure that type annotations are respected.
    - For example, if a function expects an integer and you pass a string, mypy will flag an error.

In [16]:
from typing import List
from dataclasses import dataclass

@dataclass
class SocialMediaData:
    id: str
    content: str
    timestamp: str
    # Add other attributes as needed

class SocialETL():
    pass
    
class RedditETL(SocialETL): # inheriting from SocialETL
    def extract(self,
               id: str,    # Expected to be a string
               num_records: int # Expected to be a integer
               #,client: praw.Reddit,  # Expected to be a Reddit client object (from the praw library)
               ) -> List[SocialMediaData]: # Returns a list of SocialMediaData objects
        pass


#### Using Type Hints for Higher Order Functions
- Higher Order Functions => functions that return other functions or take functions as arguements

`Callable[[List[SocialMediaData]], List[SocialMediaData]]` represents a function:
- Input: A list of `SocialMediaData`.
- Output: Another list of `SocialMediaData`.

In [18]:
from typing import Callable, List

def transformation_factory(value: str) -> Callable[[List[SocialMediaData]], List[SocialMediaData]]:
    # A dictionary of transformation functions
    factory = {
        'sd': standard_deviation_outlier_filter,
        'no_tx': no_transformation,
        'rand': random_choice_filter,
    }
    return factory[value]


# Get the transformation function for 'sd'
#transformation = transformation_factory('sd')

# Apply the returned function to a list of SocialMediaData
#transformed_data = transformation(data_list)

## 6. Itertools
- **Iterator** - an object representing a stream of data, it's the output of a generator function
- Itertools provides functions to work with iterators
- Supports processing data in steps instead of getting the complete data set into memory

### Key Functions

**Generating Data**
- `count(start, step)`: Generates numbers forever (useful for indexing or filling data).
- `cycle(iterable)` : Repeats a list over and over
- `repeat(value, times)` : Repeats a value multiple times.

**Combining and Filtering**
- `chain()`: Combines multiple lists into one
- `islice()` : picks a part of the data without creating a new list
- `compress()` : Filters items in a list based on another list of True/False values (like a mask).

**Aggregation and Grouping**
- `accumulate()` : For running totals or sum
- `groupby()` : Groups items based on a common key (like grouping data by a category).
- `batch(iterable, length)` : Batch data from the iterable into tuples of length n. If last batch is shorter than n then it can raise `ValueError` if `strict=True`

**Generating Combinations and Permutations**
- `product()` : Finds all combinations of items from 2 or more lists
- `combinations()` : Find all subsets of specified size
- `permutations()` : Finds all possible arrangements of items.

In [17]:
from itertools import *

# generate number starting from 10
print("Using count function")
for i in count(start=10, step=5):
    if i>30:
        break # else it will keep going
    print(i)

cycle_pattern = cycle(['A','B','C'])
print("Using cycle function")
for i in range(6):
    print(next(cycle_pattern)) 

print("Using chain function")
# Example: Combine two lists
data1 = [1, 2, 3]
data2 = [4, 5, 6]
combined = list(chain(data1, data2))
print(combined)
# Output: [1, 2, 3, 4, 5, 6]

print("Using islice function")
infinite_numbers = count(1)
print(list(islice(infinite_numbers, 5))) # print upto 5


# Example: Filter items based on a mask
print("Using compress function")
data = ['A', 'B', 'C', 'D']
mask = [True, True, True, False]
filtered = list(compress(data, mask))
print(filtered)


print("Using accumulate function")
data = [1, 2, 3, 4]
print(list(accumulate(data)))


# Example: Group data by the first letter
print("Using Groupby function")
data = ['apple', 'apricot', 'banana', 'blueberry']
grouped = groupby(data, key=lambda x: x[0])  # Group by first letter
for key, group in grouped:
    print(key, list(group))


print("Using product function")
print(list(product([1, 2], ['A', 'B'])))

print("Using combinations function")
print(list(combinations([1, 2, 3], 2))) # all 2 item combinations from list [1,2,3]

print("Using permutations function")
print(list(permutations([1, 2, 3])))

Using count function
10
15
20
25
30
Using cycle function
A
B
C
A
B
C
Using chain function
[1, 2, 3, 4, 5, 6]
Using islice function
[1, 2, 3, 4, 5]
Using compress function
['A', 'B', 'C']
Using accumulate function
[1, 3, 6, 10]
Using Groupby function
a ['apple', 'apricot']
b ['banana', 'blueberry']
Using product function
[(1, 'A'), (1, 'B'), (2, 'A'), (2, 'B')]
Using combinations function
[(1, 2), (1, 3), (2, 3)]
Using permutations function
[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]


In [20]:
'''
Let’s build a simple data pipeline using itertools to demonstrate how it can process and transform data efficiently. 
Imagine we have a stream of sales data (e.g., product sales) that we need to:

Filter the data to include only specific products.
Group the data by product categories.
Calculate cumulative sales for each category.
Display the top categories with the highest sales.
'''

# Sample sales data: (Product ID, Category, Sale Amount)
sales_data = [
    (101, 'Electronics', 200),
    (102, 'Electronics', 150),
    (103, 'Furniture', 300),
    (104, 'Electronics', 100),
    (105, 'Furniture', 50),
    (106, 'Clothing', 70),
    (107, 'Clothing', 30),
    (108, 'Electronics', 120),
    (109, 'Furniture', 200),
    (110, 'Clothing', 90),
]

# Step 1: Filter sales above a certain threshold (e.g., $100)
filtered_sales = filter(lambda x: x[2] > 100, sales_data)
print(f"Output of filter: {filtered_sales}")

# Step 2: Sort the filtered data by category (required for groupby to work)
sorted_sales = sorted(filtered_sales, key=lambda x: x[1])
print(f"Output of sorted: {sorted_sales}")

# Step 3: Group sales by category
grouped_sales  = groupby(sorted_sales, key=lambda x: x[1])
print(f"Output of grouped_sales: {grouped_sales}")

# Step 4: Calculate cumulative sales for each category
result = {}
for category, group in grouped_sales:
    total_sales = sum(item[2] for item in group)
    result[category] = total_sales

# Step 5: Sort the categories by total sales (descending)
sorted_result = sorted(result.items(), key=lambda x: x[1], reverse=True)

# Step 6: Display the top categories
print("Top Categories by Sales:")
for category, total in sorted_result:
    print(f"{category}: ${total}")

Output of filter: <filter object at 0x00000206607CF100>
Output of sorted: [(101, 'Electronics', 200), (102, 'Electronics', 150), (108, 'Electronics', 120), (103, 'Furniture', 300), (109, 'Furniture', 200)]
Output of grouped_sales: <itertools.groupby object at 0x00000206600184F0>
Top Categories by Sales:
Furniture: $500
Electronics: $470


## 7. Functools
- Module for higher order functions (covered above: functions that act on or return other functions)

#### Key Features of `functools`

Here are the most commonly used functions from the `functools` module:

1. **`functools.reduce()`**
2. **`functools.partial()`**
3. **`functools.lru_cache()`**
4. **`functools.w`ming**: Easily manipulate and transform data streams with `reduce()`.

Let me know if you’d like to dive deeper into any specific part or try a custom example! 😊

## 8. Regex
- Regular expression is a specialized programming language that is embedded in Python using `re` module
- **Primary Use Cases**
    - Finding specific patterns in text
    - Validating inputs like email address
    - Replacing substrings
    - Extracting information from text

**Commonly Used Regex Patterns**
1. `.` - Matches any character except a new line

2. `^` - Matches the start of the string

3. `$` - Matches the end of the string

4. `*` - Matches 0 or more repetitions of the preceding character.

5. `+` - Matches 1 or more repetitions of the preceding character.

6. `?` - Matches 0 or 1 repetition of the preceding character.

7. `{n}` - Matches excatly n repetition of the preceding character.

8. `{n, m}` - Matches between n to m repetition of the preceding character.

9. `[]` - Matches any one character within the bracket

10. `[^]` - Matches any character NOT inside the brackets.

11. `\d` - Matches any digit (0–9).

12. `\D` - Matches any NON digit (0–9).

13. `\w` - Matches any word character (letters, digits, underscores).

14. `\W` - Matches any non-word character (!, @, #)

In [1]:
import re

# SEARCH FOR A PATTERN
text = "I have 2 apples and 3 bananas."
match = re.search(r"\d", text)  # Finds the first digit
if match:
    print(f"Found match: {match.group()}")

matches = re.findall(r"\d", text)  # Finds the first digit
print(f"All matching patterns: {matches}")

# SPLITTING
text = "apple,banana;cherry|date"
result = re.split(r"[,;|]", text)  # Split by comma, semicolon, or pipe
print(result)  # Output: ['apple', 'banana', 'cherry', 'date']

# REPLACING
text = "I have 2 apples and 3 bananas."
new_text  = re.sub(r"\d", "#this_num_got_replaced", text) # replace all digits with '#'
print(new_text)

# MATCHING PATTERNS
text = "hello world"

# if I want to check if text STARTS with hello then
if re.match(r"^hello", text):
    print("Ja, es fangt mit hello an")

# Check if the entire string matches "hello world"
if re.fullmatch(r"hello world", text):
    print("Entire string matches 'hello world'")

Found match: 2
All matching patterns: ['2', '3']
['apple', 'banana', 'cherry', 'date']
I have #this_num_got_replaced apples and #this_num_got_replaced bananas.
Ja, es fangt mit hello an
Entire string matches 'hello world'


In [23]:
email = "example@gmail.com"
if re.fullmatch(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", email):
    print("Valid email")
else:
    print("Invalid email")

Valid email


In [25]:
log = "2025-01-01 ERROR: Something went wrong"
date = re.search(r"\d{4}-\d{2}-\d{2}", log)  # Extract date
print(date.group())

2025-01-01


In [39]:
'''
Match a Word:
Write a regex to check if a string contains the word python (case-insensitive).
Example:
Input: "I love Python programming"
Output: Match found (python).
'''
text = 'I love PyTHon'
match = re.search(r"python", text, re.IGNORECASE)
if match:
    print(f"Python present: {match.group()}")

Python present: PyTHon


In [47]:
'''
Match Digits:
Find all digits in a given string.
Example:
Input: "The price is 42 dollars and 15 cents."
Output: ['42', '15']
'''
text = "The price is 42 dollars and 15 cents."
re.findall(r"\d+", text) # matches one of more digits

['42', '15']

In [7]:
'''
Match Email Addresses:
Write a regex to extract all email addresses from a string.
Example:
Input: "Contact us at support@example.com or sales@company.org"
Output: ['support@example.com', 'sales@company.org']
'''

pattern = r"[a-zA-Z\d._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Contact us at support@example.com or sales@company.org"
emails = re.findall(pattern, text)
print(emails)

['support@example.com', 'sales@company.org']


In [8]:
'''
Match Dates (Basic):
Match dates in the format dd-mm-yyyy.
Example:
Input: "The event is on 01-01-2025, and registration ends by 15-12-2024."
Output: ['01-01-2025', '15-12-2024']
'''

text = "The event is on 01-01-2025, and registration ends by 15-12-2024."
pattern = r"\d{2}-\d{2}-\d{4}"
re.findall(pattern, text)

['01-01-2025', '15-12-2024']

In [9]:
'''
Split by Delimiters:
Split a string by commas, semicolons, or spaces.
Example:
Input: "apple,banana;cherry date"
Output: ['apple', 'banana', 'cherry', 'date']
'''
split_pattern = r"[,;\s]"
text = "apple,banana;cherry date"
re.split(split_pattern, text)

['apple', 'banana', 'cherry', 'date']

In [19]:
'''
Match Valid Phone Numbers:
Match phone numbers in these formats: (123) 456-7890, 123-456-7890, 123.456.7890.
Example:
Input: "Call me at (123) 456-7890 or 123-456-7890."
Output: ['(123) 456-7890', '123-456-7890']
'''
pattern = r"\(?\d{3}\)?[.\-\s]\d{3}[.\-\s]\d{4}"
text = "Call me at (123) 456-7890 or 123-456-7890, 123.456.7890."
re.findall(pattern, text)

['(123) 456-7890', '123-456-7890', '123.456.7890']

In [20]:
'''
Extract IP Addresses:
Write a regex to extract all valid IPv4 addresses from a string.
Example:
Input: "The server IPs are 192.168.1.1 and 10.0.0.255."
Output: ['192.168.1.1', '10.0.0.255']
'''
pattern = r"\d{1,}.\d{1,}.\d{1,}.\d{1,}"
text = "The server IPs are 192.168.1.1 and 10.0.0.255."
re.findall(pattern, text)

['192.168.1.1', '10.0.0.255']

In [30]:
'''
Match URLs:
Extract all valid URLs from a string.
Example:
Input: "Visit https://example.com or http://test.org for more details."
Output: ['https://example.com', 'http://test.org']
'''
pattern = r"http.?\:?\//?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Visit https://example.com or http://test.org for more details."
re.findall(pattern, text)

['https://example.com', 'http://test.org']

## 9. Shallow Copy vs Deep Copy
- When you use the assignment operator in python you are not creating copies of an object, it's simply sharing references
- One thing to note is that, when we are talking about shallow vs deep it only concerns mutable data types; immutable data types `int, str, tuple()` cannot be changed 

```python
a = [1,2,3]
b = a # using assignment operator to create a "copy"

b[1] = 5
print(a) # output: [1,5,3]
print(b) # output: [1,5,3]

```

- that's why we have the `copy` module which allows us to create two types of copies - **Shallow and Deep**

### Shallow Copy
- Creates a new object but does not recursively copy the contents of the original, which basically means that it only creates a first level copy.
- **Recursive Copy** : A complete copy including all the nested lists, dicts etc
- Shallow copies are faster and use less memory. They’re useful when you only need to create a new top-level container but don’t mind sharing references to inner objects.
- **Use Case** : Suitable when nested objects won't change.
  
#### Ways to create shallow copy
1. Using `copy.copy(object)`
2. To create a shallow copy of list you can use slicing operator => `new_copy = orginal_list[:]`
3. Using the list() or dict() constructors.

### Deep Copy
- It is a recursive copy of an object, so that means it creates a copy of all the nested elements as well
- Changes to deep copy object does not affect the orginal and vice versa

In [31]:
import copy

# Original list with nested structure
original_list = [1, 2, [3, 4]]

# Create a shallow copy
shallow_copy = copy.copy(original_list)

# Modify the nested list
shallow_copy[2][0] = 99

# Observe the results
print("Original List:", original_list)  # [1, 2, [99, 4]]
print("Shallow Copy:", shallow_copy)    # [1, 2, [99, 4]]

Original List: [1, 2, [99, 4]]
Shallow Copy: [1, 2, [99, 4]]


In [38]:
for i in range(len(shallow_copy)):
    print(f"ID of {i}th element in Orginal List: {id(original_list[i])}")
    print(f"ID of {i}th element in Shallow Copy: {id(shallow_copy[i])}")
    print("--------------------------")

ID of 0th element in Orginal List: 140719026062120
ID of 0th element in Shallow Copy: 140719026062120
--------------------------
ID of 1th element in Orginal List: 140719026062152
ID of 1th element in Shallow Copy: 140719026062152
--------------------------
ID of 2th element in Orginal List: 2377505935744
ID of 2th element in Shallow Copy: 2377505935744
--------------------------


In [39]:
original_list = [10, "hello", (1, 2, 3)]  # Contains integers, strings, and a tuple (all immutable)
shallow_copy = original_list.copy()

# Modify the first element in the shallow copy
shallow_copy[0] = 99

print("Original List:", original_list)  # [10, "hello", (1, 2, 3)]
print("Shallow Copy:", shallow_copy)    # [99, "hello", (1, 2, 3)]

Original List: [10, 'hello', (1, 2, 3)]
Shallow Copy: [99, 'hello', (1, 2, 3)]


In [40]:
# Original list with nested structure
original_list = [1, 2, [3, 4]]

# Create a deep copy
deep_copy = copy.deepcopy(original_list)

# Modify the nested list
deep_copy[2][0] = 99

# Observe the results
print("Original List:", original_list)  # [1, 2, [3, 4]]
print("Deep Copy:", deep_copy)          # [1, 2, [99, 4]]

Original List: [1, 2, [3, 4]]
Deep Copy: [1, 2, [99, 4]]


In [41]:
for i in range(len(deep_copy)):
    print(f"ID of {i}th element in Orginal List: {id(original_list[i])}")
    print(f"ID of {i}th element in Shallow Copy: {id(shallow_copy[i])}")
    print("--------------------------")

ID of 0th element in Orginal List: 140719026062120
ID of 0th element in Shallow Copy: 140719026065256
--------------------------
ID of 1th element in Orginal List: 140719026062152
ID of 1th element in Shallow Copy: 2377446900592
--------------------------
ID of 2th element in Orginal List: 2377470769920
ID of 2th element in Shallow Copy: 2377505514752
--------------------------


In [45]:
# Compare the behavior of shallow_copy() and deep_copy() for a tuple containing lists.
orginal_list_for_shallow = ([1,2],[3,4],[5,6])
shallow_copy = orginal_list_for_shallow[:]

# make a change to first list, second element
shallow_copy[0][1] = 100

print("Original tuple of lists: ", orginal_list_for_shallow)
print("Shallow Copy of tuple of lists: ", shallow_copy)

Original tuple of lists:  ([1, 100], [3, 4], [5, 6])
Shallow Copy of tuple of lists:  ([1, 100], [3, 4], [5, 6])


In [46]:
orginal_list_for_deep = ([1,2],[3,4],[5,6])
deep_copy = copy.deepcopy(orginal_list_for_shallow)

# make a change to first list, second element
deep_copy[0][1] = 100

print("Original tuple of lists: ", orginal_list_for_deep)
print("Deep Copy of tuple of lists: ", deep_copy)

Original tuple of lists:  ([1, 2], [3, 4], [5, 6])
Deep Copy of tuple of lists:  ([1, 100], [3, 4], [5, 6])
