In [22]:
import sys
sys.path.append(r'C:\Users\Mahbub\Desktop\Data Engineering\Python\data')

---
1. **[Easy] Q1: Create and Use a Simple Module**  
   Create a module `math_utils.py` that contains a function to compute factorial. Import and use it in another script `main.py`.
---

In [23]:
from math_fact import factorial

In [24]:
factorial(5)

120

---
2. **[Easy] Q3: Import Specific Functions**  
   From a module `text_utils.py` that contains multiple string functions, import only `count_vowels()` and use it.
---

In [25]:
from math_fact_2 import factorial

In [26]:
factorial(6)

720

---
4. **[Easy-Medium] Q4: Create a Simple Package**  
   Create a package `data_pipeline/` with a structure:
   ```
   data_pipeline/
       __init__.py
       cleaner.py
       loader.py
   ```
   Import functions from `cleaner` and `loader` in a script outside the package.
---

In [27]:
import sys
sys.path.append(r'C:\Users\Mahbub\Desktop\Data Engineering\Python\data\data_pipeline')

In [28]:
dir()

['In',
 'Out',
 '_',
 '_16',
 '_18',
 '_24',
 '_26',
 '_3',
 '_5',
 '_7',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i14',
 '_i15',
 '_i16',
 '_i17',
 '_i18',
 '_i19',
 '_i2',
 '_i20',
 '_i21',
 '_i22',
 '_i23',
 '_i24',
 '_i25',
 '_i26',
 '_i27',
 '_i28',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'clean_data',
 'cleaned',
 'data',
 'exit',
 'extract_data',
 'factorial',
 'get_ipython',
 'load_data',
 'loaded_data',
 'quit',
 'raw_data',
 'sys',
 'transform_data',
 'transformed_data',
 'validate_data',
 'validated_data']

In [29]:
from data_pipeline.cleaner import clean_data
from data_pipeline.loader import load_data

In [30]:
raw_data = load_data("fruits.txt")
cleaned = clean_data(raw_data)
print("Cleaned data:", cleaned)


Cleaned data: ['data1', 'data2']


---
5. **[Medium] Q5: Use `__all__` for Clean Import API**  
   Add `__all__` to your `__init__.py` in `data_pipeline` to control which submodules get imported with `from data_pipeline import *`.
--

To solve this problem, we need to add the `__all__` variable to the `__init__.py` file in the `data_pipeline` package. This variable will control which submodules are imported when someone uses `from data_pipeline import *`. 

### Solution Code
```python
# In data_pipeline/__init__.py

__all__ = ['submodule1', 'submodule2', 'submodule3']  # Replace with actual submodule names you want to expose
```

### Explanation
1. **Purpose of `__all__`**: The `__all__` variable in Python is a list of strings that defines which symbols should be imported when `from module import *` is used. It effectively acts as a whitelist for public API elements of the module.

2. **Implementation**: In the `__init__.py` file of the `data_pipeline` package, you define `__all__` as a list containing the names of the submodules you want to expose. For example, if your package has submodules named `submodule1`, `submodule2`, and `submodule3`, you would list them in `__all__`.

3. **Effect**: When a user writes `from data_pipeline import *`, only the submodules listed in `__all__` will be imported into their namespace. This helps in keeping the API clean and prevents unintended imports of internal or private submodules.

4. **Customization**: Replace the placeholder submodule names (`'submodule1'`, `'submodule2'`, etc.) with the actual names of the submodules you wish to expose in your `data_pipeline` package.

By following this approach, you ensure that the package's public API is clearly defined and users are not overwhelmed or confused by unnecessary imports.

In [31]:
import sys
sys.path.append(r'C:\Users\Mahbub\Desktop\Data Engineering\Python\data')


In [32]:

from data_pipeline import *

In [33]:
raw_data = load_data("fruits.txt")
cleaned = clean_data(raw_data)
print("Cleaned data:", cleaned)

Cleaned data: ['data1', 'data2']


---

6. **[Medium] Q6: Organize a Mini ETL Package**  
   Extend the package to include `transformer.py`, and demonstrate clean imports from multiple files using absolute imports.

---

### **Absolute Imports in Python (Explained Simply)**  

**Absolute imports** refer to importing a module or its contents using its **full path** from the **project's root directory** (or Python's `sys.path`).  

### **Key Characteristics**
1. **Starts from the Top-Level Package**  
   - Always begins with the **root package name** (e.g., `data_pipeline.transformer`).  
   - No `..` or `.` (unlike relative imports).  

2. **Explicit and Unambiguous**  
   - Clearly shows where the module/function is located.  
   - Avoids confusion in large projects.  

3. **Works Everywhere**  
   - Functions the same way whether you’re:  
     - Running a script **inside** the package.  
     - Importing the package **from outside**.  

---

### **Example: Absolute vs. Relative Imports**
#### **Directory Structure**
```
my_project/
├── data_pipeline/
│   ├── __init__.py
│   ├── extractor.py
│   ├── transformer.py
│   └── loader.py
└── main.py  (outside the package)
```

#### **1. Absolute Import (Recommended)**
```python
# Inside main.py (or anywhere else)
from data_pipeline.transformer import transform_data
```
- **Path**: Starts from `data_pipeline` (root package).  
- **Use Case**: Scripts **outside** the package (e.g., `main.py`).  

#### **2. Relative Import (Alternative)**
```python
# Inside transformer.py (if importing from extractor.py)
from .extractor import extract_data
```
- **Path**: Uses `.` (current directory) or `..` (parent directory).  
- **Use Case**: Only for imports **within the same package**.  

---

### **Why Prefer Absolute Imports?**  
1. **Clarity**  
   - Clearly shows the module’s location (e.g., `data_pipeline.transformer`).  
2. **No Ambiguity**  
   - Avoids errors like:  
     ```
     ImportError: attempted relative import with no known parent package
     ```  
3. **Refactoring-Friendly**  
   - If you move a module, you only update the import path in one place.  

---

### **Common Pitfalls**  
1. **Missing `__init__.py`**  
   - Ensures Python treats directories as packages.  
2. **Incorrect `sys.path`**  
   - If the package isn’t in `sys.path`, absolute imports fail. Fix:  
     ```python
     import sys
     sys.path.append("/path/to/project_root")  # Add parent of `data_pipeline`
     ```  
3. **Namespace Conflicts**  
   - Avoid naming your package after standard libraries (e.g., `json`, `math`).  

---

### **Demo: Fixing Your `data_pipeline` Import**  
If you got `ModuleNotFoundError`, ensure:  
1. Your project structure matches:  
   ```
   C:/Users/Mahbub/.../data/  # Parent dir
   └── data_pipeline/  # Package dir
       ├── __init__.py
       ├── transformer.py
       └── ...
   ```  
2. Use absolute imports in `main.py`:  
   ```python
   # Add the parent dir to Python's path
   import sys
   sys.path.append(r"C:\Users\Mahbub\Desktop\Data Engineering\Python\data")

   from data_pipeline.transformer import transform_data  # Now works!
   ```  

---

### **Summary**  
- **Absolute Import** = `from package.submodule import function`.  
- **Relative Import** = `from .sibling import function` (for intra-package use only).  
- **Best Practice**: Use absolute imports unless you’re writing package-internal code.  

This keeps your codebase clean and maintainable! 🚀

In [34]:
from etl_pipeline import extract_data,transform_data,load_data

In [35]:
from etl_pipeline import validate_data

In [36]:
data = extract_data('raw_file.txt')

Extracting data from raw_file.txt


In [37]:
data

[{'raw_data': 'sample'}]

In [38]:
transformed_data = transform_data(data)

Transforming data...


In [39]:
transformed_data

[{'cleaned_data': 'SAMPLE'}]

In [40]:
loaded_data = load_data(transformed_data,'output.txt')

Loading data to output.txt: [{'cleaned_data': 'SAMPLE'}]


In [41]:
validated_data = validate_data(transformed_data)

Validating data...


In [43]:
validated_data

[{'validated_data': 'SAMPLE'}]

---


8. **[Medium-Hard] Q8: Fail-Case Debug: Relative Import Error**  
   Create a module structure where relative imports break. Fix it using proper `__init__.py` and `sys.path` adjustments.


---

# **Q8 Solution: Debugging Relative Import Errors**

Let's create a problematic module structure, trigger a relative import error, and then fix it systematically.

## **Problem Setup (Broken Structure)**

### **Directory Structure**
```
project/
├── utils/                  # Missing __init__.py
│   └── string_cleaner.py
└── data_pipeline/
    ├── transformer.py      # Tries relative import
    └── __init__.py
```

### **1. string_cleaner.py**
```python
def clean(text: str) -> str:
    return text.strip().lower()
```

### **2. transformer.py (Problematic Relative Import)**
```python
from ..utils.string_cleaner import clean  # Will fail!

def transform(data: list) -> list:
    return [{"clean": clean(item["raw"])} for item in data]
```

### **The Error You'll See**
```
ImportError: attempted relative import with no known parent package
```

---

## **Why This Fails**
1. **Missing `__init__.py`** in `utils/` (not recognized as a package)
2. **Incorrect Python path** (project root not in `sys.path`)
3. **Relative import scope** (`..` goes "too far up")

---

## **The Fixes**

### **Fix 1: Proper Package Structure**
```
project/
├── utils/
│   ├── __init__.py         # Add this!
│   └── string_cleaner.py
└── data_pipeline/
    ├── __init__.py
    └── transformer.py
```

### **Fix 2: Update `transformer.py`**
```python
# Option A: Relative import (now works)
from ...utils.string_cleaner import clean  # Fixed path

# Option B: Absolute import (alternative)
from utils.string_cleaner import clean
```

### **Fix 3: Adjust `sys.path` (If Needed)**
```python
# In transformer.py or your main script
import sys
from pathlib import Path

# Add project root to Python path
sys.path.append(str(Path(__file__).parent.parent))
```

---

## **Testing the Fix**
1. **Run from project root**:
   ```bash
   python -m data_pipeline.transformer
   ```
2. **Or use absolute imports** if running scripts directly.

---

## **Key Takeaways**
1. **`__init__.py`** turns directories into importable packages.
2. **Relative imports** require:
   - A proper package structure
   - Correct parent package recognition
3. **`sys.path` adjustments** may be needed for sibling imports.

This ensures robust imports in complex projects! 🚀

### **Should `__init__.py` Files Be Empty?**  
**Short Answer**: They *can* be empty, but they often contain important package metadata or initialization code.  

---

## **When to Use Empty `__init__.py`**  
✅ **Basic Packages**: If the package is just a container for modules, an empty `__init__.py` is fine.  
✅ **Python 3.3+**: Empty `__init__.py` still marks a directory as a package (implicit namespace packages exist, but explicit is better).  

**Example (Minimal Setup)**  
```
my_package/
├── __init__.py     # Empty
├── module1.py
└── module2.py
```

---

## **When to Add Code in `__init__.py`**  
### **1. Controlling Imports (`__all__`)**  
Define the public API to limit what’s imported with `from package import *`.  

```python
# __init__.py
__all__ = ["module1", "helper"]  # Only these can be imported with *
```

### **2. Package Initialization**  
Run setup code when the package is first imported (e.g., logging, configs).  

```python
# __init__.py
print(f"Initializing {__name__}...")
```

### **3. Shortening Import Paths**  
Expose key functions at the package level for cleaner imports.  

```python
# __init__.py
from .module1 import main_function  # Now users can do `from package import main_function`
```

### **4. Handling Subpackages**  
Import submodules to ensure they’re available when the parent package is imported.  

```python
# __init__.py
from . import submodule1, submodule2
```

---

## **Example: Non-Empty `__init__.py`**  
### **Directory Structure**  
```
data_pipeline/
├── __init__.py         # Exports key functions
├── extractor.py        # def extract()
├── transformer.py      # def transform()
└── loader.py           # def load()
```

### **`__init__.py` (With Useful Content)**  
```python
from .extractor import extract
from .transformer import transform
from .loader import load

__all__ = ["extract", "transform", "load"]  # Clean API for `from data_pipeline import *`
```

### **Result**  
Users can now import directly from the package:  
```python
from data_pipeline import extract  # Instead of `from data_pipeline.extractor import extract`
```

---

## **Key Takeaways**  
1. **Empty `__init__.py` is valid**, but adding code improves usability.  
2. **Use `__all__`** to define public APIs.  
3. **Leverage `__init__.py` for**:  
   - Simplifying imports (`from package import func`).  
   - Initialization (logging, configs).  
   - Explicitly declaring submodules.  

**Best Practice**: Start with an empty `__init__.py` and add code only when needed.

---


9. **[Medium] Q9: Add a Custom Directory to `sys.path`**  
   Add a custom utilities directory to `sys.path` dynamically and import modules from there.

---

### **Solution: Adding a Custom Directory to `sys.path` Dynamically**

#### **Objective**
Dynamically add a directory (e.g., `custom_utils/`) to Python's module search path (`sys.path`) and import a module from it.

---

### **Step 1: Directory Structure**
Assume this project layout:
```
project/
├── main_script.py       # Your main Python file
└── custom_utils/        # Custom utilities directory
    ├── __init__.py      # Makes it a package (can be empty)
    └── math_ops.py      # Example utility module
```

---

### **Step 2: The Utility Module (`math_ops.py`)**
```python
# custom_utils/math_ops.py
def add(a: float, b: float) -> float:
    return a + b

def multiply(a: float, b: float) -> float:
    return a * b
```

---

### **Step 3: Dynamically Add to `sys.path` and Import**
In `main_script.py`:
```python
import sys
from pathlib import Path

# Add the `custom_utils` directory to Python's module search path
utils_path = str(Path(__file__).parent / "custom_utils")  # Resolves to absolute path
sys.path.append(utils_path)  # Adds to sys.path temporarily

# Now you can import from `custom_utils` directly
from math_ops import add, multiply  # No need for `custom_utils.` prefix

# Test the imports
print(add(2, 3))       # Output: 5
print(multiply(2, 3))  # Output: 6
```

---

### **Key Notes**
1. **Why `Path(__file__).parent`?**  
   - Ensures the path is resolved relative to the script’s location (avoids hardcoding absolute paths).  
   - Works even if the script is run from another directory.

2. **Temporary vs. Permanent Path Addition**  
   - `sys.path.append()`: Only affects the current Python session.  
   - For permanent addition, modify `PYTHONPATH` environment variable instead:
     ```bash
     export PYTHONPATH="/path/to/project/custom_utils:$PYTHONPATH"  # Linux/Mac
     set PYTHONPATH=C:\path\to\project\custom_utils;%PYTHONPATH%    # Windows
     ```

3. **Alternative: Install as a Package**  
   For long-term use, convert `custom_utils` into an installable package:
   ```bash
   pip install -e /path/to/project  # Install in development mode
   ```

---

### **When to Use This**
- **Quick Scripting**: Dynamically add paths for one-off projects.  
- **Legacy Codebases**: Import from directories not structured as packages.  
- **Avoids `PYTHONPATH` Modification**: Useful when you can’t change environment variables.

---

### **Common Pitfalls**
1. **Duplicate Imports**: If the same module exists in multiple `sys.path` entries, Python uses the first one found.  
2. **Path Conflicts**: Ensure the directory name doesn’t clash with existing Python packages (e.g., `json`, `math`).  
3. **Thread Safety**: Modifying `sys.path` at runtime isn’t thread-safe (avoid in production servers).

---

### **Final Answer**
By dynamically appending to `sys.path`, you can import modules from any directory without restructuring your project or modifying `PYTHONPATH`. This is ideal for ad-hoc development but should be replaced with proper packaging for production code.  

**Example Output**:
```
5
6
```

---
```

10. **[Medium-Hard] Q10: Simulate PYTHONPATH in Code**  
    Without modifying environment variables, simulate how `PYTHONPATH` allows access to external directories.
    
```
---

### **Solution: Simulating `PYTHONPATH` Programmatically**

#### **Objective**  
Modify Python's module search path (`sys.path`) at runtime to mimic the effect of setting `PYTHONPATH` without touching environment variables.

---

### **Step 1: Directory Structure**
Assume this layout where `external_lib/` is outside your project:
```
/home/user/
├── my_project/
│   ├── main.py          # Needs to import from external_lib
│   └── ...
└── external_lib/
    ├── __init__.py      # Required for package recognition
    └── utils.py         # Example module
```

---

### **Step 2: The External Module (`external_lib/utils.py`)**
```python
# /home/user/external_lib/utils.py
def greet(name: str) -> str:
    return f"Hello, {name}!"
```

---

### **Step 3: Simulate `PYTHONPATH` in `main.py`**
```python
import sys
from pathlib import Path

# Dynamically add the external directory to sys.path
external_lib_path = str(Path("/home/user/external_lib").resolve())  # Absolute path
if external_lib_path not in sys.path:
    sys.path.insert(0, external_lib_path)  # Insert at start to prioritize

# Now import the external module
from utils import greet

print(greet("Alice"))  # Output: "Hello, Alice!"
```

---

### **Key Notes**
1. **Why `sys.path.insert(0, ...)`?**  
   - Ensures the external directory is checked **first** for imports (mimics `PYTHONPATH` precedence).  
   - Use `append()` if lower priority is acceptable.

2. **Path Resolution**  
   - `Path("/path").resolve()` converts to an absolute path, avoiding issues with relative paths.  
   - Hardcoding paths is brittle—use environment variables or config files in production.

3. **Thread Safety**  
   - Modifying `sys.path` at runtime is **not thread-safe**. Avoid in production servers (prefer `PYTHONPATH` or package installation).

---

### **Alternative: Context Manager (Cleaner Temporary Path)**
```python
import sys
from contextlib import contextmanager
from pathlib import Path

@contextmanager
def temporary_sys_path(path: str):
    """Temporarily add a path to sys.path."""
    path = str(Path(path).resolve())
    sys.path.insert(0, path)
    try:
        yield
    finally:
        sys.path.remove(path)  # Clean up

# Usage
with temporary_sys_path("/home/user/external_lib"):
    from utils import greet
    print(greet("Bob"))  # Output: "Hello, Bob!"

# Path is automatically removed afterward
```

---

### **When to Use This**
- **Testing/Local Development**: Quickly access external code without `PYTHONPATH`.  
- **Legacy Systems**: Import from directories not structured as packages.  
- **Avoiding Global Changes**: Safer than modifying `PYTHONPATH` system-wide.

---

### **Comparison with `PYTHONPATH`**
| Approach          | Scope          | Persistence | Thread-Safe | Use Case                |
|-------------------|----------------|-------------|-------------|-------------------------|
| `sys.path` mod    | Current script | Temporary   | No          | Ad-hoc scripts          |
| `PYTHONPATH`      | All scripts    | Permanent   | Yes         | Production environments |
| `pip install -e`  | System-wide    | Permanent   | Yes         | Development packages    |

---

### **Final Answer**
By dynamically modifying `sys.path`, you simulate `PYTHONPATH` for the current script without affecting the system environment. This is ideal for temporary access to external code but should be replaced with proper packaging (`pip install -e`) or `PYTHONPATH` for production.  

**Output**:
```
Hello, Alice!
```

### **Purpose and Benefits of Simulating `PYTHONPATH` in Code**  

When you **dynamically modify `sys.path`** in Python (instead of setting `PYTHONPATH`), you are **temporarily adding directories** to Python’s module search path. This allows you to **import modules from custom locations** without permanently changing environment variables or restructuring your project.  

---

## **Key Benefits**  

### **1. Avoids Permanent Environment Changes**  
- **Problem**: Setting `PYTHONPATH` modifies the system/user environment, which can affect other Python scripts.  
- **Solution**: `sys.path.append()` only affects the **current Python session**, leaving the system untouched.  

### **2. Works Without Package Installation**  
- **Problem**: Normally, you’d need to `pip install` a package to import it.  
- **Solution**: You can **temporarily add a folder** (e.g., `external_lib/`) to `sys.path` and import directly.  

### **3. Useful for Testing/Local Development**  
- **Problem**: You’re working with a module outside your project (e.g., `../shared_utils/`).  
- **Solution**: Temporarily add its path to `sys.path` to test changes without restructuring files.  

### **4. No Need to Modify `PYTHONPATH` Manually**  
- **Problem**: Some environments (e.g., cloud notebooks) restrict modifying `PYTHONPATH`.  
- **Solution**: `sys.path.append()` works **even in restricted environments**.  

### **5. Cleaner Than Relative Imports for External Code**  
- **Problem**: Relative imports (`from ..external import x`) only work within packages.  
- **Solution**: `sys.path` lets you import **any script from any folder**.  

---

## **When Should You Use This?**  

✅ **Testing a module** before making it a proper package.  
✅ **Quick scripts** where setting `PYTHONPATH` is overkill.  
✅ **Shared utility folders** (e.g., company-wide `common_utils/`).  
✅ **Restricted environments** (e.g., Jupyter, AWS Lambda).  

❌ **Not for production code** (use `pip install` or `PYTHONPATH` instead).  

---

## **Example Scenario**  

### **Directory Structure**
```
/home/user/
├── my_project/
│   └── main.py          # Needs a function from `shared_utils/`
└── shared_utils/        # External folder
    └── math_helpers.py  # Contains `add(a, b)`
```

### **Problem**  
You **can’t** just do:  
```python
from shared_utils.math_helpers import add  # ❌ Fails (not in PYTHONPATH)
```

### **Solution**  
Dynamically add `shared_utils/` to `sys.path`:  
```python
import sys
from pathlib import Path

# Add the external folder to Python's search path
sys.path.append(str(Path("/home/user/shared_utils").resolve()))

# Now the import works!
from math_helpers import add

print(add(2, 3))  # ✅ Output: 5
```

---

## **Alternative Solutions**  

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| **`sys.path.append()`** | No env changes, quick | Temporary, not thread-safe | Testing, scripts |
| **`PYTHONPATH`** | Persistent, affects all scripts | Needs system access | Production |
| **`pip install -e`** | Proper package management | Requires setup.py | Development |

---

### **Final Verdict**  
- **Use `sys.path` for quick, temporary imports** (e.g., testing, notebooks).  
- **Use `PYTHONPATH` or `pip install` for production code**.  

This approach gives **flexibility without permanent changes**, making it ideal for debugging and rapid development. 🚀