In [4]:
import os
print(os.getcwd())


/Users/omkarthakur/Desktop/Ds_project/research


In [5]:
%pwd

'/Users/omkarthakur/Desktop/Ds_project/research'

In [6]:
os.chdir("/Users/omkarthakur/Desktop/Ds_project")  
%pwd
# to got to the parent directory

'/Users/omkarthakur/Desktop/Ds_project'

In [7]:
from dataclasses import dataclass
from pathlib import Path

@dataclass
class DataIngestionConfig:
    root_dir: Path
    source_URL: str
    local_data_file: Path
    unzip_dir: Path

- The @dataclass is a decorator
#### @dataclass:



- Tells Python: “This class is mainly for storing data.”

- You don’t need to write __init__ manually — Python will do automatically do these things:
- So whenever we use @dataclass, we don;t have to manually write these 

- def __init__(self, root_dir, source_URL, local_data_file, unzip_dir):
   -  self.root_dir = root_dir
   -  self.source_URL = source_URL
   -  self.local_data_file = local_data_file
    self.unzip_dir = unzip_dir- 

In [8]:
%pip install PyYAML
import yaml; print(yaml.__version__)

Note: you may need to restart the kernel to use updated packages.
6.0.2


In [9]:
from src.datascience.constants import *
from src.datascience.utils.common import read_yaml, create_directories

In [10]:
class ConfigurationManager:
    def __init__(self,
                 config_filepath=CONFIG_FILE_PATH,
                 params_filepath = PARAMS_FILE_PATH,
                 schema_filepath = SCHEMA_FILE_PATH):
        self.config=read_yaml(config_filepath)
        self.params=read_yaml(params_filepath)
        self.schema=read_yaml(schema_filepath)

        create_directories([self.config.artifacts_root])


    def get_data_ingestion_config(self)-> DataIngestionConfig:
        config=self.config.data_ingestion
        create_directories([config.root_dir])

        data_ingestion_config=DataIngestionConfig(
            root_dir=config.root_dir,
            source_URL=config.source_URL,
            local_data_file=config.local_data_file,
            unzip_dir=config.unzip_dir

        )
        return data_ingestion_config

In [11]:
import os
import urllib.request as request
from src.datascience import logger
import zipfile

In [12]:
## component-Data Ingestion

class DataIngestion:
    def __init__(self,config:DataIngestionConfig):
        self.config=config
    
    # Downloading the zip file
    def download_file(self):
        if not os.path.exists(self.config.local_data_file):
            filename, headers = request.urlretrieve(
                url = self.config.source_URL,
                filename = self.config.local_data_file
            )
            logger.info(f"{filename} download! with following info: \n{headers}")
        else:
            logger.info(f"File already exists")

    def extract_zip_file(self):
        """
        zip_file_path: str
        Extracts the zip file into the data directory
        Function returns None
        """
        unzip_path = self.config.unzip_dir
        os.makedirs(unzip_path, exist_ok=True)
        with zipfile.ZipFile(self.config.local_data_file, 'r') as zip_ref:
            zip_ref.extractall(unzip_path)


In [13]:
try:
    config=ConfigurationManager()
    data_ingestion_config=config.get_data_ingestion_config()
    data_ingestion=DataIngestion(config=data_ingestion_config)
    data_ingestion.download_file()
    data_ingestion.extract_zip_file()
except Exception as e:
    raise e

[2025-09-13 16:43:23,092: INFO: common: yaml file: config/config.yaml loaded successfully]
[2025-09-13 16:43:23,095: INFO: common: yaml file: params.yaml loaded successfully]
[2025-09-13 16:43:23,098: INFO: common: yaml file: schema.yaml loaded successfully]
[2025-09-13 16:43:23,100: INFO: common: created directory at: artifacts]
[2025-09-13 16:43:23,101: INFO: common: created directory at: artifacts/data_ingestion]
[2025-09-13 16:43:23,102: INFO: 251335794: File already exists]


Ah! This is a **YAML line**, and it’s very common in configuration files for data science or ML projects. Let’s break it down carefully.

---

## 1️⃣ The line

```yaml
artifacts_root: artifacts
```

---

## 2️⃣ What it means

* **`artifacts_root`** → this is a **key** in your YAML file.

  * Usually, it represents the **base directory** where all "artifacts" of a project will be stored.
  * "Artifacts" are outputs or generated files from your project, such as:

    * trained ML models
    * processed datasets
    * evaluation metrics
    * plots or reports

* **`artifacts`** → this is the **value** associated with that key.

  * It’s likely a folder name relative to your project root.
  * So `artifacts_root` points to a folder called `artifacts/`.

---

## 3️⃣ How it would be used in Python

Assuming you read your YAML config like this:

```python
from pathlib import Path
from src.utils.io import read_yaml

cfg = read_yaml(Path("configs/config.yaml"))
print(cfg.artifacts_root)
```

Output:

```
artifacts
```

You could then create directories like this:

```python
from src.utils.io import create_directories

# Create the artifacts root folder
create_directories([Path(cfg.artifacts_root)])
```

After running this, your project folder structure might look like:

```
my_project/
├─ artifacts/
├─ configs/
│  └─ config.yaml
├─ src/
└─ logs/
```

---

## 4️⃣ Why it’s useful

* **Centralized storage:** All outputs go in one place (`artifacts/`), so you don’t scatter files around your project.
* **Flexible paths:** If later you want to change where artifacts are stored, just update the YAML:

  ```yaml
  artifacts_root: output_files
  ```

  You don’t need to change any Python code.
* **Consistency:** Makes your code more maintainable in teams or production pipelines.

---

## 5️⃣ Example in a typical ML project

```yaml
# config.yaml
artifacts_root: artifacts
data_dir: data
model_dir: models
```

Python usage:

```python
cfg = read_yaml(Path("configs/config.yaml"))

# Create main folders
create_directories([Path(cfg.artifacts_root), Path(cfg.data_dir), Path(cfg.model_dir)])

# Save a model inside artifacts
from src.utils.io import save_bin
model_path = Path(cfg.artifacts_root) / "model.joblib"
save_bin(model, model_path)
```

Resulting folder structure:

```
my_project/
├─ artifacts/
│  └─ model.joblib
├─ data/
├─ models/
└─ configs/config.yaml
```

---

✅ **Summary**

* `artifacts_root` is just a **configuration key** pointing to a folder (`artifacts`) where your project outputs go.
* Using YAML allows you to **change this folder easily without touching your code**.
* All functions like `create_directories`, `save_json`, `save_bin` can use this folder to store results systematically.

---

If you want, I can also **draw a diagram showing how `artifacts_root` connects configs, directories, and saved files** in a project—it makes it visually obvious why it’s important.

Do you want me to do that?

