#### About Medallion Architecture

Medallion Architecture is a data architecture pattern commonly used in modern data engineering, particularly in cloud platforms like Databricks or in oure case miniodb. It aims to organize data into different quality levels (or layers), enabling better data management, governance, and access. The architecture is typically divided into **three layers**, symbolized by three medals:

###### 1. **Bronze Layer (Raw Data)**

- **Purpose**: The Bronze layer stores raw, unprocessed data.
- **Characteristics**: Data is ingested in its original form without much transformation. This can include structured, semi-structured, or unstructured data.
- **Typical Sources**: This layer includes logs, clickstreams, and any other data from various sources such as IoT devices, APIs, databases, or files.
- **Advantages**: 
  - Acts as the raw historical record of all data.
  - Useful for reprocessing in case of errors or when new transformations are needed.

###### 2. **Silver Layer (Cleaned Data)**

- **Purpose**: The Silver layer refines the raw data.
- **Characteristics**: Data is filtered, cleaned, and enriched with additional attributes or metadata. Business logic and validation rules are often applied here.
- **Transformations**:
  - Handling missing values
  - Normalizing data formats
  - Data joins or enrichments from other sources
- **Usage**: This layer is ideal for reporting, analytics, and as input for machine learning models.

###### 3. **Gold Layer (Business-Level Data)**

- **Purpose**: The Gold layer contains high-quality, curated datasets that are ready for business consumption.
- **Characteristics**: This data is fully processed and optimized for specific use cases, such as analytics dashboards or data marts.
- **Optimizations**: Aggregations, computations, and business logic are applied to make the data fit for direct consumption by business users or applications.
- **Benefits**:
  - Optimized for performance (e.g., querying in BI tools).
  - Ensures data accuracy and business relevance.


links:

- https://www.databricks.com/glossary/medallion-architecture
- https://dataengineering.wiki/Concepts/Medallion+Architecture

### 





#### Web Scraping and File Upload to MinIO

This script performs web scraping using Beautiful Soup and uploads the scraped data to a MinIO object storage service.

##### Dependencies
- `BeautifulSoup` from `bs4`: Used for parsing HTML and XML documents.
- `requests`: Used for making HTTP requests.
- `Minio`: Client for interacting with the MinIO storage service.
- `BytesIO`: For handling byte streams in memory.
- `datetime`: For working with date and time.

##### Setup

To use this script, ensure that the following dependencies are installed:
```bash
pip install beautifulsoup4 requests minio
```

Also, make sure you have a running MinIO instance. This script connects to a MinIO server running on `localhost:9000` with the specified access and secret keys.

###### Usage

1. **Initialize the MinIO client** by calling the `setup_minio_client` function.
2. The script checks if the bucket named 'bronze' exists and creates it if it does not.
3. You can extend this script by adding functions for web scraping and uploading data to the 'bronze' bucket.

##### Example Code

In [None]:
from bs4 import BeautifulSoup
import requests
from minio import Minio
from io import BytesIO
from datetime import datetime

def setup_minio_client():
    minio_client = Minio('localhost:9000',
                         access_key='ROOTUSER',
                         secret_key='DATAINCUBATOR',
                         secure=False)
    if not minio_client.bucket_exists('bronze'):
        minio_client.make_bucket('bronze')
        print("Bucket 'bronze' created successfully")
    return minio_client
minio_client = setup_minio_client()





#### Save Books Data to MinIO

This function saves a list of book data to a MinIO object storage service in CSV format.
The CSV file is named with the current date to ensure uniqueness.

##### Dependencies
- `csv`: Used to write CSV files.
- `io.StringIO`: Provides an in-memory string buffer for CSV data.
- `datetime`: Used to get the current date for naming the CSV file.
- `BytesIO`: Allows the conversion of string data to byte format for uploading.

##### Function: `save_books_data_to_minio`

##### Description

The `save_books_data_to_minio` function takes a list of book data and a MinIO client instance, 
converts the book data into CSV format, and uploads it to the specified MinIO bucket.

##### Parameters
- `books_data` (`List[Dict[str, str]]`): A list of dictionaries, where each dictionary contains:
  - `title`: The title of the book.
  - `price`: The price of the book.
  - `availability`: The availability status of the book.
  - `rating`: The rating of the book.
- `minio_client` (`Minio`): An instance of the MinIO client for interacting with the MinIO server.

##### Returns
- `None`: The function does not return a value, but it prints a success or error message based on the upload outcome.

##### Example Usage

```python
if book_data:
    save_books_data_to_minio(book_data, minio_client)
```


In [None]:
def scrape_books_data():
    url = "https://books.toscrape.com/catalogue/page-1.html"
    response = requests.get(url)

    if response.status_code == 200:
        html_content = response.text
        soup = BeautifulSoup(html_content, 'html.parser')
        book_rows = soup.find_all('article', class_='product_pod')  
        
        books_data = []
        for book in book_rows:
            title = book.find('h3').find('a')['title']
            price = book.find('p', class_='price_color').text.strip()
            availability = book.find('p', class_='instock availability').text.strip()
            rating = book.find('p', class_='star-rating')['class'][1]

            books_data.append({
                'title': title,
                'price': price,
                'availability': availability,
                'rating': rating
            })

        return books_data
    else:
        print(f"Failed to fetch page, status code: {response.status_code}")
        return None

book_data = scrape_books_data()
for book in book_data:
    print(book)


{'title': 'A Light in the Attic', 'price': 'Â£51.77', 'availability': 'In stock', 'rating': 'Three'}
{'title': 'Tipping the Velvet', 'price': 'Â£53.74', 'availability': 'In stock', 'rating': 'One'}
{'title': 'Soumission', 'price': 'Â£50.10', 'availability': 'In stock', 'rating': 'One'}
{'title': 'Sharp Objects', 'price': 'Â£47.82', 'availability': 'In stock', 'rating': 'Four'}
{'title': 'Sapiens: A Brief History of Humankind', 'price': 'Â£54.23', 'availability': 'In stock', 'rating': 'Five'}
{'title': 'The Requiem Red', 'price': 'Â£22.65', 'availability': 'In stock', 'rating': 'One'}
{'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': 'Â£33.34', 'availability': 'In stock', 'rating': 'Four'}
{'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': 'Â£17.93', 'availability': 'In stock', 'rating': 'Three'}
{'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'pr


#### Save Books Data to MinIO

This function saves a list of book data to a MinIO object storage service in CSV format.
The CSV file is named with the current date to ensure uniqueness.

##### Dependencies
- `csv`: Used to write CSV files.
- `io.StringIO`: Provides an in-memory string buffer for CSV data.
- `datetime`: Used to get the current date for naming the CSV file.
- `BytesIO`: Allows the conversion of string data to byte format for uploading.

##### Function: `save_books_data_to_minio`

###### Description

The `save_books_data_to_minio` function takes a list of book data and a MinIO client instance, 
converts the book data into CSV format, and uploads it to the specified MinIO bucket.

###### Parameters
- `books_data` (`List[Dict[str, str]]`): A list of dictionaries, where each dictionary contains:
  - `title`: The title of the book.
  - `price`: The price of the book.
  - `availability`: The availability status of the book.
  - `rating`: The rating of the book.
- `minio_client` (`Minio`): An instance of the MinIO client for interacting with the MinIO server.

###### Returns
- `None`: The function does not return a value, but it prints a success or error message based on the upload outcome.

###### Example Usage

```python
if book_data:
    save_books_data_to_minio(book_data, minio_client)
```


In [6]:
import csv
from io import StringIO
from datetime import datetime

def save_books_data_to_minio(books_data, minio_client):
    current_datetime = datetime.now().strftime('%Y%m%d')
    object_name = f'books_data_{current_datetime}.csv'

    csv_data = StringIO()
    fieldnames = ["title", "price", "availability", "rating"]
    writer = csv.DictWriter(csv_data, fieldnames=fieldnames)
    writer.writeheader()
    
    for book in books_data:
        writer.writerow(book)

    csv_data_bytes = BytesIO(csv_data.getvalue().encode('utf-8'))
    
    try:
        minio_client.put_object(
            'bronze', object_name, csv_data_bytes, len(csv_data_bytes.getvalue())
        )
        print(f"Book data saved successfully as {object_name}")
    except Exception as e:
        print("An error occurred while uploading to Minio:", e)

if book_data:
    save_books_data_to_minio(book_data, minio_client)


NameError: name 'minio_client' is not defined