# Data Ingestion

**Purpose**:

This notebook aims to conduct a lightweight data processing test using Python and SQL on an SQLite backend. We will adopt the medallion architecture, organising our data into Bronze, Silver, and Gold layers. Each layer will serve a specific purpose, beginning with the Bronze layer, which focuses on ingesting raw data.

Our data source is KaggleHub, and the dataset used is GoodReadBooks. In addition to this dataset, we will generate synthetic data to fill in gaps such as member information, books borrowed, return times, and other relevant details.

Finally, we will create views for the Gold layer to support refined and analytical use cases.

## Import 
We begin by importing the necessary libraries and functions required for data ingestion and processing.

In [1]:

from src.data.dataset_utils import create_sqlite_dataset, download_dataset_from_kagglehub, generate_member, generate_borrowing_records
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


## Dataset

The dataset will be downloaded from KaggleHub (jealousleopard/goodreadsbooks) and stored in the .cache directory for subsequent access and processing.

In [2]:


path = download_dataset_from_kagglehub("jealousleopard/goodreadsbooks")
path

Path to dataset files: /home/anish/.cache/kagglehub/datasets/jealousleopard/goodreadsbooks/versions/2


'/home/anish/.cache/kagglehub/datasets/jealousleopard/goodreadsbooks/versions/2'

## SQLite dataset
Next, we will create a database using the SQLite backend to store and manage the dataset for subsequent processing.

In [3]:
from src.data.dataset_utils import delete_sqlit_database
delete_sqlit_database("library.db") #we delete any exiting dataset called library.db for clean start

Database deleted.


In [4]:

conn, cursor = create_sqlite_dataset("library.db")

## Bronze layer table

Next, we will load the dataset. The GoodReadBooks dataset contains a books.csv file, which we will read using Pandas. This data will then be stored in our SQLite database as the bronze_books table, representing the raw data ingestion layer.

In [5]:
bronze_books = pd.read_csv(
    f"{path}/books.csv",
    on_bad_lines='skip',    
    quotechar='"',
    sep=","
)
display(bronze_books.sample(5))

bronze_books.to_sql("bronze_books", conn, if_exists="replace", index=False)


Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
909,2997,My Secret Garden: Women's Sexual Fantasies,Nancy Friday,3.68,671019872,9780671019877,eng,361,1817,123,10/28/2003,Pocket Books
1855,6568,Asylum (Blackstone Chronicles #6),John Saul,4.1,449227944,9780449227947,eng,97,932,24,6/3/1997,Fawcett
7134,27368,Letters from the Bay of Islands: The Story of ...,Marianne Williams/Caroline Fitzgerald,3.72,143019295,9780143019299,eng,270,24,2,1/1/2004,Penguin Books
8939,34607,The Damnation Game,Clive Barker,3.82,517681137,9780517681138,eng,374,0,0,1/13/1989,Random House
948,3120,Public Enemies (On The Run #5),Gordon Korman,4.19,439651409,9780439651400,eng,150,2050,76,12/1/2005,Scholastic


11123

## Silver Layer Table

We will now create a cleaned version of the `bronze_books` table by performing the following tasks:

1. Create a `silver_books` table.
2. Filter out incomplete or invalid rows from the `bronze_books` table.
3. Remove any leading or trailing whitespace from book titles and author names.
4. Eliminate duplicate rows to ensure data consistency.

In [6]:
cursor.execute("""
CREATE TABLE IF NOT EXISTS silver_books AS
SELECT DISTINCT
    bookID AS BookID,
    TRIM(title) AS Title,
    TRIM(authors) AS Author,
    publication_date AS YearPublished,
    average_rating AS AvgRating
FROM bronze_books
WHERE title IS NOT NULL
  AND authors IS NOT NULL
  AND publication_date IS NOT NULL
  AND average_rating IS NOT NULL;
""")
conn.commit()


Let’s preview five sample rows from the silver_books table to verify the data cleaning process.

In [7]:
pd.read_sql_query("SELECT * FROM silver_books LIMIT 5;", conn)


Unnamed: 0,BookID,Title,Author,YearPublished,AvgRating
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,9/16/2006,4.57
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,9/1/2004,4.49
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,11/1/2003,4.42
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,5/1/2004,4.56
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,9/13/2004,4.78


### Synthetic Data Generation for Members

We will now generate synthetic data to represent library members and their book borrowing behavior, as this information is not included in the original dataset.

#### Members table
We will create a members table to store synthetic data representing individual library members, which will include relevant attributes such as member ID, name, and registration details.

In [8]:
cursor.execute("""
CREATE TABLE IF NOT EXISTS Members (
    MemberID 
        INTEGER 
        PRIMARY KEY 
        AUTOINCREMENT,
    Name TEXT NOT NULL,
    JoinDate DATE DEFAULT (DATE('now'))
);
""")
conn.commit()


We will populate the `members` table with 50 synthetic entries to simulate a realistic set of library users.

In [9]:
members_list = generate_member(50)
members_list

[('Bob Evans', datetime.date(2024, 10, 28)),
 ('Alice Evans', datetime.date(2024, 8, 11)),
 ('Ivy Smith', datetime.date(2024, 7, 19)),
 ('Grace Taylor', datetime.date(2024, 10, 19)),
 ('Diana Taylor', datetime.date(2024, 12, 29)),
 ('Alice Evans', datetime.date(2025, 5, 3)),
 ('Frank Walker', datetime.date(2024, 12, 11)),
 ('Charlie Thomas', datetime.date(2024, 8, 1)),
 ('Alice Thomas', datetime.date(2024, 9, 6)),
 ('Eve Evans', datetime.date(2025, 3, 12)),
 ('Frank Roberts', datetime.date(2024, 10, 9)),
 ('Alice Walker', datetime.date(2024, 9, 28)),
 ('Charlie Evans', datetime.date(2024, 6, 30)),
 ('Alice Jones', datetime.date(2025, 2, 26)),
 ('Frank White', datetime.date(2025, 4, 1)),
 ('Ivy Evans', datetime.date(2025, 3, 13)),
 ('Eve Wilson', datetime.date(2024, 9, 14)),
 ('Eve Smith', datetime.date(2025, 3, 18)),
 ('Ivy Thomas', datetime.date(2024, 9, 25)),
 ('Diana Jones', datetime.date(2024, 8, 18)),
 ('Eve Wilson', datetime.date(2024, 7, 3)),
 ('Eve Jones', datetime.date(2025, 4

Next, we will insert the generated list into the members table.

In [10]:
cursor.executemany(
    "INSERT INTO Members (Name, JoinDate) VALUES (?, ?);",
    members_list
)
conn.commit()

pd.read_sql_query("SELECT * FROM Members;", conn)

Unnamed: 0,MemberID,Name,JoinDate
0,1,Bob Evans,2024-10-28
1,2,Alice Evans,2024-08-11
2,3,Ivy Smith,2024-07-19
3,4,Grace Taylor,2024-10-19
4,5,Diana Taylor,2024-12-29
5,6,Alice Evans,2025-05-03
6,7,Frank Walker,2024-12-11
7,8,Charlie Thomas,2024-08-01
8,9,Alice Thomas,2024-09-06
9,10,Eve Evans,2025-03-12


### Borrowing Records Table

We will now create a `borrowing_records` table to capture synthetic borrowing activity. Each record will include the book ID, member ID, borrow date, and return date. A `NULL` value in the return date column will indicate that the book has not yet been returned.


In [11]:
book_ids = pd.read_sql_query("SELECT BookID FROM silver_books LIMIT 100;", conn)["BookID"].tolist()
member_ids = pd.read_sql_query("SELECT MemberID FROM Members;", conn)["MemberID"].tolist()

In [12]:
borrowing_records = generate_borrowing_records(book_ids, member_ids)
borrowing_records

[(80, 1, datetime.date(2025, 4, 5), datetime.date(2025, 4, 17)),
 (53, 1, datetime.date(2025, 3, 25), datetime.date(2025, 4, 21)),
 (21, 2, datetime.date(2025, 5, 16), None),
 (147, 3, datetime.date(2025, 5, 5), None),
 (142, 4, datetime.date(2025, 4, 15), datetime.date(2025, 4, 21)),
 (152, 5, datetime.date(2025, 4, 8), datetime.date(2025, 4, 20)),
 (156, 5, datetime.date(2025, 3, 21), None),
 (130, 6, datetime.date(2025, 5, 24), datetime.date(2025, 6, 16)),
 (162, 7, datetime.date(2025, 4, 11), datetime.date(2025, 4, 19)),
 (4, 7, datetime.date(2025, 6, 3), datetime.date(2025, 6, 25)),
 (28, 7, datetime.date(2025, 6, 2), datetime.date(2025, 6, 20)),
 (5, 8, datetime.date(2025, 5, 13), datetime.date(2025, 5, 21)),
 (93, 9, datetime.date(2025, 5, 10), datetime.date(2025, 6, 4)),
 (58, 9, datetime.date(2025, 4, 11), datetime.date(2025, 4, 29)),
 (83, 9, datetime.date(2025, 5, 30), datetime.date(2025, 6, 7)),
 (31, 10, datetime.date(2025, 4, 15), datetime.date(2025, 5, 15)),
 (30, 10, da

In [13]:
cursor.execute("""
CREATE TABLE IF NOT EXISTS BorrowingRecords (
    RecordID 
        INTEGER 
        PRIMARY KEY 
        AUTOINCREMENT,
    BookID INTEGER,
    MemberID INTEGER,
    BorrowDate 
        DATE 
        DEFAULT (DATE('now')),
    ReturnDate DATE,
    FOREIGN KEY (BookID) 
        REFERENCES silver_books(BookID),
    FOREIGN KEY (MemberID) 
        REFERENCES Members(MemberID)
);
""")
conn.commit()


Next, we will insert the generated borrowing data into the borrowing_records table to complete our synthetic dataset.

In [14]:
cursor.executemany(
    """
    INSERT INTO BorrowingRecords (BookID, MemberID, BorrowDate, ReturnDate)
    VALUES (?, ?, ?, ?);
    """,
    borrowing_records
)
conn.commit()

# Preview
pd.read_sql_query("SELECT * FROM BorrowingRecords LIMIT 5;", conn)


Unnamed: 0,RecordID,BookID,MemberID,BorrowDate,ReturnDate
0,1,80,1,2025-04-05,2025-04-17
1,2,53,1,2025-03-25,2025-04-21
2,3,21,2,2025-05-16,
3,4,147,3,2025-05-05,
4,5,142,4,2025-04-15,2025-04-21


## Gold Layer View

For the Gold layer, we will adopt a different approach by creating SQL views instead of physical tables. These views will include refined queries to support analytics and reporting use cases.

### Top Rated Books View

We will create a view to identify top-rated books. Any book with an average rating greater than 4.5, based on the `silver_books` table, will be included in this view.



In [15]:
cursor.execute("""
CREATE VIEW IF NOT EXISTS gold_top_books AS
SELECT 
    Title, 
    Author, 
    YearPublished, 
    AvgRating
FROM silver_books
WHERE AvgRating >= 4.5
ORDER BY AvgRating DESC
LIMIT 20;
""")

conn.commit()


In [16]:
pd.read_sql_query("SELECT * FROM gold_top_books LIMIT 5;", conn)

Unnamed: 0,Title,Author,YearPublished,AvgRating
0,Comoediae 1: Acharenses/Equites/Nubes/Vespae/P...,Aristophanes/F.W. Hall/W.M. Geldart,2/22/1922,5.0
1,Willem de Kooning: Late Paintings,Julie Sylvester/David Sylvester,9/1/2006,5.0
2,Literature Circle Guide: Bridge to Terabithia:...,Tara MacCarthy,1/1/2002,5.0
3,Middlesex Borough (Images of America: New Jersey),Middlesex Borough Heritage Committee,3/17/2003,5.0
4,Zone of the Enders: The 2nd Runner Official St...,Tim Bogenn,3/6/2003,5.0


From the sample results, we can observe that several books have received a perfect 5.0 rating, indicating exceptional user feedback.

### Top Borrowed Books View

We will create a view named `gold_top_borrowed_books` to identify the most frequently borrowed books. This view will:

* Join the `borrowing_records` table with the `silver_books` table using the `book_id` field.
* Retrieve relevant metadata such as book title and author.
* Group the data by `book_id`, count the total borrowings for each book, and order the results in descending order of borrow count.
* Limit the output to the top 10 most borrowed books.


In [17]:
cursor.execute("""
CREATE VIEW IF NOT EXISTS gold_top_borrowed_books AS
SELECT 
    book.Title, 
    book.Author, 
    COUNT(*) AS BorrowCount
FROM BorrowingRecords AS borrow
JOIN silver_books AS book 
    ON borrow.BookID = book.BookID
GROUP BY borrow.BookID
ORDER BY BorrowCount DESC
LIMIT 10;
""")
pd.read_sql_query("SELECT * FROM gold_top_borrowed_books;", conn)


Unnamed: 0,Title,Author,BorrowCount
0,The Lord of the Rings: Complete Visual Companion,Jude Fisher,5
1,Simply Beautiful Beading: 53 Quick and Easy Pr...,Heidi Boyd,4
2,CliffsNotes on Tolstoy's Anna Karenina,Marianne Sturman/Leo Tolstoy,3
3,Mapping the Big Picture: Integrating Curriculu...,Heidi Hayes Jacobs,3
4,Rising from the Plains,John McPhee,3
5,The Lord of the Rings (The Lord of the Rings ...,J.R.R. Tolkien,3
6,Dinner with Anna Karenina,Gloria Goldreich,2
7,Anna Karenina,Leo Tolstoy/Louise Maude/Aylmer Maude,2
8,Anna Karenina,Leo Tolstoy/David Magarshack/Priscilla Meyer,2
9,Ruby Ann's Down Home Trailer Park BBQin' Cookbook,Ruby Ann Boxcar,2


### Top Members by Number of Borrowings

We will create a view to identify the top 10 members based on borrowing frequency. This will involve:

* Joining the `borrowing_records` table with the `members` table using the `member_id` field.
* Grouping the results by `member_id` and aggregating the total number of borrowings per member.
* Ordering the results in descending order and limiting the output to the top 10 borrowers.



In [18]:
cursor.execute("""
CREATE VIEW IF NOT EXISTS gold_top_members AS
SELECT 
     member.Name, 
     COUNT(*) AS TotalBorrows        
FROM BorrowingRecords AS borrow          
JOIN Members AS member 
     ON borrow.MemberID = member.MemberID                  
GROUP BY borrow.MemberID
ORDER BY TotalBorrows DESC               
LIMIT 10;
""")
pd.read_sql_query("SELECT * FROM gold_top_members;", conn)


Unnamed: 0,Name,TotalBorrows
0,Charlie White,3
1,Charlie Thomas,3
2,Charlie Walker,3
3,Charlie Wilson,3
4,Eve Brown,3
5,Grace Walker,3
6,Hassan Taylor,3
7,Hassan Wilson,3
8,Alice Wilson,3
9,Grace Wilson,3


### Books Not Yet Returned

We will create a view to list all books that have not yet been returned by members. In this view:

* Only records with a `NULL` return date will be included.
* We will calculate whether each book is overdue by checking if the borrow date exceeds 14 days from the current date.
* The result will include relevant book and member details along with an `is_overdue` flag.



In [19]:
cursor.execute("""
CREATE VIEW IF NOT EXISTS gold_unreturned_books AS
SELECT 
    member.Name AS Borrower,
    book.Title,
    borrow.BorrowDate,
    JULIANDAY('now') - JULIANDAY(borrow.BorrowDate) AS DaysOut,
    CASE 
        WHEN JULIANDAY('now') - JULIANDAY(borrow.BorrowDate) > 14 THEN 'Yes'
        ELSE 'No'
    END AS IsOverdue
FROM BorrowingRecords AS borrow
JOIN Members AS member 
    ON borrow.MemberID = member.MemberID
JOIN silver_books AS book 
    ON borrow.BookID = book.BookID
WHERE borrow.ReturnDate IS NULL;
""")
pd.read_sql_query("SELECT * FROM gold_unreturned_books;", conn)


Unnamed: 0,Borrower,Title,BorrowDate,DaysOut,IsOverdue
0,Alice Evans,A Short History of Nearly Everything,2025-05-16,24.369222,Yes
1,Ivy Smith,Rails Cookbook: Recipes for Rapid Web Developm...,2025-05-05,35.369222,Yes
2,Diana Taylor,Anna Karenina,2025-03-21,80.369222,Yes
3,Eve Evans,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,2025-05-13,27.369222,Yes
4,Charlie Evans,Giving Good Weight,2025-04-23,47.369222,Yes
5,Alice Jones,What to Sell on ebay and Where to Get It: The ...,2025-05-26,14.369222,Yes
6,Alice Jones,The Changeling (Daughters of England #15),2025-04-15,55.369222,Yes
7,Frank White,The Heidi Chronicles,2025-04-25,45.369222,Yes
8,Diana Jones,Molly Hatchet - 5 of the Best,2025-04-18,52.369222,Yes
9,Grace Wilson,Heidi (Heidi #1-2),2025-03-23,78.369222,Yes


### Month-by-Month Borrowing Activity

We will create a view to analyse borrowing trends over time by aggregating the number of books borrowed each month. This will help us observe borrowing patterns and seasonal activity.


In [20]:
cursor.execute("""
CREATE VIEW IF NOT EXISTS gold_monthly_borrowing AS
SELECT 
    strftime('%Y-%m', BorrowDate) AS Month,
    COUNT(*) AS BorrowCount
FROM BorrowingRecords
GROUP BY Month
ORDER BY Month ASC;
""")
pd.read_sql_query("SELECT * FROM gold_monthly_borrowing;", conn)


Unnamed: 0,Month,BorrowCount
0,2025-03,18
1,2025-04,35
2,2025-05,36
3,2025-06,9


## Data Pipeline

Now that the data structure and transformations are validated, we will create a data ingestion pipeline to automate the population of all three layers—Bronze, Silver, and Gold.



In [21]:
from src.etl.bronze_etl import BronzeETL
from src.etl.silver_etl import SilverETL
from src.etl.gold_etl import GoldETL

def run_bronze_etl():
    bronze = BronzeETL()
    bronze.download_data()
    bronze.load_raw_books()
    bronze.write_to_sqlite()

def run_silver_etl():
    silver = SilverETL()
    silver.transform_bronze_books()
    silver.create_members_table(total_members=20)
    silver.create_borrowing_records_table()
    silver.preview_tables()

def run_gold_etl():
    gold = GoldETL()
    gold.create_top_books_table()
    gold.create_top_borrowed_books_view()
    gold.create_unreturned_books_view()
    gold.create_monthly_borrowing_view()
    gold.preview_gold_outputs()

def main():
    print("Running Bronze ETL...")
    run_bronze_etl()

    print("\nRunning Silver ETL...")
    run_silver_etl()

    print("\nRunning Gold ETL...")
    run_gold_etl()


# main()