# Remote Work Tracker BI Project: ETL, DB, and Utility Scripts

This Jupyter Notebook provides a comprehensive overview and demonstration of the Extract, Transform, Load (ETL) process, database interaction, and utility functions developed for the Remote Work Tracker Business Intelligence (BI) project. These scripts are designed to process raw job data (e.g., from the Remotive.com API) and store it in a structured SQLite database, making it ready for further analysis and visualization in tools like Power BI.

## Project Components

The core components covered in this notebook are:
1.  **`db_schema_and_etl_design.md`**: Documentation outlining the database schema and ETL process.
2.  **`db_connector.py`**: Python script for connecting to and interacting with the SQLite database.
3.  **`etl_script.py`**: Python script implementing the Extract, Transform, Load logic.
4.  **`utils.py`**: Python script containing general utility functions, such as logging.

## 1. Database Schema Design

The foundation of our BI project is a well-defined database schema. We are using a simple SQLite database with a single table, `remote_jobs`, to store the processed job data. The schema is designed to capture all relevant information from the job postings.

### `remote_jobs` Table Structure


| Column Name                   | Data Type   | Constraints       | Description                                      |
| :---------------------------- | :---------- | :---------------- | :----------------------------------------------- |
| `id`                          | INTEGER     | PRIMARY KEY       | Unique identifier for each job posting (from API)|
| `job_title`                   | TEXT        | NOT NULL          | Title of the job                                 |
| `company_name`                | TEXT        | NOT NULL          | Name of the hiring company                       |
| `company_name`                | TEXT        | NOT NULL          | Name of the hiring company                       |
| `publication_date`            | TEXT        | NOT NULL          | Date and time the job was published (ISO format) |
| `job_type`                    | TEXT        |                   | Type of employment (e.g., full_time, contract)   |
| `category`                    | TEXT        |                   | Job category (e.g., Software Development)        |
| `candidate_required_location` | TEXT        |                   | Geographical restrictions for candidates         |
| `salary_range`                | TEXT        |                   | Stated salary range, if available                |
| `job_description`             | TEXT        |                   | Full description of the job                      |
| `source_url`                  | TEXT        | UNIQUE, NOT NULL  | URL to the original job posting                  |
| `company_logo`                | TEXT        |                   | URL to the company logo                          |
| `job_board`                   | TEXT        | NOT NULL          | Source job board (e.g., Remotive.com)            |
| `ingestion_timestamp`         | TIMESTAMP   | DEFAULT CURRENT_TIMESTAMP | Timestamp when the record was ingested   |

*Note: The `db_schema_and_etl_design.md` file contains a more detailed explanation of the schema and ETL process.*

## 2. Database Connector (`db_connector.py`)

The `db_connector.py` script provides a `DBConnector` class to manage interactions with the SQLite database. It handles connecting, disconnecting, creating the `remote_jobs` table, and inserting job data from a Pandas DataFrame.

### Key Features:

- **Connection Management**: `connect()` and `disconnect()` methods for robust database handling.
- **Schema Initialization**: `create_table()` ensures the `remote_jobs` table exists with the defined schema.
- **Data Insertion**: `insert_jobs(df)` efficiently inserts DataFrame records, using `INSERT OR IGNORE` to prevent duplicate entries based on the `id` column (which is the primary key from the API). This is crucial for handling daily scrapes without re-inserting old data.
- **Data Retrieval**: `fetch_all_jobs()` allows for easy retrieval of all stored job data into a Pandas DataFrame for analysis.

### Code (`db_connector.py`):

In [None]:
import sqlite3
import pandas as pd
from datetime import datetime

class DBConnector:
    def __init__(self, db_name="remote_jobs.db"):
        self.db_name = db_name
        self.conn = None
        self.cursor = None

    def connect(self):
        "Establishes a connection to the SQLite database."
        try:
            self.conn = sqlite3.connect(self.db_name)
            self.cursor = self.conn.cursor()
            print(f"Connected to database: {self.db_name}")
        except sqlite3.Error as e:
            print(f"Error connecting to database: {e}")

    def disconnect(self):
        "Closes the connection to the SQLite database."
        if self.conn:
            self.conn.close()
            print("Disconnected from database.")

    def create_table(self):
        "Creates the jobs table if it doesn't exist."
        create_table_query = """
        CREATE TABLE IF NOT EXISTS remote_jobs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_title TEXT,
            company TEXT,
            location TEXT,
            date_posted TEXT,
            job_description TEXT,
            url TEXT UNIQUE,
            date_scraped TEXT
        );
        """
        try:
            self.cursor.execute(create_table_query)
            self.conn.commit()
            print("Jobs table created or already exists.")
        except sqlite3.Error as e:
            print(f"Error creating table: {e}")
                    