## **PROJECT - NOTEBOOK #1: Raw Data**

> 🚧 Run this notebook only once to migrate the data to your DB (It will take a few minutes)

This notebook sets up our data pipeline by configuring the environment, importing essential libraries and create a SQLAlchemy engine, then loading raw data from the CSV file into a Pandas DataFrame, transferring this data into a MySQL database, and verifying the transfer with a simple query.

---

### **Setting Environment**

In [6]:
import os 
print(os.getcwd())

try:
    os.chdir("../../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl

        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        
d:\U\FIFTH SEMESTER\ETL\project_etl


### **Importing modules and libraries**

In [7]:
import pandas as pd
from functions.db_connection.connection import creating_engine
from sqlalchemy import text

### **Ingest Data**

In [8]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("arshkon/linkedin-job-postings")
print("Path to dataset files:", path)

dataset_dir = path

dfs = {}

csv_files = {
    'jobs': os.path.join(dataset_dir, "postings.csv"),
    'benefits': os.path.join(dataset_dir, "jobs", "benefits.csv"),
    'companies': os.path.join(dataset_dir, "companies", "companies.csv"),
    'employee_counts': os.path.join(dataset_dir, "companies", "employee_counts.csv"), 
    
    #Added CSV files besides the main ones in order to create the fact table
    'industries': os.path.join(dataset_dir, "mappings", "industries.csv"), 
    'skills_industries': os.path.join(dataset_dir, "mappings", "skills.csv"),
    'salaries': os.path.join(dataset_dir, "jobs", "salaries.csv"),
}

for name, file_path in csv_files.items():
    if os.path.exists(file_path):
        dfs[name] = pd.read_csv(file_path)
        print(f"\nDataFrame '{name}' loaded successfully!")
        print(f"First 5 rows of '{name}':")
        print(dfs[name].head())
    else:
        print(f"Error: '{file_path}' not found in {dataset_dir}")
        print("Available files:", os.listdir(dataset_dir if 'jobs' not in file_path else os.path.join(dataset_dir, "jobs")))

#Access to a specific Dataframe
if 'postings' in dfs:
    print("\nColumns in postings DataFrame:")
    print(dfs['postings'].columns)

Path to dataset files: C:\Users\sebas\.cache\kagglehub\datasets\arshkon\linkedin-job-postings\versions\13

DataFrame 'jobs' loaded successfully!
First 5 rows of 'jobs':
     job_id            company_name  \
0    921716   Corcoran Sawyer Smith   
1   1829192                     NaN   
2  10998357  The National Exemplar    
3  23221523  Abrams Fensterman, LLP   
4  35982263                     NaN   

                                               title  \
0                              Marketing Coordinator   
1                  Mental Health Therapist/Counselor   
2                        Assitant Restaurant Manager   
3  Senior Elder Law / Trusts and Estates Associat...   
4                                 Service Technician   

                                         description  max_salary pay_period  \
0  Job descriptionA leading real estate firm in N...        20.0     HOURLY   
1  At Aspen Therapy and Wellness , we are committ...        50.0     HOURLY   
2  The National Exempl

### **Transfer data to DB in MySQL**

In [9]:
engine = creating_engine()

In [10]:
try:
    with engine.connect() as connection:
        result = connection.execute(text('SELECT version();'))
        print(f'PostgreSQL Version: {result.fetchone()[0]}')
except Exception as e:
    print(f'Connection failed: {e}')

for name, df in dfs.items():
    try:
        df.to_sql(name.lower(), con=engine, schema='public', if_exists='replace', index=False)
        print(f"Table '{name}' successfully loaded into PostgreSQL!")
    except Exception as e:
        print(f"Error loading '{name}' into PostgreSQL: {e}")

try:
    with engine.connect() as connection:
        result = connection.execute(text("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public';"))
        print('\nTables in PostgreSQL database:')
        for row in result:
            print(row[0])
except Exception as e:
    print(f'Verification failed: {e}')

PostgreSQL Version: PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17), 64-bit
Table 'jobs' successfully loaded into PostgreSQL!
Table 'benefits' successfully loaded into PostgreSQL!
Table 'companies' successfully loaded into PostgreSQL!
Table 'employee_counts' successfully loaded into PostgreSQL!
Table 'industries' successfully loaded into PostgreSQL!
Table 'skills_industries' successfully loaded into PostgreSQL!
Table 'salaries' successfully loaded into PostgreSQL!

Tables in PostgreSQL database:
jobs
benefits
companies
employee_counts
industries
skills_industries
salaries
