## **PROJECT - NOTEBOOK #1: Raw Data**

This notebook sets up our data pipeline by configuring the environment, importing essential libraries and create a SQLAlchemy engine, then loading raw data from the CSV file into a Pandas DataFrame, transferring this data into a MySQL database, and verifying the transfer with a simple query.

---

### **Setting Environment**

In [68]:
import os 
print(os.getcwd())

try:
    os.chdir("../../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl

        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        
d:\U\FIFTH SEMESTER\ETL\project_etl


### **Importing modules and libraries**

In [69]:
import pandas as pd
from functions.db_connection.connection import creating_engine
from sqlalchemy import text

### **Ingest Data**

In [70]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("arshkon/linkedin-job-postings")
print("Path to dataset files:", path)

dataset_dir = path

dfs = {}

csv_files = {
    'jobs': os.path.join(dataset_dir, "postings.csv"),
    'benefits': os.path.join(dataset_dir, "jobs", "benefits.csv"),
    'companies': os.path.join(dataset_dir, "companies", "companies.csv"),
    'employee_counts': os.path.join(dataset_dir, "companies", "employee_counts.csv"), 
    
    #Added CSV files besides the main ones in order to create the fact table
    'industries': os.path.join(dataset_dir, "mappings", "industries.csv"), 
    'skills_industries': os.path.join(dataset_dir, "mappings", "skills.csv"),
    'salaries': os.path.join(dataset_dir, "jobs", "salaries.csv"),
}

for name, file_path in csv_files.items():
    if os.path.exists(file_path):
        dfs[name] = pd.read_csv(file_path)
        print(f"\nDataFrame '{name}' loaded successfully!")
        print(f"First 5 rows of '{name}':")
        print(dfs[name].head())
    else:
        print(f"Error: '{file_path}' not found in {dataset_dir}")
        print("Available files:", os.listdir(dataset_dir if 'jobs' not in file_path else os.path.join(dataset_dir, "jobs")))

#Access to a specific Dataframe
if 'postings' in dfs:
    print("\nColumns in postings DataFrame:")
    print(dfs['postings'].columns)

Path to dataset files: C:\Users\sebas\.cache\kagglehub\datasets\arshkon\linkedin-job-postings\versions\13

DataFrame 'jobs' loaded successfully!
First 5 rows of 'jobs':
     job_id            company_name  \
0    921716   Corcoran Sawyer Smith   
1   1829192                     NaN   
2  10998357  The National Exemplar    
3  23221523  Abrams Fensterman, LLP   
4  35982263                     NaN   

                                               title  \
0                              Marketing Coordinator   
1                  Mental Health Therapist/Counselor   
2                        Assitant Restaurant Manager   
3  Senior Elder Law / Trusts and Estates Associat...   
4                                 Service Technician   

                                         description  max_salary pay_period  \
0  Job descriptionA leading real estate firm in N...        20.0     HOURLY   
1  At Aspen Therapy and Wellness , we are committ...        50.0     HOURLY   
2  The National Exempl

### **Transfer data to DB in MySQL**

In [71]:
engine = creating_engine()

In [72]:
try:
    with engine.connect() as connection:
        result = connection.execute(text("SELECT VERSION()"))
        print(f"MySQL Version: {result.fetchone()[0]}")
except Exception as e:
    print(f"Connection failed: {e}")

# Send all DataFrames to MySQL
for name, df in dfs.items():
    try:
        df.to_sql(name, con=engine, if_exists='replace', index=False)
        print(f"Table '{name}' successfully loaded into MySQL!")
    except Exception as e:
        print(f"Error loading '{name}' into MySQL: {e}")

# Verify tables
with engine.connect() as connection:
    result = connection.execute(text("SHOW TABLES"))
    print("\nTables in MySQL database:")
    for row in result:
        print(row[0])

Connection failed: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: YES)
(Background on this error at: https://sqlalche.me/e/20/f405)
Error loading 'jobs' into MySQL: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: YES)
(Background on this error at: https://sqlalche.me/e/20/f405)
Error loading 'benefits' into MySQL: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: YES)
(Background on this error at: https://sqlalche.me/e/20/f405)
Error loading 'companies' into MySQL: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: YES)
(Background on this error at: https://sqlalche.me/e/20/f405)
Error loading 'employee_counts' into MySQL: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (u

ProgrammingError: (mysql.connector.errors.ProgrammingError) 1045 (28000): Access denied for user 'mysql'@'localhost' (using password: YES)
(Background on this error at: https://sqlalche.me/e/20/f405)