## **PROJECT - NOTEBOOK #3: Data Cleansing and Transformation**

---

In [None]:
import os 
print(os.getcwd())

try:
    os.chdir("../../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to project_etl.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl\notebooks
d:\U\FIFTH SEMESTER\ETL\project_etl


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from sqlalchemy import text
from src.database.db_connection import create_gcp_engine

plt.style.use('ggplot')

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [5]:
try:
    engine = create_gcp_engine()
    logger.info("Successfully connected to GCP database")
except Exception as e:
    logger.error(f"Failed to connect to GCP database: {str(e)}")
    raise

INFO:src.database.db_connection:Successfully created GCP database engine
INFO:__main__:Successfully connected to GCP database


In [6]:
jobs_df = pd.read_sql("SELECT * FROM raw.jobs", con=engine)
salaries_df = pd.read_sql("SELECT * FROM raw.salaries", con=engine)
benefits_df = pd.read_sql("SELECT * FROM raw.benefits", con=engine)
employee_counts_df = pd.read_sql("SELECT * FROM raw.employee_counts", con=engine)
industries_df = pd.read_sql("SELECT * FROM raw.industries", con=engine)
skills_industries_df = pd.read_sql("SELECT * FROM raw.skills_industries", con=engine)
companies_df = pd.read_sql("SELECT * FROM raw.companies", con=engine)

logger.info("DataFrames loaded from raw schema of project-etl database.")

INFO:__main__:DataFrames loaded from raw schema of project-etl database.


## Data Cleaning (jobs_df)

Now that we know how we're going to handle the data, thanks to notebook `02_read_data.ipynb`, it's time to start cleaning and transforming the data.

We'll begin by deleting the columns defined previously:

+ `closed_time`
+ `skills_desc`
+ `med_salary`, `max_salary`, `min_salary`
+ `compensation_type`
+ `listed_time`, `expiry`
+ `fips` 
+ `work_type`
+ `applies`
+ application_url
+ posting_domain

In [7]:
cols_to_drop = ['med_salary', 'work_type', 'applies', 'closed_time', 'skills_desc', 'max_salary', 'min_salary', 'fips', 'listed_time', 'expiry', 'compensation_type', 'application_url', 'posting_domain']
jobs_df.drop(columns=cols_to_drop, inplace=True, errors='ignore')
logger.info("Dropped unnecessary columns from jobs_df.")

INFO:__main__:Dropped unnecessary columns from jobs_df.


1. We know that the dates are being handled in float format, which creates reading problems. That's why we're going to change them to date format.

2. The postal code is float type, just like the dates, but we'll change this one to string format to handle different postal code values and formats.

3. Algunos formatos como los id, y las vistas están en formato tipo float, lo cual es un error, por eso lo vamos a manejar como enteros haciendo el cambio de float a int.

4. Hay trabajos que no especifican si se puede trabajar de forma remota, en nuestro caso remote_allowed es un float que vamos a cambiar por un booleano para indicar los trabajos que no tengan o no especifiquen el trabajo remoto sean "False"

5. Tenemos una columna llamada pay_period que muestra el periodo de pago a los empleados, pero el salario normalizado muestra el salario anual generalmente, por lo que cualquier valor diferente a YEARLY en pay_period lo convertiremos en ese valor

In [8]:
columns_to_replace_not_specified = ["zip_code", "formatted_experience_level"]
jobs_df["zip_code"] = jobs_df["zip_code"].astype(str)
jobs_df[columns_to_replace_not_specified] = jobs_df[columns_to_replace_not_specified].replace(["nan", None], "No specified")
jobs_df["original_listed_time"] = pd.to_datetime(jobs_df["original_listed_time"], unit="ms")
jobs_df["company_id"] = jobs_df["company_id"].fillna(-1).astype(int)
jobs_df["views"] = jobs_df["views"].fillna(0).astype(int)
jobs_df["remote_allowed"] = jobs_df["remote_allowed"].fillna(0).astype(bool)

In [9]:
columns_to_replace = ["currency", "pay_period"]
jobs_df[columns_to_replace] = jobs_df[columns_to_replace].replace([None, pd.NA], "Unknown")

logger.info("Transformed data types and handled missing values in jobs_df.")

INFO:__main__:Transformed data types and handled missing values in jobs_df.


In [10]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,description,pay_period,location,company_id,views,formatted_work_type,original_listed_time,remote_allowed,job_posting_url,application_type,formatted_experience_level,sponsored,currency,normalized_salary,zip_code
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,HOURLY,"Princeton, NJ",2774458,20,Full-time,2024-04-17 23:45:08,False,https://www.linkedin.com/jobs/view/921716/?trk...,ComplexOnsiteApply,No specified,0,USD,38480.0,8540.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",HOURLY,"Fort Collins, CO",-1,1,Full-time,2024-04-11 17:51:27,False,https://www.linkedin.com/jobs/view/1829192/?tr...,ComplexOnsiteApply,No specified,0,USD,83200.0,80521.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,YEARLY,"Cincinnati, OH",64896719,8,Full-time,2024-04-16 14:26:54,False,https://www.linkedin.com/jobs/view/10998357/?t...,ComplexOnsiteApply,No specified,0,USD,55000.0,45202.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,YEARLY,"New Hyde Park, NY",766262,16,Full-time,2024-04-12 04:23:32,False,https://www.linkedin.com/jobs/view/23221523/?t...,ComplexOnsiteApply,No specified,0,USD,157500.0,11040.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,YEARLY,"Burlington, IA",-1,3,Full-time,2024-04-18 14:52:23,False,https://www.linkedin.com/jobs/view/35982263/?t...,ComplexOnsiteApply,No specified,0,USD,70000.0,52601.0


Here we realize that the `id` has very long values, so we will not remove that column since it belongs to the real `id` of the job. However, we will create another one called `new_job_id` to make searches, graphs, and queries simpler. The same will apply to the company's `id`.

In [11]:
jobs_df["job_id_modify"] = range(1, len(jobs_df) + 1)
jobs_df["company_id_modify"] = range(1, len(jobs_df) + 1)
logger.info("Added modified IDs to jobs_df.")

INFO:__main__:Added modified IDs to jobs_df.


In [12]:
jobs_df.isnull().sum()

job_id                            0
company_name                   1719
title                             0
description                       7
pay_period                        0
location                          0
company_id                        0
views                             0
formatted_work_type               0
original_listed_time              0
remote_allowed                    0
job_posting_url                   0
application_type                  0
formatted_experience_level        0
sponsored                         0
currency                          0
normalized_salary             87776
zip_code                          0
job_id_modify                     0
company_id_modify                 0
dtype: int64

In [13]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   job_id                      123849 non-null  int64         
 1   company_name                122130 non-null  object        
 2   title                       123849 non-null  object        
 3   description                 123842 non-null  object        
 4   pay_period                  123849 non-null  object        
 5   location                    123849 non-null  object        
 6   company_id                  123849 non-null  int64         
 7   views                       123849 non-null  int64         
 8   formatted_work_type         123849 non-null  object        
 9   original_listed_time        123849 non-null  datetime64[ns]
 10  remote_allowed              123849 non-null  bool          
 11  job_posting_url             123849 non-

In [14]:
pd.options.display.float_format = '{:.2f}'.format
jobs_df.describe()

Unnamed: 0,job_id,company_id,views,original_listed_time,sponsored,normalized_salary,job_id_modify,company_id_modify
count,123849.0,123849.0,123849.0,123849,123849.0,36073.0,123849.0,123849.0
mean,3896402138.07,12034820.09,14.42,2024-04-15 03:38:58.799942144,0.0,205327.04,61925.0,61925.0
min,921716.0,-1.0,0.0,2023-12-05 21:08:53,0.0,0.0,1.0,1.0
25%,3894586595.0,12979.0,3.0,2024-04-11 19:14:36,0.0,52000.0,30963.0,30963.0
50%,3901998406.0,204770.0,4.0,2024-04-17 23:03:59,0.0,81500.0,61925.0,61925.0
75%,3904707077.0,7260866.0,7.0,2024-04-18 22:12:04,0.0,125000.0,92887.0,92887.0
max,3906267224.0,103472979.0,9975.0,2024-04-20 00:26:43,0.0,535600000.0,123849.0,123849.0
std,84043545.16,25403872.0,85.33,,0.0,5097626.76,35752.27,35752.27


With `describe`, we see that there is a value in `normalized_salary` that is unusually high:  

**535,600,000.00**  

We observe that the 75th percentile is approximately **125,000**, which suggests that there may not be many values significantly higher than that number. However, to confirm this, we perform a statist



In [15]:
q1 = jobs_df["normalized_salary"].quantile(0.25)
q3 = jobs_df["normalized_salary"].quantile(0.75)
iqr = q3 - q1
upper_cap = q3 + 1.5 * iqr

jobs_df["normalized_salary"] = jobs_df["normalized_salary"].clip(upper=upper_cap)
logger.info("Capped outliers in normalized_salary for jobs_df.")

INFO:__main__:Capped outliers in normalized_salary for jobs_df.


In [16]:
jobs_df["normalized_salary"].describe()

count    36073.00
mean     93505.17
std      52147.55
min          0.00
25%      52000.00
50%      81500.00
75%     125000.00
max     234500.00
Name: normalized_salary, dtype: float64

In this way, we guarantee data integrity.

# Data cleaning (salaries_df)

In [17]:
salaries_df.head()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type
0,1,3884428798,,20.0,,HOURLY,USD,BASE_SALARY
1,2,3887470552,25.0,,23.0,HOURLY,USD,BASE_SALARY
2,3,3884431523,120000.0,,100000.0,YEARLY,USD,BASE_SALARY
3,4,3884911725,200000.0,,10000.0,YEARLY,USD,BASE_SALARY
4,5,3887473220,35.0,,33.0,HOURLY,USD,BASE_SALARY


1. We know that most values in `min_salary`, `med_salary`, and `max_salary` are null, so we will perform some operations on the rows to normalize these values and store them in a column called `raw_salary`.

2. We know that salary periods are given in different time frames (`HOURLY`, `WEEKLY`, `BIWEEKLY`, `YEARLY`, etc.). We will standardize them to yearly values by assuming the employees' working hours for `HOURLY`, the number of weeks in a year for `WEEKLY`, and so on.

3. Since we know that the base salary is stored in `BASE_SALARY`, all calculations will consider this. Overtime wages will not be an issue, as they are not included in the base salary.


In [18]:
import numpy as np
def get_unified_salary(row):
    if not pd.isna(row['med_salary']):
        return row['med_salary']
    
    min_sal = row['min_salary']
    max_sal = row['max_salary']
    
    if not pd.isna(min_sal) and not pd.isna(max_sal):
        return (min_sal + max_sal) / 2
    
    if not pd.isna(min_sal):
        return min_sal
    if not pd.isna(max_sal):
        return max_sal
    
    return np.nan

salaries_df['raw_salary'] = salaries_df.apply(get_unified_salary, axis=1)
logger.info("Created raw_salary column in salaries_df.")

INFO:__main__:Created raw_salary column in salaries_df.


In [19]:
salaries_df.head()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type,raw_salary
0,1,3884428798,,20.0,,HOURLY,USD,BASE_SALARY,20.0
1,2,3887470552,25.0,,23.0,HOURLY,USD,BASE_SALARY,24.0
2,3,3884431523,120000.0,,100000.0,YEARLY,USD,BASE_SALARY,110000.0
3,4,3884911725,200000.0,,10000.0,YEARLY,USD,BASE_SALARY,105000.0
4,5,3887473220,35.0,,33.0,HOURLY,USD,BASE_SALARY,34.0


In [20]:
salaries_df.describe()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,raw_salary
count,40785.0,40785.0,33947.0,6838.0,33947.0,40785.0
mean,20393.0,3895563848.87,96209.87,21370.3,65085.41,70709.22
std,11773.76,94966718.0,658737.34,51338.56,465061.24,513146.76
min,1.0,921716.0,1.0,0.0,1.0,0.0
25%,10197.0,3894608085.0,50.0,18.5,39.0,30.0
50%,20393.0,3901980104.0,85000.0,25.0,62300.0,60000.0
75%,30589.0,3904576109.0,142500.0,2207.0,100000.0,113670.0
max,40785.0,3906267224.0,120000000.0,750000.0,85000000.0,102500000.0


1. A `raw_salary` column is created based on the other three salary values according to the performed operations.

2. There is an issue with the maximum salary values, as they deviate significantly from the mean. To address this, we will add a restriction to prevent such outliers and perform a statistical calculation.

**Note:** This was not applied to `jobs_df` because the number of null values was too high.


In [22]:
salary_columns = ['max_salary', 'med_salary', 'min_salary', 'raw_salary']

def cap_outliers(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    upper_cap = q3 + 1.5 * iqr
    df[column] = df[column].clip(upper=upper_cap)
    return df

for col in salary_columns:
    salaries_df = cap_outliers(salaries_df, col)
    
logger.info("Capped outliers in salary columns for salaries_df.")

INFO:__main__:Capped outliers in salary columns for salaries_df.


In [23]:
salaries_df.describe()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,raw_salary
count,40785.0,40785.0,33947.0,6838.0,33947.0,40785.0
mean,20393.0,3895563848.87,90263.02,1372.17,61934.4,66908.08
std,11773.76,94966718.0,88349.98,2306.1,58244.6,71286.75
min,1.0,921716.0,1.0,0.0,1.0,0.0
25%,10197.0,3894608085.0,50.0,18.5,39.0,30.0
50%,20393.0,3901980104.0,85000.0,25.0,62300.0,60000.0
75%,30589.0,3904576109.0,142500.0,2207.0,100000.0,113670.0
max,40785.0,3906267224.0,356175.0,5489.75,249941.5,284130.0


This way, we ensure that the data does not exceed the actual value.  

So far, we have completed `salaries_df`.


# Data Cleaning (benefits_df)

In [24]:
benefits_df.head()

Unnamed: 0,job_id,inferred,type
0,3887473071,0,Medical insurance
1,3887473071,0,Vision insurance
2,3887473071,0,Dental insurance
3,3887473071,0,401(k)
4,3887473071,0,Student loan assistance


1. Here, we see that the same ID is repeated multiple times for different benefits. However, since they belong to the same company, we need to unify them.

2. We will also remove the `inferred` column, as it does not seem relevant.


In [25]:
if 'inferred' in benefits_df.columns:
    benefits_df = benefits_df.drop(columns=['inferred'])

benefits_df = benefits_df.groupby('job_id')['type'].apply(list).reset_index()
logger.info("Grouped benefits by job_id and dropped inferred column in benefits_df.")

INFO:__main__:Grouped benefits by job_id and dropped inferred column in benefits_df.


In [26]:
benefits_df.head()

Unnamed: 0,job_id,type
0,23221523,[401(k)]
1,56482768,"[401(k), Dental insurance, Disability insurance]"
2,69333422,"[Medical insurance, Vision insurance, Dental i..."
3,95428182,"[Medical insurance, Dental insurance, Disabili..."
4,111513530,"[Medical insurance, Paid maternity leave, Pens..."


1. We group the benefits into a list and remove the `inferred` column.



# Data Cleaning (employee_counts)


In [27]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,1712346173
1,22292832,311,4471,1712346173
2,20300,1053,6554,1712346173
3,3570660,383,35241,1712346173
4,878353,52,26397,1712346173


In [28]:
employee_counts_df.isnull().sum()

company_id        0
employee_count    0
follower_count    0
time_recorded     0
dtype: int64

There are no null values in the table, but we notice that `timerecorder` is in an unusual format, and we are unsure of its actual meaning. Therefore, we will convert it into a readable format.

In [29]:
employee_counts_df["time_recorded"] = pd.to_datetime(employee_counts_df["time_recorded"], unit="s").dt.date
logger.info("Converted time_recorded to readable date format in employee_counts_df.")

INFO:__main__:Converted time_recorded to readable date format in employee_counts_df.


In [30]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,2024-04-05
1,22292832,311,4471,2024-04-05
2,20300,1053,6554,2024-04-05
3,3570660,383,35241,2024-04-05
4,878353,52,26397,2024-04-05


Now, the date is in a readable format.


# **Data Cleaning (industries_df)**

In [31]:
industries_df.head()

Unnamed: 0,industry_id,industry_name
0,1,Defense and Space Manufacturing
1,3,Computer Hardware Manufacturing
2,4,Software Development
3,5,Computer Networking Products
4,6,"Technology, Information and Internet"


In [32]:
industries_df.isnull().sum()

industry_id       0
industry_name    34
dtype: int64

There are 34 null values in `industry_name`, so we will convert them to `"Unknown"`.

In [33]:
industries_df["industry_name"] = industries_df["industry_name"].replace([None, pd.NA], "Unknown")
logger.info("Replaced null values in industry_name with 'Unknown' in industries_df.")

INFO:__main__:Replaced null values in industry_name with 'Unknown' in industries_df.


In [34]:
industries_df.isnull().sum()

industry_id      0
industry_name    0
dtype: int64

# Data Cleaning (skills_industries)

In [35]:
skills_industries_df.head()

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution


# Data Cleaning (companies)

In [36]:
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7.0,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7.0,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7.0,0,IE,Dublin 2,0,Grand Canal Harbour,https://www.linkedin.com/company/accenture


In [37]:
companies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24473 entries, 0 to 24472
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   company_id    24473 non-null  int64  
 1   name          24472 non-null  object 
 2   description   24176 non-null  object 
 3   company_size  21699 non-null  float64
 4   state         24451 non-null  object 
 5   country       24473 non-null  object 
 6   city          24472 non-null  object 
 7   zip_code      24445 non-null  object 
 8   address       24451 non-null  object 
 9   url           24473 non-null  object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.9+ MB


In [38]:
companies_df.isnull().sum()

company_id         0
name               1
description      297
company_size    2774
state             22
country            0
city               1
zip_code          28
address           22
url                0
dtype: int64

In this case, all data types are correct, so we will only perform a few operations on them.

1. We see that there are not many null values for company size, so we can use the median to determine the size.

2. Null descriptions can be handled the same way as in the other tables (`"No description"`).

3. `zip` can be treated as `"Unknown"`, and the address as well.

4. The only null company name can be removed.


In [39]:
companies_df.fillna({
    'zip_code': 'Unknown',
    'state': 'Unknown',
    'company_size': companies_df['company_size'].median(),
    'description': 'No description',
    'address': 'No specific address',
    'city': 'Unknown'
}, inplace=True)

companies_df.dropna(subset=['name'], inplace=True)

companies_df['company_size'] = companies_df['company_size'].astype(int)

companies_df['zip_code'] = companies_df['zip_code'].replace('0', 'Unknown')
companies_df['state'] = companies_df['state'].replace('0', 'Unknown')

logger.info("Handled missing values and corrected data types in companies_df.")

INFO:__main__:Handled missing values and corrected data types in companies_df.


In [40]:
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7,Unknown,US,Chicago,Unknown,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7,Unknown,IE,Dublin 2,Unknown,Grand Canal Harbour,https://www.linkedin.com/company/accenture


This way, our data remains consistent.

In [41]:
dataframes_to_load = {
    'jobs': jobs_df,
    'salaries': salaries_df,
    'benefits': benefits_df,
    'employee_counts': employee_counts_df,
    'industries': industries_df,
    'skills_industries': skills_industries_df,
    'companies': companies_df
}

# Load each DataFrame into the 'cleaned' schema
for table_name, df in dataframes_to_load.items():
    try:
        with engine.begin() as connection:
            df.to_sql(table_name, connection, schema='cleaned', if_exists='replace', index=False)
            logger.info(f"Successfully loaded {table_name} into cleaned.{table_name}")

            # Validate
            result = connection.execute(text(f"SELECT COUNT(*) FROM cleaned.{table_name}"))
            row_count = result.fetchone()[0]
            logger.info(f"Validation: cleaned.{table_name} has {row_count} rows")
    except Exception as e:
        logger.error(f"Error loading {table_name} into cleaned schema: {str(e)}")
        raise

# Verify all tables in the cleaned schema
try:
    with engine.connect() as connection:
        result = connection.execute(text("SELECT table_name FROM information_schema.tables WHERE table_schema = 'cleaned';"))
        logger.info('\nTables in cleaned schema of project-etl database:')
        for row in result:
            logger.info(row[0])
except Exception as e:
    logger.error(f'Verification failed: {e}')
    raise

INFO:__main__:Successfully loaded jobs into cleaned.jobs
INFO:__main__:Validation: cleaned.jobs has 123849 rows
INFO:__main__:Successfully loaded salaries into cleaned.salaries
INFO:__main__:Validation: cleaned.salaries has 40785 rows
INFO:__main__:Successfully loaded benefits into cleaned.benefits
INFO:__main__:Validation: cleaned.benefits has 30023 rows
INFO:__main__:Successfully loaded employee_counts into cleaned.employee_counts
INFO:__main__:Validation: cleaned.employee_counts has 35787 rows
INFO:__main__:Successfully loaded industries into cleaned.industries
INFO:__main__:Validation: cleaned.industries has 422 rows
INFO:__main__:Successfully loaded skills_industries into cleaned.skills_industries
INFO:__main__:Validation: cleaned.skills_industries has 35 rows
INFO:__main__:Successfully loaded companies into cleaned.companies
INFO:__main__:Validation: cleaned.companies has 24472 rows
INFO:__main__:
Tables in cleaned schema of project-etl database:
INFO:__main__:jobs
INFO:__main__:

In [42]:
print("Jobs Data Distribution:")
print(jobs_df[['normalized_salary', 'original_listed_time']].describe())
print("\nDate Distribution:")
print(jobs_df['original_listed_time'].value_counts().sort_index())
print(f"\nTotal unique jobs: {jobs_df['job_id'].nunique()}")
print(f"Total rows: {len(jobs_df)}")

Jobs Data Distribution:
       normalized_salary           original_listed_time
count           36073.00                         123849
mean            93505.17  2024-04-15 03:38:58.799942144
min                 0.00            2023-12-05 21:08:53
25%             52000.00            2024-04-11 19:14:36
50%             81500.00            2024-04-17 23:03:59
75%            125000.00            2024-04-18 22:12:04
max            234500.00            2024-04-20 00:26:43
std             52147.55                            NaN

Date Distribution:
original_listed_time
2023-12-05 21:08:53    1
2023-12-08 15:47:14    1
2023-12-21 18:49:15    1
2024-01-05 20:18:41    1
2024-01-05 20:19:04    1
                      ..
2024-04-20 00:26:07    1
2024-04-20 00:26:08    1
2024-04-20 00:26:28    1
2024-04-20 00:26:30    1
2024-04-20 00:26:43    1
Name: count, Length: 65036, dtype: int64

Total unique jobs: 123849
Total rows: 123849


Connect database in other new database cleaned

In [43]:
engine.dispose()
logger.info("Closed connection to GCP database.")

INFO:__main__:Closed connection to GCP database.
