## **PROJECT - NOTEBOOK #3: Data Cleansing and Transformation**

---

In [26]:
import os 
print(os.getcwd())

try:
    os.chdir("../../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl

        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        
d:\U\FIFTH SEMESTER\ETL\project_etl


In [27]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

#from functions.db_connection.connection import create_engine
from functions.db_connection.connection import creating_engine

In [28]:
from sqlalchemy import create_engine, text

In [29]:
existing_engine = creating_engine()

In [30]:
from functions.db_connection.clean_connection import create_clean_engine

In [31]:
clean_engine = create_clean_engine()

In [32]:
jobs_df = pd.read_sql("SELECT * FROM public.jobs", con=existing_engine)
salaries_df = pd.read_sql("SELECT * FROM public.salaries", con=existing_engine)
benefits_df = pd.read_sql("SELECT * FROM public.benefits", con=existing_engine)
employee_counts_df = pd.read_sql("SELECT * FROM public.employee_counts", con=existing_engine)
industries_df = pd.read_sql("SELECT * FROM public.industries", con=existing_engine)
skills_industries_df = pd.read_sql("SELECT * FROM public.skills_industries", con=existing_engine)
companies_df = pd.read_sql("SELECT * FROM public.companies", con=existing_engine)

print("DataFrames loaded from PostgreSQL (from raw database).")

DataFrames loaded from PostgreSQL (from raw database).


## Data Cleaning (jobs_df)

Now that we know how we're going to handle the data, thanks to notebook `02_read_data.ipynb`, it's time to start cleaning and transforming the data.

We'll begin by deleting the columns defined previously:

+ `closed_time`
+ `skills_desc`
+ `med_salary`, `max_salary`, `min_salary`
+ `compensation_type`
+ `listed_time`, `expiry`
+ `fips` 
+ `work_type`
+ `applies`
+ application_url
+ posting_domain

In [33]:
cols_to_drop = ['med_salary', 'work_type', 'applies', 'closed_time', 'skills_desc', 'max_salary', 'min_salary', 'fips', 'listed_time','expiry', 'compensation_type', 'application_url', 'posting_domain']
jobs_df.drop(columns=cols_to_drop, inplace=True, errors='ignore')

1. We know that the dates are being handled in float format, which creates reading problems. That's why we're going to change them to date format.

2. The postal code is float type, just like the dates, but we'll change this one to string format to handle different postal code values and formats.

3. Algunos formatos como los id, y las vistas están en formato tipo float, lo cual es un error, por eso lo vamos a manejar como enteros haciendo el cambio de float a int.

4. Hay trabajos que no especifican si se puede trabajar de forma remota, en nuestro caso remote_allowed es un float que vamos a cambiar por un booleano para indicar los trabajos que no tengan o no especifiquen el trabajo remoto sean "False"

5. Tenemos una columna llamada pay_period que muestra el periodo de pago a los empleados, pero el salario normalizado muestra el salario anual generalmente, por lo que cualquier valor diferente a YEARLY en pay_period lo convertiremos en ese valor

In [34]:
columns_to_replace_not_specified = ["zip_code", "formatted_experience_level"]
jobs_df["zip_code"] = jobs_df["zip_code"].astype(str)
jobs_df[columns_to_replace_not_specified] = jobs_df[columns_to_replace_not_specified].replace(["nan", None], "No specified")
jobs_df["original_listed_time"] = pd.to_datetime(jobs_df["original_listed_time"], unit="ms")
jobs_df["company_id"] = jobs_df["company_id"].fillna(-1).astype(int)
jobs_df["views"] = jobs_df["views"].fillna(0).astype(int)
jobs_df["remote_allowed"] = jobs_df["remote_allowed"].fillna(0).astype(bool)


In [35]:
columns_to_replace = ["currency", "pay_period"]
jobs_df[columns_to_replace] = jobs_df[columns_to_replace].replace([None, pd.NA], "Unknown")

In [36]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,description,pay_period,location,company_id,views,formatted_work_type,original_listed_time,remote_allowed,job_posting_url,application_type,formatted_experience_level,sponsored,currency,normalized_salary,zip_code,job_id_modify,company_id_modify
0,3853386067,"CrossCountry Mortgage, LLC",Licensed Loan Partner,CrossCountry Mortgage is a leading mortgage le...,YEARLY,"Ellicott City, MD",3021785,2,Full-time,2024-04-11 18:40:39,False,https://www.linkedin.com/jobs/view/3853386067/...,ComplexOnsiteApply,No specified,0,USD,42500.0,21042.0,1,1
1,3853717462,"Spruce InfoTech, Inc",Quality process analyst | Hybrid in West Berli...,Consultant's Title: Quality Process EngineerWo...,Unknown,"West Berlin, NJ",4803413,2,Full-time,2024-04-19 13:46:13,False,https://www.linkedin.com/jobs/view/3853717462/...,ComplexOnsiteApply,No specified,0,Unknown,,8091.0,2,2
2,3853719293,"Miracle Software Systems, Inc",Business Development Account Manager,"Hello conections ,\nI trust you are doing well...",Unknown,"Novi, MI",15388,4,Full-time,2024-04-11 18:05:05,False,https://www.linkedin.com/jobs/view/3853719293/...,ComplexOnsiteApply,No specified,0,Unknown,,48374.0,3,3
3,3853995874,,Professional Singer,Summary:\nWe are looking for a professional or...,Unknown,"Boston, MA",-1,3,Temporary,2024-04-15 18:58:35,True,https://www.linkedin.com/jobs/view/3853995874/...,ComplexOnsiteApply,No specified,0,Unknown,,2108.0,4,4
4,3854137450,NTCA–The Rural Broadband Association,Accounting Manager,NTCA – The Rural Broadband Association is look...,Unknown,"Arlington, VA",39231,22,Full-time,2024-04-15 19:15:02,False,https://www.linkedin.com/jobs/view/3854137450/...,OffsiteApply,No specified,0,Unknown,,22201.0,5,5


Here we realize that the `id` has very long values, so we will not remove that column since it belongs to the real `id` of the job. However, we will create another one called `new_job_id` to make searches, graphs, and queries simpler. The same will apply to the company's `id`.

In [37]:
jobs_df["job_id_modify"] = range(1, len(jobs_df) + 1)
jobs_df["company_id_modify"] = range(1, len(jobs_df) + 1)

In [38]:
jobs_df.isnull().sum()

job_id                            0
company_name                   1719
title                             0
description                       7
pay_period                        0
location                          0
company_id                        0
views                             0
formatted_work_type               0
original_listed_time              0
remote_allowed                    0
job_posting_url                   0
application_type                  0
formatted_experience_level        0
sponsored                         0
currency                          0
normalized_salary             87776
zip_code                          0
job_id_modify                     0
company_id_modify                 0
dtype: int64

In [39]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   job_id                      123849 non-null  int64         
 1   company_name                122130 non-null  object        
 2   title                       123849 non-null  object        
 3   description                 123842 non-null  object        
 4   pay_period                  123849 non-null  object        
 5   location                    123849 non-null  object        
 6   company_id                  123849 non-null  int64         
 7   views                       123849 non-null  int64         
 8   formatted_work_type         123849 non-null  object        
 9   original_listed_time        123849 non-null  datetime64[ns]
 10  remote_allowed              123849 non-null  bool          
 11  job_posting_url             123849 non-

In [40]:
pd.options.display.float_format = '{:.2f}'.format
jobs_df.describe()

Unnamed: 0,job_id,company_id,views,original_listed_time,sponsored,normalized_salary,job_id_modify,company_id_modify
count,123849.0,123849.0,123849.0,123849,123849.0,36073.0,123849.0,123849.0
mean,3896402138.07,12034820.09,14.42,2024-04-15 03:38:58.799941632,0.0,93505.17,61925.0,61925.0
min,921716.0,-1.0,0.0,2023-12-05 21:08:53,0.0,0.0,1.0,1.0
25%,3894586595.0,12979.0,3.0,2024-04-11 19:14:36,0.0,52000.0,30963.0,30963.0
50%,3901998406.0,204770.0,4.0,2024-04-17 23:03:59,0.0,81500.0,61925.0,61925.0
75%,3904707077.0,7260866.0,7.0,2024-04-18 22:12:04,0.0,125000.0,92887.0,92887.0
max,3906267224.0,103472979.0,9975.0,2024-04-20 00:26:43,0.0,234500.0,123849.0,123849.0
std,84043545.16,25403872.0,85.33,,0.0,52147.55,35752.27,35752.27


With `describe`, we see that there is a value in `normalized_salary` that is unusually high:  

**535,600,000.00**  

We observe that the 75th percentile is approximately **125,000**, which suggests that there may not be many values significantly higher than that number. However, to confirm this, we perform a statist



In [41]:
q1 = jobs_df["normalized_salary"].quantile(0.25)
q3 = jobs_df["normalized_salary"].quantile(0.75)
iqr = q3 - q1
upper_cap = q3 + 1.5 * iqr

jobs_df["normalized_salary"] = jobs_df["normalized_salary"].clip(upper=upper_cap)

In [42]:
jobs_df["normalized_salary"].describe()

count    36073.00
mean     93505.17
std      52147.55
min          0.00
25%      52000.00
50%      81500.00
75%     125000.00
max     234500.00
Name: normalized_salary, dtype: float64

De esta manera garantizamos integridad en los datos.

# Data cleaning (salaries_df)

In [43]:
salaries_df

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type,raw_salary
0,1,3884428798,,20.00,,HOURLY,USD,BASE_SALARY,20.00
1,2,3887470552,25.00,,23.00,HOURLY,USD,BASE_SALARY,24.00
2,3,3884431523,120000.00,,100000.00,YEARLY,USD,BASE_SALARY,110000.00
3,4,3884911725,200000.00,,10000.00,YEARLY,USD,BASE_SALARY,105000.00
4,5,3887473220,35.00,,33.00,HOURLY,USD,BASE_SALARY,34.00
...,...,...,...,...,...,...,...,...,...
40780,40781,3902881498,,15.50,,HOURLY,USD,BASE_SALARY,15.50
40781,40782,3902883232,,25.00,,HOURLY,USD,BASE_SALARY,25.00
40782,40783,3902866633,21.53,,21.10,HOURLY,USD,BASE_SALARY,21.32
40783,40784,3902879720,125000.00,,100000.00,YEARLY,USD,BASE_SALARY,112500.00


1. We know that most values in `min_salary`, `med_salary`, and `max_salary` are null, so we will perform some operations on the rows to normalize these values and store them in a column called `raw_salary`.

2. We know that salary periods are given in different time frames (`HOURLY`, `WEEKLY`, `BIWEEKLY`, `YEARLY`, etc.). We will standardize them to yearly values by assuming the employees' working hours for `HOURLY`, the number of weeks in a year for `WEEKLY`, and so on.

3. Since we know that the base salary is stored in `BASE_SALARY`, all calculations will consider this. Overtime wages will not be an issue, as they are not included in the base salary.


In [44]:
import numpy as np
def get_unified_salary(row):
    if not pd.isna(row['med_salary']):
        return row['med_salary']
    
    min_sal = row['min_salary']
    max_sal = row['max_salary']
    
    if not pd.isna(min_sal) and not pd.isna(max_sal):
        return (min_sal + max_sal) / 2
    
    if not pd.isna(min_sal):
        return min_sal
    if not pd.isna(max_sal):
        return max_sal
    
    return np.nan

salaries_df['raw_salary'] = salaries_df.apply(get_unified_salary, axis=1)

In [45]:
salaries_df.head()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type,raw_salary
0,1,3884428798,,20.0,,HOURLY,USD,BASE_SALARY,20.0
1,2,3887470552,25.0,,23.0,HOURLY,USD,BASE_SALARY,24.0
2,3,3884431523,120000.0,,100000.0,YEARLY,USD,BASE_SALARY,110000.0
3,4,3884911725,200000.0,,10000.0,YEARLY,USD,BASE_SALARY,105000.0
4,5,3887473220,35.0,,33.0,HOURLY,USD,BASE_SALARY,34.0


In [46]:
salaries_df.describe()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,raw_salary
count,40785.0,40785.0,33947.0,6838.0,33947.0,40785.0
mean,20393.0,3895563848.87,90263.02,1372.17,61934.4,63570.08
std,11773.76,94966718.0,88349.98,2306.1,58244.6,71136.02
min,1.0,921716.0,1.0,0.0,1.0,0.0
25%,10197.0,3894608085.0,50.0,18.5,39.0,30.0
50%,20393.0,3901980104.0,85000.0,25.0,62300.0,53800.0
75%,30589.0,3904576109.0,142500.0,2207.0,100000.0,110075.0
max,40785.0,3906267224.0,356175.0,5489.75,249941.5,303058.25


1. A `raw_salary` column is created based on the other three salary values according to the performed operations.

2. There is an issue with the maximum salary values, as they deviate significantly from the mean. To address this, we will add a restriction to prevent such outliers and perform a statistical calculation.

**Note:** This was not applied to `jobs_df` because the number of null values was too high.


In [47]:
salary_columns = ['max_salary', 'med_salary', 'min_salary', 'raw_salary']

def cap_outliers(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    upper_cap = q3 + 1.5 * iqr
    df[column] = df[column].clip(upper=upper_cap)
    return df

for col in salary_columns:
    salaries_df = cap_outliers(salaries_df, col)

In [48]:
salaries_df.describe()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,raw_salary
count,40785.0,40785.0,33947.0,6838.0,33947.0,40785.0
mean,20393.0,3895563848.87,90263.02,1372.17,61934.4,63390.13
std,11773.76,94966718.0,88349.98,2306.1,58244.6,70566.28
min,1.0,921716.0,1.0,0.0,1.0,0.0
25%,10197.0,3894608085.0,50.0,18.5,39.0,30.0
50%,20393.0,3901980104.0,85000.0,25.0,62300.0,53800.0
75%,30589.0,3904576109.0,142500.0,2207.0,100000.0,110075.0
max,40785.0,3906267224.0,356175.0,5489.75,249941.5,275142.5


This way, we ensure that the data does not exceed the actual value.  

So far, we have completed `salaries_df`.


# Data Cleaning (benefits_df)

In [49]:
benefits_df

Unnamed: 0,job_id,type
0,23221523,{401(k)}
1,56482768,"{401(k),""Dental insurance"",""Disability insuran..."
2,69333422,"{""Medical insurance"",""Vision insurance"",""Denta..."
3,95428182,"{""Medical insurance"",""Dental insurance"",""Disab..."
4,111513530,"{""Medical insurance"",""Paid maternity leave"",""P..."
...,...,...
30018,3906266212,"{""Dental insurance"",""Vision insurance"",401(k)}"
30019,3906266229,"{""Medical insurance"",""Vision insurance"",""Denta..."
30020,3906266248,"{401(k),""Vision insurance"",""Medical insurance""}"
30021,3906266272,"{""Medical insurance"",""Vision insurance"",""Denta..."


1. Here, we see that the same ID is repeated multiple times for different benefits. However, since they belong to the same company, we need to unify them.

2. We will also remove the `inferred` column, as it does not seem relevant.


In [50]:
benefits_df = benefits_df.drop(columns=['inferred'])

benefits_df = benefits_df.groupby('job_id')['type'].apply(list).reset_index()


KeyError: "['inferred'] not found in axis"

In [52]:
benefits_df

Unnamed: 0,job_id,type
0,23221523,{401(k)}
1,56482768,"{401(k),""Dental insurance"",""Disability insuran..."
2,69333422,"{""Medical insurance"",""Vision insurance"",""Denta..."
3,95428182,"{""Medical insurance"",""Dental insurance"",""Disab..."
4,111513530,"{""Medical insurance"",""Paid maternity leave"",""P..."
...,...,...
30018,3906266212,"{""Dental insurance"",""Vision insurance"",401(k)}"
30019,3906266229,"{""Medical insurance"",""Vision insurance"",""Denta..."
30020,3906266248,"{401(k),""Vision insurance"",""Medical insurance""}"
30021,3906266272,"{""Medical insurance"",""Vision insurance"",""Denta..."


1. We group the benefits into a list and remove the `inferred` column.



# Data Cleaning (employee_counts)


In [53]:
employee_counts_df

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,2024-04-05
1,22292832,311,4471,2024-04-05
2,20300,1053,6554,2024-04-05
3,3570660,383,35241,2024-04-05
4,878353,52,26397,2024-04-05
...,...,...,...,...
35782,2852377,1999,7943,2024-04-20
35783,64379,144,31910,2024-04-20
35784,19114724,4,102,2024-04-20
35785,307079,602,32199,2024-04-20


In [54]:
employee_counts_df.isnull().sum()

company_id        0
employee_count    0
follower_count    0
time_recorded     0
dtype: int64

There are no null values in the table, but we notice that `timerecorder` is in an unusual format, and we are unsure of its actual meaning. Therefore, we will convert it into a readable format.

In [55]:
employee_counts_df["time_recorded"] = pd.to_datetime(
    employee_counts_df["time_recorded"], 
    unit="s"
).dt.date

ValueError: unit='s' not valid with non-numerical val='2024-04-05', at position 0

In [56]:
employee_counts_df

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,2024-04-05
1,22292832,311,4471,2024-04-05
2,20300,1053,6554,2024-04-05
3,3570660,383,35241,2024-04-05
4,878353,52,26397,2024-04-05
...,...,...,...,...
35782,2852377,1999,7943,2024-04-20
35783,64379,144,31910,2024-04-20
35784,19114724,4,102,2024-04-20
35785,307079,602,32199,2024-04-20


Now, the date is in a readable format.


# **Data Cleaning (industries_df)**

In [57]:
industries_df

Unnamed: 0,industry_id,industry_name
0,1,Defense and Space Manufacturing
1,3,Computer Hardware Manufacturing
2,4,Software Development
3,5,Computer Networking Products
4,6,"Technology, Information and Internet"
...,...,...
417,3249,Surveying and Mapping Services
418,3250,Retail Pharmacies
419,3251,Climate Technology Product Manufacturing
420,3252,Climate Data and Analytics


In [58]:
industries_df.isnull().sum()

industry_id      0
industry_name    0
dtype: int64

There are 34 null values in `industry_name`, so we will convert them to `"Unknown"`.

In [59]:
industries_df["industry_name"] = industries_df["industry_name"].replace([None, pd.NA], "Unknown")

In [60]:
industries_df.isnull().sum()

industry_id      0
industry_name    0
dtype: int64

# Data Cleaning (skills_industries)

In [61]:
skills_industries_df

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution
5,EDU,Education
6,TRNG,Training
7,PRJM,Project Management
8,CNSL,Consulting
9,PRCH,Purchasing


No es necesario hacer nada, por ahora

# Data Cleaning (companies)

In [62]:
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7,Unknown,US,Chicago,Unknown,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7,Unknown,IE,Dublin 2,Unknown,Grand Canal Harbour,https://www.linkedin.com/company/accenture


In [63]:
companies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24472 entries, 0 to 24471
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   company_id    24472 non-null  int64 
 1   name          24472 non-null  object
 2   description   24472 non-null  object
 3   company_size  24472 non-null  int64 
 4   state         24472 non-null  object
 5   country       24472 non-null  object
 6   city          24472 non-null  object
 7   zip_code      24472 non-null  object
 8   address       24472 non-null  object
 9   url           24472 non-null  object
dtypes: int64(2), object(8)
memory usage: 1.9+ MB


In [64]:
companies_df.isnull().sum()

company_id      0
name            0
description     0
company_size    0
state           0
country         0
city            0
zip_code        0
address         0
url             0
dtype: int64

In this case, all data types are correct, so we will only perform a few operations on them.

1. We see that there are not many null values for company size, so we can use the median to determine the size.

2. Null descriptions can be handled the same way as in the other tables (`"No description"`).

3. `zip` can be treated as `"Unknown"`, and the address as well.

4. The only null company name can be removed.


In [65]:
companies_df.fillna({
    'zip_code': 'Unknown',
    'state': 'Unknown',
    'company_size': companies_df['company_size'].median(),
    'description': 'No description',
    'address': 'No specific address',
    'city': 'Unknown'
}, inplace=True)

companies_df.dropna(subset=['name'], inplace=True)

companies_df['company_size'] = companies_df['company_size'].astype(int)

companies_df['zip_code'] = companies_df['zip_code'].replace('0', 'Unknown')
companies_df['state'] = companies_df['state'].replace('0', 'Unknown')

In [66]:
companies_df

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7,Unknown,US,Chicago,Unknown,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7,Unknown,IE,Dublin 2,Unknown,Grand Canal Harbour,https://www.linkedin.com/company/accenture
...,...,...,...,...,...,...,...,...,...,...
24467,103456466,Foundation Model Startup,No description,3,Unknown,0,0,Unknown,0,https://www.linkedin.com/company/foundation-mo...
24468,103456527,Kinder Prep Montessori Nursery & Preschool,Explore our renowned daycare and preschool cen...,1,New York,US,Brooklyn,11249,49 Broadway,https://www.linkedin.com/company/kinder-prep-m...
24469,103466352,Centent Consulting LLC,Centent Consulting LLC is a reputable human re...,3,Unknown,0,0,Unknown,0,https://www.linkedin.com/company/centent-consu...
24470,103468936,WebUnite,Our mission at WebUnite is to offer experience...,3,Pennsylvania,US,Southampton,18966,720 2nd Street Pike,https://www.linkedin.com/company/webunite


This way, our data remains consistent.

In [67]:
dataframes_to_send = {
    'jobs': jobs_df,
    'salaries': salaries_df,
    'benefits': benefits_df,
    'employee_counts': employee_counts_df,
    'industries': industries_df,
    'skills_industries': skills_industries_df,
    'companies': companies_df
}

for table_name, df in dataframes_to_send.items():
    try:
        #df.to_sql(table_name, existing_engine, schema='public', if_exists='replace', index=False)
        df.to_sql(table_name, clean_engine, schema='public', if_exists='replace', index=False)
        print(f"Successfully sent {table_name} DataFrame to linkedin-postings-clean")
    except Exception as e:
        print(f"Error sending {table_name} to linkedin-postings-clean: {str(e)}")

# Verify the tables were created by querying the new database
#with existing_engine.connect() as connection:
with clean_engine.connect() as connection:
    for table_name in dataframes_to_send.keys():
        result = connection.execute(text(f"SELECT COUNT(*) FROM public.{table_name}"))
        row_count = result.fetchone()[0]
        print(f"Table {table_name} in linkedin-postings-clean has {row_count} rows")

print("All cleaned DataFrames successfully sent to linkedin-postings-clean database.")

Successfully sent jobs DataFrame to linkedin-postings-clean
Successfully sent salaries DataFrame to linkedin-postings-clean
Successfully sent benefits DataFrame to linkedin-postings-clean
Successfully sent employee_counts DataFrame to linkedin-postings-clean
Successfully sent industries DataFrame to linkedin-postings-clean
Successfully sent skills_industries DataFrame to linkedin-postings-clean
Successfully sent companies DataFrame to linkedin-postings-clean
Table jobs in linkedin-postings-clean has 123849 rows
Table salaries in linkedin-postings-clean has 40785 rows
Table benefits in linkedin-postings-clean has 30023 rows
Table employee_counts in linkedin-postings-clean has 35787 rows
Table industries in linkedin-postings-clean has 422 rows
Table skills_industries in linkedin-postings-clean has 35 rows
Table companies in linkedin-postings-clean has 24472 rows
All cleaned DataFrames successfully sent to linkedin-postings-clean database.


Connect database in other new database cleaned

In [None]:
clean_engine.dispose()
existing_engine.dispose()
print("Connections to both databases closed.")

Connections to both databases closed.
