## **PROJECT - NOTEBOOK #2: Data Cleansing, transformation and Exploratory Data Analysis (EDA)**

---

### **Setting Environment**

In [None]:
import os 
print(os.getcwd())

try:
    os.chdir("../../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to project_etl.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl\notebooks
d:\U\FIFTH SEMESTER\ETL\project_etl


### **Load Data**

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from src.database.db_connection import create_gcp_engine

plt.style.use('ggplot')

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [3]:
try:
    engine = create_gcp_engine()
    logger.info("Successfully connected to GCP database")
except Exception as e:
    logger.error(f"Failed to connect to GCP database: {str(e)}")
    raise

INFO:src.database.db_connection:Successfully created GCP database engine
INFO:__main__:Successfully connected to GCP database


### **Read Tables in database project_etl**

In [4]:
from sqlalchemy import create_engine, text

In [5]:
jobs_df = pd.read_sql("SELECT * FROM raw.jobs", con=engine)
salaries_df = pd.read_sql("SELECT * FROM raw.salaries", con=engine)
benefits_df = pd.read_sql("SELECT * FROM raw.benefits", con=engine)
employee_counts_df = pd.read_sql("SELECT * FROM raw.employee_counts", con=engine)
industries_df = pd.read_sql("SELECT * FROM raw.industries", con=engine)
skills_industries_df = pd.read_sql("SELECT * FROM raw.skills_industries", con=engine)
companies_df = pd.read_sql("SELECT * FROM raw.companies", con=engine)

logger.info("DataFrames loaded from raw schema of LJP database.")

INFO:__main__:DataFrames loaded from raw schema of LJP database.


**The database contain 7 tables:**

+ jobs_df
+ salaries_df
+ benefits_df
+ employee_counts_df
+ industries_df
+ skills_industries_df
+ companies_df

We will analyse each table one by one to perform the necessary cleanups and analyses.

## **jobs_df**

In [6]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


### **Information**

As can be observed, the `jobs` dataframe or table contains 31 columns and approximately 124,000 entries of information about job postings on LinkedIn.  

In [7]:
jobs_df.isnull().sum()

job_id                             0
company_name                    1719
title                              0
description                        7
max_salary                     94056
pay_period                     87776
location                           0
company_id                      1717
views                           1689
med_salary                    117569
min_salary                     94056
formatted_work_type                0
applies                       100529
original_listed_time               0
remote_allowed                108603
job_posting_url                    0
application_url                36665
application_type                   0
expiry                             0
closed_time                   122776
formatted_experience_level     29409
skills_desc                   121410
listed_time                        0
posting_domain                 39968
sponsored                          0
work_type                          0
currency                       87776
c

In the same table, there is a significant number of null values, although many of these are not needed, other null values could provide very useful information.

**The columns with the most null values are:**

+ `closed_time`: 122,776
+ `skills_desc`: 121,410
+ `med_salary`: 117,569
+ `remote_allowed`: 108,603
+ `applies`: 100,529
+ `max_salary`: 94,056
+ `min_salary`: 94,056
+ `compensation_type`: 87,776
+ `normalized_salary`: 87,776
+ `pay_period`: 87,776

Some of these columns are not relevant for this analysis, while others that are important have too many null values. For this reason, some of these columns will be removed.

### **Columns to Remove from `jobs_df`:**

+ `closed_time`: The closing date is not relevant in this case.
+ `skills_desc`: Although it is important, it has too many null values.
+ `med_salary`, `max_salary`, `min_salary`: We have the `normalized_salary` column, which has fewer null values and generalises the salary regardless of the currency type.
+ `compensation_type`: It only has two values (`BASE_SALARY`, `None`), and in this form, it does not provide any significant information other than indicating a base salary offered to the employee.
+ `listed_time`, `expiry`: The update date and expiry date are not needed for this analysis; we only require the original posting date (`original_listed_time`).
+ `fips`: It indicates a specific address number, but we already have `zip_code` and `location`, which are largely sufficient to precisely identify the job location.
+ `work_type`: This column indicates a specific work type using formats like **FULL_TIME** or **CONTRACT**. However, we already have the `formatted_work_type` column, which provides a more user-friendly format, such as **Full-time** or **Contract**. Therefore, having both columns might be redundant.
+ Las columnas de URL tampoco son relevantes para el análisis pensado, por lo que solo dejaremos la columna job_posting_url para saber cuál es la publicación de la oferta de empleo.

### **Dataset information**

In [8]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

1. The `zip_code` column is currently of type *float* but should be of type *string* to allow formats with hyphens "-" as is common in many places.  

2. The dates are in *float* format when they should actually be in *datetime* format to enable correct date interpretation. Therefore, we will convert them.  

### **Transformed Dataset (Explanation)**

Considering this, after cleaning the dataset, it would no longer include those columns, leaving only 21 columns that could potentially be relevant.

## **salaries_df**

In [9]:
salaries_df.head()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type
0,1,3884428798,,20.0,,HOURLY,USD,BASE_SALARY
1,2,3887470552,25.0,,23.0,HOURLY,USD,BASE_SALARY
2,3,3884431523,120000.0,,100000.0,YEARLY,USD,BASE_SALARY
3,4,3884911725,200000.0,,10000.0,YEARLY,USD,BASE_SALARY
4,5,3887473220,35.0,,33.0,HOURLY,USD,BASE_SALARY


### **Information**

We know that it is a DataFrame containing approximately 50,000 rows and 8 columns.

In [10]:
salaries_df.isnull().sum()

salary_id                0
job_id                   0
max_salary            6838
med_salary           33947
min_salary            6838
pay_period               0
currency                 0
compensation_type        0
dtype: int64

1. We observe that approximately 7,000 records in `max_salary` and `min_salary` have null values, whereas `med_salary` has nearly 34,000 null values. Therefore, we can calculate the average between `max_salary` and `min_salary` to derive `med_salary` and then remove the records that lack values in all three of these columns.

2. We also notice that the DataFrame includes the `job_id` column, which means we can, in the future, join this table with the `jobs_df` table to enhance the information available in that table.

In [11]:
benefits_df.head()

Unnamed: 0,job_id,inferred,type
0,3887473071,0,Medical insurance
1,3887473071,0,Vision insurance
2,3887473071,0,Dental insurance
3,3887473071,0,401(k)
4,3887473071,0,Student loan assistance


### **Information**

Here we see that companies offer various benefits, but they are listed under different IDs, even for the same company. This means we can create a list of the types of benefits offered, rather than keeping them separate.

In [12]:
benefits_df.isnull().sum()

job_id      0
inferred    0
type        0
dtype: int64

In this table, don't exist null values

In [13]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,1712346173
1,22292832,311,4471,1712346173
2,20300,1053,6554,1712346173
3,3570660,383,35241,1712346173
4,878353,52,26397,1712346173


In [14]:
industries_df.head()

Unnamed: 0,industry_id,industry_name
0,1,Defense and Space Manufacturing
1,3,Computer Hardware Manufacturing
2,4,Software Development
3,5,Computer Networking Products
4,6,"Technology, Information and Internet"


In [15]:
skills_industries_df.head()

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution


In [16]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

In [17]:
salaries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40785 entries, 0 to 40784
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   salary_id          40785 non-null  int64  
 1   job_id             40785 non-null  int64  
 2   max_salary         33947 non-null  float64
 3   med_salary         6838 non-null   float64
 4   min_salary         33947 non-null  float64
 5   pay_period         40785 non-null  object 
 6   currency           40785 non-null  object 
 7   compensation_type  40785 non-null  object 
dtypes: float64(3), int64(2), object(3)
memory usage: 2.5+ MB


In [18]:
benefits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67943 entries, 0 to 67942
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   job_id    67943 non-null  int64 
 1   inferred  67943 non-null  int64 
 2   type      67943 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.6+ MB


In [19]:
employee_counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35787 entries, 0 to 35786
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   company_id      35787 non-null  int64
 1   employee_count  35787 non-null  int64
 2   follower_count  35787 non-null  int64
 3   time_recorded   35787 non-null  int64
dtypes: int64(4)
memory usage: 1.1 MB


In [20]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,1712346173
1,22292832,311,4471,1712346173
2,20300,1053,6554,1712346173
3,3570660,383,35241,1712346173
4,878353,52,26397,1712346173


In [21]:
industries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422 entries, 0 to 421
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   industry_id    422 non-null    int64 
 1   industry_name  388 non-null    object
dtypes: int64(1), object(1)
memory usage: 6.7+ KB


In [22]:
skills_industries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   skill_abr   35 non-null     object
 1   skill_name  35 non-null     object
dtypes: object(2)
memory usage: 692.0+ bytes


In [23]:
skills_industries_df.head()

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution


In [24]:
companies_df

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7.0,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7.0,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7.0,0,IE,Dublin 2,0,Grand Canal Harbour,https://www.linkedin.com/company/accenture
...,...,...,...,...,...,...,...,...,...,...
24468,103456527,Kinder Prep Montessori Nursery & Preschool,Explore our renowned daycare and preschool cen...,1.0,New York,US,Brooklyn,11249,49 Broadway,https://www.linkedin.com/company/kinder-prep-m...
24469,103466352,Centent Consulting LLC,Centent Consulting LLC is a reputable human re...,,0,0,0,0,0,https://www.linkedin.com/company/centent-consu...
24470,103467540,"Kings and Queens Productions, LLC",We are a small but mighty collection of thinke...,,0,0,0,0,0,https://www.linkedin.com/company/kings-and-que...
24471,103468936,WebUnite,Our mission at WebUnite is to offer experience...,,Pennsylvania,US,Southampton,18966,720 2nd Street Pike,https://www.linkedin.com/company/webunite
