## **PROJECT - NOTEBOOK #2: Data Cleansing, transformation and Exploratory Data Analysis (EDA)**

---

### **Setting Environment**

In [74]:
import os 
print(os.getcwd())

try:
    os.chdir("../project_etl")

except FileNotFoundError:
    print("""
        FileNotFoundError - The directory may not exist or you might not be in the specified path.
        If this has already worked, do not run this block again, as the current directory is already set to workshop-001.
        """)
    
print(os.getcwd())

d:\U\FIFTH SEMESTER\ETL\project_etl
d:\U\FIFTH SEMESTER\ETL\project_etl


### **Load Data**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

from src.database.connection import creating_engine

In [76]:
engine = creating_engine()

### **Read Tables in database project_etl**

In [77]:
from sqlalchemy import create_engine, text

In [78]:
jobs_df = pd.read_sql("SELECT * FROM public.jobs", con=engine)
salaries_df = pd.read_sql("SELECT * FROM public.salaries", con=engine)
benefits_df = pd.read_sql("SELECT * FROM public.benefits", con=engine)
employee_counts_df = pd.read_sql("SELECT * FROM public.employee_counts", con=engine)
industries_df = pd.read_sql("SELECT * FROM public.industries", con=engine)
skills_industries_df = pd.read_sql("SELECT * FROM public.skills_industries", con=engine)
companies_df = pd.read_sql("SELECT * FROM public.companies", con=engine)

print("DataFrames loaded from PostgreSQL.")

DataFrames loaded from PostgreSQL.


**The database contain 7 tables:**

+ jobs_df
+ salaries_df
+ benefits_df
+ employee_counts_df
+ industries_df
+ skills_industries_df
+ companies_df

We will analyse each table one by one to perform the necessary cleanups and analyses.

## **jobs_df**

In [79]:
jobs_df.head()

Unnamed: 0,job_id,company_name,title,description,pay_period,location,company_id,views,formatted_work_type,original_listed_time,remote_allowed,job_posting_url,application_type,formatted_experience_level,sponsored,currency,normalized_salary,zip_code,job_id_modify,company_id_modify
0,3853386067,"CrossCountry Mortgage, LLC",Licensed Loan Partner,CrossCountry Mortgage is a leading mortgage le...,YEARLY,"Ellicott City, MD",3021785,2,Full-time,2024-04-11 18:40:39,False,https://www.linkedin.com/jobs/view/3853386067/...,ComplexOnsiteApply,No specified,0,USD,42500.0,21042.0,1,1
1,3853717462,"Spruce InfoTech, Inc",Quality process analyst | Hybrid in West Berli...,Consultant's Title: Quality Process EngineerWo...,Unknown,"West Berlin, NJ",4803413,2,Full-time,2024-04-19 13:46:13,False,https://www.linkedin.com/jobs/view/3853717462/...,ComplexOnsiteApply,No specified,0,Unknown,,8091.0,2,2
2,3853719293,"Miracle Software Systems, Inc",Business Development Account Manager,"Hello conections ,\nI trust you are doing well...",Unknown,"Novi, MI",15388,4,Full-time,2024-04-11 18:05:05,False,https://www.linkedin.com/jobs/view/3853719293/...,ComplexOnsiteApply,No specified,0,Unknown,,48374.0,3,3
3,3853995874,,Professional Singer,Summary:\nWe are looking for a professional or...,Unknown,"Boston, MA",-1,3,Temporary,2024-04-15 18:58:35,True,https://www.linkedin.com/jobs/view/3853995874/...,ComplexOnsiteApply,No specified,0,Unknown,,2108.0,4,4
4,3854137450,NTCA–The Rural Broadband Association,Accounting Manager,NTCA – The Rural Broadband Association is look...,Unknown,"Arlington, VA",39231,22,Full-time,2024-04-15 19:15:02,False,https://www.linkedin.com/jobs/view/3854137450/...,OffsiteApply,No specified,0,Unknown,,22201.0,5,5


### **Information**

As can be observed, the `jobs` dataframe or table contains 31 columns and approximately 124,000 entries of information about job postings on LinkedIn.  

In [80]:
jobs_df.isnull().sum()

job_id                            0
company_name                   1719
title                             0
description                       7
pay_period                        0
location                          0
company_id                        0
views                             0
formatted_work_type               0
original_listed_time              0
remote_allowed                    0
job_posting_url                   0
application_type                  0
formatted_experience_level        0
sponsored                         0
currency                          0
normalized_salary             87776
zip_code                          0
job_id_modify                     0
company_id_modify                 0
dtype: int64

In the same table, there is a significant number of null values, although many of these are not needed, other null values could provide very useful information.

**The columns with the most null values are:**

+ `closed_time`: 122,776
+ `skills_desc`: 121,410
+ `med_salary`: 117,569
+ `remote_allowed`: 108,603
+ `applies`: 100,529
+ `max_salary`: 94,056
+ `min_salary`: 94,056
+ `compensation_type`: 87,776
+ `normalized_salary`: 87,776
+ `pay_period`: 87,776

Some of these columns are not relevant for this analysis, while others that are important have too many null values. For this reason, some of these columns will be removed.

### **Columns to Remove from `jobs_df`:**

+ `closed_time`: The closing date is not relevant in this case.
+ `skills_desc`: Although it is important, it has too many null values.
+ `med_salary`, `max_salary`, `min_salary`: We have the `normalized_salary` column, which has fewer null values and generalises the salary regardless of the currency type.
+ `compensation_type`: It only has two values (`BASE_SALARY`, `None`), and in this form, it does not provide any significant information other than indicating a base salary offered to the employee.
+ `listed_time`, `expiry`: The update date and expiry date are not needed for this analysis; we only require the original posting date (`original_listed_time`).
+ `fips`: It indicates a specific address number, but we already have `zip_code` and `location`, which are largely sufficient to precisely identify the job location.
+ `work_type`: This column indicates a specific work type using formats like **FULL_TIME** or **CONTRACT**. However, we already have the `formatted_work_type` column, which provides a more user-friendly format, such as **Full-time** or **Contract**. Therefore, having both columns might be redundant.
+ Las columnas de URL tampoco son relevantes para el análisis pensado, por lo que solo dejaremos la columna job_posting_url para saber cuál es la publicación de la oferta de empleo.

### **Dataset information**

In [81]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   job_id                      123849 non-null  int64         
 1   company_name                122130 non-null  object        
 2   title                       123849 non-null  object        
 3   description                 123842 non-null  object        
 4   pay_period                  123849 non-null  object        
 5   location                    123849 non-null  object        
 6   company_id                  123849 non-null  int64         
 7   views                       123849 non-null  int64         
 8   formatted_work_type         123849 non-null  object        
 9   original_listed_time        123849 non-null  datetime64[ns]
 10  remote_allowed              123849 non-null  bool          
 11  job_posting_url             123849 non-

1. The `zip_code` column is currently of type *float* but should be of type *string* to allow formats with hyphens "-" as is common in many places.  

2. The dates are in *float* format when they should actually be in *datetime* format to enable correct date interpretation. Therefore, we will convert them.  

### **Transformed Dataset (Explanation)**

Considering this, after cleaning the dataset, it would no longer include those columns, leaving only 21 columns that could potentially be relevant.

## **salaries_df**

In [82]:
salaries_df.head()

Unnamed: 0,salary_id,job_id,max_salary,med_salary,min_salary,pay_period,currency,compensation_type,raw_salary
0,1,3884428798,,20.0,,HOURLY,USD,BASE_SALARY,20.0
1,2,3887470552,25.0,,23.0,HOURLY,USD,BASE_SALARY,24.0
2,3,3884431523,120000.0,,100000.0,YEARLY,USD,BASE_SALARY,110000.0
3,4,3884911725,200000.0,,10000.0,YEARLY,USD,BASE_SALARY,105000.0
4,5,3887473220,35.0,,33.0,HOURLY,USD,BASE_SALARY,34.0


### **Information**

We know that it is a DataFrame containing approximately 50,000 rows and 8 columns.

In [83]:
salaries_df.isnull().sum()

salary_id                0
job_id                   0
max_salary            6838
med_salary           33947
min_salary            6838
pay_period               0
currency                 0
compensation_type        0
raw_salary               0
dtype: int64

1. We observe that approximately 7,000 records in `max_salary` and `min_salary` have null values, whereas `med_salary` has nearly 34,000 null values. Therefore, we can calculate the average between `max_salary` and `min_salary` to derive `med_salary` and then remove the records that lack values in all three of these columns.

2. We also notice that the DataFrame includes the `job_id` column, which means we can, in the future, join this table with the `jobs_df` table to enhance the information available in that table.

In [84]:
benefits_df.head()

Unnamed: 0,job_id,type
0,23221523,{401(k)}
1,56482768,"{401(k),""Dental insurance"",""Disability insuran..."
2,69333422,"{""Medical insurance"",""Vision insurance"",""Denta..."
3,95428182,"{""Medical insurance"",""Dental insurance"",""Disab..."
4,111513530,"{""Medical insurance"",""Paid maternity leave"",""P..."


### **Information**

Here we see that companies offer various benefits, but they are listed under different IDs, even for the same company. This means we can create a list of the types of benefits offered, rather than keeping them separate.

In [85]:
benefits_df.isnull().sum()

job_id    0
type      0
dtype: int64

In this table, don't exist null values

In [86]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,2024-04-05
1,22292832,311,4471,2024-04-05
2,20300,1053,6554,2024-04-05
3,3570660,383,35241,2024-04-05
4,878353,52,26397,2024-04-05


In [87]:
industries_df.head()

Unnamed: 0,industry_id,industry_name
0,1,Defense and Space Manufacturing
1,3,Computer Hardware Manufacturing
2,4,Software Development
3,5,Computer Networking Products
4,6,"Technology, Information and Internet"


In [88]:
skills_industries_df.head()

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution


In [89]:
jobs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 20 columns):
 #   Column                      Non-Null Count   Dtype         
---  ------                      --------------   -----         
 0   job_id                      123849 non-null  int64         
 1   company_name                122130 non-null  object        
 2   title                       123849 non-null  object        
 3   description                 123842 non-null  object        
 4   pay_period                  123849 non-null  object        
 5   location                    123849 non-null  object        
 6   company_id                  123849 non-null  int64         
 7   views                       123849 non-null  int64         
 8   formatted_work_type         123849 non-null  object        
 9   original_listed_time        123849 non-null  datetime64[ns]
 10  remote_allowed              123849 non-null  bool          
 11  job_posting_url             123849 non-

In [90]:
salaries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40785 entries, 0 to 40784
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   salary_id          40785 non-null  int64  
 1   job_id             40785 non-null  int64  
 2   max_salary         33947 non-null  float64
 3   med_salary         6838 non-null   float64
 4   min_salary         33947 non-null  float64
 5   pay_period         40785 non-null  object 
 6   currency           40785 non-null  object 
 7   compensation_type  40785 non-null  object 
 8   raw_salary         40785 non-null  float64
dtypes: float64(4), int64(2), object(3)
memory usage: 2.8+ MB


In [91]:
benefits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30023 entries, 0 to 30022
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   job_id  30023 non-null  int64 
 1   type    30023 non-null  object
dtypes: int64(1), object(1)
memory usage: 469.2+ KB


In [92]:
employee_counts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35787 entries, 0 to 35786
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   company_id      35787 non-null  int64 
 1   employee_count  35787 non-null  int64 
 2   follower_count  35787 non-null  int64 
 3   time_recorded   35787 non-null  object
dtypes: int64(3), object(1)
memory usage: 1.1+ MB


In [93]:
employee_counts_df.head()

Unnamed: 0,company_id,employee_count,follower_count,time_recorded
0,391906,186,32508,2024-04-05
1,22292832,311,4471,2024-04-05
2,20300,1053,6554,2024-04-05
3,3570660,383,35241,2024-04-05
4,878353,52,26397,2024-04-05


In [94]:
industries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422 entries, 0 to 421
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   industry_id    422 non-null    int64 
 1   industry_name  422 non-null    object
dtypes: int64(1), object(1)
memory usage: 6.7+ KB


In [95]:
skills_industries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   skill_abr   35 non-null     object
 1   skill_name  35 non-null     object
dtypes: object(2)
memory usage: 692.0+ bytes


In [96]:
skills_industries_df.head()

Unnamed: 0,skill_abr,skill_name
0,ART,Art/Creative
1,DSGN,Design
2,ADVR,Advertising
3,PRDM,Product Management
4,DIST,Distribution


In [97]:
companies_df

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7,Unknown,US,Chicago,Unknown,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7,Unknown,IE,Dublin 2,Unknown,Grand Canal Harbour,https://www.linkedin.com/company/accenture
...,...,...,...,...,...,...,...,...,...,...
24467,103456466,Foundation Model Startup,No description,3,Unknown,0,0,Unknown,0,https://www.linkedin.com/company/foundation-mo...
24468,103456527,Kinder Prep Montessori Nursery & Preschool,Explore our renowned daycare and preschool cen...,1,New York,US,Brooklyn,11249,49 Broadway,https://www.linkedin.com/company/kinder-prep-m...
24469,103466352,Centent Consulting LLC,Centent Consulting LLC is a reputable human re...,3,Unknown,0,0,Unknown,0,https://www.linkedin.com/company/centent-consu...
24470,103468936,WebUnite,Our mission at WebUnite is to offer experience...,3,Pennsylvania,US,Southampton,18966,720 2nd Street Pike,https://www.linkedin.com/company/webunite
