<a href="https://colab.research.google.com/github/Niranjana-08/AI-Ascent/blob/main/notebooks/data_cleaning/data_cleaning_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Notebook Objective:**
*   This notebook serves as the central hub for data integration. Its purpose is to load 11 different raw CSV files, each containing a piece of the overall job data, and strategically merge them into a single, comprehensive master DataFrame.
*    This final, merged file will be the foundation for all subsequent cleaning and analysis.










## Setup and Data Loading

In [None]:
import pandas as pd
from google.colab import drive

In [None]:
print("Mounting Google Drive...")
drive.mount('/content/drive')
print("Drive mounted successfully.")

Mounting Google Drive...
Mounted at /content/drive
Drive mounted successfully.


### Defining the file paths for ALL files

In [None]:
base_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/'

In [None]:
postings_path = base_path + 'postings.csv'

In [None]:
# Files in the 'companies' folder
companies_path = base_path + 'companies/companies.csv'
company_industries_path = base_path + 'companies/company_industries.csv'
company_specialities_path = base_path + 'companies/company_specialities.csv'
employee_counts_path = base_path + 'companies/employee_counts.csv'

In [None]:
# Files in the 'jobs' folder
benefits_path = base_path + 'jobs/benefits.csv'
job_industries_path = base_path + 'jobs/job_industries.csv'
job_skills_path = base_path + 'jobs/job_skills.csv'
salaries_path = base_path + 'jobs/salaries.csv'

In [None]:
# Files in the 'mappings' folder
industries_path = base_path + 'mappings/industries.csv'
skills_path = base_path + 'mappings/skills.csv'

### Loading ALL the CSV files into pandas DataFrames

In [None]:
print("\nLoading all CSV files...")
try:
    postings_df = pd.read_csv(postings_path)

    companies_df = pd.read_csv(companies_path)
    company_industries_df = pd.read_csv(company_industries_path)
    company_specialities_df = pd.read_csv(company_specialities_path)
    employee_counts_df = pd.read_csv(employee_counts_path)

    benefits_df = pd.read_csv(benefits_path)
    job_industries_df = pd.read_csv(job_industries_path)
    job_skills_df = pd.read_csv(job_skills_path)
    salaries_df = pd.read_csv(salaries_path)

    industries_df = pd.read_csv(industries_path)
    skills_df = pd.read_csv(skills_path)

    print("All 11 files loaded successfully!")

except FileNotFoundError as e:
    print(f"\nError: A file was not found. Please check your folder and file names.")
    print(f"Details: {e}")


Loading all CSV files...
All 11 files loaded successfully!


In [None]:
postings_df.head()

Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [None]:
postings_df.shape


(123849, 31)

In [None]:
postings_df.columns

Index(['job_id', 'company_name', 'title', 'description', 'max_salary',
       'pay_period', 'location', 'company_id', 'views', 'med_salary',
       'min_salary', 'formatted_work_type', 'applies', 'original_listed_time',
       'remote_allowed', 'job_posting_url', 'application_url',
       'application_type', 'expiry', 'closed_time',
       'formatted_experience_level', 'skills_desc', 'listed_time',
       'posting_domain', 'sponsored', 'work_type', 'currency',
       'compensation_type', 'normalized_salary', 'zip_code', 'fips'],
      dtype='object')

In [None]:
companies_df.head()

Unnamed: 0,company_id,name,description,company_size,state,country,city,zip_code,address,url
0,1009,IBM,"At IBM, we do more than work. We create. We cr...",7.0,NY,US,"Armonk, New York",10504,International Business Machines Corp.,https://www.linkedin.com/company/ibm
1,1016,GE HealthCare,Every day millions of people feel the impact o...,7.0,0,US,Chicago,0,-,https://www.linkedin.com/company/gehealthcare
2,1025,Hewlett Packard Enterprise,Official LinkedIn of Hewlett Packard Enterpris...,7.0,Texas,US,Houston,77389,1701 E Mossy Oaks Rd Spring,https://www.linkedin.com/company/hewlett-packa...
3,1028,Oracle,We’re a cloud technology company that provides...,7.0,Texas,US,Austin,78741,2300 Oracle Way,https://www.linkedin.com/company/oracle
4,1033,Accenture,Accenture is a leading global professional ser...,7.0,0,IE,Dublin 2,0,Grand Canal Harbour,https://www.linkedin.com/company/accenture


## Foundation: Creating the Base DataFrame

carefully looked into all csv files to select req columns from every file

In [None]:
# Defining the list of column to be retained original postings_df
desired_columns = [
    'job_id', 'title', 'description', 'max_salary', 'med_salary', 'min_salary',
    'pay_period', 'location', 'company_id', 'formatted_work_type',
    'formatted_experience_level', 'skills_desc', 'work_type', 'currency',
    'compensation_type'
]

existing_desired_columns = [col for col in desired_columns if col in postings_df.columns]
final_df = postings_df[existing_desired_columns].copy()

print("New 'final_df' created with selected columns. Displaying info:")
final_df.info()

print("\nFirst 5 rows of the new DataFrame:")
print(final_df.head())

New 'final_df' created with selected columns. Displaying info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 15 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   title                       123849 non-null  object 
 2   description                 123842 non-null  object 
 3   max_salary                  29793 non-null   float64
 4   med_salary                  6280 non-null    float64
 5   min_salary                  29793 non-null   float64
 6   pay_period                  36073 non-null   object 
 7   location                    123849 non-null  object 
 8   company_id                  122132 non-null  float64
 9   formatted_work_type         123849 non-null  object 
 10  formatted_experience_level  94440 non-null   object 
 11  skills_desc                 2439 non-null    object 
 12  work_type

###  Merging Company Details

adding req column from companies.csv

In [None]:
company_cols_to_add = [
    'name',
    'description',
    'company_size'
]

cols_to_merge = ['company_id'] + [col for col in company_cols_to_add if col not in final_df.columns]
if len(cols_to_merge) > 1:
    print(f"Merging the following new columns: {cols_to_merge[1:]}")
    final_df = pd.merge(
        final_df,
        companies_df[cols_to_merge],
        on='company_id',
        how='left'
    )
    if 'name' in final_df.columns:
        final_df.rename(columns={'name': 'company_name'}, inplace=True)
else:
    print("No new company columns to add, as they already exist.")

print("\nCompany details merged. Displaying first 5 rows:")
final_df.head()

Merging the following new columns: ['name', 'company_size']

Company details merged. Displaying first 5 rows:


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,


### Merging Company Specialities

added req columns from company_specialities.csv

A company can have multiple specialities. We first group the company_specialities_df by company_id to aggregate all specialities into a list. Then, we merge this list into our final_df.

In [None]:
if 'speciality' not in final_df.columns:
    print("Merging company specialities...")
    company_id_to_specialities = company_specialities_df.groupby('company_id')['speciality'].apply(list).reset_index()
    final_df = pd.merge(
        final_df,
        company_id_to_specialities,
        on='company_id',
        how='left'
    )
    final_df['speciality'] = final_df['speciality'].apply(lambda x: x if isinstance(x, list) else [])

    cols = final_df.columns.tolist()
    cols.remove('speciality')
    new_order = cols + ['speciality']
    final_df = final_df[new_order]

    print("Company specialities merged successfully.")
else:
    print("The 'speciality' column already exists.")

print("\nDisplaying first 5 rows with 'speciality' column on the far right:")
final_df.head()

Merging company specialities...
Company specialities merged successfully.

Displaying first 5 rows with 'speciality' column on the far right:


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"[real estate, new development]"
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[]
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[]
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"[Civil Litigation, Corporate & Securities Law,..."
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[]


### Merging Employee Counts

adding req columns fron employee_count.csv

Employee counts can change over time. To get the most relevant number, we'll sort the employee_counts_df by time and take the most recent entry for each company before merging.

In [None]:
try:
    employee_counts_df
except NameError:
    print("Loading employee_counts.csv...")
    employee_counts_df = pd.read_csv(base_path + 'companies/employee_counts.csv')

if 'employee_count' not in final_df.columns:
    print("Merging employee counts...")
    recent_employee_counts = employee_counts_df.sort_values('time_recorded').drop_duplicates('company_id', keep='last')

    final_df = pd.merge(
        final_df,
        recent_employee_counts[['company_id', 'employee_count']],
        on='company_id',
        how='left'
    )

    cols = final_df.columns.tolist()
    cols.remove('employee_count')
    new_order = cols + ['employee_count']
    final_df = final_df[new_order]

    print("Employee counts merged successfully.")
else:
    print("The 'employee_count' column already exists.")

print("\nDisplaying first 5 rows with new 'employee_count' column:")
final_df.head()

Merging employee counts...
Employee counts merged successfully.

Displaying first 5 rows with new 'employee_count' column:


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality,employee_count
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"[real estate, new development]",402.0
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[],15.0
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"[Civil Litigation, Corporate & Securities Law,...",222.0
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],


### Merging Salary Information

adding req columns fron salaries.csv

In [None]:
try:
    salaries_df
except NameError:
    print("Loading salaries.csv...")
    salaries_df = pd.read_csv(base_path + 'jobs/salaries.csv')

salary_cols_to_add = [
    'max_salary',
    'min_salary',
    'pay_period',
    'currency',
    'compensation_type'
]

cols_to_merge = ['job_id'] + [col for col in salary_cols_to_add if col not in final_df.columns]

if len(cols_to_merge) > 1:
    print(f"Merging the following new columns: {cols_to_merge[1:]}")
    final_df = pd.merge(
        final_df,
        salaries_df[cols_to_merge],
        on='job_id',
        how='left'
    )

    new_cols = [col for col in cols_to_merge if col != 'job_id']
    existing_cols = [col for col in final_df.columns if col not in new_cols]
    new_order = existing_cols + new_cols
    final_df = final_df[new_order]

    print("Salary information merged successfully.")
else:
    print("No new salary columns to add, as they already exist.")

print("\nDisplaying first 5 rows with new salary columns:")
final_df.head()

No new salary columns to add, as they already exist.

Displaying first 5 rows with new salary columns:


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality,employee_count
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"[real estate, new development]",402.0
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[],15.0
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"[Civil Litigation, Corporate & Securities Law,...",222.0
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],


adding req columns from job_skills.csv and skills.csv grouped on company_id

### Merging Job Skills

This is a multi-step process to get a clean list of skills for each job:

1. Map: Merge job_skills_df with skills_df to map skill abbreviations (skill_abr) to full skill names.

2. Group: Group the results by job_id to aggregate all skill names into a list.

3. Merge: Merge the final skill lists into our final_df.

In [None]:
try:
    job_skills_df
except NameError:
    print("Loading job_skills.csv...")
    job_skills_df = pd.read_csv(base_path + 'jobs/job_skills.csv')
try:
    skills_df
except NameError:
    print("Loading skills.csv...")
    skills_df = pd.read_csv(base_path + 'mappings/skills.csv')

if 'skill_name' not in final_df.columns:
    print("Preparing and merging skill information...")

    skill_map_df = pd.merge(job_skills_df, skills_df, on='skill_abr', how='left')

    job_id_to_skills = skill_map_df.groupby('job_id')['skill_name'].apply(list).reset_index()

    final_df = pd.merge(
        final_df,
        job_id_to_skills,
        on='job_id',
        how='left'
    )

    final_df['skill_name'] = final_df['skill_name'].apply(lambda x: x if isinstance(x, list) else [])

    cols = final_df.columns.tolist()
    cols.remove('skill_name')
    new_order = cols + ['skill_name']
    final_df = final_df[new_order]

    print("Skill information merged successfully.")
else:
    print("The 'skill_name' column already exists.")

print("\nDisplaying first 5 rows with new 'skill_name' column:")
final_df.head()

Preparing and merging skill information...
Skill information merged successfully.

Displaying first 5 rows with new 'skill_name' column:


Unnamed: 0,job_id,title,description,max_salary,med_salary,min_salary,pay_period,location,company_id,formatted_work_type,formatted_experience_level,skills_desc,work_type,currency,compensation_type,company_name,company_size,speciality,employee_count,skill_name
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,,17.0,HOURLY,"Princeton, NJ",2774458.0,Full-time,,Requirements: \n\nWe are seeking a College or ...,FULL_TIME,USD,BASE_SALARY,Corcoran Sawyer Smith,2.0,"[real estate, new development]",402.0,"[Marketing, Sales]"
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,,30.0,HOURLY,"Fort Collins, CO",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],,[Health Care Provider]
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,,45000.0,YEARLY,"Cincinnati, OH",64896719.0,Full-time,,We are currently accepting resumes for FOH - A...,FULL_TIME,USD,BASE_SALARY,The National Exemplar,1.0,[],15.0,"[Management, Manufacturing]"
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,,140000.0,YEARLY,"New Hyde Park, NY",766262.0,Full-time,,This position requires a baseline understandin...,FULL_TIME,USD,BASE_SALARY,"Abrams Fensterman, LLP",2.0,"[Civil Litigation, Corporate & Securities Law,...",222.0,[Other]
4,35982263,Service Technician,Looking for HVAC service tech with experience ...,80000.0,,60000.0,YEARLY,"Burlington, IA",,Full-time,,,FULL_TIME,USD,BASE_SALARY,,,[],,[Information Technology]


## Saving the Final Dataset

In [None]:
output_path = base_path + 'final_merged_jobs.csv'

print(f"Saving the final DataFrame to: {output_path}")
final_df.to_csv(output_path, index=False)

print("\nFile saved successfully!")
print("You can now use 'final_merged_jobs.csv' for all your analysis.")

Saving the final DataFrame to: /content/drive/My Drive/job-analysis/job-analysis-dataset/final_merged_jobs.csv

File saved successfully!
You can now use 'final_merged_jobs.csv' for all your analysis.
