<a href="https://colab.research.google.com/github/AbhinavKumar0000/Data_collection_pipeline/blob/main/Cleaning_script.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Job Posting Data Cleaning

This notebook cleans the job posting data stored in `raw_data.csv`. It performs deduplication, removes null values, cleans the job description text, and saves the processed data to `cleaned_data.csv`.

## Prerequisites
- Ensure `raw_data.csv` exists (generated from the previous data collection step)
- Required libraries: `pandas`, `re`
- The script assumes the input CSV has columns: `job_title`, `company_name`, `location`, `job_description`, `url`

## Steps
1. Import necessary libraries
2. Define a function to clean job description text
3. Load the raw data and handle duplicates and nulls
4. Apply text cleaning to job descriptions
5. Save the cleaned dataset and display results

In [10]:
#Importing necessary
import re
import pandas as pd

In [1]:
def clean_job_description(text):

    if not isinstance(text, str):
        return ""

    #Standardize text to lowercase
    text = text.lower()

    #Remove any non-alphanumeric characters
    text = re.sub(r'[^a-z0-9\s]', '', text)

    #Collapse multiple whitespace characters into a single space
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [6]:
try:
    df = pd.read_csv('raw_data.csv')
    print(f"Loaded raw_data.csv with {len(df)} initial records")

    #De-duplicate records based on the unique job URL
    df.drop_duplicates(subset=['url'], inplace=True)

    #Remove records with null values in the description
    df.dropna(subset=['job_description'], inplace=True)
    print(f"After handling duplicates and nulls, {len(df)} records remain")

    #Apply the text cleaning to the job_description column
    df['cleaned_description'] = df['job_description'].apply(clean_job_description)

    #Select and reorder columns for the cleaned dataset
    cleaned_df = df[['job_title', 'company_name', 'location', 'cleaned_description', 'url']]
    cleaned_df.to_csv('cleaned_data.csv', index=False)
    print("Data cleaning complete and saved to cleaned_data.csv")

except FileNotFoundError:
    print("Error: raw_data.csv not found")

Loaded raw_data.csv with 50 initial records
After handling duplicates and nulls, 50 records remain
Data cleaning complete and saved to cleaned_data.csv


In [8]:
if 'cleaned_df' in locals():
    print("Cleaned Data:")
    display(cleaned_df.head())

Cleaned Data:


Unnamed: 0,job_title,company_name,location,cleaned_description,url
0,QA Test Engineer,PrimeWorks,USA,you will be working for a usbased technology c...,https://remotive.com/remote-jobs/qa/qa-test-en...
1,"Senior Product Manager, Platform",ServiceUp,USA,about the role serviceup is seeking a senior p...,https://remotive.com/remote-jobs/product/senio...
2,Business Development Associate,Forbes Advisor,USA,company description at forbes advisor our miss...,https://remotive.com/remote-jobs/sales-busines...
3,Graphic Designer,DMS International,USA,data management services inc dba dms internati...,https://remotive.com/remote-jobs/design/graphi...
4,Client Success Manager,"Momentive Software, Inc.",USA,this description is a summary of our understan...,https://remotive.com/remote-jobs/customer-supp...


In [9]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_title            50 non-null     object
 1   company_name         50 non-null     object
 2   location             50 non-null     object
 3   cleaned_description  50 non-null     object
 4   url                  50 non-null     object
dtypes: object(5)
memory usage: 2.1+ KB
