<a href="https://colab.research.google.com/github/SaraAljuraybah/saudi-tech-job-skills-analysis/blob/main/notebooks/data_collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Load the Raw Dataset**

The raw dataset collected from the JSearch API was loaded into the notebook using pandas. The dataset contains job postings from the Saudi labor market and includes both structured and unstructured features.

In [6]:
import pandas as pd

df = pd.read_excel("jobs_sa_raw.xlsx")

df.shape

(1024, 31)

**Step 2: Identify Missing Values**

Missing values were analyzed across all columns to identify features with high null counts. Columns with 100% missing values or irrelevant information were considered for removal.

In [7]:
df.isnull().sum().sort_values(ascending=False)

Unnamed: 0,0
job_posted_at_timestamp,1024
job_max_salary,1024
job_salary_period,1024
job_salary,1024
job_benefits,1024
job_state,1024
job_posted_at_datetime_utc,1024
job_min_salary,1024
employer_logo,635
job_posted_at,391


**Step 4: Remove Irrelevant Features/Empty Columns**

Columns that contained only missing values or were not relevant to the research objective (such as salary-related fields and logos) were removed to improve dataset quality and reduce noise.

In [8]:
cols_to_drop = [
    "job_salary",
    "job_min_salary",
    "job_max_salary",
    "job_salary_period",
    "job_benefits",
    "job_state",
    "job_posted_at_timestamp",
    "job_posted_at_datetime_utc",
    "employer_logo",
    "employer_website"
]

df = df.drop(columns=cols_to_drop)

**Step 4: Handle Duplicate Records**

The dataset was checked for duplicate rows. Four duplicate records were identified and removed to ensure the integrity and reliability of the analysis.

In [9]:
df.duplicated().sum()

np.int64(4)

In [10]:
df = df.drop_duplicates()

In [11]:
df["job_description"].head()

Unnamed: 0,job_description
0,Be the change. Join the world’s most visionary...
1,About Mozn\n\nMOZN is a leading Enterprise AI ...
2,About MOZN\n\nMOZN is a leading Enterprise AI ...
3,A data solutions company located in Riyadh is ...
4,The Data Scientist is responsible for deliveri...


**Step 5: Text Preprocessing**

Since job descriptions are unstructured textual data, preprocessing was applied. This included converting text to lowercase, removing special characters, eliminating newline characters, and normalizing whitespace.
A new column named clean_description was created to preserve the original text

In [12]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df["clean_description"] = df["job_description"].apply(clean_text)

In [13]:
df.shape

(1020, 22)

**Step 6: Export Cleaned Dataset**

After completing the cleaning process, the final cleaned dataset was exported as a CSV file for further exploratory data analysis (EDA) and modeling.

In [14]:
df.to_csv("jobs_sa_cleaned.csv", index=False)