# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under `outputs/datasets/collection`

## Inputs

* Kaggle JSON file — the authentication token.

## Outputs

* Generate Dataset: `outputs/datasets/collection/ai_job_dataset1.csv`

## Additional Comments

* Dataset: **Global AI Job Market & Salary Trends 2025** by Bisma Sajjad on Kaggle.
* 15,000+ synthetic job listings from 50+ countries with 20 features.
* In the workplace, data is not pushed to public repositories for security reasons.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder.
* We access the current directory with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\chahi\\Desktop\\vscode-project\\the-ai-salary-index\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory.
* `os.path.dirname()` gets the parent directory
* `os.chdir()` defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\chahi\\Desktop\\vscode-project\\the-ai-salary-index'

# Fetch data from Kaggle

Install the Kaggle package to fetch data.

In [4]:
%pip install kaggle==2.0.0

Collecting kaggle==2.0.0
  Downloading kaggle-2.0.0-py3-none-any.whl.metadata (15 kB)
Collecting bleach (from kaggle==2.0.0)
  Using cached bleach-6.3.0-py3-none-any.whl.metadata (31 kB)
Collecting kagglesdk<1.0,>=0.1.15 (from kaggle==2.0.0)
  Downloading kagglesdk-0.1.15-py3-none-any.whl.metadata (13 kB)
Collecting protobuf (from kaggle==2.0.0)
  Downloading protobuf-6.33.5-cp310-abi3-win_amd64.whl.metadata (593 bytes)
Collecting python-slugify (from kaggle==2.0.0)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting requests (from kaggle==2.0.0)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting tqdm (from kaggle==2.0.0)
  Downloading tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Collecting urllib3>=1.15.1 (from kaggle==2.0.0)
  Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting webencodings (from bleach->kaggle==2.0.0)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Collecting tex


[notice] A new release of pip is available: 25.0.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


You need a `kaggle.json` file (authentication token) in the project root.
* Download it from your Kaggle account under **Account → API → Create New API Token**.

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

Define the Kaggle dataset path and the destination folder, then download it.

In [7]:
KaggleDatasetPath = "bismasajjad/global-ai-job-market-and-salary-trends-2025"
DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/bismasajjad/global-ai-job-market-and-salary-trends-2025
License(s): CC0-1.0
Downloading global-ai-job-market-and-salary-trends-2025.zip to inputs/datasets/raw




  0%|          | 0.00/1.08M [00:00<?, ?B/s]
 93%|█████████▎| 1.00M/1.08M [00:00<00:00, 2.17MB/s]
100%|██████████| 1.08M/1.08M [00:00<00:00, 2.13MB/s]


Unzip the downloaded file.

In [9]:
import glob
import zipfile

# Find all zip files in the folder
zip_files = glob.glob(os.path.join(DestinationFolder, "*.zip"))

# Extract each zip file and then delete it
for zip_path in zip_files:
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_path)

# Remove kaggle.json if it exists to avoid exposing credentials
kaggle_json = os.path.join(os.getcwd(), "kaggle.json")
if os.path.exists(kaggle_json):
    os.remove(kaggle_json)

print("All ZIP files extracted and deleted.")

All ZIP files extracted and deleted.


# Inspect the data

In [14]:
import pandas as pd
df = pd.read_csv(f"{DestinationFolder}/ai_job_dataset1.csv")
print(f"Shape: {df.shape}")
df.head()

ModuleNotFoundError: No module named 'pandas'

Check data types and missing values.

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

---

---

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


# Conclusions and Next Steps

* The dataset has been successfully downloaded from Kaggle and saved.
* Next step: **02 - JobMarketStudy** — Exploratory Data Analysis and correlation study.