# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under `outputs/datasets/collection`

## Inputs

* Kaggle JSON file — the authentication token.

## Outputs

* Generate Dataset: `outputs/datasets/collection/ai_job_dataset1.csv`

## Additional Comments

* Dataset: **Global AI Job Market & Salary Trends 2025** by Bisma Sajjad on Kaggle.
* 15,000+ synthetic job listings from 50+ countries with 20 features.
* In the workplace, data is not pushed to public repositories for security reasons.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder.
* We access the current directory with `os.getcwd()`

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.
* `os.path.dirname()` gets the parent directory
* `os.chdir()` defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install the Kaggle package to fetch data.

In [None]:
%pip install kaggle==1.5.12

You need a `kaggle.json` file (authentication token) in the project root.
* Download it from your Kaggle account under **Account → API → Create New API Token**.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset path and the destination folder, then download it.

In [None]:
KaggleDatasetPath = "bismasajjad/global-ai-job-market-and-salary-trends-2025"
DestinationFolder = "outputs/datasets/collection"
os.makedirs(DestinationFolder, exist_ok=True)

! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file.

In [None]:
import zipfile
with zipfile.ZipFile(f"{DestinationFolder}/global-ai-job-market-and-salary-trends-2025.zip", 'r') as z:
    z.extractall(DestinationFolder)

os.remove(f"{DestinationFolder}/global-ai-job-market-and-salary-trends-2025.zip")
print("Unzipped successfully")

# Inspect the data

In [None]:
import pandas as pd
df = pd.read_csv(f"{DestinationFolder}/ai_job_dataset1.csv")
print(f"Shape: {df.shape}")
df.head()

Check data types and missing values.

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

---

---

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


# Conclusions and Next Steps

* The dataset has been successfully downloaded from Kaggle and saved.
* Next step: **02 - JobMarketStudy** — Exploratory Data Analysis and correlation study.