## 🧭 01 – Data Collection & Setup

### 🎯 Objective

The goal of this notebook is to **collect, verify, and document** all datasets used in the project **“Enhanced Skill Gap Analysis & Career Path Optimization.”**
This phase ensures all primary and supporting datasets are properly downloaded, described, and validated before any cleaning or modeling begins.

---

### 🧩 Project Context

This project aims to analyze the **global job market** to identify current and emerging skills, salary trends, and career optimization opportunities.
To achieve this, multiple **Kaggle datasets** are used, each representing a unique aspect of the technology job ecosystem.

---

### 📦 Primary Datasets Used

| Dataset Filename                | Kaggle Source                                                                                          | Rows (Approx.) | Purpose                                                                 |
| ------------------------------- | ------------------------------------------------------------------------------------------------------ | -------------- | ----------------------------------------------------------------------- |
| `dataset1_data_science_job.csv` | [brsahan / Data Science Job Market](https://www.kaggle.com/datasets/brsahan/data-science-job-market)   | ~10,000        | Core dataset containing job postings, skills, salary, and company data. |
| `dataset2_all_job_post.csv`     | [batuhanmutlu / Job Skill Set](https://www.kaggle.com/datasets/batuhanmutlu/job-skill-set)             | ~50,000        | Provides extracted skills, job categories, and detailed descriptions.   |
| `dataset3_ai_job_dataset.csv`   | [bismasajjad / Global AI Job Market](https://www.kaggle.com/datasets/bismasajjad/global-ai-job-market) | ~15,000        | Captures geographic and temporal job market trends in AI roles.         |
| `dataset3_ai_job_dataset1.csv`  | –                                                                                                      | ~12,000        | Variant dataset used for validation and cross-checking.                 |

---

### 🗂️ Folder Structure

All downloaded datasets are stored in the following structure:

```
skill-gap-analysis/
├── data/
│   ├── raw/          ← unmodified Kaggle datasets (CSV)
│   ├── processed/    ← cleaned datasets (after preprocessing)
│   └── external/     ← optional additional datasets (StackOverflow, ESCO)
└── notebooks/
    └── 01_data_collection.ipynb
```

---

### ⚙️ Data Source Validation

This notebook validates that all required datasets exist in the `data/raw/` directory and confirms their shape, column structure, and encoding consistency.
Each dataset will be loaded using `pandas` and summarized to ensure correct formatting before moving to the preprocessing phase.

---

### 🧾 Metadata Captured

For every dataset, the following metadata will be logged:

* File name
* Number of rows and columns
* Key columns (e.g., job_title, company, salary, skills)
* Missing value summary
* Data types
* Observations about structure or anomalies

This metadata will be stored in a file named **`data/raw/raw_metadata_summary.csv`** for reference.

---

### 📜 License & Usage Notes

* All datasets are publicly available on **Kaggle** and intended for **educational and research purposes only**.
* No personally identifiable information (PII) is contained in these datasets.
* Each dataset includes attribution to its original author on Kaggle.

---

### 🧠 Expected Outcome

At the end of this notebook, you should have:

* Verified presence and structure of all raw CSV files.
* Logged metadata summaries for all datasets.
* Confirmed that all sources are properly cited and licensed.
* Ready-to-clean datasets located in `/data/raw/`, ready for use in `02_data_cleaning.ipynb`.


