# Project Title
## Career and Skills Intelligence Recommendation System for Future Workforce Readiness  
*A Data-Driven Approach to Career Guidance, Skills Development, and Education Pathway Alignment*

---

# Business Understanding

## 1. Background Information

The global labor market is undergoing rapid transformation driven by technological advancement, artificial intelligence (AI), automation, globalization, and the digital economy. Traditional career pathways are evolving, new job roles are emerging, and the skills required for employability are continuously changing. As a result, individuals — particularly students, graduates, and early-career professionals — often struggle to identify relevant career opportunities and the competencies needed to remain competitive.

In Kenya, educational reforms such as the Competency-Based Curriculum (CBC) emphasize practical skills, creativity, and learner-centered career pathways. While these reforms aim to better prepare learners for the workforce, many students, parents, and educators still lack clear guidance on translating acquired competencies into viable career opportunities. Additionally, higher education institutions and Technical and Vocational Education and Training (TVET) programs offer diverse learning pathways, yet alignment between these programs and labor market demands remains fragmented.

At the same time, Massive Open Online Courses (MOOCs) and digital learning platforms have expanded access to skill development opportunities globally. These platforms enable learners to acquire industry-relevant skills outside traditional academic environments. However, the large number of available courses often creates confusion regarding which learning pathways best support specific career goals.

Furthermore, concerns about AI-driven job displacement are increasingly shaping workforce discussions. While some occupations face automation risks, others are emerging due to technological innovation. Understanding these trends is essential for informed career planning and workforce readiness.

Given these dynamics, there is a growing need for intelligent systems that integrate labor market data, education pathways, emerging skill requirements, and individual user profiles to provide actionable career guidance.



## 2. Challenges Identified

Several factors contribute to difficulties in career decision-making:

- Fragmented information about careers, required skills, and education pathways  
- Skills mismatch between graduates and labor market demands  
- Limited awareness of emerging careers and future workforce trends  
- Rapid technological change affecting job stability  
- Difficulty navigating multiple learning options such as university programs, TVET courses, and online certifications  
- Limited personalized career guidance tools tailored to evolving labor market conditions  

These challenges highlight the need for a data-driven approach to career guidance.



## 3. Proposed Solution

This project proposes the development of an **AI-driven Career and Skills Recommendation System** that integrates multiple data sources to provide personalized career insights.

The system will incorporate:

- Labor market data from job postings and occupational datasets  
- Skills extraction using Natural Language Processing (NLP)  
- Education pathway mapping including universities, TVET programs, and certifications  
- Online learning recommendations from MOOCs and digital learning platforms  
- AI-driven insights on job trends, emerging careers, and automation risks  

Based on user profiles (skills, education level, interests, career goals), the system will provide:

- Recommended career paths  
- Required skills and competency levels  
- Suggested academic programs or certifications  
- Relevant learning resources and skill development pathways  
- Future job outlook and industry trends  



## 4. Problem Statement

Many students, graduates, and professionals face challenges identifying suitable career paths and the skills required to succeed in an evolving labor market. This is due to fragmented career information, emerging AI-driven job disruptions, evolving education systems, and the growing number of learning opportunities available through both traditional institutions and online platforms.

There is therefore a need for a holistic, data-driven career recommendation system that integrates labor market intelligence, education pathways, skill requirements, and future workforce trends to support informed career decision-making.



## 5. Project Objectives

### General Objective

To develop an intelligent recommendation system that provides personalized career, skills, and education pathway suggestions based on labor market data, user profiles, and future workforce trends.

### Specific Objectives

- Collect and analyze job market data across multiple industries  
- Extract required skills and competencies using NLP techniques  
- Analyze AI automation trends and their impact on employment  
- Recommend relevant academic programs, TVET courses, and MOOCs  
- Identify skill gaps and suggest personalized learning pathways  
- Develop a user-friendly platform for interactive career guidance  



## 6. Stakeholders

### Primary Stakeholders

- Students and learners  
- Job seekers and early-career professionals  
- Career counselors and educators  
- Training institutions and universities  

### Secondary Stakeholders

- Employers and industry organizations  
- Government education and labor agencies  
- Online learning platforms  
- Policymakers interested in workforce development  



## 7. Expected Impact

By integrating career data, education pathways, skills intelligence, and workforce trends, this system aims to:

- Improve career decision-making  
- Reduce skills mismatch in the labor market  
- Support lifelong learning and workforce adaptability  
- Enhance alignment between education systems and employment needs  

Ultimately, the project seeks to contribute to workforce readiness in an increasingly technology-driven global economy.




# Data Understanding

## 1. Introduction
This project uses multiple datasets to support a **Career and Skills Intelligence Recommendation System** that connects:
- career opportunities and job market demand,
- skills and competency requirements,
- education and training pathways,
- AI/automation trends and workforce changes,
- learning resources (MOOCs).

Because the system is designed to be future-oriented and data-driven, we curated datasets that include:
- job titles and job-level attributes (salary, education, experience),
- AI impact indicators (exposure, automation probability, displacement risk),
- online learning quality signals (course reviews/sentiment),
- Course catalog / skill-to-course mapping: Coursera,
- CBC pathway context (Kenya education tracks and competencies): cbc_data.


These datasets provide both structured data (tables) and semi-structured text data. In this notebook, we focus on:
1) loading each dataset,
2) inspecting its structure (rows/columns, dtypes, missingness),
3) confirming whether the data supports our project objectives.



## 2. Datasets in This Project (Folder: `Cap_Stone_Project/DATA`)
We currently maintain 8 datasets stored in the `DATA/` folder:

1. `career_dataset_large`  
2. `ai_job_trends_dataset`  
3. `AI_Impact_on_Jobs_2030`  
4. `ai_impact_jobs_2010_2025`
5. `Coursera`      
6. `reviews`  
7. `reviews_by_course`
8. `cbc_pathways`



> Note: Some of these files are Excel workbooks, so we will load them using `pandas.read_excel()`.



## 3. Dataset Roles in the System (High-Level)
- **Career prediction / profile-to-career mapping:** `career_dataset_large`
- **Future job market & AI disruption signals:** `AI_Impact_on_Jobs_2030`, `ai_job_trends_dataset`, `ai_impact_jobs_2010_2025`
- **Skill-to-occupation mapping (baseline skills intelligence):** `skills`
- **Education/training pathway mapping by occupation:** `education`
- **MOOC quality signals (course feedback):** `reviews`, `reviews_by_course`

In later notebooks, we will clean, standardize job titles/education levels across datasets, and connect:
**user profile → career(s) → required skills → recommended learning resources** (MOOCs/certifications), while considering **AI risk and workforce trends**.


In [1]:
from pathlib import Path
import sys
import pandas as pd

# Make sure the notebook can import from /src
sys.path.append(str(Path.cwd()))

from src.data_utils import DatasetSpec, inspect_dataset

DATA_DIR = Path("DATA")
DATA_DIR.exists(), DATA_DIR.resolve()


(True,
 WindowsPath('C:/Users/ELITEBOOK/Documents/DSF-FT14/phase_5/Cap_Stone_Project/DATA'))

In [2]:
DATASETS = [
    DatasetSpec(name="Career Dataset (Large)", base_name="raw/career_dataset_large"),
    DatasetSpec(name="AI Job Trends Dataset", base_name="raw/ai_job_trends_dataset"),
    DatasetSpec(name="AI Impact on Jobs 2030", base_name="raw/AI_Impact_on_Jobs_2030"),
    DatasetSpec(name="AI Impact Jobs 2010–2025", base_name="raw/ai_impact_jobs_2010_2025"),
    DatasetSpec(name="Coursera", base_name="raw/Coursera"),
    DatasetSpec(name="Coursera Reviews", base_name="raw/reviews"),
    DatasetSpec(name="Coursera Reviews by Course", base_name="raw/reviews_by_course"),
    DatasetSpec(name="Cbc Pathways Dataset", base_name="raw/cbc_pathways"),
]


In [3]:
# This will show the real file names found (with extensions)
from src.data_utils import resolve_dataset_file

resolved = []
for spec in DATASETS:
    p = resolve_dataset_file(DATA_DIR, spec.base_name)
    resolved.append({"dataset": spec.name, "file_found": p.name})

pd.DataFrame(resolved)


Unnamed: 0,dataset,file_found
0,Career Dataset (Large),career_dataset_large.xlsx
1,AI Job Trends Dataset,ai_job_trends_dataset.csv
2,AI Impact on Jobs 2030,AI_Impact_on_Jobs_2030.csv
3,AI Impact Jobs 2010–2025,ai_impact_jobs_2010_2025.csv
4,Coursera,Coursera.csv
5,Coursera Reviews,reviews.csv
6,Coursera Reviews by Course,reviews_by_course.csv
7,Cbc Pathways Dataset,cbc_pathways.csv


## Dataset 1: Career Dataset (Large)

**Purpose:** Baseline mapping from learner profile attributes (education, specialization, skills, certifications, academic performance) to a recommended career label.  
This dataset will be used to train an initial supervised model that predicts a likely career path from a user’s profile and to simulate realistic user inputs during prototyping.


### 1) Load and Inspect the Dataset
In this subsection, we load the dataset and review its size, structure, and basic quality indicators (duplicates and missing values).

> **Expected outcome:** confirm that the dataset is usable for baseline modeling with minimal cleaning.




In [4]:
spec = DATASETS[0]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,Career Dataset (Large),5000,6,1,596,1.709


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the summary table:

- **Rows:** 5,000 — the dataset contains 5,000 learner profiles.  
- **Columns:** 6 — the dataset includes 5 feature columns plus 1 target label.  
- **Missing cells:** 0 — there are no missing values, which is ideal for an initial prototype.  
- **Duplicates:** 1 duplicate row — we will remove duplicates during preprocessing to avoid biasing the model.  
- **Memory footprint:** ~1.7 MB — lightweight and easy to work with in a notebook and during deployment.

Overall, this dataset is clean and suitable for building a baseline career recommendation model.



In [5]:
out["head"]


Unnamed: 0,education_level,specialization,skills,certifications,cgpa/percentage,recommended_career
0,Bachelor's,Finance,"Counseling, MS Office, Machine Learning",Tally ERP,67,Business Analyst
1,Intermediate,Science,"Accounting, MS Office",AWS Certified,67,Software Engineer
2,Master's,Business,"Accounting, SQL, Data Analysis",Mental Health Basics,90,Financial Analyst
3,Bachelor's,Computer Science,Communication,,75,Clerk
4,Matric,Business,Data Analysis,Tally ERP,83,Sales Assistant


### 3) Preview Interpretation (Sample Rows)

The sample rows confirm that each record represents a learner profile:

- **education_level:** Highest education attained (e.g., Matric, Intermediate, Bachelor’s, Master’s).  
- **specialization:** Area of study or background (e.g., Finance, Business, Computer Science).  
- **skills:** A comma-separated list of skills associated with the profile.  
- **certifications:** Certification(s) held; sometimes “None”.  
- **cgpa/percentage:** Numeric performance indicator.  
- **recommended_career:** The **target label** for supervised learning.

This structure aligns directly with our project goal of generating career recommendations based on user background and skills.



In [6]:
out["columns"].head(25)


Unnamed: 0,column,dtype,missing,missing_pct,nunique
3,certifications,object,596,11.92,7
2,skills,object,0,0.0,757
4,cgpa/percentage,int64,0,0.0,36
5,recommended_career,object,0,0.0,12
1,specialization,object,0,0.0,8
0,education_level,object,0,0.0,5


### 4) Column Report Interpretation (Data Types and Cardinality)

The column-level report shows:

- Most features are **categorical text fields** (`object`), including `education_level`, `specialization`, `skills`, `certifications`, and `recommended_career`.  
- `cgpa/percentage` is numeric (`int64`), which can be used directly after standard scaling (if needed).  
- The **skills column has high variability** (many unique values). This is expected since skills are naturally diverse, but it implies we must preprocess this field carefully (e.g., split into lists, normalize terms, then vectorize).

**Unique values (nunique):**
- `education_level`: 5 unique levels  
- `specialization`: 8 unique fields  
- `certifications`: 8 unique categories (including possible “None”)  
- `recommended_career`: 12 unique career labels  
- `skills`: 757 unique combinations  

This distribution is suitable for classification, but the `skills` field requires text/list preprocessing before modeling.


## Dataset 2: AI Job Trends Dataset

**Purpose:** This dataset captures how artificial intelligence is influencing job market dynamics across industries, including job growth trends, salary patterns, automation risk, required education levels, and workforce diversity indicators. It supports the future workforce intelligence component of our Career and Skills Recommendation System.


### 1) Load and Inspect the Dataset

In this subsection, we load the dataset and review its size, structure, and basic quality indicators such as duplicates, missing values, and overall dataset scale.

> **Expected outcome:** confirm whether the dataset is sufficiently complete and structured for workforce trend analysis and career recommendation augmentation.



In [7]:
spec = DATASETS[1]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,AI Job Trends Dataset,30000,13,0,0,13.89


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

- **Rows:** 30,000 — a relatively large dataset providing broad coverage of occupations and industries.  
- **Columns:** 13 — includes workforce indicators such as salaries, education requirements, automation risk, and job outlook.  
- **Missing cells:** 0 — no missing values detected, indicating high dataset completeness.  
- **Duplicates:** None detected, suggesting records are unique.  
- **Memory footprint:** ~13.9 MB — manageable for analysis and modeling workflows.

Overall, the dataset appears clean and comprehensive, making it suitable for labor market trend analysis and AI impact assessment.



In [8]:
out["head"]

Unnamed: 0,job_title,industry,job_status,ai_impact_level,median_salary_(usd),required_education,experience_required_(years),job_openings_(2024),projected_openings_(2030),remote_work_ratio_(%),automation_risk_(%),location,gender_diversity_(%)
0,Investment analyst,IT,Increasing,Moderate,42109.76,Master’s Degree,5,1515,6342,55.96,28.28,UK,44.63
1,"Journalist, newspaper",Manufacturing,Increasing,Moderate,132298.57,Master’s Degree,15,1243,6205,16.81,89.71,USA,66.39
2,Financial planner,Finance,Increasing,Low,143279.19,Bachelor’s Degree,4,3338,1154,91.82,72.97,Canada,41.13
3,Legal secretary,Healthcare,Increasing,High,97576.13,Associate Degree,15,7173,4060,1.89,99.94,Australia,65.76
4,Aeronautical engineer,IT,Increasing,Low,60956.63,Master’s Degree,13,5944,7396,53.76,37.65,Germany,72.57


### 3) Preview Interpretation (Sample Records)

The preview shows that each record represents a specific job role along with contextual workforce attributes:

Key observed features include:

- **job_title:** Occupation name.  
- **industry:** Sector classification (e.g., IT, Finance, Healthcare).  
- **job_status:** Indicates whether the job is increasing or decreasing due to AI adoption.  
- **ai_impact_level:** Degree of AI influence on the occupation (Low, Moderate, High).  
- **median_salary_(usd):** Estimated annual salary.  
- **required_education:** Typical educational requirement.  
- **experience_required_(years):** Years of experience needed.  
- **job_openings_(2024) and projected_openings_(2030):** Current and future workforce demand indicators.  
- **remote_work_ratio_(%):** Proportion of roles that can be performed remotely.  
- **automation_risk_(%):** Estimated likelihood of automation.  
- **location:** Country of reference.  
- **gender_diversity_(%):** Workforce diversity indicator.

These attributes align strongly with the project’s objective of providing future-aware career recommendations.



In [9]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
4,median_salary_(usd),float64,0,0.0,29968
10,automation_risk_(%),float64,0,0.0,9519
9,remote_work_ratio_(%),float64,0,0.0,9466
7,job_openings_(2024),int64,0,0.0,9439
8,projected_openings_(2030),int64,0,0.0,9410
12,gender_diversity_(%),float64,0,0.0,5965
0,job_title,object,0,0.0,639
6,experience_required_(years),int64,0,0.0,21
1,industry,object,0,0.0,8
11,location,object,0,0.0,8


### 4) Column Report Interpretation (Data Types and Cardinality)

The column analysis reveals:

- A mix of **numeric features** (salary, automation risk, job openings, remote work ratio) and **categorical features** (job title, industry, education level, AI impact level).  
- Several numeric columns have very high cardinality (thousands of unique values), which is expected for continuous workforce indicators.  
- The dataset contains:
  - 639 unique job titles,
  - 8 industries,
  - 5 education levels,
  - 3 AI impact categories,
  - 2 job status categories (Increasing/Decreasing).

This diversity makes the dataset suitable for trend modeling, clustering, and recommendation enrichment.

## Dataset 3: AI Impact on Jobs 2030

**Purpose:**  
This dataset provides forward-looking insights into how artificial intelligence, automation, and technological advancement may influence different occupations by the year 2030. It includes indicators such as automation probability, AI exposure, required education levels, salary estimates, and synthetic skill dimensions. This dataset supports the future workforce intelligence component of the Career and Skills Recommendation System.



### 1) Load and Inspect the Dataset

In this subsection, the dataset is loaded and examined to understand its size, structure, and basic quality indicators such as duplicates, missing values, and overall dataset scale.

> **Expected outcome:** confirm that the dataset is suitable for analyzing future job stability, AI exposure, and skill requirements.



In [10]:
spec = DATASETS[2]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]

Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,AI Impact on Jobs 2030,3000,18,0,0,0.905


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

- **Rows:** 3,000 — representing different occupations and workforce indicators.  
- **Columns:** 18 — covering salary, education level, AI exposure, automation risk, and multiple skill dimensions.  
- **Missing cells:** 0 — the dataset has no missing values, indicating strong completeness.  
- **Duplicates:** None detected, suggesting unique job entries.  
- **Memory footprint:** ~0.9 MB — relatively small and easy to handle computationally.

Overall, the dataset appears clean and ready for analysis without extensive preprocessing.



In [11]:
out["head"]

Unnamed: 0,job_title,average_salary,years_experience,education_level,ai_exposure_index,tech_growth_factor,automation_probability_2030,risk_category,skill_1,skill_2,skill_3,skill_4,skill_5,skill_6,skill_7,skill_8,skill_9,skill_10
0,Security Guard,45795,28,Master's,0.18,1.28,0.85,High,0.45,0.1,0.46,0.33,0.14,0.65,0.06,0.72,0.94,0.0
1,Research Scientist,133355,20,PhD,0.62,1.11,0.05,Low,0.02,0.52,0.4,0.05,0.97,0.23,0.09,0.62,0.38,0.98
2,Construction Worker,146216,2,High School,0.86,1.18,0.81,High,0.01,0.94,0.56,0.39,0.02,0.23,0.24,0.68,0.61,0.83
3,Software Engineer,136530,13,PhD,0.39,0.68,0.6,Medium,0.43,0.21,0.57,0.03,0.84,0.45,0.4,0.93,0.73,0.33
4,Financial Analyst,70397,22,High School,0.52,1.46,0.64,Medium,0.75,0.54,0.59,0.97,0.61,0.28,0.3,0.17,0.02,0.42


### 3) Preview Interpretation (Sample Records)

The preview confirms that each record corresponds to a specific job role with indicators related to future technological impact.

Key observed attributes include:

- **job_title:** Name of the occupation.  
- **average_salary:** Estimated annual salary for the role.  
- **years_experience:** Average experience required.  
- **education_level:** Highest education typically required.  
- **ai_exposure_index:** Degree to which the job interacts with AI technologies.  
- **tech_growth_factor:** Rate of technological advancement in the field.  
- **automation_probability_2030:** Estimated likelihood that the job may be automated by 2030.  
- **risk_category:** Categorized automation risk level (Low, Medium, High).  
- **skill_1 – skill_10:** Numeric indicators representing different skill dimensions or competency levels.

These attributes are particularly relevant for understanding how careers may evolve in response to technological change.



In [12]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
1,average_salary,int64,0,0.0,2960
4,ai_exposure_index,float64,0,0.0,101
5,tech_growth_factor,float64,0,0.0,101
8,skill_1,float64,0,0.0,101
9,skill_2,float64,0,0.0,101
10,skill_3,float64,0,0.0,101
11,skill_4,float64,0,0.0,101
12,skill_5,float64,0,0.0,101
13,skill_6,float64,0,0.0,101
14,skill_7,float64,0,0.0,101


### 4) Column Report Interpretation (Data Types and Cardinality)

Column analysis shows:

- A strong presence of **numeric workforce indicators**, including salary, automation probability, AI exposure index, and technology growth factor.  
- Synthetic skill indicators (`skill_1` to `skill_10`) each contain around 100 unique values, suggesting scaled competency measures rather than categorical labels.  
- The dataset includes:
  - 20 unique job titles,
  - 4 education levels,
  - 3 automation risk categories.

This structure makes the dataset suitable for predictive modeling, clustering, and career trend analysis.

## Dataset 4: AI Impact Jobs 2010–2025

**Purpose:**  
This dataset provides historical insights into how artificial intelligence has influenced job markets globally between 2010 and 2025. It includes job postings, AI skill requirements, automation risk indicators, salary trends, industry adoption stages, and workforce reskilling signals. This dataset supports the trend analysis component of the Career and Skills Recommendation System by helping identify how AI adoption has evolved across industries and occupations.



### 1) Load and Inspect the Dataset

In this subsection, the dataset is loaded and examined to understand its structure, scale, completeness, and overall data quality.

> **Expected outcome:** determine whether the dataset can support longitudinal analysis of AI workforce trends and skill evolution.



In [13]:
spec = DATASETS[3]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,AI Impact Jobs 2010–2025,5000,22,0,6754,4.871


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

- **Rows:** 5,000 — representing job records spanning multiple years and industries.  
- **Columns:** 22 — covering workforce attributes, AI adoption indicators, salary information, skills, and industry characteristics.  
- **Missing cells:** 6,754 missing values detected, mainly in AI-related skill columns.  
- **Duplicates:** None detected, indicating unique job records.  
- **Memory footprint:** ~4.9 MB — manageable for analysis.

Although the dataset is relatively clean structurally, the presence of missing values (particularly in AI-related skill fields) will require careful preprocessing.



In [14]:
out["head"]

Unnamed: 0,job_id,posting_year,country,region,city,company_name,company_size,industry,job_title,seniority_level,...,ai_intensity_score,core_skills,ai_skills,salary_usd,salary_change_vs_prev_year_percent,automation_risk_score,reskilling_required,ai_job_displacement_risk,job_description_embedding_cluster,industry_ai_adoption_stage
0,836b4774-702e-49ef-93d3-2f255ce1e910,2018,Brazil,South America,London,NextGen Technologies,Small,Education,Policy Analyst,Lead,...,0.81,"Research, Project Management, Business Analysis",reinforcement learning,61586,12.68,0.11,True,Low,14,Growing
1,43699e93-7b15-4728-a4c6-9e41ff438a25,2015,UAE,Middle East,Singapore,Future Solutions,Medium,Energy,Data Scientist,Executive,...,0.04,"Research, SQL, Business Analysis, Python, Clou...",,62045,-3.98,0.71,False,High,19,Emerging
2,fc9d1854-3cbf-4bab-90df-77304dfc59df,2016,Nepal,South Asia,Sydney,Future Analytics,Startup,Finance,Product Manager,Junior,...,0.15,"Statistics, Project Management, Cloud Computin...",,27035,3.55,0.86,False,High,2,Emerging
3,05c1c7d3-2add-4919-91eb-f6c78bfe23d1,2015,Spain,Europe,Nairobi,Global Technologies,Large,Government,Data Scientist,Mid,...,0.19,"Cloud Computing, SQL, Project Management, Comm...",,72894,-2.8,0.7,False,Low,15,Emerging
4,5e739937-d1b0-44d7-935c-7ebb3fc1f6e8,2014,Taiwan,East Asia,Sydney,Future Technologies,Small,Manufacturing,ML Engineer,Lead,...,0.11,"SQL, Python, Communication, Software Engineeri...",,57215,0.85,0.87,False,High,13,Emerging


### 3) Preview Interpretation (Sample Records)

The preview shows that each record corresponds to a job posting enriched with AI-related workforce indicators.

Key observed attributes include:

- **job_id:** Unique identifier for each job posting.  
- **posting_year:** Year of the job listing (2010–2025 range).  
- **country, region, city:** Geographic context of the job.  
- **company_name and company_size:** Employer characteristics.  
- **industry:** Sector classification.  
- **job_title and seniority_level:** Role description and experience level.  
- **core_skills and ai_skills:** Required general and AI-specific competencies.  
- **salary_usd:** Estimated annual salary.  
- **salary_change_vs_prev_year_percent:** Salary trend indicator.  
- **automation_risk_score:** Likelihood of automation impact.  
- **reskilling_required:** Indicator of whether workforce upskilling is needed.  
- **ai_job_displacement_risk:** Categorized automation risk level.  
- **industry_ai_adoption_stage:** Stage of AI adoption within the industry.

These attributes provide valuable historical context for analyzing workforce transformation due to AI.



In [15]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
11,ai_keywords,object,3377,67.54,617
14,ai_skills,object,3377,67.54,617
0,job_id,object,0,0.0,5000
15,salary_usd,int64,0,0.0,4883
13,core_skills,object,0,0.0,4162
16,salary_change_vs_prev_year_percent,float64,0,0.0,1815
12,ai_intensity_score,float64,0,0.0,77
17,automation_risk_score,float64,0,0.0,62
2,country,object,0,0.0,44
20,job_description_embedding_cluster,int64,0,0.0,20


### 4) Column Report Interpretation (Data Types and Cardinality)

The column analysis indicates:

- A mix of categorical, numeric, and boolean features representing workforce characteristics.  
- Significant missing values in:
  - **ai_keywords (~67.5%)**
  - **ai_skills (~67.5%)**
  This suggests that many roles either did not explicitly mention AI skills or that AI adoption varied significantly across jobs.

Other observations include:

- 44 unique countries and multiple geographic regions, providing global workforce coverage.  
- Diverse industries and job roles, supporting broad labor market analysis.  
- Boolean indicators (`ai_mentioned`, `reskilling_required`) useful for binary classification or trend analysis.

Overall, the dataset contains rich contextual information despite partial missingness in AI-specific skill fields.

## Dataset 5: Coursera Courses Dataset

### Purpose
This dataset contains structured information about online courses offered through Coursera, including course subjects, institutions, skill outcomes, ratings, and learner engagement metrics.

It supports the **learning pathway recommendation component** of the Career and Skills Recommendation System by enabling:

- Course recommendations aligned with career skill gaps  
- Skill-to-course mapping  
- Quality-based ranking of learning resources  

### 1) Load and Inspect the Dataset

The dataset is loaded to assess its structure, content diversity, and suitability for course recommendation modeling.

**Expected outcome:**

- Confirm availability of skill tags  
- Evaluate course diversity and quality indicators  
- Identify preprocessing requirements  

In [16]:
spec = DATASETS[4]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,Coursera,3404,9,0,0,2.584


### 2) Summary Interpretation

Key observations:

- **Rows:** 3,404 courses.  
- **Columns:** 9 attributes describing course content and quality.  
- **Duplicates:** None detected.  
- **Missing values:** None observed.  
- **Memory footprint:** Approximately 2.6 MB.

The dataset is clean and ready for integration into the recommendation pipeline.

In [17]:
out["head"]

Unnamed: 0,subject,title,institution,learning_product,level,duration,gained_skills,rate,reviews
0,Business,Business Analysis & Process Management,Coursera Project Network,Guided Project,Beginner,Less Than 2 Hours,"Process Analysis, Business Process, Business A...",4.4,6100
1,Business,Getting Started with Microsoft Excel,Coursera Project Network,Guided Project,Intermediate,Less Than 2 Hours,"Microsoft Excel, Excel Formulas, Spreadsheet S...",4.6,11000
2,Business,Financial Markets,Yale University,Course,Beginner,1 - 3 Months,"Investment Banking, Risk Management, Financial...",4.8,30000
3,Business,Investment Risk Management,Coursera Project Network,Guided Project,Intermediate,Less Than 2 Hours,"Investment Management, Risk Management, Financ...",4.4,1800
4,Business,Food & Beverage Management,Università Bocconi,Course,Mixed,1 - 3 Months,"Food and Beverage, Hospitality, Restaurant Man...",4.8,4800


### 3) Preview Interpretation

Important attributes include:

- **subject:** Broad discipline classification (Business, Data Science, etc.).  
- **title:** Course name.  
- **institution:** Course provider (university or organization).  
- **learning_product:** Course format (Course, Guided Project, etc.).  
- **level:** Difficulty level (Beginner, Intermediate, Mixed).  
- **duration:** Estimated course completion time.  
- **gained_skills:** Skills acquired from the course.  
- **rate:** Average learner rating.  
- **reviews:** Number of learner reviews.

These features provide both content relevance and quality indicators.

In [18]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
6,gained_skills,object,0,0.0,2762
1,title,object,0,0.0,2753
8,reviews,int64,0,0.0,782
2,institution,object,0,0.0,208
7,rate,float64,0,0.0,26
3,learning_product,object,0,0.0,5
0,subject,object,0,0.0,4
4,level,object,0,0.0,4
5,duration,object,0,0.0,4


### 4) Column Report Interpretation

Key observations:

- Over **2,700 unique skill combinations**, indicating diverse course offerings.  
- Wide range of institutions (~200 providers).  
- Ratings and review counts enable credibility assessment.  
- Difficulty levels and duration support personalized learning recommendations.

## Dataset 6: Coursera Reviews Dataset

**Purpose:**  
This dataset contains learner reviews of online courses from the Coursera platform. It includes textual feedback and rating labels that reflect learner satisfaction with different courses. The dataset supports the learning pathway recommendation component of the Career and Skills Recommendation System by helping identify high-quality courses aligned with career skill requirements.


### 1) Load and Inspect the Dataset

In this subsection, the dataset is loaded and inspected to evaluate its structure, size, and overall data quality.

> **Expected outcome:** confirm the dataset’s suitability for sentiment analysis, course quality evaluation, and MOOC recommendation support.



In [19]:
spec = DATASETS[5]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,Coursera Reviews,107018,3,0,0,22.961


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

- **Rows:** 107,018 — a large dataset representing learner feedback on multiple courses.  
- **Columns:** 3 — including review ID, textual feedback, and rating label.  
- **Missing cells:** None detected, indicating complete records.  
- **Duplicates:** None detected.  
- **Memory footprint:** ~23 MB — moderately large but manageable for NLP analysis.

Overall, the dataset appears clean and suitable for text analysis and course recommendation modeling.



In [20]:
out["head"]

Unnamed: 0,id,review,label
0,0,good and interesting,5
1,1,"This class is very helpful to me. Currently, I...",5
2,2,like!Prof and TAs are helpful and the discussi...,5
3,3,Easy to follow and includes a lot basic and im...,5
4,4,Really nice teacher!I could got the point eazl...,4


### 3) Preview Interpretation (Sample Records)

Each record consists of:

- **id:** Unique identifier for each review.  
- **review:** Textual learner feedback about a course.  
- **label:** Numeric rating representing learner satisfaction (typically 1–5 stars).

The textual reviews contain qualitative insights about course quality, teaching effectiveness, content clarity, and learner experience.

This unstructured text data is particularly suitable for Natural Language Processing (NLP) tasks such as sentiment analysis and keyword extraction.



In [21]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
0,id,int64,0,0.0,107018
1,review,object,0,0.0,100038
2,label,int64,0,0.0,5


### 4) Column Report Interpretation (Data Types and Cardinality)

Column analysis indicates:

- The **review** column contains textual data with over 100,000 unique entries, confirming diverse learner feedback.  
- The **label** column has 5 unique values representing rating levels.  
- The **id** column uniquely identifies each record.

This structure makes the dataset suitable for supervised sentiment classification, clustering, and recommendation ranking.


## Dataset 7: Coursera Reviews by Course Dataset

**Purpose:**  
This dataset contains learner reviews grouped by specific Coursera courses. Unlike the general reviews dataset, this version associates each review with a particular course identifier, enabling course-level analysis of learner satisfaction. It supports the MOOC recommendation component of the Career and Skills Recommendation System by helping identify high-quality courses aligned with skill development needs.


### 1) Load and Inspect the Dataset

In this subsection, the dataset is loaded and inspected to assess its structure, completeness, and overall quality.

> **Expected outcome:** confirm suitability for course ranking, sentiment analysis, and learning resource recommendation.



In [22]:
spec = DATASETS[6]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]


Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,Coursera Reviews by Course,140320,3,3016,3,36.848


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

- **Rows:** 140,320 — a large dataset representing learner reviews across many courses.  
- **Columns:** 3 — course identifier, textual review, and rating label.  
- **Missing cells:** 3 missing review entries detected (very minimal).  
- **Duplicates:** 3,016 duplicate records identified, likely repeated reviews.  
- **Memory footprint:** ~36.8 MB — relatively large but manageable for NLP analysis.

Overall, the dataset is usable but requires minor cleaning, particularly duplicate removal and handling of missing reviews.



In [23]:
out["head"]

Unnamed: 0,courseid,review,label
0,2-speed-it,BOring,1
1,2-speed-it,Bravo !,5
2,2-speed-it,Very goo,5
3,2-speed-it,"Great course - I recommend it for all, especia...",5
4,2-speed-it,One of the most useful course on IT Management!,5


### 3) Preview Interpretation (Sample Records)

Each record contains:

- **courseid:** Identifier for a specific Coursera course.  
- **review:** Learner textual feedback about the course.  
- **label:** Numeric rating representing satisfaction (typically 1–5).

This structure allows aggregation of reviews at the course level, enabling evaluation of course quality and learner sentiment.



In [24]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
1,review,object,3,0.0,123240
0,courseid,object,0,0.0,1835
2,label,int64,0,0.0,5


### 4) Column Report Interpretation (Data Types and Cardinality)

Column analysis indicates:

- **courseid:** 1,835 unique courses represented.  
- **review:** Over 123,000 unique textual reviews, indicating diverse learner feedback.  
- **label:** 5 rating levels corresponding to satisfaction scores.

This dataset supports both sentiment analysis and course-level ranking.


## Dataset 8: CBC Pathways Dataset

**Purpose:** 

This dataset provides structured information about the Competency-Based Curriculum (CBC) senior school pathways, tracks, and subject combinations. It enables mapping of Kenyan education pathways to potential careers, skills development trajectories, and learning recommendations.

Within the Career and Skills Recommendation System, this dataset supports:

Mapping student subject combinations to CBC pathways and career directions.

Linking education tracks to relevant industries and occupational clusters.

Aligning secondary school education choices with workforce skill demands.

Providing localized career guidance tailored to the Kenyan CBC framework.

### 1) Load and Inspect the Dataset

In this subsection, the dataset is loaded and inspected to understand its structure, completeness, and diversity.



In [25]:
spec = DATASETS[7]  # change index for each dataset
out = inspect_dataset(DATA_DIR, spec)
out["summary"]

Unnamed: 0,dataset,rows,cols,duplicates,missing_cells,memory_mb
0,Cbc Pathways Dataset,535,5,0,0,0.205


### 2) Summary Interpretation (Shape, Missingness, Duplicates)

From the dataset summary:

**Rows:** 535 — representing unique CBC program combinations.

**Columns:** 5 — including pathways, tracks, subject combinations, and identifiers.

**Missing cells:** None detected, indicating complete records.

**Duplicates:** None detected.

**Memory footprint:** ~0.2 MB — lightweight and computationally efficient.

Overall, the dataset is clean, structured, and ready for integration into the modeling pipeline.

In [26]:
out["head"]

Unnamed: 0,pathway,track,subjects,program_code,pathway_id
0,ARTS & SPORTS SCIENCE,ARTS,"Arabic,Fine Arts,Theatre & Film",AS1001,25211070-b138-4b82-813d-bbb7383a5c1d
1,ARTS & SPORTS SCIENCE,ARTS,"Biology,Fine Arts,Theatre & Film",AS1002,1cf6be29-e82d-4e59-ac5f-10d4c3da7a22
2,ARTS & SPORTS SCIENCE,ARTS,"Business Studies,Fine Arts,Theatre & Film",AS1003,84bc8e17-5fb0-4242-b23f-83141081fec4
3,ARTS & SPORTS SCIENCE,ARTS,"Computer Studies,Fine Arts,Theatre & Film",AS1004,d5d0a13e-fc32-4c57-8e59-65fc1b49fb97
4,ARTS & SPORTS SCIENCE,ARTS,"Christian Religious Education,Fine Arts,Theatr...",AS1005,21d1a140-fd7b-416b-822e-6846af18c3f3


### 3) Preview Interpretation

Each record contains:

**pathway:** Broad CBC learning pathway (e.g., Arts & Sports Science).

**track:** Specific specialization track within the pathway.

**subjects:** Combination of subjects offered under that track.

**program_code:** Unique code identifying each program.

**pathway_id:** Unique identifier for each pathway record.

This structure allows subject selections to be mapped to structured education pathways, which can later be linked to career and industry datasets.

In [27]:
out["columns"].head(25)

Unnamed: 0,column,dtype,missing,missing_pct,nunique
2,subjects,object,0,0.0,535
3,program_code,object,0,0.0,535
4,pathway_id,object,0,0.0,535
1,track,object,0,0.0,7
0,pathway,object,0,0.0,3


### 4) Column Report Interpretation (Data Types and Cardinality)

Key observations:

- subjects: 535 unique combinations, indicating diverse subject pathways.

- program_code and pathway_id: Unique identifiers useful for indexing and merging.

- track: 7 unique specialization tracks.

- pathway: 3 broad CBC pathways.

These characteristics make the dataset suitable for:

- Education pathway modeling.

- CBC-to-career mapping.

- Integration with job market datasets for localized career recommendations.