In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
khushikyad001_ai_impact_on_jobs_2030_path = kagglehub.dataset_download('khushikyad001/ai-impact-on-jobs-2030')

print('Data source import complete.')


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/ai-impact-on-jobs-2030/AI_Impact_on_Jobs_2030.csv')
df.sample(6)

# Data Quality Check

In [None]:
print(df.shape)
df.info()

In [None]:
df.describe()

Next...
1) Missing Values Check.
2) Visualization of Missing Values.
3) Duplicates Check.
4) Initial Visualization of Key Numerical Distributions.

Graph Info.
**Average Salary** appears to be fairly normally distributed but slightly bimodal, indicating two main clusters of salaries.
**Years of Experience** is also widely distributed, suggesting jobs cover both entry-level and very senior positions.

In [None]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)

if missing_values.empty:
    print("\nNo missing values found in the dataset.")
else:
    print("\nMissing Values per Column:")
    print(missing_values.to_markdown(numalign="left", stralign="left"))


    plt.figure(figsize=(10, 6))
    sns.barplot(x=missing_values.index, y=missing_values.values)
    plt.title('Missing Values Count per Column')
    plt.ylabel('Count')
    plt.xlabel('Columns')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show


duplicates_count = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates_count}")


numerical_cols = ['Average_Salary', 'Years_Experience']
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i, col in enumerate(numerical_cols):
    sns.histplot(df[col], kde=True, ax=axes[i], bins=20, color='skyblue')
    axes[i].set_title(f'Distribution of {col}', fontsize=12)
    axes[i].set_xlabel(col, fontsize=10)
    axes[i].set_ylabel('Frequency', fontsize=10)

plt.tight_layout()
plt.show()

**Data Cleaning is not required**

# Exploratory Data Analysis (EDA)

**Q1) How does the Average_Salary vary across different Education_Levels? (Box Plot)**

**ans-** There is a clear positive trend: higher education levels generally correlate with higher average salaries.

**Q2) What is the distribution of jobs across the different Risk_Category groups? (Count Plot)**

**ans-** The Medium risk category contains the largest number of jobs ($\mathbf{1,521}$), suggesting that over half the jobs fall into a moderate risk level concerning the impact of AI.The High ($\mathbf{740}$) and Low ($\mathbf{739}$) risk categories are almost equally distributed, with slightly fewer jobs than the medium risk category.

**Q3) Is there a correlation between Years_Experience and Average_Salary? (Scatter Plot)**

**ans-** The scatter plot shows a weak positive correlation between Years_Experience and Average_Salary.

**Q4) How does the Automation_Probability_2030 relate to the Risk_Category? (Box Plot)**

**ans-** The Risk_Category strongly aligns with the Automation_Probability_2030:Low Risk jobs have a median automation probability near $\mathbf{0.2}$.Medium Risk jobs have a median probability near $\mathbf{0.5}$.High Risk jobs have the highest median probability, near $\mathbf{0.8}$, confirming the risk categorization is based heavily on this automation metric.

**Q5) Which jobs have the highest and lowest average AI_Exposure_Index? (Top/Bottom 10 Bar Chart)**

**ans-** Highest AI Exposure (Red Bars): Jobs like Graphic Designer, Construction Worker, and Delivery Driver top the list. This suggests these jobs frequently interact with or rely on tasks that are heavily impacted by current AI technology.
Lowest AI Exposure (Green Bars): Jobs like Research Scientist, Data Analyst, and Teacher are at the bottom. This might indicate that the core tasks of these roles require complex, nuanced human interaction or are not yet easily quantifiable by the AI Exposure Index.



In [None]:
education_order = ["High School", "Bachelor's", "Master's", "PhD"]
risk_order = ["Low", "Medium", "High"]
plt.figure(figsize=(10, 6))
sns.boxplot(x='Education_Level', y='Average_Salary', data=df, order=education_order, palette='viridis')
plt.title('Q1) Average Salary Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Salary')
plt.tight_layout()
plt.show()


plt.figure(figsize=(8, 6))
sns.countplot(x='Risk_Category', data=df, order=risk_order, palette='magma')
plt.title('Q2) Job Count by Risk Category')
plt.xlabel('Risk Category')
plt.ylabel('Number of Jobs')
plt.tight_layout()
plt.show()


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Years_Experience', y='Average_Salary', data=df, alpha=0.6)
sns.regplot(x='Years_Experience', y='Average_Salary', data=df, scatter=False, color='red')
plt.title('Q3) Correlation between Years of Experience and Average Salary')
plt.xlabel('Years of Experience')
plt.ylabel('Average Salary')
plt.tight_layout()
plt.show()


plt.figure(figsize=(8, 6))
sns.boxplot(x='Risk_Category', y='Automation_Probability_2030', data=df, order=risk_order, palette='cividis')
plt.title('Q4) Automation Probability by Risk Category')
plt.xlabel('Risk Category')
plt.ylabel('Automation Probability (2030)')
plt.tight_layout()
plt.show()


avg_ai_exposure = df.groupby('Job_Title')['AI_Exposure_Index'].mean().sort_values(ascending=False)
top_10_ai = avg_ai_exposure.head(10)
bottom_10_ai = avg_ai_exposure.tail(10)
combined_ai_exposure = pd.concat([top_10_ai, bottom_10_ai])

plt.figure(figsize=(12, 8))
colors = ['red'] * 10 + ['green'] * 10
sns.barplot(x=combined_ai_exposure.values, y=combined_ai_exposure.index, palette=colors)
plt.title('Q5) Top 10 Highest and Bottom 10 Lowest Average AI Exposure Index by Job Title')
plt.xlabel('Average AI Exposure Index')
plt.ylabel('Job Title')
plt.legend()
plt.tight_layout()
plt.show()


# Feature Engineering and Model Training
Model- Support Vector Machine Classifier

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

le = LabelEncoder()
y_encoded = le.fit_transform(df['Risk_Category'])
X = df.drop('Risk_Category', axis=1)

categorical_features = ['Education_Level', 'Job_Title']
numerical_features = X.columns.drop(categorical_features).tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('scaler', StandardScaler(), numerical_features)
    ],
    remainder='passthrough'
)

#Model
X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

svc_model = SVC(kernel='rbf', random_state=42, class_weight='balanced')
svc_model.fit(X_train_processed, y_train_encoded)

y_pred_encoded = svc_model.predict(X_test_processed)

y_pred_encoded = svc_model.predict(X_test_processed)

final_accuracy = accuracy_score(y_test_encoded, y_pred_encoded)

print(f"\n Test Set Accuracy: {final_accuracy * 100:.2f}%")

# Summary
AI Job Risk Prediction
The goal was to perform Multi-Class Classification to predict the Risk Category (Low, Medium, or High) of a job based on its attributes.

1. Data Quality and Exploratory Analysis (EDA)-

**Data Quality:** The dataset ($\mathbf{3,000}$ rows) was found to be exceptionally clean, requiring no imputation (zero missing values) or duplicate removal.
**Key EDA Insight:** Exploratory analysis confirmed a very strong relationship between the target variable, Risk_Category, and predictive features like Automation_Probability_2030 and AI_Exposure_Index.

2. Feature Engineering and Preprocessing-
Due to the nature of the Support Vector Machine (SVC) algorithm, several critical preprocessing steps were required:

Target Label Encoding (Essential): The string labels in the target variable, $\mathbf{Y}$ (Risk_Category), were converted to numerical integers ($0, 1, 2$) using LabelEncoder, as required by $\text{SVC}$ for multi-class tasks.

Categorical Encoding: Features like Job_Title and Education_Level were converted into a machine-readable format using OneHotEncoder.

Numerical Scaling (Crucial for SVM): All numerical features (including Average_Salary, Years_Experience, and all Skill_ columns) were normalized using StandardScaler. This step ensures that distance-based calculations, which $\text{SVC}$ relies on, are not dominated by features with larger scales.

3. Model Training and Evaluation:

Support Vector Classifier (SVC)The $\text{SVC}$ model was trained on the processed data ($80\%$ train, $20\%$ test split).
   
The near-perfect accuracy strongly suggests that the Risk_Category is defined by clear, quantifiable rules (likely thresholds on Automation_Probability_2030 and AI_Exposure_Index), which the powerful $\text{SVC}$ model was able to perfectly capture and generalize.

4. Model Choice and Real-World Accuracy
Why the Support Vector Machine (SVM) was Chosen?

The Support Vector Machine (implemented as SVC) was an excellent choice for this multi-class classification problem for two key reasons:

**High-Dimensional Space:** After using One-Hot Encoding on categorical features (Job_Title, Education_Level), the dataset became high-dimensional. SVMs are mathematically designed to find the optimal separation boundary (hyperplane) that maximizes the margin between classes in high-dimensional spaces, making them very effective here.

**Strong Generalization:** SVMs are known for strong generalization capability. When paired with the necessary Standard Scaling of numerical features, the model is highly robust, avoiding common pitfalls like overfitting to minor noisy data points.

The observed accuracy of $\mathbf{99.33\%}$ is likely**overstated**compared to what would be achieved in the real world.
The high accuracy proves the features are extremely strong predictors, but expect some degradation when applying the model to messy, unlabeled, real-world data.