<p align="center">
  <img src="banner.png" alt="Personality Traits Career Prediction Banner" style="width:100%; max-width:1200px; border-radius:12px;">
</p>
<br>
<div style="
  max-width:1200px;
  margin:auto;
  padding:28px;
  background: linear-gradient(135deg, #1e1e1e, #2b2b2b);
  color:white;
  text-align:center;
  border-radius:14px;
  box-shadow: 0 8px 20px rgba(0,0,0,0.25);
">
  <h1 style="margin-bottom:10px;">
    Personality Traits Based Career Prediction
  </h1>
  <p style="font-size:1.1em; opacity:0.9; margin-top:0;">
    Using behavioral insights to recommend suitable career paths
  </p>
  <a href="https://www.kaggle.com/datasets/utkarshshrivastav07/career-prediction-dataset">Career Prediction Dataset (Kaggle)</a>
</div>


# ðŸŽ¯Project Goal and Objective

The goal of this project is to develop a system that can recommend the most suitable career paths for an individual based on their personality traits and aptitude scores.


In [41]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

## <font color="cyan">1. Dataset Familiarization and Description</font>

The first step of this project is to familiarize ourselves with the dataset and understand its basic characteristics.
This includes examining the structure of the data, the type of variables present, and the overall composition of the dataset.

The dataset contains numerical values representing:
- Personality traits based on the Big Five model (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism)
- Cognitive and aptitude test scores
- A target variable representing the most suitable career


In [None]:
df = pd.read_csv("career_dataset.csv")
df.head()

In [None]:
# shows how many null values there are in each column. also what is the data type of each column, which is very important for data preprocessing
df.info()
df.shape

In [None]:
df.describe()

In [None]:
df["Career"].value_counts()

### <font color="cyan">Data Quality Assessment and Key Observations</font>

Based on the initial inspection, the dataset is clean and well-structured.

Key observations:
- The dataset contains 155 records and 11 features.
- All input features are numerical and stored as float values.
- The target is "Career" which is already categorical (ready for classification).
- No missing values.
- Feature values are balanced.

<font color="red">**BUT** !</font>
- Most careers appear only once, and a few appear twice.
- This extreme class sparsity makes <font color="red">**traditional supervised classification infeasible**</font>.

As a result, the approach chosen is <font color="lightGreen">**similarity-based recommendation**</font>:
- For a new individual, the system identifies the career(s) whose traits are **closest** to their own.
- This allows us to recommend the **top careers** that best match an individualâ€™s personality and aptitudes, even with very few samples per career.


## <font color="cyan">Step 2: Exploratory Data Analysis (EDA)</font>

Exploratory Data Analysis helps us understand the underlying patterns and relationships in the dataset. 

In this step, we will:
- Examine the distribution of personality traits and aptitude scores.
- Identify correlations between features.
- Observe patterns that may indicate which traits are most relevant for different careers.

EDA provides insights that are useful for building a similarity-based career recommendation system.


In [None]:
features = df.drop("Career", axis=1)
target = df["Career"]

In [None]:
features.hist(bins=10, layout=(4,3), color='skyblue', edgecolor='black')
plt.suptitle("Distribution of Features", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

In [None]:
plt.figure(figsize=(10,8))
corr = features.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Between Features")
plt.show()

## <font color="cyan">Step 3: Reduce Dataset to One Sample per Career</font>

As discovered in our initial analysis, most careers in the dataset appear only once, and a few appear twice.  

To simplify the similarity-based recommendation approach, we will keep **only one sample per career**. This ensures:
- Each career is represented by a **unique trait profile**.
- The model will find the **closest career** to a new individual's traits rather than trying to generalize across multiple samples per career.

This forms the basis for a **nearest-neighbor recommendation system**.


In [None]:
# Keep only the first occurrence of each career
df_unique = df.drop_duplicates(subset=["Career"], keep="first")
print("Original dataset shape:", df.shape)
print("Reduced dataset shape (1 sample per career):", df_unique.shape)

df_unique.to_csv("careerFixed.csv", index=False)
print("Reduced dataset saved as 'careerFixed.csv'")

## <font color="cyan">Step 4: Model Training and Testing</font>

With one sample per career, we cannot train a traditional supervised classifier.  

Instead, we will use a **similarity-based approach**:

1. **Standardize all numerical features** to ensure equal contribution in distance calculations.
2. Use **K-Nearest Neighbors (KNN)** to compute similarity between a new individual's traits and all career samples.
3. Recommend the **top career(s)** based on the closest match.

This approach effectively treats each career profile as a point in multi-dimensional space and finds the nearest neighbor(s) to the input profile.

In [None]:
#we don't need scaling here because score ranges are the same for all features (1-10)
df = pd.read_csv("careerFixed.csv")
X = df.drop("Career", axis=1)
y = df["Career"]

test_sample = np.array([[5.45,8.67,3.45,5.34,4.23,9.23,4.56,6.78,7.89,6.12]])
distances = euclidean_distances(test_sample, X)
print("Distances from test sample to all careers:\n", distances)
# Find the closest career
closest_idx = np.argmin(distances)
predicted_career = y.iloc[closest_idx]

print("Closest career match:", predicted_career)

# Optional: Find top 3 closest careers
top3_idx = np.argsort(distances[0])[:3]
top3_careers = y.iloc[top3_idx]
print("Top 3 recommended careers:", top3_careers.values)


Distances from test sample to all careers:
 [[20.36771465 20.6327652  21.61276937 22.57283544 20.59024526 21.91084435
  20.77949229 21.7721106  21.28561721 19.46356083 20.55868186 20.86343931
  19.19083375 19.12927861 18.500927   19.7377557  22.49788879 18.45557639
  20.81721883 19.67321529 19.37209333 21.28207227 20.54603125 18.13672793
  19.40466696 19.27926347 20.93048017 19.01409477 19.40032474 19.52959549
  18.64863266 18.68907167 20.2145319  18.60850075 18.97079334 18.55400226
  20.76613349 19.90553943 19.04240006 19.95475883 19.74152476 19.01734471
  20.82490096 19.92486386 20.9646512  20.90027273 19.28557492 21.15232139
  19.89531352 20.82490096 20.91063366 21.15232139 19.54793595 18.90076983
  21.13547255 20.66992743 19.64882694 19.92486386 21.26162976 19.60897499
  18.77384883 21.96945379 19.71771538 19.53631234 20.91571419 19.06541896
  18.41548805 19.96529739 19.80567343 20.69877049 18.8363213  19.38297707
  19.92486386 19.86589288 19.80567343 22.01589426 20.64775048 20.869