# 01 - Exploratory Data Analysis

**BuzzwordLearner: Predicting Career Domain and Seniority from LinkedIn Profiles**

This notebook explores the LinkedIn CV dataset to understand:
1. Dataset structure and size
2. Label distributions (department/domain and seniority)
3. Text characteristics (length, language, patterns)
4. Career history patterns
5. Data quality issues

## Setup

In [None]:
import sys
sys.path.insert(0, '..')

import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from pathlib import Path

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Data directory
DATA_DIR = Path('../data')

print("Setup complete!")

## 1. Load the Data

In [None]:
# Load annotated LinkedIn CVs
with open(DATA_DIR / 'linkedin-cvs-annotated.json', 'r', encoding='utf-8') as f:
    cvs_annotated = json.load(f)

# Load non-annotated LinkedIn CVs
with open(DATA_DIR / 'linkedin-cvs-not-annotated.json', 'r', encoding='utf-8') as f:
    cvs_not_annotated = json.load(f)

# Load label dictionaries
department_labels = pd.read_csv(DATA_DIR / 'department-v2.csv')
seniority_labels = pd.read_csv(DATA_DIR / 'seniority-v2.csv')

print(f"Annotated CVs: {len(cvs_annotated)}")
print(f"Non-annotated CVs: {len(cvs_not_annotated)}")
print(f"Department label examples: {len(department_labels)}")
print(f"Seniority label examples: {len(seniority_labels)}")

## 2. Understanding the Data Structure

Each CV is a list of positions. Let's examine the structure:

In [None]:
# Look at the first CV
print("First CV (person with multiple positions):")
print(json.dumps(cvs_annotated[0], indent=2))

In [None]:
# Look at another example
print("\nSecond CV (person with career history):")
print(json.dumps(cvs_annotated[1], indent=2))

In [None]:
# Extract all fields from positions
sample_position = cvs_annotated[0][0]
print("Position fields:")
for key, value in sample_position.items():
    print(f"  - {key}: {type(value).__name__} (example: {repr(value)[:50]}...)")

### Key Observations:
- Each CV is a **list of positions** (current and past jobs)
- Position fields: `organization`, `linkedin`, `position`, `startDate`, `endDate`, `status`, `department`, `seniority`
- **status**: `ACTIVE` (current job), `INACTIVE` (past job), `UNKNOWN`
- **Target variables**: `department` and `seniority` (only in annotated data)
- **Input for prediction**: The `position` field (job title)

## 3. Flatten Data for Analysis

In [None]:
def flatten_cvs(cvs, cv_id_start=0):
    """Flatten list of CVs into a DataFrame of positions."""
    records = []
    for cv_id, cv in enumerate(cvs, start=cv_id_start):
        for pos_idx, position in enumerate(cv):
            record = {
                'cv_id': cv_id,
                'position_idx': pos_idx,
                **position
            }
            records.append(record)
    return pd.DataFrame(records)

# Flatten annotated data
df_annotated = flatten_cvs(cvs_annotated)
print(f"Total positions in annotated data: {len(df_annotated)}")
print(f"Total CVs (unique cv_id): {df_annotated['cv_id'].nunique()}")
df_annotated.head(10)

In [None]:
# Basic info
df_annotated.info()

## 4. Filter to ACTIVE Positions (Our Target)

According to the project description, we need to predict characteristics of the **current job** (status = ACTIVE).

In [None]:
# Status distribution
print("Position Status Distribution:")
print(df_annotated['status'].value_counts())
print(f"\nPercentage ACTIVE: {(df_annotated['status'] == 'ACTIVE').mean():.1%}")

In [None]:
# Filter to active positions only
df_active = df_annotated[df_annotated['status'] == 'ACTIVE'].copy()
print(f"Active positions: {len(df_active)}")
print(f"Unique CVs with active positions: {df_active['cv_id'].nunique()}")

In [None]:
# Some people have multiple active positions!
active_per_cv = df_active.groupby('cv_id').size()
print("Active positions per CV:")
print(active_per_cv.value_counts().sort_index())

## 5. Department (Domain) Analysis

In [None]:
# Department distribution in active positions
dept_counts = df_active['department'].value_counts()
print(f"Unique departments: {len(dept_counts)}")
print("\nDepartment Distribution:")
print(dept_counts)

In [None]:
# Visualize department distribution
fig, ax = plt.subplots(figsize=(12, 6))
dept_counts.plot(kind='bar', ax=ax, color=sns.color_palette('husl', len(dept_counts)))
ax.set_title('Department Distribution (Active Positions)', fontsize=14, fontweight='bold')
ax.set_xlabel('Department')
ax.set_ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('../reports/figures/department_distribution.png', dpi=150)
plt.show()

In [None]:
# Department proportions
dept_props = (dept_counts / dept_counts.sum() * 100).round(1)
print("Department Proportions (%):")
for dept, prop in dept_props.items():
    print(f"  {dept}: {prop}%")

### Department Observations:
- **"Other"** is the most common category - this is a catch-all for unclassified positions
- **Class imbalance** is significant - some departments have very few samples
- Main departments: Sales, Information Technology, Marketing, Project Management, Consulting, Business Development

## 6. Seniority Analysis

In [None]:
# Seniority distribution
seniority_counts = df_active['seniority'].value_counts()
print(f"Unique seniority levels: {len(seniority_counts)}")
print("\nSeniority Distribution:")
print(seniority_counts)

In [None]:
# Visualize seniority distribution
fig, ax = plt.subplots(figsize=(10, 5))

# Order seniority levels logically
seniority_order = ['Junior', 'Professional', 'Senior', 'Lead', 'Director', 'Management']
seniority_ordered = seniority_counts.reindex([s for s in seniority_order if s in seniority_counts.index])

colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(seniority_ordered)))
seniority_ordered.plot(kind='bar', ax=ax, color=colors)
ax.set_title('Seniority Distribution (Active Positions)', fontsize=14, fontweight='bold')
ax.set_xlabel('Seniority Level')
ax.set_ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('../reports/figures/seniority_distribution.png', dpi=150)
plt.show()

In [None]:
# Seniority proportions
seniority_props = (seniority_counts / seniority_counts.sum() * 100).round(1)
print("Seniority Proportions (%):")
for sen, prop in seniority_props.items():
    print(f"  {sen}: {prop}%")

### Seniority Observations:
- **Professional** is the most common level (mid-level workers)
- **Lead** positions are also well-represented
- **Management** and **Senior** have decent representation
- **Junior** and **Director** are less common
- More balanced than department distribution

## 7. Position Title (Text) Analysis

In [None]:
# Text length analysis
df_active['title_length'] = df_active['position'].str.len()
df_active['title_word_count'] = df_active['position'].str.split().str.len()

print("Position Title Statistics:")
print(df_active[['title_length', 'title_word_count']].describe())

In [None]:
# Visualize title length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Character length
axes[0].hist(df_active['title_length'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Position Title Length (Characters)', fontsize=12)
axes[0].set_xlabel('Number of Characters')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df_active['title_length'].median(), color='red', linestyle='--', label=f"Median: {df_active['title_length'].median():.0f}")
axes[0].legend()

# Word count
axes[1].hist(df_active['title_word_count'], bins=20, edgecolor='black', alpha=0.7)
axes[1].set_title('Position Title Length (Words)', fontsize=12)
axes[1].set_xlabel('Number of Words')
axes[1].set_ylabel('Frequency')
axes[1].axvline(df_active['title_word_count'].median(), color='red', linestyle='--', label=f"Median: {df_active['title_word_count'].median():.0f}")
axes[1].legend()

plt.tight_layout()
plt.savefig('../reports/figures/title_length_distribution.png', dpi=150)
plt.show()

In [None]:
# Sample position titles by department
print("Sample Position Titles by Department:\n")
for dept in df_active['department'].unique()[:8]:  # First 8 departments
    print(f"=== {dept} ===")
    samples = df_active[df_active['department'] == dept]['position'].head(5).tolist()
    for s in samples:
        print(f"  - {s}")
    print()

## 8. Language Detection (Multilingual Data)

In [None]:
# Simple language detection based on common words
def detect_language_simple(text):
    """Simple heuristic-based language detection."""
    text_lower = text.lower()
    
    # German indicators
    german_words = ['und', 'der', 'die', 'für', 'bei', 'leiter', 'berater', 'mitarbeiter', 'geschäftsführer']
    # French indicators
    french_words = ['de', 'du', 'la', 'le', 'responsable', 'directeur', 'chargé', 'chef']
    # Spanish indicators
    spanish_words = ['de', 'del', 'director', 'gerente', 'jefe']
    
    german_score = sum(1 for w in german_words if w in text_lower)
    french_score = sum(1 for w in french_words if w in text_lower)
    
    if german_score > french_score and german_score > 0:
        return 'German'
    elif french_score > german_score and french_score > 0:
        return 'French'
    else:
        return 'English/Other'

df_active['detected_language'] = df_active['position'].apply(detect_language_simple)

print("Detected Languages:")
print(df_active['detected_language'].value_counts())

In [None]:
# Sample titles by detected language
for lang in df_active['detected_language'].unique():
    print(f"\n=== {lang} Examples ===")
    samples = df_active[df_active['detected_language'] == lang]['position'].sample(min(5, len(df_active[df_active['detected_language'] == lang]))).tolist()
    for s in samples:
        print(f"  - {s}")

### Language Observations:
- The dataset is **multilingual** (English, German, French, Spanish, etc.)
- This is important for model design:
  - Rule-based matching needs multilingual patterns
  - Embeddings should use multilingual models

## 9. Label Dictionary Analysis

In [None]:
# Analyze department label dictionary
print("Department Label Dictionary:")
print(f"  Total examples: {len(department_labels)}")
print(f"  Unique labels: {department_labels['label'].nunique()}")
print("\n  Labels and counts:")
print(department_labels['label'].value_counts())

In [None]:
# Analyze seniority label dictionary
print("Seniority Label Dictionary:")
print(f"  Total examples: {len(seniority_labels)}")
print(f"  Unique labels: {seniority_labels['label'].nunique()}")
print("\n  Labels and counts:")
print(seniority_labels['label'].value_counts())

In [None]:
# Sample text -> label mappings
print("Sample Department Mappings:")
department_labels.sample(10)

In [None]:
print("Sample Seniority Mappings:")
seniority_labels.sample(10)

## 10. Career History Patterns (Extensions)

In [None]:
# Positions per CV
positions_per_cv = df_annotated.groupby('cv_id').size()

print("Positions per CV:")
print(positions_per_cv.describe())

fig, ax = plt.subplots(figsize=(10, 5))
positions_per_cv.hist(bins=20, ax=ax, edgecolor='black', alpha=0.7)
ax.set_title('Number of Positions per CV', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Positions')
ax.set_ylabel('Number of CVs')
ax.axvline(positions_per_cv.median(), color='red', linestyle='--', label=f"Median: {positions_per_cv.median():.0f}")
ax.legend()
plt.tight_layout()
plt.savefig('../reports/figures/positions_per_cv.png', dpi=150)
plt.show()

In [None]:
# Department consistency within CVs
def get_cv_department_consistency(cv_id):
    cv_depts = df_annotated[df_annotated['cv_id'] == cv_id]['department'].unique()
    return len(cv_depts) == 1

cv_ids = df_annotated['cv_id'].unique()
consistent_count = sum(get_cv_department_consistency(cid) for cid in cv_ids)

print(f"CVs with consistent department across all positions: {consistent_count}/{len(cv_ids)} ({consistent_count/len(cv_ids):.1%})")

## 11. Summary Statistics

In [None]:
# Create summary table
summary = {
    'Metric': [
        'Total CVs (annotated)',
        'Total CVs (unannotated)',
        'Total positions (annotated)',
        'Active positions (our training set)',
        'Avg positions per CV',
        'Unique departments',
        'Unique seniority levels',
        'Department label examples',
        'Seniority label examples',
        'Avg title length (chars)',
        'Avg title length (words)',
    ],
    'Value': [
        len(cvs_annotated),
        len(cvs_not_annotated),
        len(df_annotated),
        len(df_active),
        f"{positions_per_cv.mean():.1f}",
        df_active['department'].nunique(),
        df_active['seniority'].nunique(),
        len(department_labels),
        len(seniority_labels),
        f"{df_active['title_length'].mean():.1f}",
        f"{df_active['title_word_count'].mean():.1f}",
    ]
}

summary_df = pd.DataFrame(summary)
print(summary_df.to_string(index=False))

## 12. Key Findings & Implications for Modeling

### Data Characteristics:
1. **Multilingual data**: Position titles in English, German, French, Spanish → Need multilingual embeddings
2. **Class imbalance**: "Other" dominates departments → May need class weighting or resampling
3. **Short text**: Most titles are 2-5 words → Limited context for models
4. **Multiple active positions**: Some people have multiple current jobs → Handle carefully

### Modeling Implications:
1. **Rule-based**: Can leverage label dictionaries directly for exact/fuzzy matching
2. **Embedding-based**: Use multilingual models like `paraphrase-multilingual-MiniLM-L12-v2`
3. **Supervised**: Enough labeled examples for fine-tuning, but watch for class imbalance
4. **Extensions**: Career history can inform seniority (more positions = more experience)

### Next Steps:
1. Implement baseline rule-based classifier using exact matching
2. Test embedding similarity with label descriptions
3. Train TF-IDF + LogReg as supervised baseline
4. Compare approaches on held-out test set

In [None]:
# Save active positions for modeling
df_active.to_csv(DATA_DIR / 'active_positions_processed.csv', index=False)
print(f"Saved {len(df_active)} active positions to 'active_positions_processed.csv'")