# LinkedIn Professional Matching – Starter Notebook

This notebook demonstrates how to use the **profiles** and **compatibility_pairs** datasets. It is designed as the first Kaggle notebook linked to the dataset, covering:

- Loading both CSV files
- Basic understanding of the schema
- Simple exploratory data analysis (EDA) on profiles
- Joining profile-level information to compatibility pairs
- Inspecting compatibility scores and explanations

You can treat this as a template for your own analysis or model development.

## 1. Setup & Imports

We use standard Python data science libraries. If you run this on Kaggle, these are already available.

In [None]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-v0_8')
sns.set_palette('tab10')

## 2. Locate the Data Files

This notebook is written to work both:
- **On Kaggle**: when the dataset is attached and the CSV files live in the working directory (e.g. `profiles.csv`, `compatibility_pairs.csv`).
- **Locally in this repo**: where files live under `data/processed/`.

The helper below looks for the CSVs in the current directory first, then falls back to `data/processed/`.

In [None]:
def find_data_file(filename: str) -> Path:
    cwd_path = Path(filename)
    if cwd_path.exists():
        return cwd_path

    repo_path = Path('data/processed') / filename
    if repo_path.exists():
        return repo_path

    raise FileNotFoundError(f'Could not find {filename} in current dir or data/processed/')

profiles_path = find_data_file('profiles.csv')
pairs_path = find_data_file('compatibility_pairs.csv')

profiles_path, pairs_path

## 3. Load the Datasets

- `profiles.csv` contains one row per professional profile.
- `compatibility_pairs.csv` contains one row per ordered pair of profiles, with pre-computed compatibility scores and explanations.

In [None]:
profiles = pd.read_csv(profiles_path)
pairs = pd.read_csv(pairs_path)

profiles.shape, pairs.shape

### Quick peek at the data

In [None]:
profiles.head()

In [None]:
pairs.head()

## 4. Basic Schema Overview

Some key columns in `profiles`:
- `profile_id`: unique identifier for each profile.
- `name`, `location`, `headline`, `about`: basic profile info.
- `current_role`, `current_company`, `industry`: current position.
- `years_experience`, `seniority_level`: experience & seniority.
- `skills`: list-like string of skills.
- `goals`, `needs`, `can_offer`: networking intent and value.

Some key columns in `compatibility_pairs`:
- `pair_id`: unique identifier for each pair.
- `profile_a_id`, `profile_b_id`: IDs of the two profiles.
- `compatibility_score`: overall score (0–100).
- `skill_match_score`, `career_alignment_score`, etc.: component scores.
- `mutual_benefit_explanation`: text explanation of why the match is valuable.

## 5. Simple EDA on Profiles

In [None]:
profiles.describe(include='all').transpose().head(20)

### Seniority distribution

In [None]:
seniority_counts = profiles['seniority_level'].value_counts().sort_index()
seniority_counts

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(x=seniority_counts.index, y=seniority_counts.values)
plt.title('Seniority Level Distribution')
plt.xlabel('Seniority level')
plt.ylabel('Count of profiles')
plt.tight_layout()
plt.show()

### Years of experience by seniority

In [None]:
plt.figure(figsize=(7,4))
sns.boxplot(data=profiles, x='seniority_level', y='years_experience', order=sorted(profiles['seniority_level'].dropna().unique()))
plt.title('Years of Experience by Seniority Level')
plt.xlabel('Seniority level')
plt.ylabel('Years of experience')
plt.tight_layout()
plt.show()

### Top industries

In [None]:
top_industries = profiles['industry'].value_counts().head(10)
top_industries

In [None]:
plt.figure(figsize=(7,4))
sns.barplot(y=top_industries.index, x=top_industries.values)
plt.title('Top 10 Industries by Profile Count')
plt.xlabel('Count of profiles')
plt.ylabel('Industry')
plt.tight_layout()
plt.show()

## 6. Working with Compatibility Pairs

We now inspect the distribution of compatibility scores and how they relate to different component scores.

In [None]:
pairs['compatibility_score'].describe()

In [None]:
plt.figure(figsize=(7,4))
sns.histplot(pairs['compatibility_score'], bins=30, kde=True)
plt.title('Distribution of Compatibility Scores')
plt.xlabel('Compatibility score')
plt.ylabel('Number of pairs')
plt.tight_layout()
plt.show()

### Relationship between skill match and compatibility score

In [None]:
plt.figure(figsize=(6,5))
sns.scatterplot(
    data=pairs.sample(min(5000, len(pairs)), random_state=42),
    x='skill_match_score',
    y='compatibility_score',
    alpha=0.4
)
plt.title('Skill Match vs Compatibility Score (sample)')
plt.xlabel('Skill match score')
plt.ylabel('Compatibility score')
plt.tight_layout()
plt.show()

## 7. Joining Profiles with Pairs

To analyze pairs with full profile context, we join `pairs` to `profiles` twice:
- once for `profile_a_id` (suffix `_a`)
- once for `profile_b_id` (suffix `_b`)

In [None]:
pairs_with_profiles = pairs.merge(
    profiles.add_prefix('a_'),
    left_on='profile_a_id',
    right_on='a_profile_id',
    how='left'
)

pairs_with_profiles = pairs_with_profiles.merge(
    profiles.add_prefix('b_'),
    left_on='profile_b_id',
    right_on='b_profile_id',
    how='left'
)

pairs_with_profiles.head()

### Example: high-compatibility pairs with explanation

We can now look at the top matches and inspect both sides of the pair.

In [None]:
top_pairs = (
    pairs_with_profiles
    .sort_values('compatibility_score', ascending=False)
    .head(5)
)

columns_to_show = [
    'profile_a_id', 'a_name', 'a_current_role', 'a_current_company',
    'profile_b_id', 'b_name', 'b_current_role', 'b_current_company',
    'compatibility_score', 'mutual_benefit_explanation'
]

top_pairs[columns_to_show]

## 8. Next Steps

Here are some ideas for extending this notebook:

- Build a **recommender** that, given a profile, ranks the best potential matches using the provided compatibility scores.
- Engineer additional features from the raw text fields (e.g., embedding skills lists, industry, and goals).
- Train a supervised model to predict `compatibility_score` from profile features only, and compare against the provided scores.
- Analyze how compatibility varies by seniority, industry, geographic location, or goals/needs.

Feel free to fork this notebook and adapt it to your own experiments.