# ü§ù LinkedIn Professional Match - Getting Started

Welcome to the LinkedIn Professional Match dataset! This notebook will help you:
- üìä Explore the dataset
- ü§ñ Load and use the pre-trained model
- üéØ Make compatibility predictions
- üí° Discover contribution opportunities

## üåü Project Overview

This is an **open-source ML system** for predicting professional networking compatibility on LinkedIn.

**What's included:**
- 50K+ synthetic professional profiles
- 500K+ compatibility pairs with scores
- Pre-trained Gradient Boosting model (R¬≤=1.0)
- 18 engineered features with explainable AI

**GitHub:** https://github.com/Likitha-Gedipudi/LinkedIn_Match_Algorithm

**Live API:** https://linkedin-match-algorithm-4ce8d98dc007.herokuapp.com

---

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('‚úÖ Libraries loaded!')

## üìä Part 1: Dataset Overview

In [None]:
# Load profiles dataset
profiles = pd.read_csv('profiles_enhanced.csv')

print(f'Profiles Dataset Shape: {profiles.shape}')
print(f'Total Profiles: {len(profiles):,}')
print(f'Features: {profiles.shape[1]}')
print(f'\nMemory Usage: {profiles.memory_usage(deep=True).sum() / 1024**2:.1f} MB')

profiles.head()

In [None]:
# Load compatibility pairs
pairs = pd.read_csv('compatibility_pairs_enhanced.csv')

print(f'Pairs Dataset Shape: {pairs.shape}')
print(f'Total Pairs: {len(pairs):,}')
print(f'Features: {pairs.shape[1]}')

pairs.head()

## üîç Part 2: Data Exploration

In [None]:
# Profile statistics
print('=== Profile Statistics ===')
print(f"\nAverage Connections: {profiles['connections'].mean():.0f}")
print(f"Average Skills: {profiles['skills'].str.split(',').str.len().mean():.1f}")
print(f"Average Experience Years: {profiles['experience_years'].mean():.1f}")

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Connections distribution
axes[0, 0].hist(profiles['connections'], bins=50, edgecolor='black')
axes[0, 0].set_title('Connections Distribution')
axes[0, 0].set_xlabel('Number of Connections')

# Experience years
axes[0, 1].hist(profiles['experience_years'], bins=30, edgecolor='black', color='orange')
axes[0, 1].set_title('Experience Years Distribution')
axes[0, 1].set_xlabel('Years')

# Top industries
top_industries = profiles['industry'].value_counts().head(10)
axes[1, 0].barh(top_industries.index, top_industries.values)
axes[1, 0].set_title('Top 10 Industries')
axes[1, 0].set_xlabel('Count')

# Top locations
top_locations = profiles['location'].value_counts().head(10)
axes[1, 1].barh(top_locations.index, top_locations.values, color='green')
axes[1, 1].set_title('Top 10 Locations')
axes[1, 1].set_xlabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Compatibility score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(pairs['compatibility_score'], bins=50, edgecolor='black', color='purple')
axes[0].set_title('Compatibility Score Distribution')
axes[0].set_xlabel('Score (0-100)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(pairs['compatibility_score'].mean(), color='red', linestyle='--', label='Mean')
axes[0].legend()

# Score by recommendation
recommendation_counts = pairs['recommendation'].value_counts()
axes[1].bar(recommendation_counts.index, recommendation_counts.values, color=['green', 'blue', 'orange', 'red'])
axes[1].set_title('Recommendations Distribution')
axes[1].set_xlabel('Recommendation')
axes[1].set_ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

print(f"\nAverage Compatibility Score: {pairs['compatibility_score'].mean():.2f}")
print(f"Median: {pairs['compatibility_score'].median():.2f}")
print(f"Std Dev: {pairs['compatibility_score'].std():.2f}")

## ü§ñ Part 3: Load Pre-trained Model

In [None]:
# Load the trained model
model_data = joblib.load('compatibility_scorer.joblib')

pipeline = model_data['pipeline']
feature_names = model_data['feature_names']
model_type = model_data['model_type']

print(f'‚úÖ Model loaded successfully!')
print(f'Model Type: {model_type}')
print(f'Features: {len(feature_names)}')
print(f'\nFeature Names:')
for i, feat in enumerate(feature_names, 1):
    print(f'  {i}. {feat}')

## üéØ Part 4: Make Predictions

In [None]:
# Prepare test data
X_test = pairs[feature_names].head(100)
y_test = pairs['compatibility_score'].head(100)

# Make predictions
predictions = pipeline.predict(X_test)

# Calculate metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

print('=== Model Performance ===')
print(f'MAE: {mae:.3f}')
print(f'RMSE: {rmse:.3f}')
print(f'R¬≤ Score: {r2:.3f}')

# Visualize predictions vs actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, predictions, alpha=0.6)
plt.plot([0, 100], [0, 100], 'r--', label='Perfect Prediction')
plt.xlabel('Actual Score')
plt.ylabel('Predicted Score')
plt.title('Predictions vs Actual')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Example: Predict compatibility for a custom profile pair
custom_features = pd.DataFrame([{
    'skill_match_score': 75,
    'skill_complementarity_score': 85,
    'network_value_a_to_b': 60,
    'network_value_b_to_a': 65,
    'career_alignment_score': 70,
    'experience_gap': 5,
    'industry_match': 80,
    'geographic_score': 90,
    'seniority_match': 75,
    'network_value_avg': 62.5,
    'network_value_diff': 5,
    'skill_total': 160,
    'skill_balance': 63.75,
    'exp_gap_squared': 25,
    'is_mentorship_gap': 1,
    'is_peer': 0,
    'skill_x_network': 53.125,
    'career_x_industry': 56.0
}])

score = pipeline.predict(custom_features)[0]
print(f'\nüéØ Predicted Compatibility Score: {score:.1f}/100')

if score >= 80:
    print('‚úÖ Recommendation: Highly Compatible - Strong mutual benefit expected!')
elif score >= 60:
    print('‚úÖ Recommendation: Good Match - Potential for valuable connection')
elif score >= 40:
    print('‚ö†Ô∏è Recommendation: Moderate Match - Some synergies present')
else:
    print('‚ùå Recommendation: Low Match - Limited mutual benefit')

## üí° Part 5: Contribution Ideas

### üöÄ How You Can Contribute

This is an **open-source project** and we welcome contributions! Here are some ideas:

#### üéØ Beginner-Friendly
1. **Data Analysis** - Create visualizations and insights from the dataset
2. **Feature Engineering** - Add new compatibility features
3. **Documentation** - Improve README, add tutorials
4. **Bug Reports** - Find and report issues

#### üî• Intermediate
5. **Model Improvements** - Try different algorithms (Neural Networks, LightGBM)
6. **Hyperparameter Tuning** - Optimize model performance
7. **New Features** - Add conversation starters, red flags detection
8. **API Enhancements** - Add caching, rate limiting, authentication

#### üöÄ Advanced
9. **Deep Learning** - Build transformer-based models
10. **Real Data Integration** - Connect to real LinkedIn API (ethically)
11. **Web Dashboard** - Build React/Streamlit interface
12. **Graph Neural Networks** - Use network structure for predictions
13. **Explainable AI** - Add SHAP/LIME interpretability
14. **A/B Testing Framework** - Compare model versions

### üìù How to Contribute

1. **Fork** the GitHub repo: https://github.com/Likitha-Gedipudi/LinkedIn_Match_Algorithm
2. **Clone** your fork: `git clone <your-fork-url>`
3. **Create branch**: `git checkout -b feature/amazing-feature`
4. **Make changes** and commit: `git commit -m 'Add amazing feature'`
5. **Push**: `git push origin feature/amazing-feature`
6. **Open Pull Request** on GitHub

### üèÜ Recognition

All contributors will be:
- Listed in CONTRIBUTORS.md
- Credited in release notes
- Acknowledged in README

---

## üåê Resources

- **GitHub Repository**: https://github.com/Likitha-Gedipudi/LinkedIn_Match_Algorithm
- **Live API**: https://linkedin-match-algorithm-4ce8d98dc007.herokuapp.com
- **Chrome Extension**: [Link to extension folder]
- **Documentation**: [Link to docs]

## üìß Contact

- Questions? Open an issue on GitHub
- Suggestions? Start a discussion
- Want to collaborate? Reach out!

---

**Happy coding! üöÄ**

In [None]:
# Feature importance (if using tree-based model)
try:
    regressor = pipeline.named_steps['regressor']
    if hasattr(regressor, 'feature_importances_'):
        importance_df = pd.DataFrame({
            'feature': feature_names,
            'importance': regressor.feature_importances_
        }).sort_values('importance', ascending=False)
        
        plt.figure(figsize=(10, 8))
        plt.barh(importance_df['feature'], importance_df['importance'])
        plt.xlabel('Importance')
        plt.title('Feature Importance')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
        
        print('\nTop 5 Most Important Features:')
        print(importance_df.head())
except Exception as e:
    print(f'Feature importance not available: {e}')