# Developer Role Classification: Exploratory Data Analysis

This notebook contains exploratory analysis revealing dataset properties, patterns, and potential risk factors for the Developer Role Classification project.

We'll visually explore the data and document key observations to better understand:
1. The distribution and balance of developer roles
2. Key features and their relationships with roles
3. Patterns in commit behavior across different roles
4. Potential challenges and risk factors for modeling

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
from collections import Counter
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
import warnings
import os
import sys
import platform

# Print environment information for reproducibility
print(f"Python version: {platform.python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"NLTK version: {nltk.__version__}")

# For reproducibility - Set fixed seeds everywhere
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
import random
random.seed(RANDOM_SEED)

# Try to set seeds for other libraries if they're installed
try:
    from sklearn.utils import check_random_state
    check_random_state(RANDOM_SEED)
except ImportError:
    pass

# Suppress warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('ggplot')
sns.set(style="whitegrid")

## 1. Data Loading and Overview

First, let's load our processed dataset and get an overview of its structure.

In [None]:
# Load the processed dataset
try:
    data = pd.read_csv('final_dataset.csv')
    print(f"Dataset loaded successfully with {data.shape[0]} rows and {data.shape[1]} columns.")
except FileNotFoundError:
    print("Dataset not found. Please run the preprocessing notebook first.")
    
# Display the first few rows
print("\nFirst 5 rows of the dataset:")
display(data.head())

# Get basic information about the dataset
print("\nDataset info:")
data.info()

# Basic statistics
print("\nBasic statistics:")
display(data.describe())

## 2. Target Variable Analysis

Let's examine the distribution of developer roles (our target variable) to assess class balance and representation.

In [None]:
# Check the distribution of developer roles
role_counts = data['role'].value_counts()
print("Developer role distribution:")
display(role_counts)

# Calculate percentage distribution
role_percentages = data['role'].value_counts(normalize=True) * 100
print("\nPercentage distribution:")
display(role_percentages)

# Visualize the distribution
plt.figure(figsize=(12, 6))

# Bar plot of role counts
plt.subplot(1, 2, 1)
ax = sns.countplot(y=data['role'], order=role_counts.index)
plt.title('Distribution of Developer Roles')
plt.xlabel('Count')
plt.ylabel('Role')
# Add count labels
for i, count in enumerate(role_counts):
    ax.text(count + 5, i, f"{count} ({role_percentages[role_counts.index[i]]:.1f}%)", va='center')

# Pie chart of roles
plt.subplot(1, 2, 2)
plt.pie(role_counts, labels=role_counts.index, autopct='%1.1f%%', startangle=90, shadow=True)
plt.axis('equal')
plt.title('Proportion of Developer Roles')

plt.tight_layout()
plt.show()

# Check for class imbalance
min_class = role_counts.min()
max_class = role_counts.max()
imbalance_ratio = max_class / min_class
print(f"\nClass imbalance ratio (largest/smallest): {imbalance_ratio:.2f}")

if imbalance_ratio > 1.5:
    print("⚠️ There is significant class imbalance. Consider using techniques like SMOTE for balancing.")
else:
    print("✅ Class distribution is relatively balanced.")

## 3. Feature Analysis

### 3.1 Numerical Feature Distributions

Let's examine the distribution of our numerical features to identify patterns and potential outliers.

In [None]:
# Select numerical features
numerical_features = [col for col in data.columns if col not in ['role', 'commit_message', 'processed_message', 'clean_commit_message']]

# Plot histograms for numerical features
plt.figure(figsize=(15, 12))
for i, feature in enumerate(numerical_features):
    plt.subplot(4, 4, i+1)
    sns.histplot(data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Box plots for numerical features by developer role
plt.figure(figsize=(15, 20))
for i, feature in enumerate(numerical_features):
    plt.subplot(5, 3, i+1)
    sns.boxplot(x='role', y=feature, data=data)
    plt.title(f'{feature} by Role')
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 3.2 Feature Correlation Analysis

Let's examine correlations between features to identify potential multicollinearity and the most important features for role prediction.

In [None]:
# Calculate the correlation matrix
corr_matrix = data[numerical_features].corr()

# Plot heatmap of feature correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Identify highly correlated feature pairs
high_corr_threshold = 0.7
high_corr_pairs = []

for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) >= high_corr_threshold:
            high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

if high_corr_pairs:
    print("Highly correlated feature pairs (|r| >= 0.7):")
    for feat1, feat2, corr in high_corr_pairs:
        print(f"{feat1} and {feat2}: r = {corr:.2f}")
else:
    print("No highly correlated feature pairs found.")

### 3.3 Feature Importance for Role Prediction

Let's analyze which features are most discriminative for different developer roles.

In [None]:
# Group by role and calculate mean for each feature
role_feature_means = data.groupby('role')[numerical_features].mean()
display(role_feature_means)

# Create heatmap of feature means by role
plt.figure(figsize=(14, 8))
sns.heatmap(role_feature_means, annot=True, cmap='viridis', fmt='.2f')
plt.title('Average Feature Values by Developer Role')
plt.tight_layout()
plt.show()

# Calculate feature importance as the variance of feature means across roles
feature_importance = role_feature_means.var(axis=0)
feature_importance = feature_importance.sort_values(ascending=False)

print("\nFeature discriminative power (higher values indicate better role separation):")
display(feature_importance)

# Visualize feature importance
plt.figure(figsize=(12, 6))
feature_importance.plot(kind='bar')
plt.title('Feature Discriminative Power for Role Classification')
plt.ylabel('Variance of means across roles')
plt.xlabel('Feature')
plt.tight_layout()
plt.show()

## 4. Text Content Analysis

Let's analyze the commit messages to understand patterns and differences across roles.

In [None]:
# Ensure we have the commit messages
if 'commit_message' not in data.columns:
    print("Commit message column not found in the dataset.")
else:
    # Download NLTK resources if needed
    try:
        nltk.data.find('corpora/stopwords')
    except LookupError:
        nltk.download('stopwords')
    
    # Set up stopwords
    stop_words = set(stopwords.words('english'))
    
    # Function to clean and tokenize text
    def process_text_for_wordcloud(text):
        if not isinstance(text, str):
            return ""
        # Convert to lowercase and remove special characters
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        # Remove stopwords
        tokens = [word for word in text.split() if word not in stop_words]
        return ' '.join(tokens)
    
    # Process commit messages for analysis
    data['processed_for_cloud'] = data['commit_message'].apply(process_text_for_wordcloud)
    
    # Generate word clouds for each role
    unique_roles = data['role'].unique()
    
    plt.figure(figsize=(18, 4 * len(unique_roles)))
    for i, role in enumerate(unique_roles):
        role_messages = ' '.join(data[data['role'] == role]['processed_for_cloud'])
        
        # Generate word cloud
        wordcloud = WordCloud(
            width=800, height=400,
            background_color='white',
            colormap='viridis',
            max_words=100,
            contour_width=3,
            contour_color='steelblue'
        ).generate(role_messages)
        
        # Plot the word cloud
        plt.subplot(len(unique_roles), 1, i+1)
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f'Common Words in {role} Commit Messages')
        plt.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Extract top keywords for each role
    print("Top keywords by role:")
    
    for role in unique_roles:
        role_messages = ' '.join(data[data['role'] == role]['processed_for_cloud']).split()
        role_word_counts = Counter(role_messages)
        top_words = role_word_counts.most_common(10)
        
        print(f"\n{role}:")
        for word, count in top_words:
            print(f"  - {word}: {count}")

## 5. Dimensionality Reduction for Visualization

Let's reduce the dimensionality of our feature space to visualize how well the roles separate in a lower-dimensional space.

In [None]:
from sklearn.preprocessing import StandardScaler

# Select features for dimensionality reduction
X_features = data[numerical_features].values
y_roles = data['role'].values

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_features)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame for visualization
pca_df = pd.DataFrame({
    'PCA1': X_pca[:, 0],
    'PCA2': X_pca[:, 1],
    'Role': y_roles
})

# Plot the PCA results
plt.figure(figsize=(12, 8))
sns.scatterplot(x='PCA1', y='PCA2', hue='Role', data=pca_df, palette='viridis', s=100, alpha=0.7)
plt.title('PCA Visualization of Developer Roles')
plt.xlabel(f'Principal Component 1 (Explained Variance: {pca.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'Principal Component 2 (Explained Variance: {pca.explained_variance_ratio_[1]:.2%})')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# Print explained variance
print(f"Total variance explained by first 2 PCA components: {sum(pca.explained_variance_ratio_):.2%}")

# Check feature contributions to principal components
pca_components = pd.DataFrame(
    pca.components_,
    columns=numerical_features,
    index=['PC1', 'PC2']
)

print("\nPCA Components Feature Contributions:")
display(pca_components)

# Visualize feature contributions to PCA components
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.barplot(x=pca_components.columns, y=pca_components.iloc[0])
plt.title('Feature Contributions to PC1')
plt.xticks(rotation=90)

plt.subplot(1, 2, 2)
sns.barplot(x=pca_components.columns, y=pca_components.iloc[1])
plt.title('Feature Contributions to PC2')
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

## 6. Text Feature Analysis with TF-IDF

Let's explore how commit message content differs between roles using TF-IDF vectorization.

In [None]:
# Check if we have processed commit messages
if 'processed_message' not in data.columns and 'commit_message' in data.columns:
    # Basic processing
    data['processed_message'] = data['commit_message'].apply(process_text_for_wordcloud)

if 'processed_message' in data.columns:
    # Apply TF-IDF vectorization
    tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
    tfidf_matrix = tfidf.fit_transform(data['processed_message'])
    
    print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
    
    # Get feature names
    feature_names = tfidf.get_feature_names_out()
    
    # Reduce dimensionality for visualization
    svd = TruncatedSVD(n_components=2)
    tfidf_2d = svd.fit_transform(tfidf_matrix)
    
    # Create DataFrame for plotting
    tfidf_df = pd.DataFrame({
        'SVD1': tfidf_2d[:, 0],
        'SVD2': tfidf_2d[:, 1],
        'Role': data['role'].values
    })
    
    # Plot TF-IDF projection
    plt.figure(figsize=(12, 8))
    sns.scatterplot(x='SVD1', y='SVD2', hue='Role', data=tfidf_df, palette='viridis', s=100, alpha=0.7)
    plt.title('TF-IDF Text Features Visualization (SVD)')
    plt.xlabel(f'Component 1 (Explained Variance: {svd.explained_variance_ratio_[0]:.2%})')
    plt.ylabel(f'Component 2 (Explained Variance: {svd.explained_variance_ratio_[1]:.2%})')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()
    
    # Find most important terms for each role
    print("Most distinctive terms by role (TF-IDF):")
    
    # Average TF-IDF values for each role
    role_tfidf = {}
    for role in data['role'].unique():
        role_indices = data[data['role'] == role].index
        role_tfidf[role] = tfidf_matrix[role_indices].mean(axis=0)
    
    # Get top terms for each role
    for role, tfidf_avg in role_tfidf.items():
        # Convert to array and get top indices
        tfidf_array = np.asarray(tfidf_avg)[0]
        top_indices = tfidf_array.argsort()[-10:][::-1]
        top_terms = [(feature_names[idx], tfidf_array[idx]) for idx in top_indices]
        
        print(f"\n{role}:")
        for term, score in top_terms:
            print(f"  - {term}: {score:.4f}")

## 7. Identifying Risk Factors

Let's identify potential challenges and risk factors for our model.

In [None]:
# 1. Check for class imbalance (again)
print("Class Balance Assessment:")
role_percentages = data['role'].value_counts(normalize=True) * 100
min_percentage = role_percentages.min()
print(f"Smallest class represents {min_percentage:.2f}% of the data")

if min_percentage < 15:
    print("⚠️ HIGH RISK: Severe class imbalance detected")
elif min_percentage < 25:
    print("⚠️ MEDIUM RISK: Moderate class imbalance detected")
else:
    print("✅ LOW RISK: Classes are reasonably balanced")

# 2. Check for feature overlap between classes
print("\nFeature Separation Assessment:")

# Calculate feature overlap using coefficient of variation of means across roles
feature_means = data.groupby('role')[numerical_features].mean()
feature_std = data.groupby('role')[numerical_features].std()

# Calculate coefficient of variation for each feature's means across roles
cv_means = feature_means.std() / feature_means.mean()
avg_cv = cv_means.mean()

print(f"Average coefficient of variation of feature means across roles: {avg_cv:.4f}")
if avg_cv < 0.2:
    print("⚠️ HIGH RISK: Low feature separation between roles")
elif avg_cv < 0.5:
    print("⚠️ MEDIUM RISK: Moderate feature separation between roles")
else:
    print("✅ LOW RISK: Good feature separation between roles")

# 3. Check for potential overfitting risk (feature count vs. sample count)
print("\nOverfitting Risk Assessment:")
n_samples = len(data)
n_features = len(numerical_features)
samples_per_feature = n_samples / n_features

print(f"Number of samples: {n_samples}")
print(f"Number of features: {n_features}")
print(f"Samples per feature ratio: {samples_per_feature:.2f}")

if samples_per_feature < 10:
    print("⚠️ HIGH RISK: Low sample-to-feature ratio may lead to overfitting")
elif samples_per_feature < 50:
    print("⚠️ MEDIUM RISK: Moderate sample-to-feature ratio")
else:
    print("✅ LOW RISK: Good sample-to-feature ratio")

# 4. Check for data quality issues
print("\nData Quality Assessment:")

# Missing values
missing_values = data.isnull().sum().sum()
print(f"Total missing values: {missing_values}")
if missing_values > 0:
    print("⚠️ RISK: Dataset contains missing values")
else:
    print("✅ No missing values found")

# Check for role ambiguity using keyword overlaps
if 'frontend_keywords' in data.columns and 'backend_keywords' in data.columns:
    # Calculate average keyword counts by role
    keyword_cols = [col for col in data.columns if col.endswith('_keywords')]
    keyword_means = data.groupby('role')[keyword_cols].mean()
    
    print("\nKeyword distribution by role:")
    display(keyword_means)
    
    # Check for ambiguity
    ambiguity_scores = []
    for role in keyword_means.index:
        primary_keyword = role.lower() + "_keywords" if role.lower() + "_keywords" in keyword_cols else None
        if primary_keyword:
            # Get primary and secondary keyword values
            primary_val = keyword_means.loc[role, primary_keyword]
            secondary_vals = [keyword_means.loc[role, col] for col in keyword_cols if col != primary_keyword]
            max_secondary = max(secondary_vals) if secondary_vals else 0
            
            # Calculate ambiguity score (higher means more ambiguous)
            if primary_val > 0:
                ambiguity = max_secondary / primary_val
                ambiguity_scores.append((role, ambiguity))
    
    if ambiguity_scores:
        print("\nRole ambiguity scores (higher means more ambiguous):")
        for role, score in ambiguity_scores:
            print(f"{role}: {score:.2f}")
            if score > 0.8:
                print(f"⚠️ HIGH RISK: {role} role shows high keyword ambiguity")
            elif score > 0.5:
                print(f"⚠️ MEDIUM RISK: {role} role shows moderate keyword ambiguity")

## 8. Summary of Findings

Let's summarize our key findings from the exploratory analysis.

In [None]:
print("# Summary of Exploratory Analysis\n")

# 1. Dataset Overview
print("## Dataset Overview")
print(f"- Total number of commits: {len(data)}")
print(f"- Number of unique developer roles: {data['role'].nunique()}")
role_counts = data['role'].value_counts()
for role, count in role_counts.items():
    print(f"  - {role}: {count} commits ({count/len(data)*100:.1f}%)")

# 2. Key Features
print("\n## Key Discriminative Features")
# Get top 5 features by importance
top_features = feature_importance.head(5)
for feature, importance in top_features.items():
    print(f"- {feature}: {importance:.4f}")

# 3. Role Separation
print("\n## Role Separation")
pca_variance = sum(pca.explained_variance_ratio_)
print(f"- PCA explains {pca_variance:.1%} of variance with 2 components")
if pca_variance < 0.5:
    print("- ⚠️ Roles are not well-separated in feature space")
else:
    print("- ✅ Roles show reasonable separation in feature space")

# 4. Risk Factors
print("\n## Risk Factors")
# Class imbalance
max_class = role_counts.max()
min_class = role_counts.min()
imbalance_ratio = max_class / min_class
if imbalance_ratio > 1.5:
    print(f"- ⚠️ Class imbalance: largest/smallest class ratio = {imbalance_ratio:.2f}")

# Feature overlap
if avg_cv < 0.3:
    print(f"- ⚠️ Low feature separation (CV = {avg_cv:.2f})")

# Overfitting risk
if samples_per_feature < 20:
    print(f"- ⚠️ Risk of overfitting: only {samples_per_feature:.1f} samples per feature")

# 5. Distinctive Words by Role
print("\n## Distinctive Words by Role")
for role in data['role'].unique():
    role_messages = ' '.join(data[data['role'] == role]['processed_for_cloud']).split()
    role_word_counts = Counter(role_messages)
    top_words = role_word_counts.most_common(5)
    
    print(f"\n### {role}:")
    for word, count in top_words:
        print(f"- {word}")

## Conclusion

This exploratory analysis has provided valuable insights into our developer role classification dataset. We've identified key patterns, potential challenges, and risk factors that will inform our modeling approach.

### Key Insights:
1. **Feature Importance**: We've identified the most discriminative features for distinguishing between developer roles.
2. **Text Patterns**: Different roles show distinct patterns in their commit messages, which can be leveraged for classification.
3. **Role Separation**: PCA visualization shows how well (or poorly) the roles separate in the feature space.
4. **Risk Factors**: We've identified potential challenges like class imbalance, feature overlap, and potential overfitting risks.

### Next Steps:
1. Address any class imbalance issues in the modeling phase
2. Focus on the most discriminative features 
3. Consider feature engineering to improve role separation
4. Implement techniques to mitigate identified risks