# Developer Role Classification: Data Preprocessing

This notebook documents the complete preprocessing pipeline for the Developer Role Classification project. We'll transform raw developer commit data into model-ready examples through a series of well-documented steps.

The preprocessing pipeline includes:
1. Data loading and inspection
2. Data cleaning
3. Feature engineering
4. Data normalization/scaling
5. Train-test split
6. Data augmentation (if needed)
7. Saving the processed dataset

## 1. Data Loading and Inspection

First, let's import the necessary libraries and load our raw dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import os
import sys
import platform

# Print environment information for reproducibility
print(f"Python version: {platform.python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NLTK version: {nltk.__version__}")

# For reproducibility - Set fixed seeds everywhere
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
import random
random.seed(RANDOM_SEED)

# Try to set seeds for other libraries if they're installed
try:
    import tensorflow as tf
    tf.random.set_seed(RANDOM_SEED)
    print(f"TensorFlow version: {tf.__version__}")
except ImportError:
    print("TensorFlow not installed")

try:
    import torch
    torch.manual_seed(RANDOM_SEED)
    torch.cuda.manual_seed_all(RANDOM_SEED)
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    print("PyTorch not installed")

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the raw dataset
# Assuming the raw data is stored in a CSV file with commit messages and metadata
try:
    raw_data = pd.read_csv('raw_developer_commits.csv')
    print(f"Dataset loaded successfully with {raw_data.shape[0]} rows and {raw_data.shape[1]} columns.")
except FileNotFoundError:
    # If the file doesn't exist, we'll create a sample dataset for demonstration
    print("Raw data file not found. Creating a sample dataset for demonstration.")
    
    # Sample data with commit messages from different developer roles
    sample_data = {
        'commit_id': [f'commit_{i}' for i in range(1, 101)],
        'author': [f'dev_{np.random.randint(1, 20)}' for _ in range(100)],
        'timestamp': pd.date_range(start='2020-01-01', periods=100, freq='D'),
        'commit_message': [
            "Fixed bug in authentication module",
            "Added new UI components for dashboard",
            "Optimized database queries for better performance",
            "Updated documentation for API endpoints",
            "Implemented CI/CD pipeline configuration",
            "Refactored user service for better maintainability",
            "Created unit tests for payment processing",
            "Fixed security vulnerability in login form",
            "Integrated third-party analytics service",
            "Deployed new version to production server"
        ] * 10,
        'files_changed': [np.random.randint(1, 10) for _ in range(100)],
        'insertions': [np.random.randint(10, 200) for _ in range(100)],
        'deletions': [np.random.randint(5, 100) for _ in range(100)],
        'role': np.random.choice(['Frontend', 'Backend', 'DevOps', 'Full Stack', 'QA'], 100)
    }
    
    raw_data = pd.DataFrame(sample_data)
    print(f"Sample dataset created with {raw_data.shape[0]} rows and {raw_data.shape[1]} columns.")

# Display the first few rows
print("\nFirst 5 rows of the dataset:")
display(raw_data.head())

In [None]:
# Check basic information about the dataset
print("Dataset info:")
raw_data.info()

# Check for missing values
print("\nMissing values per column:")
print(raw_data.isnull().sum())

# Basic statistics for numerical columns
print("\nBasic statistics for numerical columns:")
display(raw_data.describe())

# Distribution of developer roles (target variable)
print("\nDistribution of developer roles:")
role_counts = raw_data['role'].value_counts()
display(role_counts)

# Visualize the distribution
plt.figure(figsize=(10, 6))
sns.countplot(y=raw_data['role'])
plt.title('Distribution of Developer Roles')
plt.xlabel('Count')
plt.ylabel('Role')
plt.tight_layout()
plt.show()

## 2. Data Cleaning

Now let's clean our dataset by:
1. Handling missing values
2. Removing duplicates
3. Standardizing text data
4. Removing outliers if necessary

In [None]:
# Handle missing values
# For numeric columns, fill with median
numeric_cols = raw_data.select_dtypes(include=['int64', 'float64']).columns
for col in numeric_cols:
    if raw_data[col].isnull().sum() > 0:
        raw_data[col] = raw_data[col].fillna(raw_data[col].median())

# For text columns, fill with empty string
text_cols = raw_data.select_dtypes(include=['object']).columns
for col in text_cols:
    if col != 'role':  # Don't fill missing values in the target column
        if raw_data[col].isnull().sum() > 0:
            raw_data[col] = raw_data[col].fillna('')

# Check if we have any missing values left
print("Missing values after cleaning:")
print(raw_data.isnull().sum())

# Remove duplicate entries based on commit_id
duplicates = raw_data.duplicated(subset=['commit_id'])
print(f"\nNumber of duplicate entries: {duplicates.sum()}")
raw_data = raw_data.drop_duplicates(subset=['commit_id'])
print(f"Dataset shape after removing duplicates: {raw_data.shape}")

# Standardize text in commit messages
def clean_text(text):
    if not isinstance(text, str):
        return ""
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    # Remove special characters and numbers
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d', ' ', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply text cleaning to commit messages
raw_data['clean_commit_message'] = raw_data['commit_message'].apply(clean_text)

# Display sample of cleaned messages
print("\nSample of original vs. cleaned commit messages:")
sample_comparison = pd.DataFrame({
    'Original': raw_data['commit_message'].head(),
    'Cleaned': raw_data['clean_commit_message'].head()
})
display(sample_comparison)

# Check for outliers in numerical columns
plt.figure(figsize=(15, 5))
for i, col in enumerate(numeric_cols):
    if col != 'commit_id':  # Skip non-meaningful numeric IDs
        plt.subplot(1, len(numeric_cols)-1, i+1)
        sns.boxplot(y=raw_data[col])
        plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

# Remove extreme outliers (optional)
# Here we're using IQR method to identify outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR
    upper_bound = Q3 + 3 * IQR
    
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply outlier removal to numerical columns if needed
# This is commented out by default as it might remove important data points
# for col in ['files_changed', 'insertions', 'deletions']:
#     raw_data = remove_outliers(raw_data, col)
# print(f"Dataset shape after outlier removal: {raw_data.shape}")

# Clean data after preprocessing
clean_data = raw_data.copy()
print(f"\nClean dataset shape: {clean_data.shape}")

## 3. Feature Engineering

In this section, we'll:
1. Extract features from commit messages
2. Create new features from existing metadata
3. Encode categorical variables

In [None]:
# Set up text processing tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function to process text with stopword removal and lemmatization
def process_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatize
    processed = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(processed)

# Apply text processing
clean_data['processed_message'] = clean_data['clean_commit_message'].apply(process_text)

# Extract features from commit messages
# 1. Message length
clean_data['message_length'] = clean_data['commit_message'].apply(lambda x: len(str(x)))
clean_data['word_count'] = clean_data['processed_message'].apply(lambda x: len(str(x).split()))

# 2. Check for keywords related to different roles
frontend_keywords = ['ui', 'css', 'html', 'react', 'angular', 'vue', 'component', 'style', 'interface']
backend_keywords = ['api', 'database', 'query', 'endpoint', 'server', 'service', 'model', 'controller']
devops_keywords = ['deploy', 'pipeline', 'docker', 'kubernetes', 'ci', 'cd', 'configuration', 'infrastructure']
qa_keywords = ['test', 'bug', 'fix', 'issue', 'verify', 'validation', 'coverage', 'quality']

def count_keywords(text, keyword_list):
    if not isinstance(text, str):
        return 0
    text = text.lower()
    return sum(1 for keyword in keyword_list if keyword in text)

clean_data['frontend_keywords'] = clean_data['processed_message'].apply(lambda x: count_keywords(x, frontend_keywords))
clean_data['backend_keywords'] = clean_data['processed_message'].apply(lambda x: count_keywords(x, backend_keywords))
clean_data['devops_keywords'] = clean_data['processed_message'].apply(lambda x: count_keywords(x, devops_keywords))
clean_data['qa_keywords'] = clean_data['processed_message'].apply(lambda x: count_keywords(x, qa_keywords))

# 3. Create day of week and hour features from timestamp
if pd.api.types.is_datetime64_any_dtype(clean_data['timestamp']):
    clean_data['day_of_week'] = clean_data['timestamp'].dt.dayofweek
    clean_data['hour_of_day'] = clean_data['timestamp'].dt.hour
elif isinstance(clean_data['timestamp'][0], str):
    clean_data['timestamp'] = pd.to_datetime(clean_data['timestamp'])
    clean_data['day_of_week'] = clean_data['timestamp'].dt.dayofweek
    clean_data['hour_of_day'] = clean_data['timestamp'].dt.hour

# 4. Calculate ratio features
clean_data['insertion_deletion_ratio'] = clean_data['insertions'] / (clean_data['deletions'] + 1)  # +1 to avoid division by zero

# 5. One-hot encode the role (target variable)
# We'll use pandas get_dummies for simplicity
role_encoded = pd.get_dummies(clean_data['role'], prefix='role')
clean_data = pd.concat([clean_data, role_encoded], axis=1)

# Display the new features
print("Dataset with engineered features:")
display(clean_data.head())

# Check correlation between engineered features and roles
# Create a correlation matrix for the new features and encoded roles
feature_cols = ['message_length', 'word_count', 'frontend_keywords', 'backend_keywords', 
                'devops_keywords', 'qa_keywords', 'files_changed', 'insertions', 'deletions',
                'insertion_deletion_ratio']

role_cols = [col for col in clean_data.columns if col.startswith('role_')]
corr_cols = feature_cols + role_cols

# Calculate correlation matrix
corr_matrix = clean_data[corr_cols].corr()

# Plot heatmap of correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 4. Data Normalization/Scaling

Now we'll normalize or standardize our numerical features to ensure they're on comparable scales for the model.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Select numerical features for scaling
numerical_features = ['message_length', 'word_count', 'frontend_keywords', 'backend_keywords', 
                      'devops_keywords', 'qa_keywords', 'files_changed', 'insertions', 'deletions',
                      'insertion_deletion_ratio', 'day_of_week', 'hour_of_day']

# Create a copy of the data for scaling
scaled_data = clean_data.copy()

# Apply StandardScaler (z-score normalization)
scaler = StandardScaler()
scaled_data[numerical_features] = scaler.fit_transform(scaled_data[numerical_features])

# Display scaled data
print("Scaled numerical features (first 5 rows):")
display(scaled_data[numerical_features].head())

# Visualize distribution of scaled features
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
    plt.subplot(3, 4, i+1)
    sns.histplot(scaled_data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Save the scaler for later use
import joblib
joblib.dump(scaler, 'feature_scaler.pkl')
print("Scaler saved to 'feature_scaler.pkl'")

## 5. Train-Test Split

Split the processed data into training and testing sets, ensuring balanced class distribution.

In [None]:
from sklearn.model_selection import train_test_split

# Define features and target
X = scaled_data[numerical_features]
y = clean_data['role']  # Original categorical target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Verify the split
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Check class distribution in train and test sets
print("\nClass distribution in training set:")
display(y_train.value_counts(normalize=True))

print("\nClass distribution in testing set:")
display(y_test.value_counts(normalize=True))

# Create a DataFrame with the original text and the features for both train and test sets
train_indices = y_train.index
test_indices = y_test.index

train_data = scaled_data.loc[train_indices].copy()
test_data = scaled_data.loc[test_indices].copy()

print(f"\nFinal training data shape: {train_data.shape}")
print(f"Final testing data shape: {test_data.shape}")

## 6. Data Augmentation (if needed)

Check if we need data augmentation to address class imbalance. If so, we'll use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic examples for minority classes.

In [None]:
# Check for class imbalance
class_counts = y_train.value_counts()
print("Class distribution before augmentation:")
display(class_counts)

# Calculate imbalance ratio (majority class / minority class)
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"Imbalance ratio: {imbalance_ratio:.2f}")

# If the imbalance ratio is high (e.g., > 1.5), apply SMOTE
if imbalance_ratio > 1.5:
    from imblearn.over_sampling import SMOTE
    
    print("\nApplying SMOTE to balance the classes...")
    smote = SMOTE(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    print(f"Original training set shape: {X_train.shape}")
    print(f"Resampled training set shape: {X_train_resampled.shape}")
    
    print("\nClass distribution after SMOTE:")
    display(pd.Series(y_train_resampled).value_counts())
    
    # Update the training data
    X_train = X_train_resampled
    y_train = y_train_resampled
else:
    print("\nClass distribution is relatively balanced. No augmentation needed.")

## 7. Save Processed Dataset

Finally, let's save our processed dataset for use in the modeling phase.

In [None]:
# Combine features and target into final datasets
final_train = pd.DataFrame(X_train, index=X_train.index)
final_train['role'] = y_train

final_test = pd.DataFrame(X_test, index=X_test.index)
final_test['role'] = y_test

# Add back the original text for reference
final_train['commit_message'] = clean_data.loc[final_train.index, 'commit_message']
final_test['commit_message'] = clean_data.loc[final_test.index, 'commit_message']

# Save the processed datasets
final_dataset = pd.concat([final_train, final_test])
final_dataset.to_csv('final_dataset.csv', index=False)
print("Final processed dataset saved to 'final_dataset.csv'")

# Save train and test sets separately
final_train.to_csv('train_dataset.csv', index=False)
final_test.to_csv('test_dataset.csv', index=False)
print("Training dataset saved to 'train_dataset.csv'")
print("Testing dataset saved to 'test_dataset.csv'")

# Save the processed dataset in pickle format for preserving data types
import pickle

with open('processed_data.pkl', 'wb') as f:
    pickle.dump({
        'X_train': X_train,
        'y_train': y_train,
        'X_test': X_test,
        'y_test': y_test,
        'feature_names': numerical_features,
        'scaler': scaler
    }, f)

print("Processed data saved in pickle format to 'processed_data.pkl'")

# Create a preprocessing script that can be used for new data
with open('preprocess.py', 'w') as f:
    f.write("""
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import joblib

def preprocess_commit_data(data):
    # Download NLTK resources if needed
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')
    try:
        nltk.data.find('corpora/stopwords')
    except LookupError:
        nltk.download('stopwords')
    try:
        nltk.data.find('corpora/wordnet')
    except LookupError:
        nltk.download('wordnet')
    
    # Clean text
    data['clean_commit_message'] = data['commit_message'].apply(clean_text)
    
    # Process text
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    data['processed_message'] = data['clean_commit_message'].apply(
        lambda x: ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(x) 
                           if word not in stop_words])
    )
    
    # Extract features
    data['message_length'] = data['commit_message'].apply(lambda x: len(str(x)))
    data['word_count'] = data['processed_message'].apply(lambda x: len(str(x).split()))
    
    # Keyword counts
    frontend_keywords = ['ui', 'css', 'html', 'react', 'angular', 'vue', 'component', 'style', 'interface']
    backend_keywords = ['api', 'database', 'query', 'endpoint', 'server', 'service', 'model', 'controller']
    devops_keywords = ['deploy', 'pipeline', 'docker', 'kubernetes', 'ci', 'cd', 'configuration', 'infrastructure']
    qa_keywords = ['test', 'bug', 'fix', 'issue', 'verify', 'validation', 'coverage', 'quality']
    
    data['frontend_keywords'] = data['processed_message'].apply(lambda x: count_keywords(x, frontend_keywords))
    data['backend_keywords'] = data['processed_message'].apply(lambda x: count_keywords(x, backend_keywords))
    data['devops_keywords'] = data['processed_message'].apply(lambda x: count_keywords(x, devops_keywords))
    data['qa_keywords'] = data['processed_message'].apply(lambda x: count_keywords(x, qa_keywords))
    
    # Time features
    if pd.api.types.is_datetime64_any_dtype(data['timestamp']):
        data['day_of_week'] = data['timestamp'].dt.dayofweek
        data['hour_of_day'] = data['timestamp'].dt.hour
    elif 'timestamp' in data.columns:
        try:
            data['timestamp'] = pd.to_datetime(data['timestamp'])
            data['day_of_week'] = data['timestamp'].dt.dayofweek
            data['hour_of_day'] = data['timestamp'].dt.hour
        except:
            data['day_of_week'] = 0
            data['hour_of_day'] = 0
    else:
        data['day_of_week'] = 0
        data['hour_of_day'] = 0
    
    # Ratio features
    if 'insertions' in data.columns and 'deletions' in data.columns:
        data['insertion_deletion_ratio'] = data['insertions'] / (data['deletions'] + 1)
    else:
        data['insertion_deletion_ratio'] = 0
    
    # Select features
    features = ['message_length', 'word_count', 'frontend_keywords', 'backend_keywords', 
                'devops_keywords', 'qa_keywords', 'files_changed', 'insertions', 'deletions',
                'insertion_deletion_ratio', 'day_of_week', 'hour_of_day']
    
    # Make sure all required columns exist
    for feature in features:
        if feature not in data.columns:
            data[feature] = 0
    
    # Scale features
    scaler = joblib.load('feature_scaler.pkl')
    data[features] = scaler.transform(data[features])
    
    return data, features

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(r'http\\S+|www\\S+|https\\S+', '', text)
    text = re.sub(r'\\W', ' ', text)
    text = re.sub(r'\\d', ' ', text)
    text = re.sub(r'\\s+', ' ', text).strip()
    return text

def count_keywords(text, keyword_list):
    if not isinstance(text, str):
        return 0
    text = text.lower()
    return sum(1 for keyword in keyword_list if keyword in text)

if __name__ == "__main__":
    # Example usage
    data = pd.read_csv('raw_developer_commits.csv')
    processed_data, features = preprocess_commit_data(data)
    processed_data.to_csv('processed_dataset.csv', index=False)
    print(f"Processed {len(processed_data)} commits and saved to 'processed_dataset.csv'")
""")

print("Preprocessing script saved to 'preprocess.py'")

## Summary

In this notebook, we've processed raw developer commit data into model-ready examples through:

1. **Data Loading and Inspection**: Loaded the raw data, examined its structure, and identified data quality issues.

2. **Data Cleaning**: Handled missing values, removed duplicates, standardized text, and addressed outliers.

3. **Feature Engineering**: Created meaningful features from commit messages and metadata, including:
   - Text-based features: message length, word count, keyword counts
   - Time-based features: day of week, hour of day
   - Ratio-based features: insertion/deletion ratio

4. **Data Normalization**: Standardized numerical features to ensure they're on comparable scales.

5. **Train-Test Split**: Split the data into training and testing sets with stratification to maintain class distribution.

6. **Data Augmentation**: Applied SMOTE for class balancing when needed.

7. **Saved Processed Dataset**: Saved the final processed dataset and created a reusable preprocessing script.

The final dataset is now ready for modeling and contains engineered features that capture the essence of developer commit patterns.