# Deliverable 3: Data Preprocessing Pipeline + Baseline ML Models

## AI-Powered Resume Screening System - Machine Learning

**Goal:** Build a complete data preprocessing pipeline and train baseline ML models to classify resumes into job categories.

---

## Table of Contents
1. Load Dataset
2. Data Exploration & Cleaning
3. Feature Engineering (TF-IDF + Structured Features)
4. Train-Test Split
5. Baseline Model 1: Random Forest
6. Baseline Model 2: Logistic Regression
7. Model Comparison & Evaluation
8. Visualizations

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, classification_report, confusion_matrix)

print("‚úÖ All libraries imported successfully!")

## Step 1: Load Dataset

In [None]:
# Download dataset using kagglehub
import kagglehub

path = kagglehub.dataset_download("snehaanbhawal/resume-dataset")
print(f"Dataset path: {path}")

# Load the data
df = pd.read_csv(path + '/Resume/Resume.csv')

print(f"\nüìä Dataset Information:")
print(f"Total resumes: {len(df)}")
print(f"Number of job categories: {df['Category'].nunique()}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nDataset shape: {df.shape}")

## Step 2: Data Exploration & Cleaning

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Display category distribution
print("\nüìã Job Category Distribution:")
print(df['Category'].value_counts())

# Visualize category distribution
plt.figure(figsize=(14, 6))
df['Category'].value_counts().plot(kind='bar', color='steelblue')
plt.title('Distribution of Job Categories', fontsize=14, fontweight='bold')
plt.xlabel('Job Category')
plt.ylabel('Number of Resumes')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Text cleaning function
def clean_resume_text(text):
    """
    Clean resume text by:
    1. Removing URLs
    2. Removing special characters
    3. Converting to lowercase
    4. Removing extra whitespace
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # Remove special characters but keep letters, numbers, and spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply cleaning
print("üßπ Cleaning resume text...")
df['Resume_cleaned'] = df['Resume_str'].apply(clean_resume_text)

# Display example
print("\nüìÑ Original Resume (first 200 chars):")
print(df['Resume_str'].iloc[0][:200])
print("\nüìÑ Cleaned Resume (first 200 chars):")
print(df['Resume_cleaned'].iloc[0][:200])

print("\n‚úÖ Text cleaning complete!")

## Step 3: Feature Engineering

We'll create two types of features:
1. **TF-IDF Features**: Capture important words in resumes
2. **Structured Features**: Skills count, resume length, word count

In [None]:
# 1. TF-IDF Features (Text Vectorization)
print("üî§ Creating TF-IDF features...")

# TF-IDF converts text into numerical features
# max_features=300 means we keep the 300 most important words
# ngram_range=(1,2) means we consider both single words and pairs of words
vectorizer = TfidfVectorizer(
    max_features=300, 
    stop_words='english',  # Remove common words like 'the', 'is', etc.
    ngram_range=(1, 2)     # Use unigrams and bigrams
)

# Fit and transform the cleaned resume text
tfidf_features = vectorizer.fit_transform(df['Processed_Text'])

print(f"‚úÖ TF-IDF features created: {tfidf_features.shape}")
print(f"   Number of resumes: {tfidf_features.shape[0]}")
print(f"   Number of TF-IDF features: {tfidf_features.shape[1]}")

In [None]:
# 2. Structured Features
print("\nüîß Creating structured features...")

# Count skills in each resume
def count_skills(text):
    """Count how many technical skills appear in the resume"""
    skills = ['python', 'java', 'sql', 'javascript', 'machine learning', 
              'aws', 'docker', 'react', 'nodejs', 'git', 'excel', 
              'data analysis', 'leadership', 'management']
    count = sum(1 for skill in skills if skill in text.lower())
    return count

# Create structured features
df['num_skills'] = df['Resume_str'].apply(count_skills)
df['resume_length'] = df['Resume_cleaned'].apply(len)
df['word_count'] = df['Resume_cleaned'].apply(lambda x: len(x.split()))

print("‚úÖ Structured features created:")
print(f"   - num_skills: Count of technical skills")
print(f"   - resume_length: Character count")
print(f"   - word_count: Number of words")

# Display examples
print("\nüìä Feature Examples:")
print(df[['num_skills', 'resume_length', 'word_count']].head())

In [None]:
# 3. Combine all features
from scipy.sparse import hstack

print("\nüîó Combining features...")

# Get structured features as numpy array
structured_features = df[['num_skills', 'resume_length', 'word_count']].values

# Combine TF-IDF features (sparse matrix) with structured features (dense array)
X = hstack([tfidf_features, structured_features])

print(f"‚úÖ Final feature matrix shape: {X.shape}")
print(f"   Total features: {X.shape[1]} (300 TF-IDF + 3 structured)")

# Prepare labels (job categories)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['Category'])

print(f"\nüìã Labels prepared:")
print(f"   Number of classes: {len(label_encoder.classes_)}")
print(f"   Classes: {label_encoder.classes_[:5]}... (showing first 5)")

## Step 4: Train-Test Split

Split data into training (80%) and testing (20%) sets:

In [None]:
# Split data: 80% training, 20% testing
# stratify=y ensures each class is proportionally represented in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducibility
    stratify=y          # Maintain class distribution
)

print("üìä Data Split:")
print(f"Training samples: {X_train.shape[0]} ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Testing samples:  {X_test.shape[0]} ({X_test.shape[0]/len(df)*100:.1f}%)")
print(f"Features per sample: {X_train.shape[1]}")
print("\n‚úÖ Ready to train models!")

## Step 5: Baseline Model 1 - Random Forest Classifier

Random Forest is an ensemble learning method that combines multiple decision trees.

In [None]:
import time

print("üå≤ Training Random Forest Classifier...")
start_time = time.time()

# Create Random Forest model
# n_estimators=100 means we use 100 decision trees
# max_depth=20 limits tree depth to prevent overfitting
# random_state=42 ensures reproducibility
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=20,
    random_state=42,
    n_jobs=-1  # Use all CPU cores for faster training
)

# Train the model
rf_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"‚úÖ Training completed in {training_time:.2f} seconds")

# Make predictions on test set
rf_predictions = rf_model.predict(X_test)

# Calculate performance metrics
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions, average='weighted', zero_division=0)
rf_recall = recall_score(y_test, rf_predictions, average='weighted')
rf_f1 = f1_score(y_test, rf_predictions, average='weighted')

print("\nüìä Random Forest Performance:")
print("="*50)
print(f"Accuracy:  {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")
print(f"Precision: {rf_precision:.4f}")
print(f"Recall:    {rf_recall:.4f}")
print(f"F1-Score:  {rf_f1:.4f}")
print("="*50)

## Step 6: Baseline Model 2 - Logistic Regression

Logistic Regression is a linear model for classification.

In [None]:
print("üìà Training Logistic Regression Classifier...")
start_time = time.time()

# Create Logistic Regression model
# max_iter=1000 allows enough iterations for convergence
# multi_class='auto' handles multiple job categories
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Train the model
lr_model.fit(X_train, y_train)

training_time = time.time() - start_time
print(f"‚úÖ Training completed in {training_time:.2f} seconds")

# Make predictions on test set
lr_predictions = lr_model.predict(X_test)

# Calculate performance metrics
lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_precision = precision_score(y_test, lr_predictions, average='weighted', zero_division=0)
lr_recall = recall_score(y_test, lr_predictions, average='weighted')
lr_f1 = f1_score(y_test, lr_predictions, average='weighted')

print("\nüìä Logistic Regression Performance:")
print("="*50)
print(f"Accuracy:  {lr_accuracy:.4f} ({lr_accuracy*100:.2f}%)")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1-Score:  {lr_f1:.4f}")
print("="*50)

## Step 7: Model Comparison & Evaluation

Let's compare both models side by side:

In [None]:
# Create comparison table
comparison_df = pd.DataFrame({
    'Model': ['Random Forest', 'Logistic Regression'],
    'Accuracy': [rf_accuracy, lr_accuracy],
    'Precision': [rf_precision, lr_precision],
    'Recall': [rf_recall, lr_recall],
    'F1-Score': [rf_f1, lr_f1]
})

print("\nüìä MODEL COMPARISON TABLE")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)

# Determine best model
best_model_name = 'Random Forest' if rf_f1 > lr_f1 else 'Logistic Regression'
best_f1 = max(rf_f1, lr_f1)

print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   F1-Score: {best_f1:.4f}")

# Bar chart comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: All metrics comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
rf_scores = [rf_accuracy, rf_precision, rf_recall, rf_f1]
lr_scores = [lr_accuracy, lr_precision, lr_recall, lr_f1]

x = np.arange(len(metrics))
width = 0.35

axes[0].bar(x - width/2, rf_scores, width, label='Random Forest', color='steelblue')
axes[0].bar(x + width/2, lr_scores, width, label='Logistic Regression', color='coral')
axes[0].set_ylabel('Score')
axes[0].set_title('Model Performance Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(metrics)
axes[0].legend()
axes[0].set_ylim([0, 1])
axes[0].grid(axis='y', alpha=0.3)

# Plot 2: F1-Score comparison
models = ['Random Forest', 'Logistic Regression']
f1_scores = [rf_f1, lr_f1]
colors = ['steelblue', 'coral']

axes[1].bar(models, f1_scores, color=colors)
axes[1].set_ylabel('F1-Score')
axes[1].set_title('F1-Score Comparison (Higher is Better)')
axes[1].set_ylim([0, 1])
axes[1].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(f1_scores):
    axes[1].text(i, v + 0.02, f'{v:.4f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

## Step 8: Detailed Classification Report & Confusion Matrix

In [None]:
# Classification Report for Random Forest (Best Model)
print("üìã DETAILED CLASSIFICATION REPORT - RANDOM FOREST")
print("="*70)
print(classification_report(y_test, rf_predictions, target_names=label_encoder.classes_))
print("="*70)

In [None]:
# Confusion Matrix for Random Forest
cm = confusion_matrix(y_test, rf_predictions)

plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=label_encoder.classes_, 
            yticklabels=label_encoder.classes_,
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix - Random Forest Classifier', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nüí° Confusion Matrix Interpretation:")
print("   - Diagonal values (dark blue) = Correct predictions")
print("   - Off-diagonal values = Misclassifications")

## Step 9: Feature Importance Analysis

In [None]:
# Get feature importance from Random Forest
feature_names = vectorizer.get_feature_names_out().tolist() + ['num_skills', 'resume_length', 'word_count']
importances = rf_model.feature_importances_

# Get top 15 most important features
top_indices = np.argsort(importances)[-15:]
top_features = [feature_names[i] for i in top_indices]
top_importances = importances[top_indices]

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(15), top_importances, color='steelblue')
plt.yticks(range(15), top_features)
plt.xlabel('Importance Score')
plt.title('Top 15 Most Important Features - Random Forest', fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüîë Top 5 Most Important Features:")
for i, (feature, importance) in enumerate(zip(top_features[-5:], top_importances[-5:]), 1):
    print(f"   {i}. {feature}: {importance:.4f}")

## Step 10: Save Models for Future Use

In [None]:
import joblib
import os

# Create directory for models if it doesn't exist
os.makedirs('models', exist_ok=True)

# Save models
joblib.dump(rf_model, 'models/random_forest_model.pkl')
joblib.dump(lr_model, 'models/logistic_regression_model.pkl')
joblib.dump(vectorizer, 'models/tfidf_vectorizer.pkl')
joblib.dump(label_encoder, 'models/label_encoder.pkl')

print("üíæ Models saved successfully!")
print("   ‚úÖ models/random_forest_model.pkl")
print("   ‚úÖ models/logistic_regression_model.pkl")
print("   ‚úÖ models/tfidf_vectorizer.pkl")
print("   ‚úÖ models/label_encoder.pkl")

## Summary & Conclusion

In [None]:
print("="*70)
print("üéâ DELIVERABLE 3 COMPLETE!")
print("="*70)

print("\nüìä SUMMARY:")
print("-" * 70)
print(f"Dataset Size: {len(df)} resumes")
print(f"Number of Job Categories: {len(label_encoder.classes_)}")
print(f"Training Samples: {X_train.shape[0]}")
print(f"Testing Samples: {X_test.shape[0]}")
print(f"Total Features: {X.shape[1]} (300 TF-IDF + 3 structured)")
print("-" * 70)

print("\nü§ñ MODELS TRAINED:")
print("-" * 70)
print("1. Random Forest Classifier")
print(f"   - Accuracy: {rf_accuracy:.4f}")
print(f"   - F1-Score: {rf_f1:.4f}")
print()
print("2. Logistic Regression")
print(f"   - Accuracy: {lr_accuracy:.4f}")
print(f"   - F1-Score: {lr_f1:.4f}")
print("-" * 70)

print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   F1-Score: {best_f1:.4f}")

print("\n‚úÖ All components successfully implemented:")
print("   ‚úì Data preprocessing pipeline")
print("   ‚úì Feature engineering (TF-IDF + structured features)")
print("   ‚úì Two baseline ML models trained")
print("   ‚úì Performance metrics calculated")
print("   ‚úì Visualizations created")
print("   ‚úì Models saved for future use")

print("\nüìà Next Steps (Deliverable 4):")
print("   ‚Üí Integrate BERT for semantic matching")
print("   ‚Üí Implement Reinforcement Learning agent")
print("   ‚Üí Add SHAP/LIME explainability")
print("="*70)