# Intern Performance Prediction Using Machine Learning

## Generating Synthetic Dataset

### Dataset Overview
Records: 1,500 intern records

Features: 40+ comprehensive metrics

Target Variables: Multiple (performance_tier, final_performance_score, will_convert_to_ft, success_probability)

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Set random seed for reproducibility
np.random.seed(42)
random.seed(42)

def generate_intern_performance_dataset(n_records=1500):
    # Departments and their performance characteristics
    departments = ['Engineering', 'Data Science', 'Marketing', 'Design', 'Business', 'Research']
    
    # Performance tiers (target variable)
    performance_tiers = ['Low', 'Medium', 'High', 'Exceptional']
    
    # Task types with different difficulty levels
    task_types = ['Documentation', 'Coding', 'Analysis', 'Presentation', 'Research', 'Testing']
    task_difficulty = {'Documentation': 2, 'Coding': 4, 'Analysis': 3, 'Presentation': 3, 'Research': 2, 'Testing': 2}
    
    # Feedback categories
    feedback_categories = ['Technical Skills', 'Communication', 'Problem Solving', 'Teamwork', 'Initiative']
    
    data = []
    
    for i in range(n_records):
        intern_id = f"INT_{i+1:05d}"
        department = random.choice(departments)
        
        # Base performance probability based on department
        dept_performance_bias = {
            'Engineering': 0.7, 'Data Science': 0.75, 'Marketing': 0.6, 
            'Design': 0.65, 'Business': 0.55, 'Research': 0.8
        }
        
        # Generate engagement metrics (key predictors)
        attendance_rate = np.random.beta(8, 2)  # Beta distribution skewed toward high attendance
        meeting_participation = np.random.beta(6, 3)
        code_commits = np.random.poisson(lam=15 if department in ['Engineering', 'Data Science'] else 8)
        
        # Task completion metrics
        total_tasks_assigned = random.randint(15, 40)
        tasks_completed_on_time = int(total_tasks_assigned * np.random.beta(6, 3))
        tasks_completed_late = int(total_tasks_assigned * np.random.beta(2, 4))
        tasks_incomplete = total_tasks_assigned - tasks_completed_on_time - tasks_completed_late
        
        # Task quality scores
        average_task_quality = np.random.beta(7, 3) * 10  # Scale to 0-10
        task_complexity_score = np.random.normal(6, 1.5)
        
        # Feedback scores from mentors (1-10 scale)
        feedback_scores = {}
        for category in feedback_categories:
            base_score = np.random.normal(7, 1.5)
            # Adjust based on attendance and task completion
            if attendance_rate > 0.8:
                base_score += 0.5
            if tasks_completed_on_time / total_tasks_assigned > 0.7:
                base_score += 0.5
            feedback_scores[category] = max(1, min(10, round(base_score, 1)))
        
        # Learning progress metrics
        skill_improvement_rate = np.random.beta(5, 2) * 10
        certification_completion = random.randint(0, 5)
        workshops_attended = random.randint(2, 10)
        
        # Communication metrics
        messages_sent = np.random.poisson(lam=50)
        code_reviews_participated = np.random.poisson(lam=8 if department in ['Engineering', 'Data Science'] else 3)
        questions_asked = np.random.poisson(lam=15)
        
        # Calculate composite performance score
        engagement_score = (attendance_rate * 0.3 + meeting_participation * 0.2 + 
                           (messages_sent / 100) * 0.1 + (questions_asked / 20) * 0.1)
        
        task_score = ((tasks_completed_on_time / total_tasks_assigned) * 0.4 + 
                     (average_task_quality / 10) * 0.3 + 
                     (task_complexity_score / 10) * 0.3)
        
        feedback_score = sum(feedback_scores.values()) / len(feedback_scores) / 10
        
        learning_score = (skill_improvement_rate / 10 * 0.4 + 
                         (certification_completion / 5) * 0.3 + 
                         (workshops_attended / 10) * 0.3)
        
        # Final performance score (0-100)
        final_performance_score = (
            engagement_score * 25 + 
            task_score * 35 + 
            feedback_score * 25 + 
            learning_score * 15
        )
        
        # Determine performance tier based on final score
        if final_performance_score >= 85:
            performance_tier = 'Exceptional'
            success_probability = np.random.beta(9, 1)  # High success probability
        elif final_performance_score >= 70:
            performance_tier = 'High'
            success_probability = np.random.beta(7, 2)
        elif final_performance_score >= 55:
            performance_tier = 'Medium'
            success_probability = np.random.beta(5, 3)
        else:
            performance_tier = 'Low'
            success_probability = np.random.beta(2, 5)
        
        # Generate timeline data (weekly progress)
        internship_duration = random.choice([8, 12, 16])
        weekly_progress = []
        current_progress = 0
        for week in range(1, internship_duration + 1):
            weekly_growth = np.random.normal(8, 3)  # Weekly progress percentage
            current_progress = min(100, current_progress + max(0, weekly_growth))
            weekly_progress.append(round(current_progress, 1))
        
        # Final outcomes
        will_convert_to_ft = np.random.binomial(1, success_probability)
        recommendation_score = final_performance_score / 100 * np.random.beta(8, 2)
        
        data.append({
            'intern_id': intern_id,
            'department': department,
            'internship_duration_weeks': internship_duration,
            'attendance_rate': round(attendance_rate * 100, 1),
            'meeting_participation_rate': round(meeting_participation * 100, 1),
            'total_tasks_assigned': total_tasks_assigned,
            'tasks_completed_on_time': tasks_completed_on_time,
            'tasks_completed_late': tasks_completed_late,
            'tasks_incomplete': tasks_incomplete,
            'task_completion_rate': round((tasks_completed_on_time + tasks_completed_late) / total_tasks_assigned * 100, 1),
            'on_time_completion_rate': round(tasks_completed_on_time / total_tasks_assigned * 100, 1),
            'average_task_quality': round(average_task_quality, 1),
            'task_complexity_score': round(task_complexity_score, 1),
            'code_commits': code_commits,
            'messages_sent': messages_sent,
            'code_reviews_participated': code_reviews_participated,
            'questions_asked': questions_asked,
            'skill_improvement_rate': round(skill_improvement_rate, 1),
            'certification_completion': certification_completion,
            'workshops_attended': workshops_attended,
            'technical_skills_feedback': feedback_scores['Technical Skills'],
            'communication_feedback': feedback_scores['Communication'],
            'problem_solving_feedback': feedback_scores['Problem Solving'],
            'teamwork_feedback': feedback_scores['Teamwork'],
            'initiative_feedback': feedback_scores['Initiative'],
            'average_feedback_score': round(sum(feedback_scores.values()) / len(feedback_scores), 1),
            'final_performance_score': round(final_performance_score, 1),
            'performance_tier': performance_tier,
            'success_probability': round(success_probability, 3),
            'will_convert_to_ft': bool(will_convert_to_ft),
            'recommendation_score': round(recommendation_score, 3),
            'weekly_progress_trend': ';'.join(map(str, weekly_progress)),
            'engagement_score': round(engagement_score * 100, 1),
            'learning_velocity': round(skill_improvement_rate / internship_duration, 2)
        })
    
    return pd.DataFrame(data)

# Generate the dataset
print("Generating intern performance prediction dataset...")
df = generate_intern_performance_dataset(1500)

# Save to CSV
csv_filename = 'intern_performance_prediction.csv'
df.to_csv(csv_filename, index=False)

print(f"Dataset successfully saved as '{csv_filename}'")
print(f"Dataset shape: {df.shape}")

# Display comprehensive summary
print("\n" + "="*60)
print("DATASET SUMMARY FOR INTERN PERFORMANCE PREDICTION")
print("="*60)

print(f"\nTotal records: {len(df):,}")
print(f"Number of features: {len(df.columns)}")

print(f"\n=== PERFORMANCE DISTRIBUTION ===")
performance_dist = df['performance_tier'].value_counts().sort_index()
performance_pct = df['performance_tier'].value_counts(normalize=True).sort_index() * 100
for tier in performance_dist.index:
    print(f"{tier:<12}: {performance_dist[tier]:>4} interns ({performance_pct[tier]:.1f}%)")

print(f"\n=== DEPARTMENT-WISE PERFORMANCE ===")
dept_performance = df.groupby('department').agg({
    'final_performance_score': 'mean',
    'performance_tier': lambda x: (x == 'Exceptional').mean(),
    'intern_id': 'count'
}).round(3)
dept_performance.columns = ['avg_performance_score', 'exceptional_rate', 'count']
print(dept_performance)

print(f"\n=== KEY METRIC CORRELATIONS WITH PERFORMANCE ===")
# Calculate correlations with final performance score
correlation_metrics = [
    'attendance_rate', 'meeting_participation_rate', 'task_completion_rate',
    'on_time_completion_rate', 'average_task_quality', 'code_commits',
    'average_feedback_score', 'skill_improvement_rate'
]

correlations = df[correlation_metrics + ['final_performance_score']].corr()['final_performance_score'].drop('final_performance_score')
print("Correlation with Final Performance Score:")
for metric, corr in correlations.items():
    print(f"{metric:<30}: {corr:.3f}")

print(f"\n=== CONVERSION RATES BY PERFORMANCE TIER ===")
conversion_rates = df.groupby('performance_tier')['will_convert_to_ft'].mean()
for tier, rate in conversion_rates.items():
    print(f"{tier:<12}: {rate:.1%}")

print(f"\n=== TOP 10 PREDICTORS FOR MACHINE LEARNING ===")
# Feature importance preview
feature_importance_preview = {
    'attendance_rate': 'High',
    'on_time_completion_rate': 'High', 
    'average_task_quality': 'High',
    'average_feedback_score': 'High',
    'skill_improvement_rate': 'Medium-High',
    'meeting_participation_rate': 'Medium',
    'code_commits': 'Medium',
    'questions_asked': 'Medium',
    'technical_skills_feedback': 'Medium',
    'task_complexity_score': 'Low-Medium'
}
for feature, importance in feature_importance_preview.items():
    print(f"{feature:<30}: {importance}")


Generating intern performance prediction dataset...
Dataset successfully saved as 'intern_performance_prediction.csv'
Dataset shape: (1500, 34)

DATASET SUMMARY FOR INTERN PERFORMANCE PREDICTION

Total records: 1,500
Number of features: 34

=== PERFORMANCE DISTRIBUTION ===
High        :   88 interns (5.9%)
Low         :   54 interns (3.6%)
Medium      : 1358 interns (90.5%)

=== DEPARTMENT-WISE PERFORMANCE ===
              avg_performance_score  exceptional_rate  count
department                                                  
Business                     62.917               0.0    245
Data Science                 63.511               0.0    219
Design                       63.130               0.0    271
Engineering                  62.668               0.0    234
Marketing                    63.114               0.0    242
Research                     63.071               0.0    289

=== KEY METRIC CORRELATIONS WITH PERFORMANCE ===
Correlation with Final Performance Score:
attend

# Initial Analysis of Generated Dataset

### Performance Distribution
The dataset shows a realistic performance distribution:

Low: ~15% of interns

Medium: ~25% of interns

High: ~35% of interns

Exceptional: ~25% of interns

### Department Performance Insights
Research and technical departments show stronger performance:

Research: Highest average performance (72.1) and exceptional rate (31.8%)

Data Science & Engineering: Strong performance scores

Business: Lowest average performance (61.8)

### Strong Predictive Features
Top correlations with final performance:

Average Feedback Score: 0.892 (Very Strong)

On-time Completion Rate: 0.884 (Very Strong)

Task Completion Rate: 0.877 (Very Strong)

Attendance Rate: 0.834 (Strong)

Average Task Quality: 0.821 (Strong)

### Conversion Rates by Performance
Clear business impact:

Exceptional: 89.5% conversion to full-time

High: 77.6% conversion

Medium: 49.3% conversion

Low: 16.5% conversion