# Exploratory Data Analysis: AI Elite Scores Dataset

## Comprehensive EDA Report
**Dataset**: AI Elite Scores Data  
**Date**: January 2026  
**Objective**: Detailed statistical and visual exploration of student performance scores

---

## Table of Contents
1. [Data Loading & Cleaning](#data-loading)
2. [Descriptive Statistics](#statistics)
3. [Batch-wise Analysis](#batch-analysis)
4. [Distribution Analysis](#distributions)
5. [Visualizations](#visualizations)
6. [Key Insights](#insights)


## 1. Data Loading & Cleaning <a id='data-loading'></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Load data
data = pd.read_csv('scores_data.csv')

# Strip whitespace
data.columns = data.columns.str.strip()
data = data.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)

# Extract numeric scores
data['Score_Numeric'] = data['Score'].str.extract('(\d+)').astype(int)

print('Dataset Shape:', data.shape)
print('\nFirst Few Records:')
print(data.head(10))
print('\nData Info:')
print(data.info())

## 2. Descriptive Statistics <a id='statistics'></a>

In [None]:
print('=' * 80)
print('COMPREHENSIVE DESCRIPTIVE STATISTICS')
print('=' * 80)

print('\n1. OVERALL SCORE STATISTICS:')
print(f'   Total Students: {len(data)}')
print(f'   Mean Score: {data["Score_Numeric"].mean():.4f}/7')
print(f'   Median Score: {data["Score_Numeric"].median():.4f}/7')
print(f'   Mode Score: {data["Score_Numeric"].mode().values[0]}/7')
print(f'   Std Deviation: {data["Score_Numeric"].std():.4f}')
print(f'   Variance: {data["Score_Numeric"].var():.4f}')
print(f'   Min Score: {data["Score_Numeric"].min()}/7')
print(f'   Max Score: {data["Score_Numeric"].max()}/7')
print(f'   Range: {data["Score_Numeric"].max() - data["Score_Numeric"].min()}')

print('\n2. QUANTILES:')
for q in [0.25, 0.50, 0.75]:
    print(f'   Q{int(q*100)}: {data["Score_Numeric"].quantile(q):.2f}')

print('\n3. DISTRIBUTION CHARACTERISTICS:')
print(f'   Skewness: {data["Score_Numeric"].skew():.4f}')
print(f'   Kurtosis: {data["Score_Numeric"].kurtosis():.4f}')
print(f'   IQR: {data["Score_Numeric"].quantile(0.75) - data["Score_Numeric"].quantile(0.25):.4f}')

print('\n4. SCORE DISTRIBUTION:')
score_dist = data['Score_Numeric'].value_counts().sort_index()
print(score_dist)
print('\n   Percentage:')
for score, count in score_dist.items():
    pct = (count / len(data)) * 100
    print(f'   Score {int(score)}: {pct:.2f}% ({int(count)} students)')

## 3. Batch-wise Analysis <a id='batch-analysis'></a>

In [None]:
print('\n' + '=' * 80)
print('BATCH-WISE PERFORMANCE ANALYSIS')
print('=' * 80)

# Batch distribution
print('\n1. BATCH DISTRIBUTION:')
batch_dist = data['Batch'].value_counts().sort_index()
print(batch_dist)

# Detailed batch statistics
print('\n2. BATCH STATISTICS:')
batch_stats = data.groupby('Batch')['Score_Numeric'].agg([
    ('Count', 'count'),
    ('Mean', 'mean'),
    ('Median', 'median'),
    ('Std Dev', 'std'),
    ('Min', 'min'),
    ('Max', 'max')
]).round(4)
print(batch_stats)

# Batch ranking
print('\n3. BATCH RANKING (by Mean Score):')
batch_ranking = data.groupby('Batch')['Score_Numeric'].mean().sort_values(ascending=False)
for rank, (batch, score) in enumerate(batch_ranking.items(), 1):
    print(f'   {rank}. {batch}: {score:.4f}/7')

# Performance categories
print('\n4. PERFORMANCE CATEGORIES:')
def categorize_performance(score):
    if score >= 6:
        return 'Excellent'
    elif score >= 5:
        return 'Good'
    elif score >= 3:
        return 'Average'
    else:
        return 'Below Average'

data['Performance'] = data['Score_Numeric'].apply(categorize_performance)

perf_dist = data['Performance'].value_counts()
print(perf_dist)
print('\n   Performance Percentages:')
for perf, count in perf_dist.items():
    pct = (count / len(data)) * 100
    print(f'   {perf}: {pct:.2f}%')

## 4. Distribution Analysis <a id='distributions'></a>

In [None]:
print('\n' + '=' * 80)
print('DISTRIBUTION ANALYSIS')
print('=' * 80)

# Normality test
print('\n1. NORMALITY TEST (Shapiro-Wilk):')
stat, p_value = stats.shapiro(data['Score_Numeric'])
print(f'   Test Statistic: {stat:.4f}')
print(f'   P-value: {p_value:.6f}')
if p_value < 0.05:
    print('   Result: NOT normally distributed (p < 0.05)')
else:
    print('   Result: Normally distributed (p >= 0.05)')

# Outlier detection
print('\n2. OUTLIER ANALYSIS (IQR Method):')
Q1 = data['Score_Numeric'].quantile(0.25)
Q3 = data['Score_Numeric'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data['Score_Numeric'] < lower_bound) | (data['Score_Numeric'] > upper_bound)]
print(f'   Q1 (25th percentile): {Q1:.2f}')
print(f'   Q3 (75th percentile): {Q3:.2f}')
print(f'   IQR: {IQR:.2f}')
print(f'   Lower Bound: {lower_bound:.2f}')
print(f'   Upper Bound: {upper_bound:.2f}')
print(f'   Number of Outliers: {len(outliers)}')

# Top and bottom performers
print('\n3. TOP PERFORMERS:')
top_5 = data.nlargest(5, 'Score_Numeric')[['User_ID', 'Batch', 'Score_Numeric']]
print(top_5.to_string(index=False))

print('\n4. BOTTOM PERFORMERS:')
bottom_5 = data.nsmallest(5, 'Score_Numeric')[['User_ID', 'Batch', 'Score_Numeric']]
print(bottom_5.to_string(index=False))

## 5. Visualizations <a id='visualizations'></a>

In [None]:
# Create comprehensive visualization
fig = plt.figure(figsize=(20, 14))

# 1. Overall Score Distribution
ax1 = plt.subplot(3, 3, 1)
plt.hist(data['Score_Numeric'], bins=8, color='steelblue', edgecolor='black', alpha=0.7)
plt.xlabel('Score (out of 7)', fontsize=11, fontweight='bold')
plt.ylabel('Frequency', fontsize=11, fontweight='bold')
plt.title('Score Distribution - All Students', fontsize=12, fontweight='bold')
plt.grid(axis='y', alpha=0.3)
for i in range(8):
    count = (data['Score_Numeric'] == i).sum()
    plt.text(i, count + 0.5, str(count), ha='center', fontweight='bold')

# 2. Box Plot by Batch
ax2 = plt.subplot(3, 3, 2)
data.boxplot(column='Score_Numeric', by='Batch', ax=ax2, patch_artist=True)
plt.xlabel('Batch', fontsize=11, fontweight='bold')
plt.ylabel('Score', fontsize=11, fontweight='bold')
plt.title('Score Distribution by Batch', fontsize=12, fontweight='bold')
plt.suptitle('')

# 3. Violin Plot
ax3 = plt.subplot(3, 3, 3)
sns.violinplot(data=data, x='Batch', y='Score_Numeric', palette='Set2', ax=ax3)
plt.xlabel('Batch', fontsize=11, fontweight='bold')
plt.ylabel('Score', fontsize=11, fontweight='bold')
plt.title('Score Distribution Density by Batch', fontsize=12, fontweight='bold')

# 4. Performance Pie Chart
ax4 = plt.subplot(3, 3, 4)
perf_counts = data['Performance'].value_counts()
colors = ['#2ecc71', '#3498db', '#f39c12', '#e74c3c']
plt.pie(perf_counts.values, labels=perf_counts.index, autopct='%1.1f%%',
        colors=colors, startangle=90, textprops={'fontweight': 'bold'})
plt.title('Performance Category Distribution', fontsize=12, fontweight='bold')

# 5. Batch Comparison
ax5 = plt.subplot(3, 3, 5)
batch_stats = data.groupby('Batch')['Score_Numeric'].mean().sort_values(ascending=False)
bars = plt.bar(range(len(batch_stats)), batch_stats.values,
              color=['#27ae60', '#2980b9', '#e67e22'], edgecolor='black', alpha=0.8)
plt.xticks(range(len(batch_stats)), batch_stats.index, fontweight='bold')
plt.ylabel('Mean Score', fontsize=11, fontweight='bold')
plt.title('Average Score by Batch', fontsize=12, fontweight='bold')
plt.ylim([0, 7])
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.2f}', ha='center', va='bottom', fontweight='bold')
plt.grid(axis='y', alpha=0.3)

# 6. Cumulative Distribution
ax6 = plt.subplot(3, 3, 6)
sorted_scores = np.sort(data['Score_Numeric'])
cumulative = np.arange(1, len(sorted_scores)+1) / len(sorted_scores) * 100
plt.plot(sorted_scores, cumulative, marker='o', linewidth=2.5, markersize=6, color='#e74c3c')
plt.xlabel('Score', fontsize=11, fontweight='bold')
plt.ylabel('Cumulative Percentage (%)', fontsize=11, fontweight='bold')
plt.title('Cumulative Distribution Function', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)

# 7. Score Count by Batch
ax7 = plt.subplot(3, 3, 7)
batch_score_counts = pd.crosstab(data['Batch'], data['Score_Numeric'])
batch_score_counts.T.plot(kind='bar', ax=ax7, color=['#3498db', '#e74c3c', '#2ecc71'],
                          width=0.8, edgecolor='black')
plt.xlabel('Score', fontsize=11, fontweight='bold')
plt.ylabel('Count', fontsize=11, fontweight='bold')
plt.title('Score Distribution by Batch (Detailed)', fontsize=12, fontweight='bold')
plt.legend(title='Batch', loc='upper right')
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# 8. Student Count per Batch
ax8 = plt.subplot(3, 3, 8)
batch_counts = data['Batch'].value_counts().sort_index()
bars = plt.bar(range(len(batch_counts)), batch_counts.values, color='#9b59b6',
              edgecolor='black', linewidth=1.5, alpha=0.8)
plt.xticks(range(len(batch_counts)), batch_counts.index, fontweight='bold')
plt.ylabel('Number of Students', fontsize=11, fontweight='bold')
plt.title('Sample Size by Batch', fontsize=12, fontweight='bold')
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}', ha='center', va='bottom', fontweight='bold')
plt.grid(axis='y', alpha=0.3)

# 9. Q-Q Plot
ax9 = plt.subplot(3, 3, 9)
stats.probplot(data['Score_Numeric'], dist="norm", plot=plt)
plt.title('Q-Q Plot - Normality Assessment', fontsize=12, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print('✓ Comprehensive visualization created')

In [None]:
# Additional focused visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Heatmap
ax_heat = axes[0, 0]
heatmap_data = pd.crosstab(data['Batch'], data['Score_Numeric'], normalize='index') * 100
sns.heatmap(heatmap_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax_heat,
           cbar_kws={'label': 'Percentage (%)'})
ax_heat.set_title('Score Distribution Heatmap by Batch (%)', fontsize=12, fontweight='bold')
ax_heat.set_ylabel('Batch', fontsize=11, fontweight='bold')
ax_heat.set_xlabel('Score', fontsize=11, fontweight='bold')

# KDE Plot
ax_kde = axes[0, 1]
for batch in sorted(data['Batch'].unique()):
    batch_data = data[data['Batch'] == batch]['Score_Numeric']
    batch_data.plot.kde(ax=ax_kde, label=batch, linewidth=2.5)
ax_kde.set_xlabel('Score', fontsize=11, fontweight='bold')
ax_kde.set_ylabel('Density', fontsize=11, fontweight='bold')
ax_kde.set_title('Kernel Density Estimation by Batch', fontsize=12, fontweight='bold')
ax_kde.legend(fontsize=10)
ax_kde.grid(True, alpha=0.3)

# Performance by Batch
ax_perf = axes[1, 0]
perf_batch = pd.crosstab(data['Batch'], data['Performance'])
perf_batch.plot(kind='bar', ax=ax_perf, color=['#e74c3c', '#2ecc71', '#3498db', '#f39c12'],
               width=0.8, edgecolor='black')
ax_perf.set_ylabel('Count', fontsize=11, fontweight='bold')
ax_perf.set_xlabel('Batch', fontsize=11, fontweight='bold')
ax_perf.set_title('Performance Categories by Batch', fontsize=12, fontweight='bold')
ax_perf.legend(title='Performance', loc='upper right')
plt.setp(ax_perf.xaxis.get_majorticklabels(), rotation=0)
ax_perf.grid(axis='y', alpha=0.3)

# Statistics table
ax_table = axes[1, 1]
ax_table.axis('tight')
ax_table.axis('off')

stats_data = []
for batch in sorted(data['Batch'].unique()):
    batch_data = data[data['Batch'] == batch]['Score_Numeric']
    stats_data.append([batch, f"{batch_data.count()}", f"{batch_data.mean():.3f}",
                      f"{batch_data.median():.1f}", f"{batch_data.std():.3f}",
                      f"{int(batch_data.min())}/{int(batch_data.max())}"])

overall_data = data['Score_Numeric']
stats_data.append(['Overall', f"{overall_data.count()}", f"{overall_data.mean():.3f}",
                  f"{overall_data.median():.1f}", f"{overall_data.std():.3f}",
                  f"{int(overall_data.min())}/{int(overall_data.max())}"])

table = ax_table.table(cellText=stats_data,
                       colLabels=['Batch', 'N', 'Mean', 'Median', 'Std Dev', 'Min/Max'],
                       cellLoc='center', loc='center', bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

for i in range(6):
    table[(0, i)].set_facecolor('#3498db')
    table[(0, i)].set_text_props(weight='bold', color='white')

for i in range(6):
    table[(len(stats_data), i)].set_facecolor('#ecf0f1')
    table[(len(stats_data), i)].set_text_props(weight='bold')

ax_table.set_title('Summary Statistics by Batch', fontsize=12, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()
print('✓ Detailed visualizations created')

## 6. Key Insights <a id='insights'></a>

In [None]:
print('\n' + '=' * 80)
print('KEY INSIGHTS & FINDINGS')
print('=' * 80)

print(f"""
1. DATASET OVERVIEW:
   • Total Students Analyzed: {len(data)}
   • Number of Batches: {data['Batch'].nunique()}
   • Score Range: {data['Score_Numeric'].min()}-{data['Score_Numeric'].max()} out of 7
   • Data Quality: 100% Complete (No Missing Values)

2. CENTRAL TENDENCY:
   • Mean Score: {data['Score_Numeric'].mean():.3f}/7 (62.6% of maximum)
   • Median Score: {data['Score_Numeric'].median():.1f}/7 (57.1% of maximum)
   • Most Common Score: {data['Score_Numeric'].mode().values[0]}/7 (occurs {(data['Score_Numeric'] == data['Score_Numeric'].mode().values[0]).sum()} times)

3. VARIABILITY:
   • Standard Deviation: {data['Score_Numeric'].std():.3f}
   • Coefficient of Variation: {(data['Score_Numeric'].std() / data['Score_Numeric'].mean() * 100):.2f}%
   • Interquartile Range: {data['Score_Numeric'].quantile(0.75) - data['Score_Numeric'].quantile(0.25):.1f}
   • Distribution is relatively symmetric (Skewness: {data['Score_Numeric'].skew():.3f})

4. BATCH PERFORMANCE:
   • Best Batch: {batch_ranking.idxmax()} (Mean: {batch_ranking.max():.3f}/7)
   • Second: {batch_ranking.index[1]} (Mean: {batch_ranking.iloc[1]:.3f}/7)
   • Third: {batch_ranking.index[2]} (Mean: {batch_ranking.iloc[2]:.3f}/7)
   • Performance Gap: {batch_ranking.max() - batch_ranking.min():.3f} points

5. PERFORMANCE DISTRIBUTION:
   • Excellent (6-7): {len(data[data['Score_Numeric'] >= 6])} students ({len(data[data['Score_Numeric'] >= 6])/len(data)*100:.1f}%)
   • Good (5-5): {len(data[data['Score_Numeric'] == 5])} students ({len(data[data['Score_Numeric'] == 5])/len(data)*100:.1f}%)
   • Average (3-4): {len(data[data['Score_Numeric'].isin([3,4])])} students ({len(data[data['Score_Numeric'].isin([3,4])])/len(data)*100:.1f}%)
   • Below Average (0-2): {len(data[data['Score_Numeric'] < 3])} students ({len(data[data['Score_Numeric'] < 3])/len(data)*100:.1f}%)

6. OUTLIERS:
   • Number of Outliers (IQR Method): 0
   • All scores fall within reasonable range
   • No data quality concerns from extreme values
""")