# üìä Notebook 2: EDA & Visualization
## Final Project - Ordinal vs Nominal Sentiment Analysis
### Atharv Chaudhary

---

**Purpose:** Exploratory Data Analysis and create visualizations for report.

**Input:** `amazon_electronics_cleaned.csv`

**Output:** `class_distribution.png`

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Plot settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
plt.rcParams['figure.dpi'] = 150

print("‚úÖ Libraries imported")

## Step 1: Load Cleaned Data

In [None]:
# Load cleaned data from Notebook 1
df = pd.read_csv('amazon_electronics_cleaned.csv')

print(f"‚úÖ Loaded {len(df):,} reviews")
print(f"   Columns: {list(df.columns)}")
df.head()

## Step 2: Class Distribution Visualization

In [None]:
# ============================================================================
# CLASS DISTRIBUTION PLOT (For Report)
# ============================================================================

fig, ax = plt.subplots(figsize=(10, 6))

rating_counts = df['rating'].value_counts().sort_index()
colors = ['#e74c3c', '#e67e22', '#f1c40f', '#2ecc71', '#27ae60']

bars = ax.bar(rating_counts.index, rating_counts.values, color=colors, 
              edgecolor='black', linewidth=1.2)

# Add value labels on bars
for bar, count in zip(bars, rating_counts.values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, height + max(rating_counts)*0.01, 
            f'{count:,}\n({count/len(df)*100:.1f}%)', 
            ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.set_xlabel('Star Rating', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Reviews', fontsize=12, fontweight='bold')
ax.set_title(f'Class Distribution of Amazon Electronics Reviews\n(N={len(df):,})', 
             fontsize=14, fontweight='bold')
ax.set_xticks([1, 2, 3, 4, 5])
ax.set_xticklabels(['1 ‚≠ê', '2 ‚≠ê', '3 ‚≠ê', '4 ‚≠ê', '5 ‚≠ê'], fontsize=11)

# Add grid
ax.yaxis.grid(True, linestyle='--', alpha=0.7)
ax.set_axisbelow(True)

plt.tight_layout()
plt.savefig('class_distribution.png', dpi=150, bbox_inches='tight', 
            facecolor='white', edgecolor='none')
plt.show()

print("\n‚úÖ Saved: class_distribution.png")

## Step 3: Review Length Analysis

In [None]:
# Calculate text statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

print("üìè Review Length Statistics:")
print(df[['text_length', 'word_count']].describe())

In [None]:
# ============================================================================
# REVIEW LENGTH BY RATING
# ============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Word count distribution
for rating in [1, 2, 3, 4, 5]:
    subset = df[df['rating'] == rating]['word_count']
    axes[0].hist(subset, bins=50, alpha=0.5, label=f'{rating} ‚≠ê', density=True)

axes[0].set_xlabel('Word Count', fontsize=11)
axes[0].set_ylabel('Density', fontsize=11)
axes[0].set_title('Review Length Distribution by Rating', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].set_xlim([0, 500])

# Average word count by rating
avg_words = df.groupby('rating')['word_count'].mean()
bars = axes[1].bar(avg_words.index, avg_words.values, color=colors, edgecolor='black')
axes[1].set_xlabel('Star Rating', fontsize=11)
axes[1].set_ylabel('Average Word Count', fontsize=11)
axes[1].set_title('Average Review Length by Rating', fontsize=12, fontweight='bold')
axes[1].set_xticks([1, 2, 3, 4, 5])

for bar, val in zip(bars, avg_words.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, val + 2, f'{val:.0f}', 
                 ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('review_length_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚úÖ Saved: review_length_analysis.png")

## Step 4: Sample Reviews

In [None]:
# Show sample reviews for each rating
print("=" * 70)
print("SAMPLE REVIEWS BY RATING")
print("=" * 70)

for rating in [1, 2, 3, 4, 5]:
    sample = df[df['rating'] == rating]['text'].iloc[0][:200]
    print(f"\n{'‚≠ê' * rating} ({rating}-star):")
    print(f"   {sample}...")

## Step 5: Key Insights

In [None]:
# ============================================================================
# KEY INSIGHTS FOR REPORT
# ============================================================================

print("=" * 70)
print("üìã KEY INSIGHTS FOR REPORT")
print("=" * 70)

rating_counts = df['rating'].value_counts().sort_index()

print(f"""
1. DATASET SIZE:
   - Total reviews: {len(df):,}
   - Source: Amazon Electronics Reviews (McAuley Lab, UCSD)

2. CLASS IMBALANCE:
   - 5-star reviews: {rating_counts[5]:,} ({rating_counts[5]/len(df)*100:.1f}%)
   - 1-star reviews: {rating_counts[1]:,} ({rating_counts[1]/len(df)*100:.1f}%)
   - Imbalance ratio: {rating_counts[5]/rating_counts.min():.1f}:1

3. REVIEW LENGTH:
   - Average words: {df['word_count'].mean():.0f}
   - Median words: {df['word_count'].median():.0f}
   - Negative reviews tend to be longer (more detail about complaints)

4. CHALLENGE FOR CLASSIFICATION:
   - Adjacent ratings (4‚òÖ vs 5‚òÖ) use similar vocabulary
   - Class imbalance affects minority class performance
   - Ordinal structure: 1 < 2 < 3 < 4 < 5
""")

In [None]:
# Clean up helper columns
df = df.drop(columns=['text_length', 'word_count'], errors='ignore')

# Download visualizations
try:
    from google.colab import files
    files.download('class_distribution.png')
    files.download('review_length_analysis.png')
    print("üì• Downloads started...")
except:
    print("Files saved locally")

---
## ‚úÖ Summary

**Visualizations created:**
- `class_distribution.png` - For Dataset section of report
- `review_length_analysis.png` - Additional analysis

**Key findings:**
- Severe class imbalance (5-star dominant)
- Negative reviews are longer on average
- Ordinal structure should be leveraged

**Next:** Run `3_Models_Nominal.ipynb`