# ðŸš€ E-Commerce Analytics Platform: Advanced Customer Insights & Sales Forecasting

**Developed by:** [Anishitha Varma](https://github.com/Anishithavaram-4242)  
**GitHub:** [@Anishithavaram-4242](https://github.com/Anishithavaram-4242?tab=repositories)

---

## Project Overview
This notebook performs comprehensive analysis of e-commerce electronics sales data, including:
- **Data Preprocessing & Feature Engineering**: Automated cleaning and transformation pipelines
- **Customer Segmentation**: K-Means clustering to identify distinct customer groups
- **Sales Forecasting**: Time series analysis and Random Forest predictive modeling
- **Product Recommendation System**: Collaborative filtering using cosine similarity
- **Statistical Analysis**: Correlation analysis, hypothesis testing, and business insights


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA

from scipy import stats
from scipy.stats import chi2_contingency

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")


## 1. Data Loading and Initial Exploration


In [None]:
df = pd.read_csv('electronics.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
df.head()


In [None]:
print("=" * 60)
print("DATASET INFORMATION")
print("=" * 60)
df.info()
print("\n" + "=" * 60)
print("MISSING VALUES")
print("=" * 60)
print(df.isnull().sum())
print("\n" + "=" * 60)
print("DUPLICATE ROWS")
print("=" * 60)
print(f"Total duplicates: {df.duplicated().sum()}")
print("\n" + "=" * 60)
print("BASIC STATISTICS")
print("=" * 60)
df.describe()


## 2. Data Preprocessing and Feature Engineering


In [None]:
data = df.copy()

data['timestamp'] = pd.to_datetime(data['timestamp'])

data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day_of_week'] = data['timestamp'].dt.dayofweek
data['quarter'] = data['timestamp'].dt.quarter
data['is_weekend'] = (data['day_of_week'] >= 5).astype(int)

data['user_id'] = data['user_id'].astype(str)
data['item_id'] = data['item_id'].astype(str)

data['rating'] = data['rating'].astype(float)

print("Feature engineering completed!")
print(f"\nNew columns: {[col for col in data.columns if col not in df.columns]}")
data.head()


In [None]:
initial_shape = data.shape[0]
data = data.drop_duplicates()
final_shape = data.shape[0]
print(f"Removed {initial_shape - final_shape} duplicate rows")
print(f"Final dataset shape: {data.shape}")

print("\nMissing values after preprocessing:")
print(data.isnull().sum())


## 3. Exploratory Data Analysis (EDA)


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.countplot(data=data, x='rating', ax=axes[0], palette='viridis')
axes[0].set_title('Distribution of Ratings', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')

sns.boxplot(data=data, y='rating', ax=axes[1], palette='viridis')
axes[1].set_title('Rating Distribution (Box Plot)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Rating')

plt.tight_layout()
plt.show()

print(f"\nRating Statistics:")
print(data['rating'].describe())
print(f"\nMode rating: {data['rating'].mode()[0]}")


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

yearly_sales = data.groupby('year').size()
axes[0, 0].bar(yearly_sales.index, yearly_sales.values, color='steelblue', alpha=0.7)
axes[0, 0].set_title('Total Sales by Year', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Year')
axes[0, 0].set_ylabel('Number of Transactions')
axes[0, 0].grid(axis='y', alpha=0.3)

monthly_sales = data.groupby('month').size()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[0, 1].plot(monthly_sales.index, monthly_sales.values, marker='o', linewidth=2, markersize=8)
axes[0, 1].set_title('Sales Trend by Month', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Month')
axes[0, 1].set_ylabel('Number of Transactions')
axes[0, 1].set_xticks(range(1, 13))
axes[0, 1].set_xticklabels(month_names, rotation=45)
axes[0, 1].grid(alpha=0.3)

quarterly_sales = data.groupby('quarter').size()
axes[1, 0].bar(quarterly_sales.index, quarterly_sales.values, color='coral', alpha=0.7)
axes[1, 0].set_title('Sales by Quarter', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Quarter')
axes[1, 0].set_ylabel('Number of Transactions')
axes[1, 0].grid(axis='y', alpha=0.3)

weekend_analysis = data.groupby('is_weekend').size()
axes[1, 1].pie(weekend_analysis.values, labels=['Weekday', 'Weekend'], autopct='%1.1f%%',
               startangle=90, colors=['lightblue', 'lightcoral'])
axes[1, 1].set_title('Weekend vs Weekday Sales', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()


In [None]:
top_categories = data['category'].value_counts().head(10)

plt.figure(figsize=(14, 8))
colors = sns.color_palette("husl", len(top_categories))
bars = plt.barh(range(len(top_categories)), top_categories.values, color=colors)
plt.yticks(range(len(top_categories)), top_categories.index)
plt.xlabel('Number of Transactions', fontsize=12)
plt.title('Top 10 Product Categories by Sales Volume', fontsize=16, fontweight='bold')
plt.gca().invert_yaxis()

for i, (idx, val) in enumerate(zip(top_categories.index, top_categories.values)):
    plt.text(val + 1000, i, f'{val:,}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print(f"\nTotal unique categories: {data['category'].nunique()}")
print(f"\nCategory with most sales: {top_categories.index[0]} ({top_categories.values[0]:,} transactions)")


## 4. Customer Behavior Analysis


In [None]:
customer_features = data.groupby('user_id').agg({
    'rating': ['mean', 'count', 'std'],
    'item_id': 'nunique',
    'category': 'nunique'
}).reset_index()

customer_features.columns = ['user_id', 'avg_rating', 'total_reviews', 'rating_std', 
                            'unique_products', 'unique_categories']
customer_features = customer_features.fillna(0)

customer_features['engagement_score'] = (
    customer_features['total_reviews'] * 0.4 +
    customer_features['unique_products'] * 0.3 +
    customer_features['unique_categories'] * 0.3
)

print("Customer features created:")
print(customer_features.head())
print(f"\nTotal unique customers: {len(customer_features)}")


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

axes[0, 0].hist(customer_features['avg_rating'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Average Ratings per Customer', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Average Rating')
axes[0, 0].set_ylabel('Number of Customers')
axes[0, 0].axvline(customer_features['avg_rating'].mean(), color='red', 
                   linestyle='--', label=f'Mean: {customer_features["avg_rating"].mean():.2f}')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

axes[0, 1].hist(customer_features['total_reviews'], bins=50, color='lightgreen', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Distribution of Reviews per Customer', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Number of Reviews')
axes[0, 1].set_ylabel('Number of Customers')
axes[0, 1].set_xlim(0, 100)
axes[0, 1].grid(alpha=0.3)

axes[1, 0].hist(customer_features['engagement_score'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Customer Engagement Score Distribution', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Engagement Score')
axes[1, 0].set_ylabel('Number of Customers')
axes[1, 0].grid(alpha=0.3)

scatter = axes[1, 1].scatter(customer_features['total_reviews'], 
                             customer_features['avg_rating'],
                             c=customer_features['engagement_score'],
                             cmap='viridis', alpha=0.6, s=20)
axes[1, 1].set_title('Customer Rating vs Review Count', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Total Reviews')
axes[1, 1].set_ylabel('Average Rating')
axes[1, 1].set_xlim(0, 100)
plt.colorbar(scatter, ax=axes[1, 1], label='Engagement Score')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()


## 5. Customer Segmentation using K-Means Clustering


In [None]:
features_for_clustering = customer_features[['avg_rating', 'total_reviews', 
                                             'unique_products', 'unique_categories']].copy()

features_for_clustering = features_for_clustering.replace([np.inf, -np.inf], np.nan).fillna(0)

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_for_clustering)

inertias = []
K_range = range(2, 8)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Inertia', fontsize=12)
plt.title('Elbow Method for Optimal k', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.show()

print("Elbow method completed. Optimal k appears to be around 3-4 clusters.")


In [None]:
n_clusters = 4
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
customer_features['cluster'] = kmeans.fit_predict(features_scaled)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

scatter1 = axes[0].scatter(customer_features['total_reviews'], 
                          customer_features['avg_rating'],
                          c=customer_features['cluster'], 
                          cmap='Set1', alpha=0.6, s=30)
axes[0].set_title('Customer Clusters: Rating vs Review Count', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Total Reviews')
axes[0].set_ylabel('Average Rating')
axes[0].set_xlim(0, 100)
axes[0].grid(alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

scatter2 = axes[1].scatter(customer_features['unique_products'], 
                          customer_features['unique_categories'],
                          c=customer_features['cluster'], 
                          cmap='Set1', alpha=0.6, s=30)
axes[1].set_title('Customer Clusters: Product Diversity', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Unique Products')
axes[1].set_ylabel('Unique Categories')
axes[1].grid(alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()

print("\n" + "=" * 60)
print("CLUSTER CHARACTERISTICS")
print("=" * 60)
cluster_summary = customer_features.groupby('cluster').agg({
    'avg_rating': 'mean',
    'total_reviews': 'mean',
    'unique_products': 'mean',
    'unique_categories': 'mean',
    'user_id': 'count'
}).round(2)
cluster_summary.columns = ['Avg Rating', 'Avg Reviews', 'Avg Unique Products', 
                          'Avg Unique Categories', 'Customer Count']
print(cluster_summary)


## 6. Sales Forecasting with Time Series Analysis


In [None]:
time_series_data = data.groupby([data['timestamp'].dt.to_period('M')]).size().reset_index()
time_series_data.columns = ['period', 'sales_count']
time_series_data['period'] = time_series_data['period'].astype(str)
time_series_data['date'] = pd.to_datetime(time_series_data['period'])
time_series_data = time_series_data.sort_values('date').reset_index(drop=True)

plt.figure(figsize=(16, 6))
plt.plot(time_series_data['date'], time_series_data['sales_count'], 
         marker='o', linewidth=2, markersize=6, color='steelblue')
plt.title('Sales Trend Over Time', fontsize=16, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Number of Sales', fontsize=12)
plt.grid(alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"\nTime series statistics:")
print(time_series_data['sales_count'].describe())


In [None]:
window_size = 3
time_series_data['moving_avg'] = time_series_data['sales_count'].rolling(window=window_size).mean()

time_series_data['month_num'] = time_series_data['date'].dt.month
time_series_data['year_num'] = time_series_data['date'].dt.year
time_series_data['period_num'] = range(len(time_series_data))

forecast_data = time_series_data[['period_num', 'month_num', 'year_num', 'sales_count']].dropna()

if len(forecast_data) > 5:
    X = forecast_data[['period_num', 'month_num']]
    y = forecast_data['sales_count']
    
    split_idx = int(len(forecast_data) * 0.8)
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]
    
    rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_model.fit(X_train, y_train)
    
    y_pred = rf_model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    plt.figure(figsize=(16, 6))
    plt.plot(time_series_data['date'], time_series_data['sales_count'], 
             label='Actual Sales', marker='o', linewidth=2, markersize=6)
    plt.plot(time_series_data['date'], time_series_data['moving_avg'], 
             label=f'Moving Average (window={window_size})', linewidth=2, linestyle='--')
    
    test_dates = time_series_data['date'].iloc[split_idx:split_idx+len(y_test)]
    plt.plot(test_dates, y_pred, label='Forecasted Sales', 
             marker='s', linewidth=2, markersize=6, color='red')
    
    plt.title('Sales Forecasting: Actual vs Predicted', fontsize=16, fontweight='bold')
    plt.xlabel('Date', fontsize=12)
    plt.ylabel('Number of Sales', fontsize=12)
    plt.legend()
    plt.grid(alpha=0.3)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    print(f"\nForecasting Model Performance:")
    print(f"Mean Squared Error: {mse:.2f}")
    print(f"RÂ² Score: {r2:.4f}")
else:
    print("Insufficient data for forecasting model")


## 7. Statistical Analysis


In [None]:
numeric_cols = ['rating', 'year', 'month', 'day_of_week', 'quarter']
correlation_matrix = data[numeric_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numerical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

ratings_sample = data['rating'].sample(min(5000, len(data)), random_state=42)
shapiro_stat, shapiro_p = stats.shapiro(ratings_sample)

print("\n" + "=" * 60)
print("STATISTICAL TESTS")
print("=" * 60)
print(f"\nShapiro-Wilk Normality Test (Rating Distribution):")
print(f"  Statistic: {shapiro_stat:.4f}")
print(f"  p-value: {shapiro_p:.4e}")
print(f"  Result: {'Not normally distributed' if shapiro_p < 0.05 else 'Normally distributed'}")

yearly_stats = data.groupby('year').agg({
    'rating': ['mean', 'count']
}).reset_index()
yearly_stats.columns = ['year', 'avg_rating', 'total_sales']

print(f"\nYear-over-Year Statistics:")
print(yearly_stats)


## 8. Product Recommendation System (Collaborative Filtering)


In [None]:
sample_size = min(50000, len(data))
sample_data = data.sample(n=sample_size, random_state=42)

user_item_matrix = sample_data.pivot_table(
    index='user_id', 
    columns='item_id', 
    values='rating', 
    fill_value=0
)

print(f"User-Item Matrix Shape: {user_item_matrix.shape}")
print(f"Matrix Sparsity: {(user_item_matrix == 0).sum().sum() / (user_item_matrix.shape[0] * user_item_matrix.shape[1]) * 100:.2f}%")

from sklearn.metrics.pairwise import cosine_similarity

item_similarity = cosine_similarity(user_item_matrix.T)
item_similarity_df = pd.DataFrame(
    item_similarity, 
    index=user_item_matrix.columns, 
    columns=user_item_matrix.columns
)

print("\nItem similarity matrix created!")
print(f"Shape: {item_similarity_df.shape}")

def get_recommendations(item_id, n_recommendations=5):
    if item_id not in item_similarity_df.index:
        return []
    
    similar_items = item_similarity_df[item_id].sort_values(ascending=False)
    similar_items = similar_items[similar_items.index != item_id]
    return similar_items.head(n_recommendations).index.tolist()

if len(user_item_matrix.columns) > 0:
    sample_item = user_item_matrix.columns[0]
    recommendations = get_recommendations(sample_item, n_recommendations=5)
    print(f"\nExample: Recommendations for item {sample_item}:")
    print(f"Top 5 similar items: {recommendations[:5]}")


## 9. Business Insights and Recommendations


In [None]:
print("=" * 70)
print("BUSINESS INSIGHTS & STRATEGIC RECOMMENDATIONS")
print("=" * 70)

peak_year = data.groupby('year').size().idxmax()
peak_month = data.groupby('month').size().idxmax()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

print(f"\n1. SALES PERFORMANCE:")
print(f"   â€¢ Peak sales year: {peak_year}")
print(f"   â€¢ Peak sales month: {month_names[peak_month-1]} ({peak_month})")
print(f"   â€¢ Total transactions: {len(data):,}")

top_3_categories = data['category'].value_counts().head(3)
print(f"\n2. TOP PERFORMING CATEGORIES:")
for idx, (cat, count) in enumerate(top_3_categories.items(), 1):
    percentage = (count / len(data)) * 100
    print(f"   {idx}. {cat}: {count:,} transactions ({percentage:.2f}%)")

print(f"\n3. CUSTOMER SEGMENTATION:")
for cluster_id in sorted(customer_features['cluster'].unique()):
    cluster_data = customer_features[customer_features['cluster'] == cluster_id]
    print(f"   Cluster {cluster_id}:")
    print(f"      â€¢ Customers: {len(cluster_data):,}")
    print(f"      â€¢ Avg Rating: {cluster_data['avg_rating'].mean():.2f}")
    print(f"      â€¢ Avg Reviews: {cluster_data['total_reviews'].mean():.1f}")
    print(f"      â€¢ Product Diversity: {cluster_data['unique_products'].mean():.1f}")

avg_rating = data['rating'].mean()
high_ratings = (data['rating'] >= 4).sum()
high_rating_pct = (high_ratings / len(data)) * 100

print(f"\n4. CUSTOMER SATISFACTION:")
print(f"   â€¢ Average rating: {avg_rating:.2f}/5.0")
print(f"   â€¢ High ratings (â‰¥4): {high_ratings:,} ({high_rating_pct:.2f}%)")

print(f"\n5. STRATEGIC RECOMMENDATIONS:")
print(f"   â€¢ Focus marketing efforts during {month_names[peak_month-1]} for maximum impact")
print(f"   â€¢ Prioritize inventory for top categories: {', '.join(top_3_categories.index[:2])}")
print(f"   â€¢ Develop targeted campaigns for different customer segments")
print(f"   â€¢ Implement recommendation system to increase cross-selling")
print(f"   â€¢ Monitor sales trends and adjust inventory accordingly")

print("\n" + "=" * 70)


## 10. Summary and Conclusions

### Key Findings:
1. **Sales Trends**: Identified peak sales periods and seasonal patterns
2. **Customer Segmentation**: Successfully segmented customers into 4 distinct groups
3. **Product Performance**: Analyzed top-performing categories and products
4. **Forecasting**: Built predictive models for sales forecasting
5. **Recommendations**: Developed collaborative filtering system for product recommendations

### Technical Achievements:
- Advanced data preprocessing and feature engineering
- Machine learning models (K-Means clustering, Random Forest regression)
- Statistical analysis and hypothesis testing
- Time series analysis and forecasting
- Collaborative filtering recommendation system

### Business Value:
- Data-driven insights for strategic decision-making
- Customer segmentation for targeted marketing
- Sales forecasting for inventory planning
- Product recommendations to increase revenue
