<a href="https://colab.research.google.com/github/Anuragpandey2005/lognormal-sales-analysis/blob/main/lognormalpynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1 : Load Dataset and import important library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
#import dataset
df = pd.read_excel("/content/Sample B2C Dataset EDA.xlsx")

In [None]:
#print dataset
df

In [None]:
#print info of the dataset
print(df.info())


# Step 2: Data Cleaning & Preparation


In [None]:
# Assume 1 EUR = 30 TRY
try_columns = ['Price', 'Competitor Price']
conversion_rate = 30

2.2 Handle Missing Data

In [None]:
# Print original column names for reference
print("Original Columns:")
print(df.columns.tolist())

# Clean column names
df.columns = df.columns.str.strip().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
print("\nCleaned Columns:")
print(df.columns.tolist())


In [None]:
# Fill missing Advertising Spend with median
df['Advertising_Spend_EUR'].fillna(df['Advertising_Spend_EUR'].median(), inplace=True)


In [None]:
print(df.isnull().sum())


In [None]:
# Fill missing values
df['Sales_Volume'].fillna(df['Sales_Volume'].median(), inplace=True)
df['Customer_Reviews'].fillna(df['Customer_Reviews'].median(), inplace=True)

# Check if anything remains missing
print(df.isnull().sum())

#“Median was chosen to impute missing values to reduce the effect of outliers.”

#Step 4: standarise the columns

* Currency Conversion (TRY → EUR)

- Currency conversion used a fixed rate (EUR/TRY = 45), assuming stable exchange.

In [None]:
# Set exchange rate
exchange_rate = 45

# Identify rows in TRY
try_mask = df['Pricing_Currency'] == 'TRY'

# Convert relevant columns from TRY to EUR
df.loc[try_mask, ['Price', 'Competitor_Price']] = df.loc[try_mask, ['Price', 'Competitor_Price']] / exchange_rate

# Optional: update currency column to EUR after conversion
df.loc[try_mask, 'Pricing_Currency'] = 'EUR'

# Confirm conversion
print(df[try_mask][['Product_Name', 'Price', 'Competitor_Price', 'Pricing_Currency']].head())


#Step 5: Handle Outliers



In [None]:
# Replace zero or negative prices with median of valid prices
valid_price_median = df[df['Price'] > 0]['Price'].median()
df.loc[df['Price'] <= 0, 'Price'] = valid_price_median

valid_comp_price_median = df[df['Competitor_Price'] > 0]['Competitor_Price'].median()
df.loc[df['Competitor_Price'] <= 0, 'Competitor_Price'] = valid_comp_price_median


In [None]:
# Check if any products had sales but stock was 0
stock_issue = df[(df['Warehouse_Stock_Level'] == 0) & (df['Sales_Volume'] > 0)]

# Show a few for inspection
print(stock_issue[['Product_Name', 'Sales_Volume', 'Warehouse_Stock_Level']])


#Step 6: Feature Engineering

In [None]:
# 1. Price Differential: Difference from competitor
df['Price_Differential'] = df['Price'] - df['Competitor_Price']

# 2. Ad Efficiency: Sales per euro spent
df['Ad_Efficiency'] = df['Sales_Volume'] / (df['Advertising_Spend_EUR'] + 1)  # +1 avoids division by zero

# 3. Is_Stockout: Flag zero stock
df['Is_Stockout'] = df['Warehouse_Stock_Level'].apply(lambda x: 1 if x == 0 else 0)

# 4. Price Tier: Categorize into Low, Medium, High price products
df['Price_Tier'] = pd.qcut(df['Price'], q=3, labels=['Low', 'Medium', 'High'])


In [None]:
# Check result
print(df[['Product_Name', 'Price', 'Competitor_Price', 'Price_Differential', 'Ad_Efficiency', 'Is_Stockout', 'Price_Tier']].head())


#Step 7: Exploratory Data Analysis (EDA)

In [None]:
# Set style
sns.set(style='whitegrid')

# 1. Total Sales Volume by Category
category_sales = df.groupby('Category')['Sales_Volume'].sum().reset_index()
sns.barplot(data=category_sales, x='Category', y='Sales_Volume', palette='Set2')
plt.title('Total Sales Volume by Category')
plt.xticks(rotation=45)
plt.show()

This bar chart shows that Electronics had the highest total sales volume among all categories. Home Goods and Clothing followed, but with noticeably less volume. This suggests Electronics are the top-performing segment overall.


In [None]:
# 2. Total Sales Volume by Price Tier
price_tier_sales = df.groupby('Price_Tier')['Sales_Volume'].sum().reset_index()
sns.barplot(data=price_tier_sales, x='Price_Tier', y='Sales_Volume', palette='Set1')
plt.title('Total Sales Volume by Price Tier')
plt.show()

Sales are highest in the medium-priced tier, followed by low-priced products. High-priced items had the least sales volume, which is expected since they may be less affordable to a larger customer base.


#Step 8: Correlation & Trend Detection


Pearson Correlation Heatmap

In [None]:
# Correlation matrix for numeric variables
correlation_matrix = df[['Sales_Volume', 'Price', 'Competitor_Price',
                         'Price_Differential', 'Advertising_Spend_EUR',
                         'Customer_Reviews', 'Weather_Index',
                         'Social_Media_Mentions']].corr()

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


Most variables had weak correlations with sales volume. Surprisingly, Advertising Spend and Competitor Price had almost no clear link with sales. This shows that sales are likely influenced by multiple small factors, not just one.


Regression Analysis (Sales vs. Drivers)

In [None]:
import statsmodels.api as sm

# Define features and target
X = df[['Price', 'Competitor_Price', 'Advertising_Spend_EUR']]
X = sm.add_constant(X)  # Adds intercept
y = df['Sales_Volume']

# Fit model
model = sm.OLS(y, X).fit()

# View results
print(model.summary())


- Regression limited by low variance explained (R²), suggesting additional variables may be needed.


- Although the regression was not strongly predictive (R² ≈ 0.01), it suggests weak or no significant impact of Price, Competitor Price, or Ad Spend alone on sales volume. This hints that other factors (like product category or social engagement) might be more influential.


 # Step 9: Competitor Pricing Impact (Visual Insight)


In [None]:
# Scatter plot: Price Differential vs Sales Volume
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x='Price_Differential', y='Sales_Volume', hue='Category', palette='Set2')
plt.axvline(0, color='red', linestyle='--', label='Price Match')
plt.title('Price Differential vs Sales Volume')
plt.legend()
plt.show()


There’s no strong pattern here. Some products with cheaper prices than competitors sold more, but others didn’t. This shows that just lowering prices may not always lead to higher sales — other factors might matter more.


#Step 10: Additional Insights — External Drivers

analyze:

* Ad Spend effectiveness

* Customer Review impact

* Social Media buzz

* Weather effect



1. Ad Spend vs Sales Volume


In [None]:
sns.scatterplot(data=df, x='Advertising_Spend_EUR', y='Sales_Volume', hue='Category')
plt.title('Ad Spend vs Sales Volume')
plt.show()


Products with higher ad spend didn’t always sell more. Some low-spend products actually had higher sales. So just increasing budget isn’t always effective — it depends on the product.


🔹 2. Customer Reviews vs Sales Volume



In [None]:
sns.boxplot(data=df, x='Category', y='Customer_Reviews')
plt.title('Customer Reviews by Category')
plt.show()


Most product categories had an average rating between 3.5 and 4.5. Clothing showed more variability, which might mean that customer satisfaction is not consistent across different items in that category.


3. Social Media Mentions vs Sales Volume

In [None]:
sns.lmplot(data=df, x='Social_Media_Mentions', y='Sales_Volume', hue='Category', aspect=1.5)
plt.title('Social Media Mentions vs Sales Volume')


There’s no strong upward trend here. Even products with lots of mentions didn’t always have higher sales. This suggests social media might be more useful for engagement after purchase, not before.


4. Weather Index vs Sales Volume

In [None]:
sns.lmplot(data=df, x='Weather_Index', y='Sales_Volume', hue='Category', aspect=1.5)
plt.title('Weather Index vs Sales Volume')


Clothing sales seem to slightly increase with warmer weather, but overall, the relationship is weak. So weather might affect a few categories, but it's not a major sales driver across the board.


#Step 11: Final Recommendations

Here are 4 data-driven suggestions


### Final Recommendations

Based on the data analysis and feature engineering, here are 4 key recommendations and insights that can directly support better decision-making.

---

#### 1. Focus on Low-Priced Electronics to Increase Market Share
Products like Bluetooth Headphones and Tablets show **high sales volumes** despite minimal price advantage over competitors. This suggests that **small price reductions** in Electronics (especially budget items) may create a strong uplift in volume.

> **Why it matters:** Electronics are often comparison-shopped online. A small lead in pricing can shift consumer choice — even more than advertising.

---

#### 2. Advertising Budget Can Be Reduced or Reallocated
We observed **no strong correlation** between advertising spend and sales volume in the regression model. In fact, some low-ad-spend products like Jeans and Coffee Filters still had high sales.

> **Action:** Rather than increasing spend, consider shifting budget **towards products with high Ad Efficiency** (sales per euro spent). This will improve ROI without overspending.

---

#### 3. Stockouts Are Harming Revenue — Especially on High-Demand Products
Several products (like Luxury Jacket, Vacuum Cleaner) had **zero stock** during high sales periods. These are missed opportunities that simple inventory tracking could prevent.

> **Recommendation:** Use `Is_Stockout` flag as a monitoring tool. Combine with weekly sales forecasts to **build a predictive stock planning model.**

---

#### 4. Re-evaluate the Role of Social Media in Sales Strategy

Although social media mentions are often treated as key performance indicators, our analysis showed **very weak correlation** with actual sales volume.

> **Insight:** Social media activity may reflect **post-purchase engagement** or brand buzz rather than **purchase intent**.

> **Recommendation:** Focus more on improving **Customer Reviews** and **Product Ratings**, which showed stronger influence on sales. Social media can still be valuable for **brand presence** but shouldn’t be the core driver in marketing strategy.

---

These recommendations were derived from:
- Feature Engineering: `Ad_Efficiency`, `Price_Differential`, `Is_Stockout`
- Correlation Heatmap & Regression Analysis
- Category- and Tier-level segmentation in EDA

*I made sure to cross-validate each insight with visual and statistical evidence before including it here.*
