In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### About Dataset


In [2]:
#Credits to author of the dataset: https://data.world/anilsharma87 - Analyzing and Maximizing Online Business Performance
#By ANil 

# Project: E-Commerce Sales Forecasting

### Objective:

The goal of this project is to build a machine learning model that predicts future sales for an e-commerce business based on historical sales data. 
Sales forecasting is crucial for businesses to optimize inventory management, set realistic sales targets, and make data-driven decisions. 
Accurate forecasts can help reduce costs and maximize profits by anticipating future demand trends.

### Problem Definition:

I'm tasked with predicting the future sales volume for a given period (e.g., daily, weekly, or monthly) using a dataset that includes historical sales data from an e-commerce platform. 
The challenge is to account for seasonality, trends, and external factors that may influence sales, such as holidays, discounts, and other promotional activities.

### Key Assumptions and Factors to Explore:

1. **Trend and Seasonality:** 
   - Analyze whether sales exhibit long-term trends or seasonal patterns, such as increased sales during holidays or weekends.
   
2. **Price Influence:**
   - Investigate the relationship between price fluctuations and sales volume, examining how discounts or price increases impact consumer behavior.

3. **Promotional Activities:**
   - Assess the effect of promotions, discounts, and marketing campaigns on sales and how they can skew the forecast.

4. **Product Categories:**
   - Examine whether specific product categories exhibit different sales patterns compared to others.

### Approach:

1. **Exploratory Data Analysis (EDA):**
   - Perform an initial data exploration to understand the structure of the dataset, identify trends, and discover correlations between variables.

2. **Feature Engineering:**
   - Create new features such as day of the week, month, holidays, and other relevant factors that may affect sales.

3. **Model Selection:**
   - Train multiple forecasting models (e.g., Linear Regression, ARIMA, Prophet, XGBoost) to compare performance.
   - Evaluate models based on metrics like RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error).

4. **Forecasting:**
   - Generate future sales predictions based on the chosen model and validate accuracy using test data.

### Outcome:

The final output will be an accurate forecast model that can predict future sales, providing insights that are useful for business decision-making, such as inventory planning, staffing, and financial forecasting.

## Step 1: Data Loading

In [5]:
df = pd.read_csv('Amazon Sale Report.csv', low_memory=False)
df.head(3)

Unnamed: 0,index,Order ID,Date,Status,Fulfilment,Sales Channel,ship-service-level,Style,SKU,Category,...,currency,Amount,ship-city,ship-state,ship-postal-code,ship-country,promotion-ids,B2B,fulfilled-by,Unnamed: 22
0,0,405-8078784-5731545,04-30-22,Cancelled,Merchant,Amazon.in,Standard,SET389,SET389-KR-NP-S,Set,...,INR,647.62,MUMBAI,MAHARASHTRA,400081.0,IN,,False,Easy Ship,
1,1,171-9198151-1101146,04-30-22,Shipped - Delivered to Buyer,Merchant,Amazon.in,Standard,JNE3781,JNE3781-KR-XXXL,kurta,...,INR,406.0,BENGALURU,KARNATAKA,560085.0,IN,Amazon PLCC Free-Financing Universal Merchant ...,False,Easy Ship,
2,2,404-0687676-7273146,04-30-22,Shipped,Amazon,Amazon.in,Expedited,JNE3371,JNE3371-KR-XL,kurta,...,INR,329.0,NAVI MUMBAI,MAHARASHTRA,410210.0,IN,IN Core Free Shipping 2015/04/08 23-48-5-108,True,,


## Step 2: Initial Data Exploration

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128975 entries, 0 to 128974
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   index               128975 non-null  int64  
 1   Order ID            128975 non-null  object 
 2   Date                128975 non-null  object 
 3   Status              128975 non-null  object 
 4   Fulfilment          128975 non-null  object 
 5   Sales Channel       128975 non-null  object 
 6   ship-service-level  128975 non-null  object 
 7   Style               128975 non-null  object 
 8   SKU                 128975 non-null  object 
 9   Category            128975 non-null  object 
 10  Size                128975 non-null  object 
 11  ASIN                128975 non-null  object 
 12  Courier Status      122103 non-null  object 
 13  Qty                 128975 non-null  int64  
 14  currency            121180 non-null  object 
 15  Amount              121180 non-nul

In [7]:
df.describe()

Unnamed: 0,index,Qty,Amount,ship-postal-code
count,128975.0,128975.0,121180.0,128942.0
mean,64487.0,0.904431,648.561465,463966.236509
std,37232.019822,0.313354,281.211687,191476.764941
min,0.0,0.0,0.0,110001.0
25%,32243.5,1.0,449.0,382421.0
50%,64487.0,1.0,605.0,500033.0
75%,96730.5,1.0,788.0,600024.0
max,128974.0,15.0,5584.0,989898.0


In [8]:
df.isnull().sum()

index                     0
Order ID                  0
Date                      0
Status                    0
Fulfilment                0
Sales Channel             0
ship-service-level        0
Style                     0
SKU                       0
Category                  0
Size                      0
ASIN                      0
Courier Status         6872
Qty                       0
currency               7795
Amount                 7795
ship-city                33
ship-state               33
ship-postal-code         33
ship-country             33
promotion-ids         49153
B2B                       0
fulfilled-by          89698
Unnamed: 22           49050
dtype: int64