# ***Problem Statement-***
"In the dynamic world of retail, forecasting sales accurately is a critical aspect of optimizing operations, managing inventory, and ensuring profitability. This project revolves around a retail dataset spanning four years from a global superstore. The primary objective is to conduct Exploratory Data Analysis (EDA) and develop a predictive model to forecast sales for the next 7 days from the last date of the training dataset."
* Row ID => Unique ID for each row.
* Order ID => Unique Order ID for each Customer.
* Order Date => Order Date of the product.
* Ship Date => Shipping Date of the Product.
* Ship Mode=> Shipping Mode specified by the Customer.
* Customer ID => Unique ID to identify each Customer.
* Customer Name => Name of the Customer.
* Segment => The segment where the Customer belongs.
* Country => Country of residence of the Customer.
* City => City of residence of of the Customer.
* State => State of residence of the Customer.
* Postal Code => Postal Code of every Customer.
* Region => Region where the Customer belong.
* Product ID => Unique ID of the Product.
* Category => Category of the product ordered.
* Sub-Category => Sub-Category of the product ordered.
* Product Name => Name of the Product
* Sales => Sales of the Product.

# ***Hypothesis Generation-***

This is a very important stage in any data science/machine learning pipeline. It involves understanding the problem in detail by brainstorming as many factors as possible that can have an impact on the target variable.
We are going to look at this problem statement from the point of view of a Business Manager and try to find out weak areas to make more Sales. Some of the questions we will try to answer are:

* What is the average sales per month for the month?
* What is the top demanded product in United States?
* What is the favourite shipping mode for customers?

# ***Loading module, data and related-***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
%matplotlib inline 
import matplotlib.pyplot as plt # side-stepping mpl backend
import matplotlib.gridspec as gridspec # subplots
import mpld3 as mpl
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# ***Understanding Data-*** 

In [None]:
df_s=pd.read_csv(r'/kaggle/input/sales-forecasting/train.csv')
df_s.head()

In [None]:
df=df_s.copy()

# ***Exploratry Data Analysis-***

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.nunique()

In [None]:
df[df.duplicated()]

In [None]:
train = df.iloc[:6993]
test = df.iloc[6994:].drop(columns='Sales')

In [None]:
tr_c=train.copy()
ts_c=test.copy()

In [None]:
tr_c.drop(['Row ID', 'Order ID', 'Customer ID', 'Country', 'Product ID', 'Product Name'], axis=1, inplace=True)

In this dataset, there are mainly three types of data—categorical data, numerical data and pandas datetime.

* Categorical features: Ship Mode, Segment, Country, City, State, Region, Category, Sub-Category
* Numerical features: Postal Code, Sales
* Datetime features: Order Date, order_month_year, Ship Date, ship_month_year

**Independent variable (categorical)-**

**Ordinal-**

In [None]:
tr_c['Ship Mode'].unique()

In [None]:
tr_c['Ship Mode'].value_counts(normalize=True, dropna=False)

**Nominal-**

In [None]:
tr_c['Customer Name'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['Segment'].unique()

In [None]:
tr_c['Segment'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['City'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['State'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['Region'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['Category'].value_counts(normalize=True, dropna=False)

In [None]:
tr_c['Sub-Category'].value_counts(normalize=True, dropna=False)

**Numerical-**

In [None]:
tr_c['Postal Code'].fillna(tr_c['Postal Code'].median(), inplace=True)

In [None]:
tr_c['Postal Code']=tr_c['Postal Code'].astype('int')

**Date_Time_column-**

In [None]:
tr_c['Order Date'] = pd.to_datetime(tr_c['Order Date'], format='%d/%m/%Y')
tr_c['Ship Date'] = pd.to_datetime(tr_c['Ship Date'], format='%d/%m/%Y')

tr_c['Ship Year'] = tr_c['Ship Date'].dt.year
tr_c['Ship Month'] = tr_c['Ship Date'].dt.month
tr_c['Ship date'] = tr_c['Ship Date'].dt.day

tr_c['Order Year'] = tr_c['Order Date'].dt.year
tr_c['Order Month'] = tr_c['Order Date'].dt.month
tr_c['Order date'] = tr_c['Order Date'].dt.day

In [None]:
tr_c.drop(['Order Date', 'Ship Date'], axis=1, inplace=True)

In [None]:
tr_c.info()

## Univariate Analysis-

**Categorical-**

In [None]:
import matplotlib.pyplot as plt
plt.figure(1)
plt.subplot(241)
sns.set_theme(style="darkgrid")
tr_c['Ship Mode'].value_counts(normalize=True, dropna=False).plot.bar(figsize=(11,8), title='Ship Mode', color='C5');

plt.subplot(242)
sns.set_theme(style="darkgrid")
tr_c['Segment'].value_counts(normalize=True, dropna=False).plot.bar(figsize=(11, 8), title='Segment', color='C6');

plt.subplot(243)
sns.set_theme(style="darkgrid")
tr_c['Region'].value_counts(normalize=True, dropna=False).plot.bar(figsize=(11,8), title='Region', color='C7');

plt.subplot(244)
sns.set_theme(style="darkgrid")
tr_c['Category'].value_counts(normalize=True, dropna=False).plot.bar(figsize=(11,8), title='Category', color='C8');
plt.tight_layout()

**Observation:**
- Around 60% of the shipment happens in 'Standard Class' mode.
- 'Consumer' segment makes up for around 50% of dataset
- 60% of the items are 'Office Supplies'
- Majority of the items in dataset are from the 'West' region

**Numerical-**

In [None]:
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

plt.figure(1) 
plt.subplot(121) 
sns.kdeplot(df['Sales'], color='C1'); 

plt.subplot(122) 
tr_c['Sales'].plot.box(figsize=(8,4)) 

plt.tight_layout()

## Bivariate Analysis- 
* **The average sales/month for the store-**

In [None]:
monthly_sales = pd.DataFrame(tr_c.groupby('Order Month')['Sales'].sum()).reset_index()
plt.figure(figsize=(5, 3))
sns.barplot(x = 'Order Month', y = 'Sales',data = monthly_sales)
plt.xticks(rotation=90)
plt.show()

In [None]:
yearly_sales = pd.DataFrame(tr_c.groupby('Order Year')['Sales'].sum()).reset_index()
plt.figure(figsize=(3, 3))
sns.barplot(x = 'Order Year', y = 'Sales',data = yearly_sales)
plt.xticks(rotation=90)
plt.show()

In [None]:
daily_sales = pd.DataFrame(tr_c.groupby('Order date')['Sales'].sum()).reset_index()
plt.figure(figsize=(8, 3))
sns.barplot(x = 'Order date', y = 'Sales',data = daily_sales)
plt.xticks(rotation=90)
plt.show()

**Observation-**
* We found maximum sales were in december month 2018 year and on 8 date.

* **Most order category per region-**

In [None]:
plt.figure(figsize=(5, 3))
tr_c.groupby('Region')['Category'].value_counts().plot(kind='barh',title='Quantity that has been sold');

**Observation-**
* Maximum office supplies category from south region were sales. 

* **State have the highest sales**

In [None]:
plt.figure(figsize=(7, 5))
tr_c['State'].value_counts().head(10).plot(kind='pie', autopct='%1.1f%%', colors=sns.color_palette("rocket"), textprops={'weight':'bold', 'color':'#8A8D8F'}) #creates a pie chart
plt.show();

**Observation-**
* Above plot shows 10 states with highest sales.
* California state has highest sales.

* **State have the lowest sales**

In [None]:
plt.figure(figsize=(7, 5))
tr_c['State'].value_counts().tail(10).plot(kind='pie', autopct='%1.1f%%', colors=sns.color_palette("rocket"), textprops={'weight':'bold', 'color':'#8A8D8F'}) #creates a pie chart
plt.show();

**Observation-**
* Above plot shows 10 states with lowest sales.
* Wyoming state has lowest sales.

In [None]:
plt.figure(1)
fig, ax = plt.subplots(2, 2, figsize=(10,4))
plt.subplot(121)
sns.set_theme(style="darkgrid")
sns.countplot(data=tr_c, x = 'City', order = tr_c["City"].value_counts().head(10).index);
plt.xticks(rotation=90)

plt.subplot(122)
sns.set_theme(style="darkgrid")
sns.countplot(data=tr_c, x = 'City', order = tr_c["City"].value_counts().tail(10).index);
plt.xticks(rotation=90)

plt.tight_layout();

**Observation-**
* Above plot shows 10 city with highest sales and 10 city with lowest sales.
* New york city has lowest sales.

* **The top demanded product in United States**

In [None]:
category_sales = pd.DataFrame(tr_c.groupby('Category')['Sales'].sum()).reset_index()
plt.figure(figsize=(3, 3))
sns.barplot(x = 'Category', y = 'Sales',data = category_sales)
plt.xticks(rotation=45)
plt.show()

**Observation-**
* The most demanding category in US are Technolo

* **Sales for each Region-**

In [None]:
region_sales = pd.DataFrame(tr_c.groupby('Region')['Sales'].sum()).reset_index()
plt.figure(figsize=(3, 3))
sns.barplot(x = 'Region', y = 'Sales',data = region_sales)
plt.xticks(rotation=45)
plt.show()

**Observation-**
* Maximum sales were in west region.

# ***Missing value-***

In [None]:
tr_c.isnull().sum()

# ***Outliers Treatment-***

In [None]:
tr_c['Sales'].plot.box(figsize=(5,4));

The interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. It represents the middle 50% of data values and is an important value that gives a more accurate perspective of data spread and statistical variances1. IQR can be used to identify outliers in a data set

In [None]:
IQR= tr_c['Sales'].quantile(0.75) - tr_c['Sales'].quantile(0.25)
lower_bridge= tr_c['Sales'].quantile(0.25)-(IQR*1.5)
upper_bridge= tr_c['Sales'].quantile(0.75)+(IQR*1.5)
tr_c.loc[tr_c['Sales']>upper_bridge,'Sales'] = upper_bridge

In [None]:
tr_c['Sales'].plot.box(figsize=(5,4));

## Label Encoding for categorical columns-

Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important pre-processing step in a machine-learning project.

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
ordinal_cols=[ 'Ship Mode', 'Customer Name', 'Segment', 'State', 'Postal Code', 'Category', 'Sub-Category', 'City', 'Region']
for column in ordinal_cols:
    tr_c[column]=le.fit_transform(tr_c[column])

# **Evaluation Metrics for regression problems-**
The process of model building is not complete without evaluation of model’s performance. Suppose we have the predictions from the model, how can we decide whether the predictions are accurate? We can plot the results and compare them with the actual values, i.e. calculate the distance between the predictions and actual values. Lesser this distance more accurate will be the predictions

In [None]:
X_tr=tr_c.drop(columns=["Sales"], axis=1)
y_tr=tr_c["Sales"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_tr, y_tr, test_size=0.2, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler


## Using Piepline- 

In [None]:
from sklearn.pipeline import Pipeline
p1=Pipeline([('sc',StandardScaler()),('sv',SVR())])
p2=Pipeline([('sc',StandardScaler()),('kn',KNeighborsRegressor())])
p3=Pipeline([('sc',StandardScaler()),('lr',LinearRegression())])
p4=Pipeline([('sc',StandardScaler()),('dt',DecisionTreeRegressor())])
p5=Pipeline([('sc',StandardScaler()),('rf',RandomForestRegressor())])
pipe=[p1,p2, p3, p4, p5]
for i in pipe:
    i.fit(X_train,y_train)

In [None]:
for i in pipe:
    y_pred = i.predict(X_test)    
    print(f'{i[1]},train - { i.score(X_train, y_train)},test {r2_score(y_test, y_pred)}')

**Using PCA-**

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
p6=Pipeline([('sc',StandardScaler()), ('pca',PCA()),('sv',SVR())])
p7=Pipeline([('sc',StandardScaler()),('pca',PCA()), ('kn',KNeighborsRegressor())])
p8=Pipeline([('sc',StandardScaler()),('pca',PCA()), ('lr',LinearRegression())])
p9=Pipeline([('sc',StandardScaler()),('pca',PCA()), ('dt',DecisionTreeRegressor())])
p10=Pipeline([('sc',StandardScaler()),('pca',PCA()), ('rf',RandomForestRegressor())])
pipe=[p6,p7, p8, p9, p10]
for j in pipe:
    j.fit(X_train,y_train)

In [None]:
for j in pipe:
    y_pred = j.predict(X_test)    
    print(f'{j[2]},train - { j.score(X_train, y_train)},test {r2_score(y_test, y_pred)}')

## Hyperparameter Tuning-

**GridSearch CV-**

In [None]:
p_GS_knn=Pipeline([('sc',StandardScaler()),('kn',GridSearchCV(KNeighborsRegressor(), param_grid={'metric':['euclidean', 'minkowski', 'manhattan'],
           'n_neighbors':range(1,11)}, cv=5))])
p_GS_dt=Pipeline([('sc',StandardScaler()),('dt',GridSearchCV(DecisionTreeRegressor(), param_grid={"max_depth": range(3,6),
              "max_features": range(1,11),
              "min_samples_split": range(2,11)}, cv=5))])
pipe=[p_GS_knn, p_GS_dt]
for i in pipe:
    i.fit(X_train,y_train)

In [None]:
for i in pipe[:2]:
    y_pred = i.predict(X_test)    
    print(f'{i[1].best_estimator_},train - { i.score(X_train, y_train)},test {r2_score(y_test, y_pred)}')

**RandomSearch CV-**

In [None]:
p_RS_knn=Pipeline([('sc',StandardScaler()),('kn',RandomizedSearchCV(KNeighborsRegressor(), param_distributions={'metric':['euclidean', 'minkowski', 'manhattan'],
           'n_neighbors':range(1,11)}, cv=5))])
p_RS_dt=Pipeline([('sc',StandardScaler()),('dt',RandomizedSearchCV(DecisionTreeRegressor(), param_distributions={"max_depth": range(3,6),
              "max_features": range(1,11),
              "min_samples_split": range(2,11)}, cv=5))])
pipe=[p_RS_knn, p_RS_dt]
for i in pipe:
    i.fit(X_train,y_train)

In [None]:
for i in pipe[:2]:
    y_pred = i.predict(X_test)    
    print(f'{i[1].best_estimator_},train - { i.score(X_train, y_train)},test {r2_score(y_test, y_pred)}')

# Feature importance-
Feature importance is a technique that assigns a score to input features based on how useful they are at predicting a target variable. It is useful for machine learning tasks because it allows practitioners to understand which features in a dataset are contributing most to the final prediction, and which features are less important.

In [None]:
rf=RandomForestRegressor()
rf.fit(X_train, y_train)
rf.feature_importances_

In [None]:
features_imp = pd.DataFrame({'importance': rf.feature_importances_}, index= X_train.columns).sort_values('importance')

In [None]:
features_imp.plot.barh();

## Using Feature Importance-

In [None]:
X_imp =X_train[features_imp[features_imp['importance'] > 0.05].index]

In [None]:
X_train_imp,X_test_imp,y_train_imp,y_test_imp = train_test_split(X_imp,y_train,test_size = 0.3)

In [None]:
p1=Pipeline([('sc',StandardScaler()),('sv',SVR())])
p2=Pipeline([('sc',StandardScaler()),('kn',KNeighborsRegressor())])
p3=Pipeline([('sc',StandardScaler()),('lr',LinearRegression())])
p4=Pipeline([('sc',StandardScaler()),('dt',DecisionTreeRegressor())])
p5=Pipeline([('sc',StandardScaler()),('rf',RandomForestRegressor())])
pipe=[p1,p2, p3, p4, p5]
for i in pipe:
    i.fit(X_train_imp,y_train_imp)

In [None]:
for i in pipe:
    y_pred = i.predict(X_test_imp)    
    print(f'{i[1]},train - { i.score(X_train_imp, y_train_imp)},test {r2_score(y_test_imp, y_pred)}, error {np.mean((y_pred - y_test_imp)**2)}')

In [None]:
20327**0.5

In [None]:
sample = X_imp.iloc[0,:]
sample

In [None]:
sample.index