![Capgemini](https://raw.githubusercontent.com/interviewquery/takehomes/capgemini_3/capgemini_3/logo.png)
# Purpose

The team has asked that you prepare a 20 to 30-minute presentation (in this notebook) on one of the topics from below. The purpose of this
exercise is to demonstrate your ability to draw insights from data, put
insights in business-friendly format and confirm coding knowledge. These
topics are similar in nature to projects we run. Please make sure that
your presentation is accessible to a general technical audience (aside from the sections of code, of course).


# Data


**Sales Forecasting**

Build and evaluate models to predict national retail store sales for
each store and department. Our sales are very seasonal, and we make
our money during holidays like Super Bowl, Labor Day, Thanksgiving,
and Christmas. The data is contained in `sales_forcasting.csv`

# Deliverables
Your notebook should contain:

1.  Description of the problem; state what you are solving/ analyzing
2.  Presentation of insights/conclusion you generate
3.  Relevant descriptive statistics (charts, graphs, etc.)
4.  Specification of predictive model (mathematical formulation)
5.  Relevant model diagnostics
6.  Model interpretation (what do the coefficients mean, how do you use
    them?)
7.  Please specify language, packages and libraries used to develop your
    solution

# Evaluation Criteria

- Presentation on analysis conducted that covers business outcomes and statistical methodologies
- Preform exploratory data analysis to gather starting insights and conclusions
- Selection of ML/Predictive modeling technique(s) & feature extraction
- Knowledge with data ingestion tools/languages
- Ability to conduct appropriate data cleansing if any
- Ability to code in open source languages

# Rules

- Use any open source language of your choice

- Solution you provide should be your own. Reference any material desired.

- Be prepared to discuss your code in-depth (what it does, how it does it etc.)

- Utilize any statistical or ML technique(s) you deem relevant. For
each technique that you use, be prepared to talk about model
diagnostics, results, and mathematics behind your technique(s).

- No time limit on developing your solution. Let us know when you are ready.


In [1]:
!git clone --branch capgemini_3 https://github.com/interviewquery/takehomes.git
%cd takehomes/capgemini_3
!ls

Cloning into 'takehomes'...
remote: Enumerating objects: 1963, done.[K
remote: Counting objects: 100% (1963/1963), done.[K
remote: Compressing objects: 100% (1220/1220), done.[K
remote: Total 1963 (delta 752), reused 1927 (delta 726), pack-reused 0 (from 0)[K
Receiving objects: 100% (1963/1963), 297.43 MiB | 12.64 MiB/s, done.
Resolving deltas: 100% (752/752), done.
/content/takehomes/capgemini_3
logo.png  metadata.json  sales_forecasting.csv	takehomefile.ipynb


In [20]:
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from xgboost import cv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error
import plotly.graph_objects as go
os.listdir()

['sales_forecasting.csv', 'metadata.json', 'takehomefile.ipynb', 'logo.png']

# READ DATA

In [3]:
# read data
df = pd.read_csv("sales_forecasting.csv")

# print
print(df.shape)
df.head()

(421570, 5)


Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday
0,1,1,2010-02-05,24924.5,False
1,1,1,2010-02-12,46039.49,True
2,1,1,2010-02-19,41595.55,False
3,1,1,2010-02-26,19403.54,False
4,1,1,2010-03-05,21827.9,False


# EPLORATIVE ANALYSIS

In [58]:
# NaA checking
for str_col in df.columns.tolist():
  n_nan = df["IsHoliday"].isna().sum()
  print(str_col + " has NaN values: " + str(n_nan))


Store has NaN values: 0
Dept has NaN values: 0
Date has NaN values: 0
Weekly_Sales has NaN values: 0
IsHoliday has NaN values: 0


In [51]:
# general overview
d_dict_formula = {"Dept":"nunique",
                  "Date":["min", "max"],
                  "Weekly_Sales":["min", "mean", "max", "std"]
                  }
df.groupby(["Store"], as_index=False).agg(d_dict_formula)

Unnamed: 0_level_0,Store,Dept,Date,Date,Weekly_Sales,Weekly_Sales,Weekly_Sales,Weekly_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,nunique,min,max,min,mean,max,std
0,1,77,2010-02-05,2012-10-26,-863.0,21710.543621,203670.47,27748.945511
1,2,78,2010-02-05,2012-10-26,-1098.0,26898.070031,285353.53,33077.612059
2,3,72,2010-02-05,2012-10-26,-1008.96,6373.033983,155897.94,14251.034807
3,4,78,2010-02-05,2012-10-26,-898.0,29161.210415,385051.04,34583.677814
4,5,72,2010-02-05,2012-10-26,-101.26,5053.415813,93517.72,8068.22105
5,6,77,2010-02-05,2012-10-26,-698.0,21913.243624,342578.65,23633.427075
6,7,76,2010-02-05,2012-10-26,-459.0,8358.766148,222921.09,10679.008085
7,8,76,2010-02-05,2012-10-26,-100.0,13133.014768,153431.69,15132.069598
8,9,73,2010-02-05,2012-10-26,-496.0,8772.890379,139427.43,12446.502614
9,10,77,2010-02-05,2012-10-26,-798.0,26332.303819,693099.36,32133.006264


## store

In [96]:
# general overview of the ts per each store
fig = go.Figure()
# ts for each store
for n_store in df['Store'].unique():
  df_store = df[df['Store'] == n_store].groupby(["Store", "Date"], as_index=False).agg({"Weekly_Sales":"mean"}).sort_values("Date")
  x_axis = df_store["Date"].unique()
  y_axis = df_store["Weekly_Sales"].tolist()
  data = go.Scatter(x=x_axis,
                    y=y_axis,
                    name=f"Store {n_store}"
                    )
  fig.add_trace(data)
  fig.update_layout(title=dict(text='weekly_sales avg for each store'))
# print
fig.show()

## dept

In [97]:
# general overview of the ts per each department
fig = go.Figure()
# ts for each store
for n_store in df['Dept'].unique():
  df_store = df[df['Dept'] == n_store].groupby(["Dept", "Date"], as_index=False).agg({"Weekly_Sales":"mean"}).sort_values("Date")
  x_axis = df_store["Date"].unique()
  y_axis = df_store["Weekly_Sales"].tolist()
  data = go.Scatter(x=x_axis,
                    y=y_axis,
                    name=f"Dept {n_store}"
                    )
  fig.add_trace(data)
  fig.update_layout(title=dict(text='weekly_sales avg for each department'))
# print
fig.show()

## holidays

In [93]:
# dataframe with holidays
df_holidays = pd.DataFrame({"Date":sorted(df.loc[df["d_is_holiday"]==1,"Date"].unique())})
df_holidays.reset_index(drop=False, inplace=True)
df_holidays["index"] = df_holidays["index"] + 1

# dataframe with days
df_dates = pd.DataFrame({"Date":sorted(df["Date"].unique())})
# merge
df_dates = df_dates.merge(df_holidays, on="Date", how="left")
df_dates["d_is_holiday"] = np.where(df_dates["Date"].isin(df_holidays["Date"]), df_dates["index"],np.nan)
df_dates["str_is_holiday"] = np.where(df_dates["d_is_holiday"].isna()==False, df_dates["Date"], "")

In [92]:
# general overview of the ts per each department
fig = go.Figure()
x_axis = df_dates["Date"].unique()
y_axis = df_dates["index"].tolist()
data = go.Scatter(x=x_axis,
                  y=y_axis,
                  name="Holidays",
                  mode="markers+text",
                  text=df_dates["str_is_holiday"],
                  textposition="bottom center"
                  )
fig.add_trace(data)
fig.update_layout(title=dict(text='Holidays'))
# print
fig.show()