## Beans & Pods Coffee Shop Case Study
---
#### A Practical Data Science Primer

Goals:

1. Analyze the dataset provided by Beans & Pods Coffe Shop.
2. Look for patterns in the data.
3. Provide recommendations for a new marketing campaign.
4. Suggest additional data to collect for future analysis.
5. Prepare a presentation of the data.

In [None]:
# import all libraries as necessary
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# When working with matplotlib in Notebook environment: 
# output of plotting commands diplayed inline within the Notebook:
# Reference: https://www.statology.org/matplotlib-inline/
%matplotlib inline

In [None]:
# How to read excel file stored locally with pandas:
# https://stackoverflow.com/questions/46599016/reading-xlsx-file-using-jupyter-notebook

path = ('./06_BeansDataSet.xlsx')
x1 = pd.ExcelFile(path)
print(x1.sheet_names)

In [None]:
# Load the excel file into a dataframe:
data = pd.read_excel('./06_BeansDataSet.xlsx')

# view the head of the dataset:
data.head(10)

In [None]:
# Need to clean up the data and get rid of the "NaN" values

# the "dropna" method will drop non-existent values
# how='all' will drop rows which are all NaN values (drop rows which only contain NaN)
# how='any' will drop rows which have any NaN values (some cells may contain data)
data.dropna(how='all', inplace=True)

# Now we can view the nice clean data values
data.head(10)

In [None]:
# make a copy of the data to work with and display it:

df = data.copy()
df.head()

In [None]:
# view info for our dataframe to check for irregularities:

df.info()

In [None]:
# view description of the dataset to check statistics:

df.describe()

## Explore, Dissect & Analyze Data
---

In [None]:
df.head()

In [None]:
# How many unique Regions are in the dataset?

df.Region.nunique()

In [None]:
# We know there are 3 unique Regions
# How are many sales are in each region?

# display numerical values:
df.Region.value_counts()

In [None]:
# Or make it a fancy bar graph:

df.Region.value_counts().plot.bar(title='Total Sales by Region')

In [None]:
# How many sales per channel?

# display numerical values:
df.Channel.value_counts()

In [None]:
# display as a bar graph:

df.Channel.value_counts().plot.bar(ylabel='Total Sales', title="Total Sales per Channel", color=['blue', 'red'])

In [None]:
df.head()

In [None]:
# lets group by Channel and aggregate for another column, like Robusta and its values:
# notice that inside the agg method, we have a dictionary

df.groupby('Channel').agg({'Robusta':['min','max','mean','sum']})

In [None]:
# plot the sum of sales for all products by Channel:

df.groupby('Channel').agg({'Robusta':['sum'],
                           'Arabica':['sum'],
                           'Espresso':['sum'],
                           'Lungo':['sum'],
                           'Latte':['sum'],
                           'Cappuccino':['sum'],}).plot.bar(ylabel='Millions of $', title='Total Sales by Channel for All Products')

In [None]:
# plot the mean and total of sales for all products by Channel:
# Note the legend has been set to false

# store info as a new variable
channel_sales = df.groupby('Channel').agg({'Robusta':['mean','sum'],
                                           'Arabica':['mean','sum'],
                                           'Espresso':['mean','sum'],
                                           'Lungo':['mean','sum'],
                                           'Latte':['mean','sum'],
                                           'Cappuccino':['mean','sum'],})

# print it out as a bar graph:
# Note that the mean values are too small to see
channel_sales.plot.bar(legend=False,
                       ylabel='Millions of $',
                       title='Mean Sales by Channel for All Products')

In [None]:
# what is the data type of our new variable?

type(channel_sales)
# it is a dataframe: so we can perform the same type of dataframe operations on it if needed.

In [None]:
# plot the mean and total of sales for all products by Region:
# Note the legend has been set to false to unclutter the bar graph

# store info as a new variable
region_sales = df.groupby('Region').agg({'Robusta':['mean','sum'],
                                         'Arabica':['mean','sum'],
                                         'Espresso':['mean','sum'],
                                         'Lungo':['mean','sum'],
                                         'Latte':['mean','sum'],
                                         'Cappuccino':['mean','sum'],})

# print it out as a bar graph:
# Note that the mean values are too small to see
region_sales.plot.bar(legend=False,
                      ylabel='Millions of $',
                      title='Mean Sales by Region for All Products')

In [None]:
# check out our custom sales per channel dataframe:
channel_sales.head()

In [None]:
# check out our custom sales per region dataframe:
region_sales.head()

In [None]:
# We still need a complete picture of where the best performers are. We don't know
# what region is best online or in store, so we need that data next.

sales = df.groupby(['Channel', 'Region']).agg({'Robusta':['mean','sum'],
                                               'Arabica':['mean','sum'],
                                               'Espresso':['mean','sum'],
                                               'Lungo':['mean','sum'],
                                               'Latte':['mean','sum'],
                                               'Cappuccino':['mean','sum']})
# This is much more comprehensive:
# Channel per Region:
sales.head(6)

In [None]:
# Check info in the new combined dataframe:
# Note that channel and region are no longer columns, they are the index of the dataframe
# Note that the columns have 2 values now, instead of just 1

sales.info()

In [None]:
# more info on the columns:

sales.columns

In [None]:
# more information on the indices:

sales.index

In [None]:
# make a new copy of our custom sales dataframe

df1 = sales.copy()

In [None]:
# make sure this is the data we want to work with

df1.head(6)

In [None]:
# get index ith the .loc method
# These are the data values for Online sales in the Central region by Product

df1.loc[('Online', 'Central')]

In [None]:
# Create a bar plot:

df1.loc[('Online', 'Central')].plot.bar(ylabel='Dollars', title='Online Sales, Central Region')

In [None]:
# turns out we don't need the mean, so we can make another dataframe:

sales1 = df.groupby(['Channel', 'Region']).agg({'Robusta':['sum'],
                                               'Arabica':['sum'],
                                               'Espresso':['sum'],
                                               'Lungo':['sum'],
                                               'Latte':['sum'],
                                               'Cappuccino':['sum']})
# This is much more comprehensive:
# Channel per Region:
sales1.head(6)

In [None]:
# Alternative: Get sales data using pandas pivot table method:
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
# note that "data" from the beginning of this exercize can be replaced with "sales1" 

sales_table = pd.pivot_table(data, values=['Robusta', 'Arabica', 'Espresso', 'Lungo', 'Latte', 'Cappuccino'], index=['Channel', 'Region'], aggfunc=np.sum)

# display our table: note the lack of "sum" in the columns
sales_table

In [None]:
# make a new copy the new sales dataframe

df2 = sales1.copy()
df2.head(6)

In [None]:
# create a new bar plot from the new dataframe using only the sum data:

df2.loc[('Online', 'Central')].plot.bar(ylabel='Dollars', title='Online Sales, Central Region')

In [None]:
# plot all online sales for all regions

df2.loc[:].plot.bar(figsize=(15, 10), ylabel='Millions of $', title='All Sales for all Regions')

### Trend analysis so far:
---

Trends By Region:

1. Overall, the South region has the most sales revenue by far.
2. Overall, the Central region has the lowest sales revenue.

Trends by product:

1. For In-Store sales, all regions, Robusta grosses the most revenue by far.
2. For In-Store sales, all regions, Latte grosses the least revenue.
3. For Online sales, all regions, Espresso grosses the most revenue, followed by Arabica.
4. For Online sales, all regions, Lungo and Cappucino gross the least revenue.

### Seller Recommendations:
---
Problem: This data is only static. It needs to be dynamic, but we are missing time elements.

1. Need to add a time element to the data to make it dynamic and model the trends. When was the data logged? Growth or decline of sales over time would be very useful.

2. For In-Store sales, the sales can be tracked by individual salesman, and that can be tracked over time as a performance measure as well.

3. Build a recommendation system based on market analysis of items bought associated with customer profiles to reveal patterns. 

    3.1 This can be individual buyer based, where customer profiles and customer histories are linked to product recommendations based on their personal profile.
    
    3.2 This can be product based, where product profiles are linked together, so someone adds a certain product to their cart, they are recommended another due to the product's popularity of being bought with some other product.