## Data Scientist Challenge

In this challenge you are assuming the position of a data scientist for a large corporation with offices in many states. The company you work for is looking to gain insights from its data from accross the many states that it does business over the course of the year 2014. The company runs many 'sales' over the course of the year in its states. The data you have is of the 10,000 different sales run in different states. Each sale lasts for a particular amount of time denotd by 'sale_start' and 'sale_end' columns. Along with each sale is the revenue for that particular sale.

The company is primarily interested in determining what is happening to revenue over time and what strategy should be undertaken to maximize revenue.

You don't need any advanced statistics for this assignment. Good exploration and visualizations will reveal everything you need to know about what is happening. Check out pandas time-series module documentation here - http://pandas.pydata.org/pandas-docs/stable/timeseries.html

Please fork the repository into your github profile, clone it and then start working on it locally.

In [1]:
# starter code
import pandas as pd
import numpy as np
import calendar

In [2]:
from bokeh.io import output_notebook, show, vplot
import bokeh.plotting as bk_plotting
import bokeh.layouts as bk_layouts
from bokeh.models import Legend
from bokeh.models.ranges import FactorRange
output_notebook()

In [3]:
from bokeh.charts import Bar
from bokeh.charts import BoxPlot
from bokeh.charts import Histogram
from bokeh.models.ranges import Range1d

In [4]:
sales = pd.read_csv('data/sales.csv', parse_dates=['sale_start', 'sale_end'])

In [5]:
sales.head()

Unnamed: 0,state,sale_start,sale_end,sale_key,revenue
0,Arkansas,2014-12-24,2014-12-24,0,1311.0
1,Florida,2014-10-15,2014-10-17,1,698.0
2,Iowa,2014-09-07,2014-09-07,2,1193.0
3,Indiana,2014-05-19,2014-05-22,3,469.0
4,Maine,2014-04-19,2014-04-19,4,334.0


In [6]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
state         10000 non-null object
sale_start    10000 non-null datetime64[ns]
sale_end      10000 non-null datetime64[ns]
sale_key      10000 non-null int64
revenue       10000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(1)
memory usage: 390.7+ KB


In [7]:
sales['sale_duration_int'] = (sales.sale_end - sales.sale_start).dt.days
sales['sale_duration'] = sales['sale_duration_int'].map(lambda t: '{} days'.format(t))
sales['sale_month'] = sales.sale_start.dt.month.map(lambda x: calendar.month_abbr[x])
sales['sale_month_int'] = sales.sale_start.dt.month
sales = sales[['state', 'sale_month_int', 'sale_month', 'sale_start', 'sale_end', 'sale_duration', 
               'sale_duration_int', 'sale_key', 'revenue']]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

sales.head(3)

Unnamed: 0,state,sale_month_int,sale_month,sale_start,sale_end,sale_duration,sale_duration_int,sale_key,revenue
0,Arkansas,12,Dec,2014-12-24,2014-12-24,0 days,0,0,1311.0
1,Florida,10,Oct,2014-10-15,2014-10-17,2 days,2,1,698.0
2,Iowa,9,Sep,2014-09-07,2014-09-07,0 days,0,2,1193.0


In [8]:
g = sales.groupby(['state', 'sale_duration'])

In [9]:
sales_states_duration = g['revenue'].mean().unstack('sale_duration')

In [10]:
h = sales_states_duration.mean(axis=1)

In [11]:
sales = sales.sort_values('sale_duration')
sales.head()

Unnamed: 0,state,sale_month_int,sale_month,sale_start,sale_end,sale_duration,sale_duration_int,sale_key,revenue
0,Arkansas,12,Dec,2014-12-24,2014-12-24,0 days,0,0,1311.0
6459,Oregon,4,Apr,2014-04-06,2014-04-06,0 days,0,6459,771.0
6458,Oregon,2,Feb,2014-02-28,2014-02-28,0 days,0,6458,856.0
2294,Oklahoma,10,Oct,2014-10-07,2014-10-07,0 days,0,2294,471.0
6454,Alaska,1,Jan,2014-01-31,2014-01-31,0 days,0,6454,647.0


In [12]:
width, height = 900, 600
p1 = Bar(sales.sort_values('sale_duration'), label='state', values='revenue', agg='mean', stack='sale_duration',
        title="Average revenue per state (per sale duration)", xlabel='', legend='top_right',
        plot_width=width, plot_height=height)

p2 = Bar(sales, label='state', values='revenue', agg='count', stack='sale_duration',
        title="Number of sales per state (per sale duration)", xlabel='', legend='top_right',
        plot_width=width, plot_height=height)

p3 = Bar(sales.sort_values('sale_month_int'), label='state', values='revenue', agg='count', stack='sale_month',
        title="Number of sales per state (per month)", xlabel='', legend='top_right',
        plot_width=width, plot_height=height)

p4 = BoxPlot(sales, label='state', values='revenue',
            xlabel='', ylabel='Revenue', title='Revenue per state', legend=None,
            plot_width=width, plot_height=height)

p5 = BoxPlot(sales, label='sale_month', values='revenue',
            xlabel='', ylabel='Revenue', title='Revenue per month', legend=None,
            plot_width=width, plot_height=height)
p5.x_range = FactorRange(factors=months)

gp = bk_layouts.column(p1, p2, p3, p4, p5)
show(gp)

### Split by semester

In [13]:
sales.head()

Unnamed: 0,state,sale_month_int,sale_month,sale_start,sale_end,sale_duration,sale_duration_int,sale_key,revenue
0,Arkansas,12,Dec,2014-12-24,2014-12-24,0 days,0,0,1311.0
6459,Oregon,4,Apr,2014-04-06,2014-04-06,0 days,0,6459,771.0
6458,Oregon,2,Feb,2014-02-28,2014-02-28,0 days,0,6458,856.0
2294,Oklahoma,10,Oct,2014-10-07,2014-10-07,0 days,0,2294,471.0
6454,Alaska,1,Jan,2014-01-31,2014-01-31,0 days,0,6454,647.0


In [14]:
q1 = ['Jan', 'Feb', 'Mar']
q2 = ['Apr', 'May', 'Jun']
q3 = ['Jul', 'Aug', 'Sep']
q4 = ['Oct', 'Nov', 'Dec']

def to_quarter(month):
    global q1, q2, q3, q4
    if month in q1:
        return 'Q1'
    elif month in q2:
        return 'Q2'
    elif month in q3:
        return 'Q3'
    else:
        return 'Q4'
    
sales['quarter'] = sales.sale_month.map(to_quarter)
sales.head()

Unnamed: 0,state,sale_month_int,sale_month,sale_start,sale_end,sale_duration,sale_duration_int,sale_key,revenue,quarter
0,Arkansas,12,Dec,2014-12-24,2014-12-24,0 days,0,0,1311.0,Q4
6459,Oregon,4,Apr,2014-04-06,2014-04-06,0 days,0,6459,771.0,Q2
6458,Oregon,2,Feb,2014-02-28,2014-02-28,0 days,0,6458,856.0,Q1
2294,Oklahoma,10,Oct,2014-10-07,2014-10-07,0 days,0,2294,471.0,Q4
6454,Alaska,1,Jan,2014-01-31,2014-01-31,0 days,0,6454,647.0,Q1


In [45]:
st = list(sales.state.unique())
for sales in st
sales.loc[(sales.quarter == 'Q1') & (sales.state == 'Arkansas'), 'revenue'].sum()

21774.0

In [15]:
width, height = 900, 600
p1 = Bar(sales.sort_values('quarter'), label='state', values='revenue', agg='count', stack='quarter',
        title="", xlabel='', ylabel='Number of sales', legend='top_right',
        plot_width=width, plot_height=height)

p2 = Bar(sales.sort_values('quarter'), label='state', values='revenue', agg='mean', stack='quarter',
        title="", xlabel='', ylabel='Average revenue', legend='top_right',
        plot_width=width, plot_height=height)

p3 = Bar(sales.sort_values('quarter'), label='state', values='revenue', agg='sum', stack='quarter',
        title="", xlabel='', ylabel='Total revenue', legend='top_right',
        plot_width=width, plot_height=height)

gp = bk_layouts.column(p1, p2, p3)
show(gp)

In [16]:
gr_month = sales.groupby('sale_month')['revenue']
sales_month = gr_month.agg(['sum', 'count', 'mean'])
sales_month.index = months
sales_month['month'] = months
sales_month.columns = ['Monthly_Revenue', 'Monthly_Sales', 'Average_Sale_Revenue', 'Month']
sales_month = sales_month[['Month', 'Monthly_Sales', 'Monthly_Revenue', 'Average_Sale_Revenue']]
sales_month

Unnamed: 0,Month,Monthly_Sales,Monthly_Revenue,Average_Sale_Revenue
Jan,Jan,757,618229.0,816.682959
Feb,Feb,892,846619.0,949.124439
Mar,Mar,820,833648.0,1016.643902
Apr,Apr,727,534894.0,735.755158
May,May,791,537733.0,679.814159
Jun,Jun,832,765917.0,920.573317
Jul,Jul,806,707292.0,877.533499
Aug,Aug,749,576577.0,769.795728
Sep,Sep,895,770376.0,860.755307
Oct,Oct,918,922144.0,1004.514161


In [17]:
np.arange(200,1800, 200)
sales['sale_type'] = pd.cut(sales.revenue.values, np.arange(200,1800, 200), 
                            labels=[' 200 < sale < 400', ' 400 < sale < 600',
                                    ' 600 < sale < 800', ' 800 < sale < 1000',
                                    '1000 < sale < 1200', '1200 < sale < 1400',
                                    '1400 < sale < 1600'])

In [18]:
width, height = 800, 300
p1 = Bar(sales, label='sale_duration', values='revenue', color='sale_duration', agg='mean', 
        title="Average revenue per sale duration", legend=None,
        plot_width=width, plot_height=height)

p2 = Bar(sales, label='sale_month', values='revenue', color='sale_month', agg='mean', 
        title="Average revenue per month", legend=None,
        plot_width=width, plot_height=height)
p2.x_range = FactorRange(factors=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

p3 = Bar(sales, label='sale_month', values='revenue', color='sale_month', agg='sum', 
        title="Total revenue per month", legend=None,
        plot_width=width, plot_height=height)
p3.x_range = p2.x_range

p4 = Bar(sales, label='sale_month', values='revenue', agg='count', 
        title="Number of sales per month", legend=None,
        plot_width=width, plot_height=height)
p4.x_range = p2.x_range

show(bk_layouts.column(p1, p2, p3, p4))

In [19]:
width, height = 400, 400
p0 = Bar(sales[sales.sale_duration_int==0], label='state', values='revenue', agg='mean', 
        title="Average revenue per state (same-day sale)", legend=None,
        plot_width=width, plot_height=height)
p1 = Bar(sales[sales.sale_duration_int==1], label='state', values='revenue', agg='mean', 
        title="Average revenue per state (1-day sale)", legend=None,
        plot_width=width, plot_height=height)
p2 = Bar(sales[sales.sale_duration_int==2], label='state', values='revenue', agg='mean', 
        title="Average revenue per state (2-day sale)", legend=None,
        plot_width=width, plot_height=height)
p3 = Bar(sales[sales.sale_duration_int==3], label='state', values='revenue', agg='mean', 
        title="Average revenue per state (3-day sale)", legend=None,
        plot_width=width, plot_height=height)

gp = bk_layouts.gridplot([[p0,p1], [p2,p3]])
show(gp)

In [20]:
width, height = 900, 300
p1 = Histogram(sales, 'revenue',
        plot_width=width, plot_height=height)
show(p1)

In [21]:
width, height = 440, 300
p1 = Histogram(sales[sales.revenue<=1000], 'revenue', title = 'Revenue <= 1000',
        plot_width=width, plot_height=height)
p2 = Histogram(sales[sales.revenue>1000], 'revenue', title = 'Revenue > 1000',
        plot_width=width, plot_height=height)

p1.y_range = Range1d(0,400)
p2.y_range = p1.y_range

gp = bk_layouts.row(p1,p2)
show(gp)

In [22]:
width, height = 900, 600
p1 = Bar(sales.sort_values(['sale_type']), label='sale_month', values='revenue', stack='sale_type', agg='count', 
        title="Number of sales per type of sale (per month)", legend='top_right',
        plot_width=width, plot_height=height)

p1.toolbar_location = 'above'
p1.y_range = Range1d(0, 1700)
p1.x_range = FactorRange(factors=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

show(p1)