# Marijuana Sales Prediction

## Projects Objective
- Goal: Predict sales of each brand
- Sub-problems: 

## Load Data
- Load from local dataset

In [59]:
# import required packages
import sys
import os
import pandas as pd
# Load Dataset
avgRetail = pd.read_csv("../data/BrandAverageRetailPrice.csv")
brandDetail = pd.read_csv("../data/BrandDetails.csv")
totalSales = pd.read_csv("../data/BrandTotalSales.csv")
totalUnits = pd.read_csv("../data/BrandTotalUnits.csv")

## Construct Dataset 
- Time-series data
    - Convert into same time-stamp format
- Clean Data 
    - without imputation
- Feature Engineering
    - Feature augmentation
    - Combining feature in same/different datasets

### Find the dataset to start with
- Criteria: Information of each datasets
- Objective: Since the final goal is to predict the future sales, we start with the dataset that seems to have strong correlation with sales, and also with few feature to start with. 

#### Total Sales dataset

In [60]:
totalSales.info()
totalSales.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25279 entries, 0 to 25278
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Months           25279 non-null  object
 1   Brand            25279 non-null  object
 2   Total Sales ($)  25279 non-null  object
dtypes: object(3)
memory usage: 592.6+ KB


Unnamed: 0,Months,Brand,Total Sales ($)
0,09/2018,10x Infused,1711.334232
1,09/2018,1964 Supply Co.,25475.215945
2,09/2018,3 Bros Grow,120153.644757
3,09/2018,3 Leaf,6063.529785
4,09/2018,350 Fire,631510.048155


#### Total units dataset

In [61]:
totalUnits.info()
totalUnits.head(5)
# find out how many brand do we have currently

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27686 entries, 0 to 27685
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Brands            27686 non-null  object 
 1   Months            27686 non-null  object 
 2   Total Units       25712 non-null  object 
 3   vs. Prior Period  24935 non-null  float64
dtypes: float64(1), object(3)
memory usage: 865.3+ KB


Unnamed: 0,Brands,Months,Total Units,vs. Prior Period
0,#BlackSeries,08/2020,1616.339004,
1,#BlackSeries,09/2020,,-1.0
2,#BlackSeries,01/2021,715.532838,
3,#BlackSeries,02/2021,766.669135,0.071466
4,#BlackSeries,03/2021,,-1.0


### Clean the choosen dataset
- Selected dataset: Total Sales
- Breakdown: Base-on different brand
- Reason: 
    - It contains more information about salses. 
    - It has much narrow data in brand features. 

#### Preprocess the selected dataset

In [62]:
import numpy as np
# convert the time stamp dataset
totalSales["Months"] = pd.to_datetime(totalSales["Months"])
# convert salses data to numerical data form
totalSales["Total Sales ($)"] = totalSales["Total Sales ($)"].str.replace(',', '')
totalSales["Total Sales ($)"] = pd.to_numeric(totalSales["Total Sales ($)"])
totalSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25279 entries, 0 to 25278
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Months           25279 non-null  datetime64[ns]
 1   Brand            25279 non-null  object        
 2   Total Sales ($)  25279 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 592.6+ KB


#### Find out all the brand name

In [63]:
brands = list(totalSales["Brand"].unique())
print("==================================")
print("Total: {} different brands".format(brands.__len__()))
print("==================================")

Total: 1627 different brands


#### Find the brand to start with
- Start with the brand that has much information. 

In [66]:
# start with one of the brand
# only list-out top three brand
totalSales["Brand"].value_counts().head(3)

Lift Ticket Laboratories    37
Garden Society              37
Field Extracts              37
Name: Brand, dtype: int64

### Feature Engineering 
- Transform and add more features in the selected dataset. 
- Add more features from other related datasets. 

#### Sales info about selected brand

In [98]:
brandName = 'Garden Society'
brandData = totalSales[totalSales.Brand == brandName].set_index("Months").drop(['Brand'], 1)
brandData.head(10)

  brandData = totalSales[totalSales.Brand == brandName].set_index("Months").drop(['Brand'], 1)


Unnamed: 0_level_0,Total Sales ($)
Months,Unnamed: 1_level_1
2018-09-01,679.796207
2018-10-01,9847.971509
2018-11-01,17585.544522
2018-12-01,13796.748683
2019-01-01,13525.256162
2019-02-01,129564.314831
2019-03-01,104925.011206
2019-04-01,131054.937778
2019-05-01,121876.366551
2019-06-01,218701.840227


#### Adding Features to Dataset

In [120]:
# adding from current dataset 
# add last month data
brandData.loc[:, 'Previous Month Sales'] = brandData.loc[:,"Total Sales ($)"].shift(-1)

#======================Adding Rolling Data======================
# add rolling sales data (for three months)
# calcualte the sum
rollSum, rollMonths = 0, 3
for month in range(1, rollMonths + 1): 
    rollSum += brandData.loc[:,"Total Sales ($)"].shift(-1 * month)
brandData.loc[:, 'Rolling Sales (3 months)'] = rollSum / rollMonths

# add rolling sales data (for six months)
rollSum, rollMonths = 0, 6
for month in range(1, rollMonths + 1): 
    rollSum += brandData.loc[:,"Total Sales ($)"].shift(-1 * month)
brandData.loc[:, 'Rolling Sales (6 months)'] = rollSum / rollMonths

brandData.head()

Unnamed: 0_level_0,Total Sales ($),Previous Month Sales,Rolling Sales (3 months),Rolling Sales (6 months)
Months,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-09-01,679.796207,9847.971509,13743.421571,48207.474485
2018-10-01,9847.971509,17585.544522,14969.183122,68408.63553
2018-11-01,17585.544522,13796.748683,52295.439892,85790.439202
2018-12-01,13796.748683,13525.256162,82671.5274,119941.287792
2019-01-01,13525.256162,129564.314831,121848.087938,151471.647204


#### Add features from other dataset

## Explore Data 
- Visualize Data 
- Explore data (correlation)
- Objective of features combination

In [None]:
import numpy as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Preprocessing
- Develope several pipeline
- Visualize pipelines 
- (pipeline update -- base on training result) 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline



## Hyper-parameters
- Test/Train ratio
- Hyperparameters for traning
- Hyperparameters for model 

## Split Datasets
- Split into training/validation and testing dataset

In [None]:
from sklearn.model_selection import train_test_split

## Deploy ML/DL
- Model selection
- Comparison between each baseline models
- Select best baseline model
- Search for best parameters (grid/random search)

## Visualize Result 
- Visualization result
    - Tableau
    - Seaborn
- Check important features 
    - Go back to exploring data (if necessary)