# Marijuana Sales Prediction

## Projects Objective
- Goal: Predict sales of each brand
- Sub-problems: 

## Load Data
- Load from local dataset

In [1]:
# import required packages
import sys
import os
import pandas as pd
# Load Dataset
avgRetail = pd.read_csv("../data/BrandAverageRetailPrice.csv")
brandDetail = pd.read_csv("../data/BrandDetails.csv")
totalSales = pd.read_csv("../data/BrandTotalSales.csv")
totalUnits = pd.read_csv("../data/BrandTotalUnits.csv")

## Construct Dataset 
- Time-series data base form
    - Convert into same time-stamp format
- Clean Data 
    - without imputation
- Feature Engineering
    - Feature augmentation
    - Combining feature in same/different datasets

### Find the dataset to start with
- Criteria: Information of each datasets
- Objective: Since the final goal is to predict the future sales, we start with the dataset that seems to have strong correlation with sales, and also with few feature to start with. 

#### Total Sales dataset

In [15]:
totalSales.info()
totalSales.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25279 entries, 0 to 25278
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Months           25279 non-null  datetime64[ns]
 1   Brand            25279 non-null  object        
 2   Total Sales ($)  25279 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 592.6+ KB


Unnamed: 0,Months,Brand,Total Sales ($)
0,2018-09-01,10x Infused,1711.334232
1,2018-09-01,1964 Supply Co.,25475.215945
2,2018-09-01,3 Bros Grow,120153.644757
3,2018-09-01,3 Leaf,6063.529785
4,2018-09-01,350 Fire,631510.048155


#### Total units dataset

In [16]:
totalUnits.info()
totalUnits.head(5)
# find out how many brand do we have currently

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27686 entries, 0 to 27685
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Brands            27686 non-null  object 
 1   Months            27686 non-null  object 
 2   Total Units       25712 non-null  object 
 3   vs. Prior Period  24935 non-null  float64
dtypes: float64(1), object(3)
memory usage: 865.3+ KB


Unnamed: 0,Brands,Months,Total Units,vs. Prior Period
0,#BlackSeries,08/2020,1616.339004,
1,#BlackSeries,09/2020,,-1.0
2,#BlackSeries,01/2021,715.532838,
3,#BlackSeries,02/2021,766.669135,0.071466
4,#BlackSeries,03/2021,,-1.0


### Clean the choosen dataset
- Selected dataset: Total Sales
- Breakdown: Base-on different brand
- Reason: 
    - It contains more information about salses. 
    - It has much narrow data in brand features. 

#### Preprocess the selected dataset

In [27]:
import numpy as np
# convert the time stamp dataset
totalSales["Months"] = pd.to_datetime(totalSales["Months"])
# convert salses data to numerical data form
totalSales["Total Sales ($)"] = totalSales["Total Sales ($)"].astype(np.float
totalSales.head(5)

Unnamed: 0,Months,Brand,Total Sales ($)
0,2018-09-01,10x Infused,1711.334232
1,2018-09-01,1964 Supply Co.,25475.215945
2,2018-09-01,3 Bros Grow,120153.644757
3,2018-09-01,3 Leaf,6063.529785
4,2018-09-01,350 Fire,631510.048155


#### Find out all the brand name

In [9]:
brands = list(totalSales["Brand"].unique())
print("==================================")
print("Total: {} different brands".format(brands.__len__()))
print("==================================")

Total: 1627 different brands


#### Find the brand to start with
- Start with the brand that has much information. 

In [13]:
# start with one of the brand
totalSales["Brand"].value_counts()

Lift Ticket Laboratories    37
Garden Society              37
Field Extracts              37
Northern Emeralds           37
Fiori                       37
                            ..
Rambling Rose Farm           1
Goldie's Vault               1
530                          1
Lost Coast Alchemy           1
Zanna                        1
Name: Brand, Length: 1627, dtype: int64

#### Clean the selected brand data

In [None]:
brand = "Garden Society"


### Feature Engineering 
- Transform and add more features in the selected dataset. 
- Add more features from other related datasets. 

## Explore Data 
- Visualize Data 
- Explore data (correlation)
- Objective of features combination

## Data Preprocessing
- Develope several pipeline
- Visualize pipelines 
- (pipeline update -- base on training result) 

## Hyper-parameters
- Test/Train ratio
- Hyperparameters for traning
- Hyperparameters for model 

## Split Datasets
- Split into training/validation and testing dataset

## Deploy ML/DL
- Model selection
- Comparison between each baseline models
- Select best baseline model
- Search for best parameters (grid/random search)

## Visualize Result 
- Visualization result
    - Tableau
    - Seaborn
- Check important features 
    - Go back to exploring data (if necessary)