Modelling Mary Jane
==============

***Using machine earning to reveal insights and predict performance of cannabis dispensaries***

**Author:** *Scott Butters*



# Abstract

In 2012, Washington state passed I-502 and legalized the recreational sale, use, and possession of marijuana. Since 2014, approximately 500 state licensed dispensaries have opened throughout the state, with nearly 150 of those here in Seattle. The industry is heavily tracked and regulated and a wealth of sales and business statistics are publicly available. In this project I scour the web for publicly available data that might be predictive of how a cannabis dispensary performs, such as customer reviews, inventory distributions, and local demographics. I then train machine learning models to predict a dispensary's monthly revenue and analyze the resulting models to distill insights about what drives sales in the marijuana market.

# Obtain the Data

The data for this project is derived from several sources:

## Dispensary profiles from [Leafly](www.leafly.com)

Leafly is an information aggregator for cannabis. They maintain a profile for most of the dispensaries in the state. As part of my dataset, I've scraped the following features from the Leafly website for each dispensary for which it was available:

* Average customer rating and number of customer reviews
* Inventory counts (number of products under descriptions like "flower", "edibles", "concentrates", etc.
* Categorical qualities, such as whether or not the store is ADA accessible or has an ATM onsite
* Metadata such as name, address, phone number, etc.

The combination of these features gives us a profile of each dispensary that allow us to draw insights from our model into what makes for a successful dispensary.

## Demographics from [WA HomeTownLocator](https://washington.hometownlocator.com/)

Of course, having the best inventory, friendliest staff and prettiest building in the state doesn't amount to anything if a dispensary is in the middle of nowhere. This is where demographic data comes in. WA HomeTownLocator maintains a database of demographic statistics for nearly every zip code in the state of Washington. The data is produced by Esri Demographics, and updated 4 times per year using data from the federal census, IRS, USPS, as well as local data sources and more. From this website I scraped data likely to be predictive of a local market such as:

* Population density
* Diversity
* Average income

These data give our model an image of what a dispensary's customer base is like, allowing us to characterize what makes for a good location to establish a dispensary.

## [Washington State Liquor and Cannabis Board (WSLCB)](https://lcb.wa.gov/)

Lastly, all that data would get us nowhere if we didn't have any target data to train our models on. That's where the WSLCB comes in. The WSLCB maintains data on every dispensary in the state, including monthly reports of revenue (which is what our model is predicting). Their data is scattered across a couple of different outlets, but for this project I used spreadsheets downloadable from [this obsure page](https://lcb.wa.gov/records/frequently-requested-lists) to get sales data dating back to November 2017. Because the only identifying information in that spreadsheet is the license number of the dispensary, I also downloaded a spreadsheet listing metadata for every entity that has applied for a Marijuana license, which I then joined with the sales data in order to link it up with data scraped from other resources.

## Data Collection

The code below contains a pipeline to visit each of our sources and scrape or download all of the desired data into a few files stored in the data/raw/ directory to be scrubbed and processed later.

In [275]:
## %%writefile ../src/data/make_dataset.py

# Imports
import json
import os
import random
import re
import requests
import sys
import time

import numpy as np
import pandas as pd

from fake_useragent import UserAgent
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# Helper functions
def parse_products(text):
    '''
    Parses string of products into dictionary of products with counts
    Input: string of products as scraped from Leafly dispensary page
    Output: dictionary of {product: count} relationships
    '''
    repl = ['(', ')']
    for char in repl:
        text = text.replace(char, '')
    prod_list = text.split('\n')
    prod_list = [prod.strip().lower() for prod in prod_list]
    prod_dict = {}
    for i, element in enumerate(prod_list):
        if element.isnumeric():
            prod_dict[prod_list[i - 1]] = int(element)
        elif 'difference' in element:
            pass
        else:
            prod_dict[element.strip()] = 0
    return prod_dict


def scrape_disp(disp, driver, user_agent):
    """
    Scrapes dispensary-specific page on leafly for additional data and adds it
    to existing dictionary dataset
    Input: dictionary containing metadata for a single dispensary
    Output: dictionary with additional metadata for given dispensary
    """
    url = 'https://www.leafly.com/dispensary-info/'
    slug = disp['slug']
    url += slug
    
    if 'OR' in disp['formattedShortLocation']:
        return {}
    
    response  = requests.get(url, headers=user_agent)
    if not response.ok:
        print('Connection to {} failed'.format(disp['name']))
        return {}
    
    # Open page
    driver.get(url)
    
    # Confirm over 21
    try:
        yes_button = driver.find_element_by_xpath('//button[text()="Yes"]')
        yes_button.click()
    except:
        pass

    # Scrape categoricals
    try:
        cat_selector = driver.find_element_by_class_name('jsx-4153842418')
        # cat_selector = driver.find_element_by_tag_name('ul')
        items = cat_selector.find_elements_by_tag_name("li")
        categories = {item.text.lower(): True for item in items}
        disp.update(categories)
    except:
        print('Failed to scrape categories for {}'.format(disp['name']))
        pass

    # Scrape products
    try:
        products = driver.find_elements_by_class_name('jsx-1433915045')
        products_text = products[0].text
        product_dict = parse_products(products_text)
        disp.update(product_dict)
    except:
        print('Failed to scrape products for {}'.format(disp['name']))
        pass
    
    print('Successfully scraped {}'.format(disp['name']))
    return disp


def scrape_leafly_disps(path, disp_data_filename, data):
    """
    Gets JSON file of data on dispensaries from Leafly, either by loading
    pre-existing file or by re-scraping Leafly
    Input: path and filename for output file, index  of basic dispensary metadata
    Output: Index formatted JSON with one dictionary for each found dispensary
    """
#     filepath = '../data/raw/dispensary_data.json'
    filepath = path + disp_data_filename
    if os.path.isfile(filepath):
        overwrite = input(
            '''Dispensaries data dict already exists. Scrape data again? y/n\n
            Note: this could take several minutes.''')
        if overwrite.lower() != 'y':
            with open(filepath) as json_file:
                data = json.load(json_file)
            return data

    print("Beginning scrape...")
    ua = UserAgent()
    user_agent = {'User-agent': ua.random}
    chromedriver = "/Applications/chromedriver"
    os.environ["webdriver.chrome.driver"] = chromedriver
    driver = webdriver.Chrome(chromedriver)

    for disp in data:
        new_data = scrape_disp(data[disp], driver, user_agent)
        data[disp].update(new_data)
    
    with open(filepath, 'w') as outfile:  
        json.dump(data, outfile)
        print('Scraped data written to {}'.format(filepath))
    
    return data


def retry(TL_lat, TL_lon, cell_size):
    '''
    If request hits Leafly API limit, split cell into 4 subcells and retry
    Input: Lat/lon coordinates for top left of map and optionally a size for
    the map area (defaults to 0.5)
    Output: dictionary of dictionaries containing metadata for each dispensary 
    found in map area
    '''
    TL_lats = [TL_lat, TL_lat - 0.4 * cell_size]
    TL_lons = [TL_lat, TL_lat + 0.4 * cell_size]
    disp_data = {}
    for lat, lon in zip(TR_lats, TR_lons):
        data = get_disp_data_by_coords(lat, lon, cell_size=0.6 * cell_size)
        disp_data.update(data)
    return disp_data


def get_disp_data_by_coords(TL_lat, TL_lon, cell_size=0.5):
    """
    Performs search for all dispensaries within a map region on Leafly
    Input: Lat/lon coordinates for top left of map and optionally a size for
    the map area (defaults to 0.5)
    Output: dictionary of dictionaries containing metadata for each dispensary 
    found in map area
    """
    # Setup
    BR_lat = TL_lat - cell_size
    BR_lon = TL_lon + cell_size
    coords = TL_lat, TL_lon, BR_lat, BR_lon
    
    url = (
        'https://web-finder.leafly.com/api/searchThisArea?topLeftLat={}&topLeftLon={}&bottomRightLat={}&bottomRightLon={}&userLat=47.6&userLon=-122.3'
        ).format(TL_lat, TL_lon, BR_lat, BR_lon)
    
    # Scrape
    time.sleep(.5+2*random.random())
    r = requests.get(url)
    if r.status_code != 200:
        print('Leafly search failed at {}'.format(coords))
        return {}
    disps = r.json()
    
    # Parse
    fields = ['name', 'address1', 'address2', 'city', 'location', 'phone',
              'formattedShortLocation', 'medical', 'recreational', 'tier', 
              'lastMenuUpdate', 'starRating', 'numberOfReviews', 'slug']

    disp_data = {
        d['name']: {k: d[k] for k in fields} for d in disps['dispensaries']}
    entries = len(disp_data)
    
    # Check results; retry if necessary and return data
    if entries > 200:
        return retry(TR_lat, TR_lon, cell_size)
    elif entries < 1:
#         print('no results at {}'.format(coords))
        return {}
    else:
#         print('{} results found at {}'.format(len(disp_data), coords))
        return disp_data
    
    
def get_rect_disp_data(TL_lat, TL_lon, BR_lat, BR_lon, cell_size=0.5):
    """
    Performs grid search on sub-rectangles with slight overlap, gathering data 
    on each cell
    Input: lat/lon coords of top left and bottom right corners, as well as 
    optional cell size parameter (defaults to 0.5)
    Output: dictionary of dictionaries representing all dispensaries in
    rectangle
    """
    coords = TL_lat, TL_lon, BR_lat, BR_lon
    max_step = 0.8 * cell_size
    lat_steps = np.ceil((TL_lat - BR_lat - cell_size) / max_step)
    lon_steps = np.ceil((BR_lon - TL_lon - cell_size) / max_step)

    TL_lats = np.linspace(TL_lat, BR_lat + cell_size, lat_steps + 1)
    TL_lons = np.linspace(TL_lon, BR_lon - cell_size, lon_steps + 1)

    disp_data = {}

    for lat in TL_lats:
        for lon in TL_lons:
            data = get_disp_data_by_coords(lat, lon, cell_size)
            disp_data.update(data)

    print('Total dispensaries found: ', len(disp_data))
    return disp_data


def get_disp_dict(path):
    """
    Performs a grid search across Washington for all dispensaries with an
    account on Leafly and scrapes metadata for each
    Input: relative path to raw data directory
    Output: Index formatted JSON with one dictionary for each found dispensary
    """
    filepath = path + 'dispensary_list.json'
    
    if os.path.isfile(filepath):
        overwrite = input(
            '''Initial dispensary list already exists. Scrape data again? y/n\n 
            Note: this could take several minutes.''')
        if overwrite.lower() != 'y':
            with open(filepath) as json_file:
                data = json.load(json_file)
            return data
    print("Beginning scrape...")
    
    # WA State bounding coordinates
    north = 49
    west = -124.8
    south = 45.4
    east = -116.8
    
    data = get_rect_disp_data(north, west, south, east, cell_size=1.4)
    
    with open(filepath, 'w') as outfile:  
        json.dump(data, outfile)
        print('Scraped data written to {}'.format(filepath))
        
    return data


def get_leafly_disp_data(path, disp_filename):
    """
    Steps through all helper functions to scrape data from Leafly
    Input: raw data path and desired filename for output
    Output: JSON file containing scraped data
    """
    disp_dict = get_disp_dict(path)
    disp_data = scrape_leafly_disps(path, disp_filename, disp_dict)
    return


def get_demo_data(path, license_filename, demo_filename):
    """
    Scrapes zip code based demographic data from washington.hometownlocator.com
    for all zip codes containing a dispensary found in WSLCB license data
    Input: relative path to raw data directory, license data filename, 
    demographics data filename
    Output: saves demographic dataset to csv in raw data directory
    """
    license_filepath = path + license_filename
    demo_filepath = path + demo_filename

    if os.path.isfile(demo_filepath):
        overwrite = input(
            '''Demographics file already exists. Scrape data again? y/n\n 
            Note: this could take several minutes.''')
        if overwrite.lower() != 'y':
            return
    
    license_data = pd.read_excel(license_filepath, sheet_name=2, header=0)
    zips = license_data['ZipCode'].astype(str).str[:5].unique()
    demographics = pd.DataFrame()
    
    print("Beginning scrape...")
    for zip_code in sorted(zips):
        url = f'https://washington.hometownlocator.com/zip-codes/data,zipcode,{zip_code}.cfm'
        r = requests.get(url)
        if 'table' in r.text:
            df0, df1 = pd.read_html(url, index_col=0)[:2]
            df0.columns = [str(zip_code)]
            df1.columns = [str(zip_code)]
            df = pd.concat([df0, df1], axis=0).T.dropna(axis=1)
            df.drop(['INCOME', 'HOUSEHOLDS'], axis=1, inplace=True)
            demographics = pd.concat([demographics, df])
            print('Scraped {}/{} zips. Latest: {}'
                  .format(len(demographics), len(zips), zip_code), end='\r')
            sys.stdout.flush()
        else:
            print(f'\nNo data found for {zip_code}')
        
    demographics.to_csv(demo_filepath)
    print('Scraped data written to {}'.format(demo_filepath))
    return
    
    
def download_dataset(url, path, filename):
    """
    Downloads dataset from specified url and saves file to raw data directory
    Input: url from which to retrieve data, filename to store data in
    Output: dataset stored in raw data file directory
    """
#     filepath = '../data/raw/{}'.format(filename)
    filepath = path + filename
    file_exists = os.path.isfile(filepath)
    if file_exists:
        overwrite = input('{} already exists. Update? y/n'.format(filename))
        if overwrite.lower() != 'y':
            return
    print("Beginning file download...")
    r = requests.get(url)
    if not r.ok:
        print('Download failed')
        return
    with open(filepath, 'wb') as f:  
        f.write(r.content)
    print('File written to {}\n'.format(filepath))
    return
    
    
def get_sales_data(path, sales_filename, license_filename):
    """
    Gets links for most up-to-date dispensary sales and license information
    from WSLCB and downloads datasets
    Input:
    Output: downloaded files to raw data directory
    """
    # Get urls for most up-to-date sales and license data
    url = 'https://lcb.wa.gov/records/frequently-requested-lists'
    response = requests.get(url)
    if response.ok:
        soup = BeautifulSoup(response.text, "html.parser")
        links = soup.find_all('a')
        for link in links:
            if 'Traceability' in link.text:
                sales_url = link['href']
                print(f'\nLatest sales data found:\n{sales_url}')
                #filename = 'sales_data.xlsx'
                download_dataset(sales_url, path, sales_filename)
            elif 'Applicants' in link.text:
                licenses_url = link['href']
                print(f'\nLatest license data found:\n{licenses_url}')
                #filename = 'license_data.xls'
                download_dataset(licenses_url, path, license_filename)
    else:
        print('Failed to download sales data')

    return

    
def run():
    """
    Executes a set of helper functions that download data several
    sources and saves those datasets to the data/raw directory.
    """
    path = '../data/raw/'
    
    sales_filename = 'sales_data.xlsx'
    license_filename = 'license_data.xls'
    demo_filename = 'demographics.csv'
    disp_filename = 'dispensary_data.json'
    
    get_sales_data(path, sales_filename, license_filename)
    get_demo_data(path, license_filename, demo_filename)
    get_leafly_disp_data(path, disp_filename)
    
    print('\nData acquisition complete.\n')
    return

In [273]:
run()


Latest license data found:
https://lcb.wa.gov/sites/default/files/publications/Public_Records/2019/MarijuanaApplicants.xls
license_data.xls already exists. Update? y/ny
Beginning file download...
File written to ../data/raw/license_data.xls

Latest sales data found:
https://lcb.wa.gov/sites/default/files/publications/Marijuana/sales_activity/2019-04-10-MJ-Sales-Activity-by-License-Number-Traceability-Contingency-Reporting.xlsx
sales_data.xlsx already exists. Update? y/ny
Beginning file download...
File written to ../data/raw/sales_data.xlsx
Demographics file already exists. Scrape data again? y/nn
Initial dispensary list already exists. Scrape data again? y/nn
Dispensaries data dict already exists. Scrape data again? y/nn


# Scrub the Data

*Look through the raw data files and see what you will need to do to them in order to have a workable data set. If your source data is already well-formatted, you may want to ask yourself why it hasn't already been analyzed and what other people may have overlooked when they were working on it. Are there other data sources that might give you more insights on some of the data you have here?*

*The end goal of this step is to produce a [design matrix](https://en.wikipedia.org/wiki/Design_matrix), containing one column for every variable that you are modeling, including a column for the outputs, and one row for every observation in your data set. It needs to be in a format that won't cause any problems as you visualize and model your data.*

In [276]:
## %%writefile ../src/features/build_features.py

# Imports
import json
import re

import numpy as np
import pandas as pd

def join_sales_data():
    """
    Loads sales and license data files and joins them into one table
    Also creates a column with zip code as a 5 digit string for later use
    Input:
    Output: returns merged dataframe
    """
    path = '../data/raw/sales_data.xlsx'
    disp_sales_data = pd.read_excel(path, sheet_name=0, header=3)
    disp_sales_data.rename(columns={'License Number':'License #'}, inplace=True)
    disp_sales_data.set_index(keys='License #', inplace=True)

    path = '../data/raw/license_data.xls'
    license_data = pd.read_excel(path, sheet_name=2, header=0, index_col=1)

    sales_data = pd.merge(disp_sales_data, license_data, how='left', on='License #')
    sales_data['zip_code'] = sales_data['ZipCode'].astype(str).str[:5]
    
    return sales_data


def clean_leafly_data(path, filename):
    """
    Loads and cleans raw data scraped from Leafly.
    Input: dictionaries containing paths and filenames for input/output files
    Output: a cleaned and pickled dataframe of data scraped from Leafly
    """
    raw_filename = path['raw'] + filename['raw_leafly']
    int_filename = path['interim'] + filename['int_leafly']
    
    leafly = pd.read_json(raw_filename)
        
    leafly.to_pickle(int_filename)
    return

    
def run():
    """
    Executes a set of helper functions that read files from data/raw, 
    cleans them, and converts the data into a design matrix that is ready
    for modeling.
    """
    path = {
        'raw': '../data/raw/',
        'interim': '../data/interim/',
        'processed': '../data/processed/'
    }
    
    filename = {
        'raw_leafly': 'dispensary_data.json',
        'raw_demo': 'demographics.csv',
        'raw_license': 'license_data.xls',
        'raw_sales': 'sales_data.xlsx',
        'int_leafly': 'leafly.pkl',
        'int_demo': 'demographics.pkl',
        'int_sales': 'sales.pkl',
        'processed': 'data.pkl'
    }
    
    clean_leafly_data(path, filename)
    clean_demographic_data(path, filename)
    clean_wslcb_data(path, filename)
    join_cleaned_data(path, filename)
    build_features(path, filename)
    
    print('\nData acquisition complete.\n')
    return

In [289]:
pd.set_option('max_columns', 50)
path = {
    'raw': '../data/raw/',
    'interim': '../data/interim/',
    'processed': '../data/processed/'
}
    
filename = {
    'raw_leafly': 'dispensary_data.json',
    'raw_demo': 'demographics.csv',
    'raw_license': 'license_data.xls',
    'raw_sales': 'sales_data.xlsx',
    'int_leafly': 'leafly.pkl',
    'int_demo': 'demographics.pkl',
    'int_sales': 'sales.pkl',
    'processed': 'data.pkl'
}

In [292]:
raw_filename = path['raw'] + filename['raw_leafly']
int_filename = path['interim'] + filename['int_leafly']

data = pd.read_json(raw_filename, orient='index')

In [293]:
display(data.info())
data.columns

<class 'pandas.core.frame.DataFrame'>
Index: 635 entries, Mister Buds to Canna4Life - Clarkston
Data columns (total 30 columns):
accessories               11 non-null float64
ada accessible            289 non-null float64
address1                  633 non-null object
address2                  134 non-null object
all products              255 non-null float64
atm                       312 non-null float64
cartridges                36 non-null float64
city                      635 non-null object
concentrates              238 non-null float64
debit cards accepted      45 non-null float64
edibles                   245 non-null float64
flower                    248 non-null float64
formattedShortLocation    635 non-null object
lastMenuUpdate            624 non-null object
location                  635 non-null object
medical                   635 non-null int64
name                      635 non-null object
numberOfReviews           635 non-null int64
other                     200 non-null 

None

Index(['accessories', 'ada accessible', 'address1', 'address2', 'all products',
       'atm', 'cartridges', 'city', 'concentrates', 'debit cards accepted',
       'edibles', 'flower', 'formattedShortLocation', 'lastMenuUpdate',
       'location', 'medical', 'name', 'numberOfReviews', 'other', 'phone',
       'pre-rolls', 'recreational', 'seeds', 'slug', 'starRating',
       'storefront', 'tier', 'topicals', 'ufcw discount', 'veteran discount'],
      dtype='object')

In [309]:
meta_cols = ['name', 'address1', 'address2', 'city', 'formattedShortLocation', 
            'location', 'lastMenuUpdate', 'phone', 'slug', 'tier']
metadata = data[meta_cols]
display(metadata.head())
metadata.info()

Unnamed: 0,name,address1,address2,city,formattedShortLocation,location,lastMenuUpdate,phone,slug,tier
Mister Buds,Mister Buds,536 Marine Dr,,Port Angeles,"Port Angeles, WA","{'lat': 48.1219849, 'lon': -123.4437221}",2017-04-10T17:54:28.278868+00:00,(360) 797-1966,mister-buds,900
Origins Port Angeles,Origins Port Angeles,1215 E Front Street,,Port Angeles,"Port Angeles, WA","{'lat': 48.1115394, 'lon': -123.4118052}",2019-04-12T01:37:53.395365+00:00,360.406.4902,sparket-rnr,300
Cannabis Coast,Cannabis Coast,193161 Highway 101,,Forks,"Forks, WA","{'lat': 47.9683179, 'lon': -124.404138}",2016-04-07T23:37:15.980746+00:00,(360) 374-4020,cannabis-coast,900
Lux Pot Shop - Ballard,Lux Pot Shop - Ballard,4912 17th Ave NW,,Seattle,"Seattle, WA","{'lat': 47.6648419, 'lon': -122.3786413}",2019-04-12T00:37:20.749868+00:00,206-294-5586,stash-pot-shop,200
Nature's Gifts - Sequim,Nature's Gifts - Sequim,755 W Washington St,Suite C,Sequim,"Sequim, WA","{'lat': 48.0791305, 'lon': -123.1204523}",2019-04-12T03:17:37.187515+00:00,360-797-1993,natures-gifts-sequim,300


<class 'pandas.core.frame.DataFrame'>
Index: 635 entries, Mister Buds to Canna4Life - Clarkston
Data columns (total 10 columns):
name                      635 non-null object
address1                  633 non-null object
address2                  134 non-null object
city                      635 non-null object
formattedShortLocation    635 non-null object
location                  635 non-null object
lastMenuUpdate            624 non-null object
phone                     624 non-null object
slug                      635 non-null object
tier                      635 non-null int64
dtypes: int64(1), object(9)
memory usage: 54.6+ KB


In [310]:
cat_cols = ['ada accessible', 'atm', 'debit cards accepted', 'medical', 
            'recreational', 'storefront', 'ufcw discount', 'veteran discount']
categoricals = data[cat_cols].astype('category')
display(categoricals.head())
categoricals.info()

Unnamed: 0,ada accessible,atm,debit cards accepted,medical,recreational,storefront,ufcw discount,veteran discount
Mister Buds,,,,0,1,1.0,,
Origins Port Angeles,1.0,1.0,,0,1,1.0,,1.0
Cannabis Coast,,1.0,,0,1,1.0,,
Lux Pot Shop - Ballard,1.0,1.0,1.0,0,1,1.0,,
Nature's Gifts - Sequim,1.0,1.0,,0,1,1.0,,1.0


<class 'pandas.core.frame.DataFrame'>
Index: 635 entries, Mister Buds to Canna4Life - Clarkston
Data columns (total 8 columns):
ada accessible          289 non-null category
atm                     312 non-null category
debit cards accepted    45 non-null category
medical                 635 non-null category
recreational            635 non-null category
storefront              396 non-null category
ufcw discount           8 non-null category
veteran discount        230 non-null category
dtypes: category(8)
memory usage: 10.6+ KB


In [311]:
num_cols = ['accessories', 'all products','cartridges', 'concentrates', 
            'edibles', 'flower', 'numberOfReviews', 'other', 'pre-rolls', 
            'seeds', 'starRating', 'topicals']
numerical = data[num_cols]
display(numerical.head())
numerical.info()

Unnamed: 0,accessories,all products,cartridges,concentrates,edibles,flower,numberOfReviews,other,pre-rolls,seeds,starRating,topicals
Mister Buds,,,,,,,3,,,,5.0,
Origins Port Angeles,,621.0,,176.0,175.0,115.0,26,29.0,126.0,,4.961538,
Cannabis Coast,,,,,,,2,,,,5.0,
Lux Pot Shop - Ballard,,2849.0,,634.0,765.0,659.0,52,49.0,742.0,,4.358974,
Nature's Gifts - Sequim,,618.0,,262.0,73.0,153.0,19,,130.0,,4.594737,


<class 'pandas.core.frame.DataFrame'>
Index: 635 entries, Mister Buds to Canna4Life - Clarkston
Data columns (total 12 columns):
accessories        11 non-null float64
all products       255 non-null float64
cartridges         36 non-null float64
concentrates       238 non-null float64
edibles            245 non-null float64
flower             248 non-null float64
numberOfReviews    635 non-null int64
other              200 non-null float64
pre-rolls          245 non-null float64
seeds              2 non-null float64
starRating         635 non-null float64
topicals           37 non-null float64
dtypes: float64(11), int64(1)
memory usage: 64.5+ KB


In [287]:
leafly.describe()

Unnamed: 0,accessories,ada accessible,all products,atm,cartridges,concentrates,debit cards accepted,edibles,flower,medical,...,other,pre-rolls,recreational,seeds,starRating,storefront,tier,topicals,ufcw discount,veteran discount
count,11.0,289.0,255.0,312.0,36.0,238.0,45.0,245.0,248.0,635.0,...,200.0,245.0,635.0,2.0,635.0,396.0,635.0,37.0,8.0,230.0
mean,50.090909,1.0,653.207843,1.0,108.0,173.084034,1.0,138.771429,170.375,0.415748,...,73.015,119.734694,0.951181,1.5,4.359332,1.0,414.96063,20.054054,1.0,1.0
std,93.586809,0.0,516.822934,0.0,119.279504,148.463357,0.0,116.375333,145.833582,0.493239,...,167.71915,97.708662,0.215659,0.707107,1.055869,0.0,301.227442,20.771169,0.0,0.0
min,0.0,1.0,1.0,1.0,1.0,2.0,1.0,0.0,1.0,0.0,...,0.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0
25%,3.0,1.0,337.5,1.0,21.75,74.0,1.0,66.0,76.5,0.0,...,11.75,60.0,1.0,1.25,4.404785,1.0,200.0,6.0,1.0,1.0
50%,7.0,1.0,534.0,1.0,74.0,146.0,1.0,109.0,146.5,0.0,...,24.5,95.0,1.0,1.5,4.656604,1.0,300.0,14.0,1.0,1.0
75%,50.0,1.0,820.0,1.0,133.5,229.5,1.0,185.0,215.25,1.0,...,56.75,154.0,1.0,1.75,4.8,1.0,900.0,29.0,1.0,1.0
max,312.0,1.0,3651.0,1.0,536.0,1031.0,1.0,949.0,861.0,1.0,...,1449.0,742.0,1.0,2.0,5.0,1.0,900.0,107.0,1.0,1.0


*Before moving on to exploratory analysis, write down some notes about challenges encountered while working with this data that might be helpful for anyone else (including yourself) who may work through this later on.*

# Explore the Data

*Before you start exploring the data, write out your thought process about what you're looking for and what you expect to find. Take a minute to confirm that your plan actually makes sense.*

*Calculate summary statistics and plot some charts to give you an idea what types of useful relationships might be in your dataset. Use these insights to go back and download additional data or engineer new features if necessary. Not now though... remember we're still just trying to finish the MVP!*

In [None]:
## %%writefile ../src/visualization/visualize.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from 
    data/processed, calculates descriptive statistics for the population,
    and plots charts that visualize interesting relationships between 
    features.
    """
    # data = load_features('data/processed')
    # describe_features(data, 'reports/')
    # generate_charts(data, 'reports/figures/')
    pass


*What did you learn? What relationships do you think will be most helpful as you build your model?*

# Model the Data

*Describe the algorithm or algorithms that you plan to use to train with your data. How do these algorithms work? Why are they good choices for this data and problem space?*

In [None]:
## %%writefile ../src/models/train_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from 
    data/processed, calculates descriptive statistics for the population,
    and plots charts that visualize interesting relationships between 
    features.
    """
    # data = load_features('data/processed/')
    # train, test = train_test_split(data)
    # save_train_test(train, test, 'data/processed/')
    # model = build_model()
    # model.fit(train)
    # save_model(model, 'models/')
    pass


In [None]:
## %%writefile ../src/models/predict_model.py

# imports
# helper functions go here

def run():
    """
    Executes a set of helper functions that read files from 
    data/processed, calculates descriptive statistics for the population,
    and plots charts that visualize interesting relationships between
    features.
    """
    # test_X, test_y = load_test_data('data/processed')
    # trained_model = load_model('models/')
    # predictions = trained_model.predict(test_X)
    # metrics = evaluate(test_y, predictions)
    # save_metrics('reports/')
    pass



_Write down any thoughts you may have about working with these algorithms on this data. What other ideas do you want to try out as you iterate on this pipeline?_

# Interpret the Model

_Write up the things you learned, and how well your model performed. Be sure address the model's strengths and weaknesses. What types of data does it handle well? What types of observations tend to give it a hard time? What future work would you or someone reading this might want to do, building on the lessons learned and tools developed in this project?_