# Title: Cannabis Retail Sales 
# Authors: Tae Sakong

## Problem Statement
- What is the domain?
- Why the interest?
- Existing data and models
- Implications/Relevance

## Objectives
1. Process the data
2. Explore the data
3. Develop a Model
4. Train and Fit the model
5. Make predictions by using the model
6. Evaluate the model
7. What are the next steps?

## Data Processing
1. Import the data
2. Clean and process the data

## Exploratory Data Analysis:
1. Explore the 

## Data Analysis

1. Test Assumptions
    - Normal Distribution
    - Homegeneity of Variance

Hypothesis:
- Direction and magnitude?

Rationale:
- Evidence from literature, EDA, etc

Predictions:
- Direction and magnitude?

## Model Development

## Model Deployment

## Results

## Summary

In [13]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Data Processing:

Data Source: https://data.ct.gov/api/views/ucaf-96h6/rows.csv?accessType=DOWNLOAD

Summary of Data:
- What is the data about?
- When was the data collected?
- What is the source of the data?
- Who authored the data?
- Cite the source

### 1. Import the data

In [2]:
# URL for dat
url = "https://data.ct.gov/api/views/ucaf-96h6/rows.csv?accessType=DOWNLOAD"

# Convert csv to dataframe
df = pd.read_csv(url)

In [4]:
# Inspect the dataframe
df.head()

Unnamed: 0,Week Ending,Adult-Use Retail Sales,Medical Marijuana Retail Sales,Total Adult-Use and Medical Sales,Adult-Use Products Sold,Medical Products Sold,Total Products Sold,Adult-Use Average Product Price,Medical Average Product Price
0,01/14/2023,1485019.32,1776700.69,3261720.01,33610,49312,82922,44.25,36.23
1,01/21/2023,1487815.81,2702525.61,4190341.42,33005,77461,110466,45.08,34.89
2,01/28/2023,1553216.3,2726237.56,4279453.86,34854,76450,111304,44.56,35.65
3,01/31/2023,578840.62,863287.86,1442128.48,12990,24023,37013,44.56,35.93
4,02/04/2023,1047436.2,1971731.4,3019167.6,24134,56666,80800,43.49,34.84


In [5]:
# Check datatypes and variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Week Ending                        41 non-null     object 
 1   Adult-Use Retail Sales             41 non-null     float64
 2   Medical Marijuana Retail Sales     41 non-null     float64
 3   Total Adult-Use and Medical Sales  41 non-null     float64
 4   Adult-Use Products Sold            41 non-null     int64  
 5   Medical Products Sold              41 non-null     int64  
 6   Total Products Sold                41 non-null     int64  
 7   Adult-Use Average Product Price    41 non-null     float64
 8   Medical Average Product Price      41 non-null     float64
dtypes: float64(5), int64(3), object(1)
memory usage: 3.0+ KB


In [9]:
# Check for missing values
df.isnull().sum()

Week Ending                          0
Adult-Use Retail Sales               0
Medical Marijuana Retail Sales       0
Total Adult-Use and Medical Sales    0
Adult-Use Products Sold              0
Medical Products Sold                0
Total Products Sold                  0
Adult-Use Average Product Price      0
Medical Average Product Price        0
dtype: int64

This dataset contains 9 variables, each with 41 observations.

The columns (variables) are:
    - Week Ending
    - Medical Marajuan Retail Sales
    - Total Adult-Use and Medical Sales
    - Adult-Use Products Sold
    - Medical Products Sold
    - Total Products Sold
    - Adult-Use Average Product Price
    - Medical Average Product Price


The data can be organized into the following:

1. The intervals for all reported values are determined by "Week Ending." Where, the "Week Ending" represents the time intervals at which sales data were collected.

2. Product types are categorized into "Adult-Use" or "Medical Marajuana." Such that, the number of sales, in terms of quantities or dollar amounts, are reported for each category. Additionally, there are measures that Total the sales, in terms of quantities or dollar amounts, for both categories of product types.

3. Given the product quantities and dollar amounts sold, the data also contains the average product prices at each time interval.

In summary, this dataset contains a time-series of cannabis product sales, further segmented by their product type (e.g. Retail Marajuana vs Medical Marajuana)

### 2. Clean and Process Data

In general, most of the datatypes are correct.

Float64 represents the variables that represent the dollar amounts of products sold:
    - Adult-Use Retail Sales
    - Medical Marijuana Retail Sales
    - Total Adult-Use and Medical Sales
    - Adult-Use Average Product Price
    - Medical Average Product Price

Int64 represents the variables that represent the number of products sold:
    - Adult-Use Products Sold
    - Medical Products Sold 
    - Total Products Sold

However, "Week Ending" will need to be converted to datetime. The rationale being: this variable represents the time intervals at which, presumably sales data was collected. Therefore, for ease of future statitiscal operations, "Week Ending" will be converted to the type, DateTime.

Regarding missing values, there are none. So, no methods will be required to process or modify missing data.

In [10]:
# Convert Week Ending to Datetime
df['Week Ending'] = pd.to_datetime(df['Week Ending'])

In [12]:
# Check datatypes after type conversion
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   Week Ending                        41 non-null     datetime64[ns]
 1   Adult-Use Retail Sales             41 non-null     float64       
 2   Medical Marijuana Retail Sales     41 non-null     float64       
 3   Total Adult-Use and Medical Sales  41 non-null     float64       
 4   Adult-Use Products Sold            41 non-null     int64         
 5   Medical Products Sold              41 non-null     int64         
 6   Total Products Sold                41 non-null     int64         
 7   Adult-Use Average Product Price    41 non-null     float64       
 8   Medical Average Product Price      41 non-null     float64       
dtypes: datetime64[ns](1), float64(5), int64(3)
memory usage: 3.0 KB


## Exploratory Data analysis

1. Visualize Tot