# Iowa Liquor Sales 

## Project Summary 

In this project, I will be analyzing a moderately large data set (~790mb)  provided by Iowa's [Alcoholic Beverages Division](https://abd.iowa.gov/division) on spirit purchases from January 1, 2012 to current. The data set has time stamped store level information which will allow for both a geo-spatial analysis of purchases by region and a time series analysis of purchases. 

In order to perform this analysis, I will need to utilize several different technologies. I will be using a NoSQL solution of MongoDB to store the raw data. Then I will use Numpy and Pandas to transform the raw data into a useful format, including any other relevant python module. Finally, I will use Plotly and any other python visualization techniques to illustrate the insights discovered in the analysis section. 

Ultimately, my main goal for this project is to gain a deeper level of understanding in using python to analyize geographic information and time series data. My secondary goals will be to learn to process information efficiently with python as this data set is relatively large and to discover one meaningful insight about the data. I hope that through this project, I will eventually be able to translate the work to a real world appliaction as geographic information of purchases has obvious commerical value. 

## Data Set 

The state of Iowa's Alcoholic Beverages Division [dataset](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy  ) consists of spirit purchase information from January 1, 2012 to current. 

> The Division can provide this level of information because Iowa is one of 17 states that directly controls the sale and distribution of alcoholic beverages https://abd.iowa.gov/division. 

It should be noted that the data set is limited to Iowa Class “E” liquor license which is for:  

> Grocery stores, liquor stores, convenience stores, etc. Allows commercial establishments to sell liquor for off-premises consumption in original unopened containers. No sales by the drink.

Also, while this data set is not direct consumer sales, we can use this information as a proxy indicator of sales because most stores should be only buying spirits that sell well in their stores.  

The structure of each line is as follows:

>   
JSON format    
{  
"DATE": [ "02/26/2015" ],  
"CONVENIENCE.STORE": [ "" ],  
"STORE": [ 2515 ],  
"NAME": [ "Hy-Vee Food Store #1 / Mason City" ],  
"ADDRESS": [ "2400 4TH ST SW" ],  
"CITY": [ "MASON CITY" ],  
"ZIPCODE": [ "50401" ],  
"STORE.LOCATION": [ "2400 4TH ST SW\nMASON CITY 50401\n(43.148463097000047, -93.236272961999987)" ],  
"COUNTY.NUMBER": [ 17 ],  
"COUNTY": [ "Cerro Gordo" ],  
"CATEGORY": [ 1022100 ],  
"CATEGORY.NAME": [ "TEQUILA" ],  
"VENDOR.NO": [ 434 ],  
"VENDOR": [ "Luxco-St Louis" ],  
"ITEM": [ 87937 ],  
"DESCRIPTION": [ "Juarez Tequila Silver" ],  
"PACK": [ 12 ],  
"LITER.SIZE": [ 1000 ],  
"STATE.BTL.COST": [ "6.92"],  
"BTL.PRICE": ["10.38" ],  
"BOTTLE.QTY": [ 48 ],  
"TOTAL": [ "498.24" ]   
}

## Approach 

### Data Management 

The program will be automated to pull the data from https://data.iowa.gov. 

Then the program will write the information to a MongoDB collection, this step will require some interaction with the user in opening a MongoDB connection and having MongoDB installed.

In [16]:
from IPython.display import IFrame
IFrame('https://en.wikipedia.org/wiki/MongoDB', width=900, height=250)

### Analysis 

Then an analysis will be made on the data using mainly numpy and pandas but possible some other useful python modules. I will develop some interactions with the user at this stage, such as the ability to set filters and possible certain columns to analyze. However, there will be some validations at this stage to ensure the user does not attempt to perform an inappropriate operation. 

In [17]:
IFrame('https://en.wikipedia.org/wiki/NumPy', width=900, height=250)

In [18]:
IFrame('https://en.wikipedia.org/wiki/Pandas_(software)', width=900, height=250)

### Visualization
Finally, the graphical representation will be done through Plotly for the time series analysis and geo-spatial. Due to the data being at the state level I do have some concerns that Plotly isn't designed for state level analysis so I may need to use another type of module. 

In [19]:
IFrame('https://plot.ly/python/', width=900, height=550) 

## Goals 

I want to dive deep into geo-spatial analysis using python and different visualization techniques. Some of the most significant and lasting visualizations I have encountered were GIS related and it was due to seeing the information spatial represented. 

I want to fully understand setting up and solving a time series analysis. Data that is time stamped is very interesting and commonly analyzed because it is useful to be able to understand changes over time. I also hope to become more versatile in understanding date time data types in python and how to analyze such types. 

Finally, after this entire process I hope that I will be able to find one meaningful insight. This large data set  should have some interesting information, such as the timing of liquor sales related to holiday or general Iowa shopping trends. 