In [1]:
# Import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Background
In 2006, global concern was raised over the rapid decline in the honeybee population, an integral component of American honey agriculture. Large numbers of hives were lost to Colony Collapse Disorder, a phenomenon of disappearing worker bees that causes the remaining hive colonies to collapse. Speculation on the cause of this disorder points to hive diseases and pesticides harming the pollinators, tho no overall consensus has been reached. The U.S. used to locally produce over half the honey it consumes per year. Nowadays, honey mostly comes from overseas, with 350 of the 400 million pounds of honey consumed every year originating from imports. This dataset provides insight into honey production supply and demand in America from 1998 to 2016.

# Objective: 
To visualize how honey production has changed over the years (1998–2016) in the United States. 

Key questions to be answered:

* How has honey production yield changed from 1998 to 2016?
* Over time, what have been the major production trends across the states?
* Are there any pattern that can be observed between total honey production and the value of production every year? How has the value of production, which in some sense could be tied to demand, changed every year?

# Dataset:

* **State**: Various states in the U.S.
* **year**: Year of production
* **stocks**: Refers to stocks held by producers. Unit is pounds
* **numcol**: Number of honey-producing colonies. Honey producing colonies are the maximum number of colonies from which honey was taken during the year. It is possible to take honey from colonies that did not survive the entire year
* **yieldpercol**: honey yield per colony. The unit is in pounds
* **totalprod**: Total production (numcol x yieldpercol). Unit is pounds
* **priceperlb**: Refers to average price per pound based on expanded sales. The unit is dollars.
* **prodvalue**: Value of production (totalprod x priceperlb). The unit is dollars.


In [2]:
# Read the dataset
df = pd.read_csv("honeyproduction1998-2016-1.csv")

In [4]:
df.head()

Unnamed: 0,state,numcol,yieldpercol,totalprod,stocks,priceperlb,prodvalue,year
0,Alabama,16000.0,71,1136000.0,159000.0,0.72,818000.0,1998
1,Arizona,55000.0,60,3300000.0,1485000.0,0.64,2112000.0,1998
2,Arkansas,53000.0,65,3445000.0,1688000.0,0.59,2033000.0,1998
3,California,450000.0,83,37350000.0,12326000.0,0.62,23157000.0,1998
4,Colorado,27000.0,72,1944000.0,1594000.0,0.7,1361000.0,1998


In [5]:
df.shape

(785, 8)

In [6]:
df.tail()

Unnamed: 0,state,numcol,yieldpercol,totalprod,stocks,priceperlb,prodvalue,year
780,Virginia,5000.0,38,190000.0,30000.0,5.85,1112000.0,2016
781,Washington,84000.0,35,2940000.0,412000.0,1.99,5851000.0,2016
782,West Virginia,5000.0,32,160000.0,43000.0,3.92,627000.0,2016
783,Wisconsin,54000.0,62,3348000.0,1205000.0,2.67,8939000.0,2016
784,Wyoming,40000.0,68,2720000.0,190000.0,1.78,4842000.0,2016


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785 entries, 0 to 784
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   state        785 non-null    object 
 1   numcol       785 non-null    float64
 2   yieldpercol  785 non-null    int64  
 3   totalprod    785 non-null    float64
 4   stocks       785 non-null    float64
 5   priceperlb   785 non-null    float64
 6   prodvalue    785 non-null    float64
 7   year         785 non-null    int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 49.2+ KB


In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
numcol,785.0,61686.62,92748.94,2000.0,9000.0,26000.0,65000.0,510000.0
yieldpercol,785.0,60.57834,19.42783,19.0,46.0,58.0,72.0,136.0
totalprod,785.0,4140957.0,6884594.0,84000.0,470000.0,1500000.0,4096000.0,46410000.0
stocks,785.0,1257629.0,2211794.0,8000.0,119000.0,391000.0,1380000.0,13800000.0
priceperlb,785.0,1.695159,0.9306234,0.49,1.05,1.48,2.04,7.09
prodvalue,785.0,5489739.0,9425394.0,162000.0,901000.0,2112000.0,5559000.0,83859000.0
year,785.0,2006.818,5.491523,1998.0,2002.0,2007.0,2012.0,2016.0


In [10]:
df.isnull().sum()

state          0
numcol         0
yieldpercol    0
totalprod      0
stocks         0
priceperlb     0
prodvalue      0
year           0
dtype: int64

# Univariate Analysis

In [13]:
df['numcol'].value_counts()

numcol
7000.0      51
6000.0      37
8000.0      35
9000.0      30
5000.0      27
            ..
355000.0     1
103000.0     1
66000.0      1
270000.0     1
54000.0      1
Name: count, Length: 164, dtype: int64