### Table of Contents

* [Goals](#goals)
* [Data](#Data)
    * [Loading the Data](#section1_1)
    * [Data Information](#section1_2)
* [Data Cleaning](#cleaning)
* [Exploratory Data Analysis](#EDA)
* * [Big Picture](#section2_1)
    * [Humidity and Flow](#section2_2)
    * [Humidity and Weight](#section2_3)
    * [Temperature and Flow](#section2_4)
    * [Temperature and Weight](#section2_5)
* [Conclusion](#conclusion)

### Goals <a class="anchor" id="Goals"></a> 
This notebook contains an analysis on some beehive metric data. The goal for this project was to do the following:

* Get acquainted with the data
* Clean the data so it is ready for analysis
* Develop questions for analysis
* Analyze variables within the data to gain patterns and insights on these questions

### Data <a class="anchor" id="Data"></a>

The data for this project was downloaded from Kaggle:

https://www.kaggle.com/datasets/se18m502/bee-hive-metrics?select=flow_2017.csv

Information regarding the features for the data are located in the Detail section on the website.

#### Loading the Data <a class="anchor" id="section1_1"></a>

First, the necessary libraries are loaded into the notebook. The pandas library is used to import data from marketing_data.csv and preview the first 100 rows of the DataFrame.

In [None]:
import piplite
await piplite.install('seaborn')

In [None]:
#import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import linregress

In [None]:
#load data files
flow_2017 = pd.read_csv('flow_2017.csv')
humidity_2017 = pd.read_csv('humidity_2017.csv')
temperature_2017 = pd.read_csv('temperature_2017.csv')
weight_2017 = pd.read_csv('weight_2017.csv')

In [None]:
#explore the datasets
print(flow_2017.describe())
print(weight_2017.describe())
print(humidity_2017.describe())
print(temperature_2017.describe())
print(flow_2017.columns)
print(weight_2017.columns)
print(humidity_2017.columns)
print(temperature_2017.columns)

In [None]:
#first merge
flow_weight_data = pd.merge(flow_2017, weight_2017, how='left')

In [None]:
print(flow_weight_data.head(100))

In [None]:
#second merge
humidity_temp_data = pd.merge(humidity_2017, temperature_2017, how='left')

In [None]:
print(humidity_temp_data.head(100))

In [None]:
#final merge
beehive_data = pd.merge(flow_weight_data, humidity_temp_data, how='inner')

#### Data Information <a class="anchor" id="section1_2"></a>

Some immediate insights are:

* There are 4 columns and 227,500 rows.
* Most values are floats in this dataset; the only integer values are in the `flow` column, and the only object is in the `timestamp` column.
* The `flow` column contains mostly 0 values, meaning there was no change in bee flow (no bees entering or leaving the hive) at that timestamp. In fact, the interquartile range for `flow` is 0. 
* The `flow` column has a minimum value of -345 and a maximum value of 293, which seem to be outliers possibly due to a measurement error or data entry error.
* The `weight` column has a minimum value of -0.1 and a maximum value of 68.7. The minimum seems to be an outlier, since the weight of a beehive cannot be below 0 Kg. The maximum value is not within 3 standard deviations of the median, so I do not consider it an outlier.
* The `humidity` column has a minimum value of -66, which also seems to be an outlier since humidity cannot be below 0%.
* The `temperature` column has a minimun value of -227 and a maximum value of 56, which also seem to be an outliers since suspected temperature would not go below -32 degrees Celcius or above 38 degrees Celcius.
* None of the values are null in any of our columns.

In [None]:
#exploring the final dataset
print(beehive_data.head(100))
print(beehive_data.describe())
print(beehive_data.info())

#### Data Cleaning <a class="anchor" id="cleaning"></a>

As mentioned before, none of our values are null. 
However, we do have outliers in all of our columns and our float values contain too many decimals to be read easily. 
To clean the data, the following must be done:

* Round float values in `weight`, `humidity`, and `temperature` columns to 2 decimal points
* Remove outliers from all columns


In [None]:
#round float values
beehive_data['weight'] = round(beehive_data['weight'], 2)
beehive_data['humidity'] = round(beehive_data['humidity'], 2)
beehive_data['temperature'] = round(beehive_data['temperature'], 2)

#summary statistics
print(beehive_data.describe())

In [None]:
#remove outliers
flow_2017_mid_95 = np.percentile(beehive_data['flow'], [2.5, 97.5])
beehive_data_mid_95 = beehive_data[(beehive_data['flow'] >= flow_2017_mid_95[0]) & (beehive_data['flow'] <= flow_2017_mid_95[1])]
print(beehive_data_mid_95.describe())



In [None]:
#use histogram to visualize spread
plt.hist(beehive_data_mid_95['temperature'])
plt.xlabel('temperature')
plt.ylabel('count')
plt.show()

The confidence interval of 95% is still showing extreme outliers in the `temperature` column, so the data will now be filtered such that temperatures below -32 degrees Celcius and above 38 degrees Celcius will be removed.

In [None]:
#remove temperatures below 38 degrees Celcius
filtered_beehive_data = beehive_data_mid_95[(beehive_data_mid_95['temperature'] >= -38) & (beehive_data_mid_95['temperature'] <= 38)]

#use boxplot to show outliers
plt.figure()
sns.boxplot(filtered_beehive_data['temperature'])
plt.show()

#summary statistics
print(filtered_beehive_data['temperature'].describe())

After removing the outliers, our data is more symmetrical and now ready for analysis.

### Exploratory Data Analysis <a class="anchor" id="EDA"></a>

After some data cleaning and tidying, the DataFrame is ready for EDA. The following independent variables will be focused on in the analysis:

* `humidity`
* `temperature` 

The goal will be to see how these independent variables associate with the following dependent variables:

* `flow`
* `weight`

The hope is that through summary statistics and visualizations the following questions can be answered:

* Does humidity of the outdoor climate affect bee's flow to and from the beehive?
* Does humitity of the outdoor climate affect the weight of the beehive?
* Does temperature of the outdoor climate affect the bee's flow to and from the beehive?
* Does temperature of the outdoor climate affect the weight of the beehive?

Along the way, this question may be refined and more questions may pop up.

#### Big Picture <a class="anchor" id="section2_1">
In order to observe the dataset as a whole, `DataFrame.hist()` is used. It gives a full view of all numerical variables in the distribution. 

Next, correlations between all numerical variables are viewed using a heat matrix. The heat matrix shows that `weight` may have some correlation with both `humidity` and `temperature`. Interestingly, `flow` may not be correlated with either `humidity` or `temperature`. This is something to look into it a bit more in the analysis.

In [None]:
fig = plt.figure(figsize = (10,20))
ax = fig.gca()
filtered_beehive_data.hist(ax=ax);

The overview shows the `weight` column is skewed to the right, while the others appear normally distributed.

In [None]:
#create correlation heatmap
columns_to_correlate = ['flow', 'weight', 'humidity', 'temperature']

plt.figure(figsize=(10,10))
# heat matrix that shows correlation across all numerical variables
sns.heatmap(filtered_beehive_data[columns_to_correlate].corr(), annot=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

The table of correlation offers initial insight into our variables. It appears `humidity` and `weight` might have a correlation.

#### Humidity and Flow <a class="anchor" id="section2_2"></a>

Now it's time to start looking further into our Big Quetion A: Does humidity of the outdoor climate affect bee's flow to and from the beehive?

We have already seen in the heatmap above that `humidity` and `flow` don't appear to be correlated. To further our confidence, a scatterplot will be created to inspect the variables.

In [None]:
#create a scatterplot
fig = plt.figure(figsize=(8,8))
ax = plt.plot()
sns.scatterplot(x=filtered_beehive_data['humidity'], y=filtered_beehive_data['flow'])
plt.title('Humidity vs. Flow')
plt.show()
plt.clf()

As expected, there is no obvious relationship between `humidity` and `flow`.

#### Humidity and Weight <a class="anchor" id="section2_3"></a>

Next we will look further into our Big Quetion B: Does humidity of the outdoor climate affect the weight of a beehive?

We have seen in the correlation heatmap previous that `humidity` and `weight` could have a correlation with a negative relationship. To further our confidence, a scatterplot will be created to inspect the variables.

In [None]:
#create a scatterplot
fig = plt.figure(figsize=(8,8))
ax = plt.plot()
sns.scatterplot(x=filtered_beehive_data['humidity'], y=filtered_beehive_data['weight'])
plt.title('Humidity vs. Weight')
plt.show()
plt.clf()

As expected, there appears to be a negative relationship between `humidity` and `weight`. To explore further, let's find the regression line for this relationship.

In [None]:
#create regex line
model = sm.OLS.from_formula('weight~humidity', data = filtered_beehive_data)
results = model.fit()
print(results.summary)
print(results.params)

fitted_values = results.predict(filtered_beehive_data)
residuals = filtered_beehive_data.weight - fitted_values
#check normality
plt.hist(residuals)
plt.show()
plt.clf()
#check homoscedasticity
plt.scatter(fitted_values, residuals)
plt.show()
plt.clf()
#plot line
slope, intercept, r_value, p_value, std_err = linregress(filtered_beehive_data['humidity'], filtered_beehive_data['weight'])
def regression_line(x):
    return slope * x + intercept
x_pred = np.linspace(min(filtered_beehive_data['humidity']), max(filtered_beehive_data['humidity']), 100)
y_pred = regression_line(x_pred)
plt.scatter(filtered_beehive_data['humidity'], filtered_beehive_data['weight'], label='Data points')
plt.plot(x_pred, y_pred, color='red', label='Regression line')
plt.xlabel('Humidity')
plt.ylabel('Weight')
plt.title('Humidity vs. Weight Regression Line')
plt.legend()
plt.show()

The overall relationship between `humidity` and `weight` is strong and linear. The regex line has a negative slope of -0.312772, indicating that as `humidity` increases by 1%, `weight` of the beehive decreases by about .31 Kg.

#### Temperature and Flow <a class="anchor" id="section2_4"></a>
Now we will look further into our Big Question C: Does temperature of the outdoor climate affect the bee's flow to and from the beehive?

Looking at our correlation heatmap, there does not appear to be an indicator of any obvious relationship between `temperature` and `flow`. To further our confidence, a scatterplot will be created to inspect the variables.

In [None]:
#create a scatterplot
fig = plt.figure(figsize=(8,8))
ax = plt.plot()
sns.scatterplot(x=filtered_beehive_data['temperature'], y=filtered_beehive_data['flow'])
plt.title('Temperature vs. Flow')
plt.show()
plt.clf()

As expected, there is no obvious relationship between `temperature` and `flow`.

#### Temperature and Weight <a class="anchor" id="section2_5"></a>
Next we will look further into our Big Question D: Does temperature of the outdoor climate affect the weight of the beehive?

Looking at our correlation heatmap, we see an indication of a weak positive correlation between `temperature` and `weight`. To further our confidence, a scatterplot will be created to inspect the variables.

In [None]:
#create a scatterplot
fig = plt.figure(figsize=(8,8))
ax = plt.plot()
sns.scatterplot(x=filtered_beehive_data['temperature'], y=filtered_beehive_data['weight'])
plt.title('Temperature vs. Weight')
plt.show()
plt.clf()

It appears there might be a weak positive correlation betwen `temperature` and `weight`. To explore further, let's find the regression line for this relationship.

In [None]:
#create regex line
model2 = sm.OLS.from_formula('weight~temperature', data = filtered_beehive_data)
results2 = model2.fit()
print(results2.summary)
print(results2.params)

fitted_values2 = results2.predict(filtered_beehive_data)
residuals2 = filtered_beehive_data.weight - fitted_values2
#check normality
plt.hist(residuals2)
plt.show()
plt.clf()
#check homoscedasticity
plt.scatter(fitted_values2, residuals2)
plt.show()
plt.clf()
#plot line
slope, intercept, r_value, p_value, std_err = linregress(filtered_beehive_data['temperature'], filtered_beehive_data['weight'])
def regression_line(x):
    return slope * x + intercept
x_pred2 = np.linspace(min(filtered_beehive_data['temperature']), max(filtered_beehive_data['temperature']), 100)
y_pred2 = regression_line(x_pred2)
plt.scatter(filtered_beehive_data['temperature'], filtered_beehive_data['weight'], label='Data points')
plt.plot(x_pred2, y_pred2, color='red', label='Regression line')
plt.xlabel('Temperature')
plt.ylabel('Weight')
plt.title('Temperature vs. Weight Regression Line')
plt.legend()
plt.show()
plt.clf()

The overall relationship between `temperature` and `weight` is weak and linear. The regex line has a positive slope of 0.231429, indicating that as `temperature` increases by 1 degree Celcius, `weight` of the beehive increases by about .23 Kg.

### Conclusion <a class="anchor" id="conclusion"></a>

#### Findings Overview

It has been shown humidity and weight have the strongest relationship, and that it is negative and linear. Temperature and weight also have a linear relationship, though it is weak and positive. This analysis indicates that humidity and temperature have some effect on bee's production of honey. Interestingly, there was no apparent corrrelation between flow and either humidity or temperature, indicating that bees flow in and out of their hives no matter the climate. These insights would be very helpful to beekeepers looking to alter their bee's honey production and scientists studying bee behavior or honey production.

#### Next Steps

More datasets could be collected to further research. Some things to consider in a future analysis are:
* Geographical data
* Bee species
* Bee food supply (species of flowers used for nectar and pollen)
* Hobbyist vs. commercial beekeeping

