<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Avocado Production

- [Avocado Production Worldwide](https://ourworldindata.org/grapher/avocado-production)
- [California Avocado Production 1980-2020](https://www.kaggle.com/datasets/jarredpriester/california-avocado-production-19802020)
- [3 datasets on Dataworld](https://data.world/datasets/avocados)
- [GitHub Topic on Avocado](https://github.com/topics/avocado-dataset)
- [Statista Avocado production worldwide](https://www.statista.com/statistics/577455/world-avocado-production/)
-[Hass Avocados R Package](https://cran.r-project.org/web/packages/avocado/vignettes/a_intro.html)
- [Avocado Source](https://www.avocadosource.com/)
- [Hass Avocado Board](https://hassavocadoboard.com/)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

### Table of Contents <a class="anchor" id="AVO_toc"></a>

* [Table of Contents](#AVO_toc)
    * [Page 1 - Abstract](#AVO_page_1)
    * [Page 2 - Imported Libraries](#AVO_page_2)
    * [Page 3 - Import the Dataset](#AVO_page_3)
    * [Page 4 - Setting Notebook Options](#AVO_page_4)
    * [Page 5 - Looking at the Data](#AVO_page_5)
    * [Page 6 - Get Descriptive Statistics about the Dataset](#AVO_page_6)
    * [Page 7 - Filter for Three Cities](#AVO_page_7)
    * [Page 8 - Recoding](#AVO_page_8)
    * [Page 9 - Test for Assumptions](#AVO_page_9)
    * [Page 10 - Correlation analysis](#AVO_page_10)
    * [Page 11 - Regression analysis](#AVO_page_11)
    * [Page 12 - T-tests](#AVO_page_12)
    * [Page 13 - Chi-squared test](#AVO_page_13)
    * [Page 14 - Time-series analysis](#AVO_page_14)
    * [Page 15 - Summary](#AVO_page_15)
    * [Page 16 - Future Work](#AVO_page_16)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 1 - Abstract <a class="anchor" id="AVO_page_1"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

Research on Avocado Production

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 2 - Imported Libraries<a class="anchor" id="AVO_page_2"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import bartlett, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd, MultiComparison

import warnings
warnings.filterwarnings("ignore")

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 3 - Import the Dataset <a class="anchor" id="AVO_page_3"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

## Research, find datasets and read in your data
We found three datasets to work on, we'll work on the first:

- [USA 2015-2018](../Data/avocados-us-2015-1018.csv)

This dataset contains information about avocado sales in various regions from 2015 to 2018. Here are some interesting descriptive statistics that can be derived from this dataset:

1. The average price of avocados across all regions and years.
2. The total volume of avocados sold across all regions and years.
3. The total volume of each type of avocado sold (4046, 4225, 4770) across all regions and years.
4. The average price of avocados in each region and for each year.
5. The total number of bags sold across all regions and years, and the proportion of each type of bag (small, large, XL) sold.
6. The number of avocados sold in each region and for each year.
7. The distribution of avocado prices for each region and for each year.

here are the other two:
- [Worldwide 1961-2020](../Data/avocado-production-worldwide-1961-2021.csv)
- [California 1980-2020](../Data/avocados-california-1980-2020.csv)

In [None]:
### start code
df = pd.read_csv('../Data/avocados-us-2015-1018.csv')
df = df.iloc[:,1:]
df.head()
### end code

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 4 - Setting Notebook Options<a class="anchor" id="AVO_page_4"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

#### Check number of rows and columns

In [None]:
print(f'Rows: {df.shape[0]}')
print(f'Columns: {df.shape[1]}')

In [None]:
# reset the options
#pd.reset_option('display.max_rows')

# set the option to display the maximum number of columns
pd.set_option('display.max_columns', 20)

# set the option to display the minimum and maximum number of rows
pd.set_option('display.min_rows', 200)
pd.set_option('display.max_rows', 1000)

pd.describe_option('display.max_rows')
pd.describe_option('display.max_columns')

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 5 - EDA: Looking at the Data<a class="anchor" id="AVO_page_5"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

In [None]:
# check the column names 
df.columns

In [None]:
#print unique values for type column
df.type.unique()

In [None]:
#print unique values for region column
df.region.unique()

In [None]:
df.region.describe()

In [None]:
df.AveragePrice.describe()

In [None]:
# usually objects are you key factors/independent variables where floats and ints are continuous/dependent variables
df.info()

In [None]:
# notice that the Date column is an object and not a proper date object let's create a new column as a date object
# convert string column to date column
#df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')

In [None]:
# this is a large dataset and will be truncated to 11 rows unless you set your row and column options
df

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 6 - Get Descriptive Statistics about the Dataset<a class="anchor" id="AVO_page_6"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">


In [None]:
#What is the average price of avocados across all regions and years.

# Calculate the mean of the 'AveragePrice' column
average_price = df["AveragePrice"].mean()

print("The average price of avocados across all regions and years is:", average_price)

In this code, we first import the pandas library and read the avocado dataset into a pandas DataFrame. We then use the mean() method to calculate the average of the 'AveragePrice' column, which gives us the average price of avocados across all regions and years. Finally, we print out the result using the print() function.



In [None]:
# Group the DataFrame by 'year' and 'region' and calculate the mean of the 'AveragePrice' column for each group
avg_price_by_region_year = df.groupby(['year', 'region'])['AveragePrice'].mean()

# Reset the index of the resulting DataFrame
avg_price_by_region_year = avg_price_by_region_year.reset_index()

# Set the size of the figure using matplotlib
plt.figure(figsize=(12, 6))

# Create a line plot of the average price of avocados across all regions and years
sns.lineplot(data=avg_price_by_region_year, x='year', y='AveragePrice')

# Set the title of the plot
plt.title('Average Price of Avocados Across All Regions and Years')

# Set the labels of the x- and y-axes
plt.xlabel('Year')
plt.ylabel('Average Price (in dollars)')

# Show the plot
plt.show()

In this code, we use the pd.read_csv() function to load the avocado dataset into a pandas DataFrame. We then use the groupby() method to group the DataFrame by 'year' and 'region' and calculate the mean of the 'AveragePrice' column for each group. We store this object in the variable avg_price_by_region_year.

Next, we use the reset_index() method to reset the index of the resulting DataFrame to integers. This is necessary to plot the DataFrame with seaborn.

We then use the plt.figure(figsize=(12, 6)) statement to set the size of the figure using matplotlib.

We then call the seaborn.lineplot() function to create a line plot of the average price of avocados across all regions and years, with the x parameter set to 'year' and the y parameter set to 'AveragePrice'.

Finally, we set the title of the plot using plt.title(), set the labels of the x- and y-axes using plt.xlabel() and plt.ylabel(), and show the plot using plt.show().

In [None]:
# What is the total volume of avocados sold across all regions and years.

# Calculate the sum of the 'Total Volume' column
total_volume = df["Total Volume"].sum()

print("The total volume of avocados sold across all regions and years is:", total_volume)

In this code, we first import the pandas library and read the avocado dataset into a pandas DataFrame. We then use the sum() method to calculate the sum of the 'Total Volume' column, which gives us the total volume of avocados sold across all regions and years. Finally, we print out the result using the print() function.

In [None]:
# What is the total volume of each type of avocado sold (4046, 4225, 4770) across all regions and years.
# Calculate the sum of the '4046', '4225', and '4770' columns
total_4046 = df["4046"].sum()
total_4225 = df["4225"].sum()
total_4770 = df["4770"].sum()

print("The total volume of 4046 avocados sold is:", total_4046)
print("The total volume of 4225 avocados sold is:", total_4225)
print("The total volume of 4770 avocados sold is:", total_4770)

In [None]:
# Group the DataFrame by 'year' and sum the values of the '4046', '4225', and '4770' columns for each group
total_volume_by_type_year = df.groupby('year')[['4046', '4225', '4770']].sum()

# Reset the index of the resulting DataFrame
total_volume_by_type_year = total_volume_by_type_year.reset_index()

# Set the size of the figure using matplotlib
plt.figure(figsize=(12, 6))

# Create a line plot of the total volume of each type of avocado sold for each year
plt.plot(total_volume_by_type_year['year'], total_volume_by_type_year['4046'], label='4046')
plt.plot(total_volume_by_type_year['year'], total_volume_by_type_year['4225'], label='4225')
plt.plot(total_volume_by_type_year['year'], total_volume_by_type_year['4770'], label='4770')

# Set the title of the plot
plt.title('Total Volume of Each Type of Avocado Sold Across All Regions and Years')

# Set the labels of the x- and y-axes
plt.xlabel('Year')
plt.ylabel('Total Volume (in millions of units)')

# Show the legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# The average price of avocados in each region and for each year.
# Calculate the mean of the 'AveragePrice' column grouped by 'region' and 'year'
avg_price_by_region_year = df.groupby(['region', 'year'])['AveragePrice'].mean()

print("The average price of avocados in each region and for each year is:\n", avg_price_by_region_year)

In [None]:

# Reset the index of the resulting DataFrame
avg_price_by_region_year = avg_price_by_region_year.reset_index()

# Set the size of the figure using matplotlib
plt.figure(figsize=(16, 9))

# Create a line plot of the average price of avocados for each region and year
sns.lineplot(data=avg_price_by_region_year, x='year', y='AveragePrice', hue='region')

# Move the legend to the top and make it smaller
#plt.legend(loc='upper left', bbox_to_anchor=(0.2, 0.50), ncol=3, fontsize='small')

# Put the legend on the right side and make it smaller
plt.legend(bbox_to_anchor=(1.01, 1), borderaxespad=0, prop={'size': 8})

# Show the plot
plt.show()

In [None]:
# The total number of bags sold across all regions and years, and the proportion of each type of bag (small, large, XL) sold.

# Calculate the sum of the 'Total Bags', 'Small Bags', 'Large Bags', and 'XLarge Bags' columns
total_bags = df["Total Bags"].sum()
total_small_bags = df["Small Bags"].sum()
total_large_bags = df["Large Bags"].sum()
total_xlarge_bags = df["XLarge Bags"].sum()

# Calculate the proportion of each type of bag sold
prop_small_bags = total_small_bags / total_bags
prop_large_bags = total_large_bags / total_bags
prop_xlarge_bags = total_xlarge_bags / total_bags

print("The total number of bags sold across all regions and years is:", total_bags)
print("The proportion of small bags sold is:", prop_small_bags)
print("The proportion of large bags sold is:", prop_large_bags)
print("The proportion of XLarge bags sold is:", prop_xlarge_bags)

In [None]:
# The number of avocados sold in each region and for each year.

# Group the DataFrame by 'region' and 'year' and sum the 'Total Volume' column
avocado_count_by_region_year = df.groupby(['region', 'year'])['Total Volume'].sum()

print("The number of avocados sold in each region and for each year is:\n", avocado_count_by_region_year)

In this code, we first import the pandas library and read the avocado dataset into a pandas DataFrame. We then use the groupby() method to group the DataFrame by 'region' and 'year', and then calculate the sum of the 'Total Volume' column for each group using the sum() method. The resulting object is a pandas Series object that contains the number of avocados sold in each region and for each year. Finally, we print out the results using the print() function.

In [None]:
# The distribution of avocado prices for each region and for each year.

# Group the DataFrame by 'region' and 'year' and get the 'AveragePrice' column
avocado_prices_by_region_year = df.groupby(['region', 'year'])['AveragePrice']

# Set the width of the plot using the 'figure()' function from matplotlib
plt.figure(figsize=(16,10))

# Use the 'histplot()' method in seaborn to plot the distribution of avocado prices for each region and for each year
sns.histplot(data=df, x='AveragePrice', hue='region', multiple='stack', kde=True)

# Show the plot
plt.show()

In this code, we first import the pandas, seaborn, and matplotlib libraries and read the avocado dataset into a pandas DataFrame. We then use the groupby() method to group the DataFrame by 'region' and 'year', and then extract the 'AveragePrice' column from each group. We store this object in the variable avocado_prices_by_region_year.

Next, we use the figure() function from the matplotlib library to create a new figure with a larger width of 12 inches and a height of 6 inches. We pass this value as a tuple to the figsize parameter of the figure() function.

Finally, we use the histplot() method in seaborn to plot the distribution of avocado prices for each region and for each year. The histplot() method takes the DataFrame (data=df), the column to plot (x='AveragePrice'), the grouping variable (hue='region'), and the option to stack the histograms (multiple='stack') and add a kernel density estimate (kde=True).

Finally, we use the plt.show() function to display the plot.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 7 - Filter for Three Cities<a class="anchor" id="AVO_page_7"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

#### Filter for 3 cities
Focusing on the Three Categories
The data has many more categories than three, so you will need to filter the dataset by the categories you want. The code below makes a list of the categories you want to keep, then searches through the Category column using the isin() function to keep only those that match.



In [None]:
# Filter the DataFrame for New York, Los Angeles, and Chicago
cities = ['NewYork', 'LosAngeles', 'Chicago']

# you can select a column using dot "." notation and use a function called "isin"
df_filtered = df[df['region'].isin(cities)]

df_filtered.head()

In [None]:
# Set the size of the figure using matplotlib
plt.figure(figsize=(12, 6))

# Create a line plot of the average price of avocados for each year and city
sns.lineplot(data=df_filtered, x='year', y='AveragePrice', hue='region')

# Set the title of the plot
plt.title('Average Price of Avocados in New York, Los Angeles, and Chicago')

# Set the labels of the x- and y-axes
plt.xlabel('Year')
plt.ylabel('Average Price (in dollars)')

# Show the plot
plt.show()

In [None]:
df_filtered.region.describe()

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 8 - Recoding<a class="anchor" id="AVO_page_8"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

#### recode 'type' and 'region' columns

In [None]:
# Copy the dataset
df_recoded = df.copy()

# Recode the 'type' column
df_recoded['type'] = df_recoded['type'].replace({'conventional': 0, 'organic': 1})

# Recode the 'region' column
region_map = {'Albany': 0, 'Atlanta': 1, 'BaltimoreWashington': 2, 'Boise': 3, 'Boston': 4, 'BuffaloRochester': 5, 'California': 6, 'Charlotte': 7, 'Chicago': 8, 'CincinnatiDayton': 9, 'Columbus': 10, 'DallasFtWorth': 11, 'Denver': 12, 'Detroit': 13, 'GrandRapids': 14, 'GreatLakes': 15, 'HarrisburgScranton': 16, 'HartfordSpringfield': 17, 'Houston': 18, 'Indianapolis': 19, 'Jacksonville': 20, 'LasVegas': 21, 'LosAngeles': 22, 'Louisville': 23, 'MiamiFtLauderdale': 24, 'Midsouth': 25, 'Nashville': 26, 'NewOrleansMobile': 27, 'NewYork': 28, 'Northeast': 29, 'NorthernNewEngland': 30, 'Orlando': 31, 'Philadelphia': 32, 'PhoenixTucson': 33, 'Pittsburgh': 34, 'Plains': 35, 'Portland': 36, 'RaleighGreensboro': 37, 'RichmondNorfolk': 38, 'Roanoke': 39, 'Sacramento': 40, 'SanDiego': 41, 'SanFrancisco': 42, 'Seattle': 43, 'SouthCarolina': 44, 'SouthCentral': 45, 'Southeast': 46, 'Spokane': 47, 'StLouis': 48, 'Syracuse': 49, 'Tampa': 50, 'TotalUS': 51, 'West': 52, 'WestTexNewMexico': 53}
df_recoded['region'] = df_recoded['region'].replace(region_map)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 9 - Test for Assumptions<a class="anchor" id="AVO_page_9"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

#### Test for assumptions

- Normality
- Homogeneity of Variance

In [None]:
from scipy.stats import normaltest

# Test for normality of the 'AveragePrice' column
alpha = 0.05
stat, p = normaltest(df_recoded['AveragePrice'])

# Print the results of the normality test
print(f"Normality test result for 'AveragePrice':\nStatistic={stat:.4f}, p-value={p:.4f}")
if p < alpha:
    print("The null hypothesis (data is normally distributed) can be rejected.")
else:
    print("The null hypothesis (data is normally distributed) cannot be rejected.")

In [None]:
# Without Transformation
sns.displot(df_recoded['Total Volume'], kde=True).set(title='Without Transformation')

# With Square Root Transformation
sns.displot(np.sqrt(df_recoded['Total Volume']), kde=True).set(title='With Square Root Transformation')

# With Log Transformation
sns.displot(np.log(df_recoded['Total Volume']), kde=True).set(title='With Log Transformation')

In [None]:
# Perform a log transformation on the 'Total Volume' column of the df_recoded DataFrame
df_recoded['Total Volume Log'] = np.log(df_recoded['Total Volume'])

In [None]:
df_recoded.head()

In [None]:
# Copy the log-transformed 'Total Volume' column to the original dataset
df['Total Volume Log'] = df_recoded['Total Volume Log']
df.head()

In [None]:
#In this code, we first import the necessary functions from the scipy.stats module: bartlett(), levene(), and shapiro(). We define a significance level alpha of 0.05.


from scipy.stats import bartlett, levene, shapiro

# Perform Bartlett's test for homogeneity of variances
alpha = 0.05
stat, p = bartlett(df_recoded['Total Volume'], df_recoded['Total Volume Log'])
print(f"Bartlett's test result:\nStatistic={stat:.4f}, p-value={p:.4f}")
if p < alpha:
    print("The null hypothesis (equal variances) can be rejected.")
else:
    print("The null hypothesis (equal variances) cannot be rejected.")

# Perform Levene's test for homogeneity of variances
stat, p = levene(df_recoded['Total Volume'], df_recoded['Total Volume Log'])
print(f"Levene's test result:\nStatistic={stat:.4f}, p-value={p:.4f}")
if p < alpha:
    print("The null hypothesis (equal variances) can be rejected.")
else:
    print("The null hypothesis (equal variances) cannot be rejected.")

# Perform Shapiro-Wilk test for normality
stat, p = shapiro(df_recoded['Total Volume Log'])
print(f"Shapiro-Wilk test result:\nStatistic={stat:.4f}, p-value={p:.4f}")
if p < alpha:
    print("The null hypothesis (data is normally distributed) can be rejected.")
else:
    print("The null hypothesis (data is normally distributed) cannot be rejected.")


## explaining the results

The results of the tests are used to check the assumptions of normality and homogeneity of variance, which are important assumptions underlying many statistical analyses. Here's what we can infer from the results of the tests:

Bartlett's test: The null hypothesis for Bartlett's test is that the variances of different groups are equal. If the p-value is less than the chosen significance level (0.05 in this case), we can reject the null hypothesis and conclude that the variances are not equal. In this case, the p-value is greater than the chosen significance level (p > 0.05), so we cannot reject the null hypothesis. Therefore, we can assume that the variances of the Total Volume and Total Volume Log columns of the df_recoded DataFrame are equal.

Levene's test: Levene's test is another test for homogeneity of variances, which is more robust than Bartlett's test when the data is not normally distributed. In this case, the p-value is greater than the chosen significance level (p > 0.05), so we cannot reject the null hypothesis. Therefore, we can assume that the variances of the Total Volume and Total Volume Log columns of the df_recoded DataFrame are equal.

Shapiro-Wilk test: The null hypothesis for the Shapiro-Wilk test is that the data is normally distributed. If the p-value is less than the chosen significance level (0.05 in this case), we can reject the null hypothesis and conclude that the data is not normally distributed. In this case, the p-value is less than the chosen significance level (p < 0.05), so we can reject the null hypothesis. Therefore, we can assume that the log-transformed Total Volume column of the df_recoded DataFrame is not normally distributed.

In summary, the results suggest that the log-transformed Total Volume column of the df_recoded DataFrame does not follow a normal distribution, but the variances of the Total Volume and Total Volume Log columns are equal. This information can be useful for selecting appropriate statistical tests for analyzing the data.


In [None]:
df.head()

In [None]:
from scipy.stats import f_oneway

# Select the rows corresponding to the selected cities
cities = ['Chicago', 'NewYork', 'LosAngeles']
df_cities = df.loc[df['region'].isin(cities)]

df_cities.head()

In [None]:
# Perform one-way ANOVA
alpha = 0.05
stat, p = f_oneway(df_cities['Total Volume Log'][df_cities['region'] == 'Chicago'],
                   df_cities['Total Volume Log'][df_cities['region'] == 'NewYork'],
                   df_cities['Total Volume Log'][df_cities['region'] == 'LosAngeles'])
print(f"One-way ANOVA result:\nStatistic={stat:.4f}, p-value={p:.4f}")
if p < alpha:
    print("The null hypothesis (the means of the populations are equal) can be rejected.")
else:
    print("The null hypothesis (the means of the populations are equal) cannot be rejected.")

#explain your results

The one-way ANOVA test conducted on the three cities for avocado production is used to determine whether there is a significant difference between the means of the populations of avocado production in each of the three cities. The null hypothesis of this test is that the means of the populations are equal, while the alternative hypothesis is that at least one of the means is different.

The ANOVA test produces two key outputs: the test statistic and the p-value. The test statistic is a measure of how much the sample means deviate from the overall mean, while the p-value is the probability of observing such a deviation by chance if the null hypothesis were true.

In this case, the one-way ANOVA result shows that the test statistic is 34.1755, and the p-value is 0.0000. Since the p-value is less than the commonly used threshold of 0.05, we can reject the null hypothesis that the means of the populations are equal, and conclude that there is a significant difference between the means of avocado production in at least one of the three cities.

In other words, the ANOVA test suggests that there is evidence of a difference in avocado production between the three cities. However, the ANOVA test alone cannot tell us which cities are different from each other. To determine which city or cities are different, we would need to conduct post-hoc tests such as Tukey's HSD or Bonferroni's correction.

In [None]:
df_cities['region'].unique()

In [None]:
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multitest import multipletests

In [None]:
# Perform one-way ANOVA
anova = f_oneway(df_cities[df_cities['region'] == cities[0]]['Total Volume'],
                 df_cities[df_cities['region'] == cities[1]]['Total Volume'],
                 df_cities[df_cities['region'] == cities[2]]['Total Volume'])

# Print ANOVA results
print("One-way ANOVA result:\n", anova)

# Perform Bartlett's test for homogeneity of variances
bartlett_test = bartlett(df_cities[df_cities['region'] == cities[0]]['Total Volume'],
                         df_cities[df_cities['region'] == cities[1]]['Total Volume'],
                         df_cities[df_cities['region'] == cities[2]]['Total Volume'])

# Print Bartlett's test results
print("\n")
print("Bartlett's test for homogeneity of variances:\n", bartlett_test)

# Perform Tukey's HSD post-hoc test
mc = MultiComparison(df_cities['Total Volume'], df_cities['region'])
tukey = mc.tukeyhsd()

# Print Tukey's HSD results
print("\n")
print("Tukey's HSD post-hoc test:\n", tukey)

# Perform Bonferroni correction post-hoc test
p_values = tukey.pvalues
adjusted_p_values = p_values * len(p_values)
reject = adjusted_p_values < 0.05

# Print Bonferroni correction results
print("Bonferroni correction post-hoc test:\n", adjusted_p_values)


## Explained results

The results show the output of a one-way ANOVA on the Total Volume of avocados sold in three cities: Chicago, Los Angeles, and New York.

The F_onewayResult output indicates that there is a significant difference between the means of the three cities (F-statistic = 115.182, p-value = 8.62e-46).

Bartlett's test for homogeneity of variances is a test for whether the variances are equal across groups. The BartlettResult output indicates that the variances are significantly different across groups (Bartlett's statistic = 562.733, p-value = 6.37e-123). Therefore, it is not appropriate to use an ANOVA assuming equal variances across groups.

Tukey's HSD post-hoc test is a multiple comparison test that compares the means of all pairs of groups. The output shows the mean difference between each pair of groups, the p-value after adjustment for multiple comparisons using the False Discovery Rate, and whether the null hypothesis of equal means can be rejected (reject=True) or not (reject=False). The output indicates that all pairs of groups have significantly different means (p < 0.05), and thus the null hypothesis of equal means can be rejected.

The Bonferroni correction post-hoc test is another method for controlling the family-wise error rate (FWER) when performing multiple comparisons. The output shows the adjusted p-values after Bonferroni correction for each pairwise comparison. The output indicates that all pairs of groups have p-values less than 0.05/3 = 0.0167, which is the Bonferroni-corrected significance level. Therefore, all pairwise comparisons are significant at the 0.05 level after Bonferroni correction.


<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 10 - Correlation analysis<a class="anchor" id="AVO_page_10"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

## Correlation analysis

In [None]:
# Calculate the correlation coefficient between Average Price and Total Volume
corr_coeff = df['AveragePrice'].corr(df['Total Volume'])

print("The correlation coefficient between Average Price and Total Volume is:", corr_coeff)

## Explained results

The corr() method in pandas calculates the correlation coefficient between two variables. The result will be a value between -1 and 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

In this case, we are interested in the correlation between Average Price and Total Volume. If the correlation coefficient is positive and close to 1, it would indicate that as the Total Volume increases, the Average Price also tends to increase. On the other hand, if the correlation coefficient is negative and close to -1, it would indicate that as the Total Volume increases, the Average Price tends to decrease. A correlation coefficient close to 0 would indicate that there is no correlation between the two variables.


In [None]:
# Calculate the correlation matrix
corr_matrix = df.corr()

# Set up the figure and plot the heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=ax)

# Add a title to the plot
ax.set_title('Correlation Matrix of Avocado Dataset')

# Show the plot
plt.show()

In this code, we first load the avocado dataset and then calculate the correlation matrix using the corr() method in pandas. Then, we use the seaborn library's heatmap() function to plot the correlation matrix. We also add annotations to the cells to show the exact correlation coefficients. Finally, we add a title to the plot using matplotlib and display the plot using plt.show().

The resulting plot will show the correlation between all pairs of variables in the avocado dataset. The cells that are shaded in red indicate positive correlation, while those shaded in blue indicate negative correlation. The darker the color, the stronger the correlation.

In this code, we set the figsize parameter in plt.subplots() to a tuple of (10, 8), which will create a plot with a width of 10 inches and a height of 8 inches. You can adjust these values to your desired size.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 11 - Regression analysis<a class="anchor" id="AVO_page_11"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

In [None]:
# Set up the regression model
X = df[['Total Volume', 'type', 'region']]
X = pd.get_dummies(X, columns=['type', 'region'], drop_first=True)
y = df['AveragePrice']
X = sm.add_constant(X)

# Fit the model and print the results
model = sm.OLS(y, X).fit()
print(model.summary())

In this code, we first load the avocado dataset and then set up the regression model using the OLS function in statsmodels. We specify the independent variables as Total Volume, type, and region, and convert the categorical variables type and region into dummy variables using pd.get_dummies().

Next, we fit the model using model.fit() and print the summary of the results using model.summary(). The summary includes information on the coefficients of each variable, the R-squared value, and other statistics.

The resulting plot will show four subplots: the predicted values versus Total Volume, a plot of the residuals versus Total Volume, a plot of the partial regression plot of Average Price on Total Volume with the influence of the other independent variables removed, and a plot of the component plus residual plot. These plots can help to assess the assumptions of the regression model and identify any issues with the data.

In [None]:
# Plot the results
fig = plt.figure(figsize=(16, 10))
sm.graphics.plot_regress_exog(model, 'Total Volume', fig=fig)
plt.show()

## Explained results

The output shows the results of an Ordinary Least Squares (OLS) regression analysis. The dependent variable is the Average Price of avocados, and the independent variables include Total Volume, type, and region.

The R-squared value of 0.548 indicates that approximately 54.8% of the variability in the dependent variable can be explained by the independent variables.

The F-statistic of 400.4 and the associated p-value of 0.00 suggest that the regression model is statistically significant.

The coefficients for each independent variable can be interpreted as follows:

Total Volume: For a one-unit increase in Total Volume, the Average Price is expected to decrease by 1.231e-06 units, holding all other variables constant.

Type: The 'organic' type of avocado is associated with a higher Average Price than the 'conventional' type.

Region: The regression model includes a separate coefficient for each region, representing the average difference in Average Price compared to the reference region (Albany). For example, the coefficient for the region Atlanta is -0.03, indicating that the Average Price in Atlanta is 0.03 lower on average than in Albany, holding all other variables constant.

Overall, the regression analysis suggests that Total Volume, type, and region are significant predictors of Average Price for avocados.

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 12 - T-tests<a class="anchor" id="AVO_page_12"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

In [None]:
import scipy.stats as stats

In [None]:
# perform a t-test to determine if there is a significant difference in the Average Price of conventional avocados versus organic avocados:

conventional = df[df['type'] == 'conventional']['AveragePrice']
organic = df[df['type'] == 'organic']['AveragePrice']

t_stat, p_val = stats.ttest_ind(conventional, organic)

print('T-test result:')
print('t-statistic:', t_stat)
print('p-value:', p_val)

The t-test result provides the t-statistic and the associated p-value. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that there is a significant difference between the means of the two groups.

The t-test result shows that the t-statistic is -105.587 and the p-value is 0.0. This means that there is a significant difference in the Average Price of conventional avocados versus organic avocados. The negative t-statistic suggests that the average price of conventional avocados is lower than the average price of organic avocados. The p-value of 0.0 indicates that the probability of obtaining such a large difference in means by chance alone is extremely low, and we can reject the null hypothesis that there is no difference between the means of the two groups.

Note that the example above assumes that the two groups have equal variances. If the variances are unequal, a Welch's t-test can be used instead:

In [None]:
t_stat, p_val = stats.ttest_ind(conventional, organic, equal_var=False)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 13 - Chi-squared test<a class="anchor" id="AVO_page_13"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

In [None]:
# create contingency table
contingency_table = pd.crosstab(df['type'], df['region'])
print(contingency_table)

# conduct chi-squared test
chi2_stat, p_val, dof, ex = stats.chi2_contingency(contingency_table)
print("Chi-squared test statistic:", chi2_stat)
print("p-value:", p_val)

## Explained Results

The chi-squared test result indicates that there is no significant association between the type of avocado and the region where they are produced. The chi-squared test statistic is very low (0.026), which indicates that the observed frequency distribution is not significantly different from the expected frequency distribution. The p-value of 1.0 also supports this result, indicating that there is no evidence to reject the null hypothesis that there is no association between the type of avocado and the region where they are produced. Therefore, we can conclude that the type of avocado and the region where they are produced are independent categorical variables.


<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 14 - Time-series analysis<a class="anchor" id="AVO_page_14"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

df.info()To perform time-series analysis, we first need to convert the 'Date' column to a pandas datetime object and set it as the index of the DataFrame. Then, we can use various time-series analysis techniques such as decomposition, forecasting, and autocorrelation to examine trends and patterns in the data over time.

For example, to examine if there is a seasonal pattern in the sales of avocados, we can use seasonal decomposition. We can use the statsmodels library to perform seasonal decomposition and plot the results.

In [None]:
df.info()

In [None]:
# create a deep copy of df and assign it to df2
df2 = df.copy()

In [None]:
# Convert Date column to datetime format and set it as the index
df2['Date'] = pd.to_datetime(df['Date']) 

In [None]:
df2.info()

In [None]:
# This code sets the 'Date' column as the index of the DataFrame named 'df2' and removes the 'Date' column. 
# This is done using the pandas function 'set_index' with the parameter 'inplace' set to True. 
# This will modify the DataFrame 'df2', making the 'Date' column irrelevant.
df2.set_index('Date', inplace=True)

In [None]:
# notice that the Date column has vanished
df2.info()

In [None]:
# Group the data by week and calculate the average price and total volume for each week
weekly_data = df2.resample('W').agg({'AveragePrice': 'mean', 'Total Volume': 'sum'})

In [None]:
# Plot the time series of Total Volume
fig, ax = plt.subplots(figsize=(12,6))
weekly_data['Total Volume'].plot(ax=ax)
ax.set(title='Total Avocado Sales by Week', xlabel='Date', ylabel='Total Volume')

In [None]:
# Decompose the time series to visualize any trends and seasonal patterns
decomposition = sm.tsa.seasonal_decompose(weekly_data['Total Volume'], model='additive')
fig = decomposition.plot()
fig.set_figheight(8)
fig.set_figwidth(12)
plt.show()

This code loads the avocado dataset and converts the 'Date' column to datetime format, sets it as the index, and groups the data by week. It calculates the average price and total volume for each week and then plots the time series of the total volume. The code then decomposes the time series using the seasonal decomposition function from the statsmodels library to visualize any trends and seasonal patterns. The resulting plot shows the original time series, the trend component, the seasonal component, and the residual component.

In time series analysis, a time series can be decomposed into several components: trend, seasonal, and residual.

The trend component represents the long-term changes in the time series. It is the underlying pattern or direction in which the series is moving over time, regardless of short-term fluctuations. The trend component can be linear, non-linear, or a combination of both.

The seasonal component represents the periodic fluctuations that occur within the time series at fixed intervals, such as daily, weekly, monthly, or yearly. Seasonality is often observed in economic, environmental, and social data, and it can be caused by various factors such as weather, holidays, and cultural events.

The residual component, also known as the error or noise component, represents the random variation in the time series that cannot be explained by the trend or seasonal components. It includes all other factors that affect the series but are not accounted for by the model.

To summarize, the original time series is the complete set of data over time, including all the components mentioned above. The trend component represents the long-term pattern of change, the seasonal component represents the periodic fluctuations, and the residual component represents the unexplained variation or noise in the series.




The seasonal decomposition plot can help us identify any recurring patterns in the data, such as weekly or monthly seasonal patterns, which can inform forecasting models and help us make predictions about future sales.

In [None]:
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Decompose the time series to visualize any trends and seasonal patterns
decomposition = sm.tsa.seasonal_decompose(weekly_data['Total Volume'], model='additive')

# Create a 2x2 grid of subplots
fig, axs = plt.subplots(4, 1, figsize=(12, 10))

# Plot the original time series
axs[0].plot(weekly_data['Total Volume'])
axs[0].set_title('Original Time Series')

# Plot the trend component
axs[1].plot(decomposition.trend)
axs[1].set_title('Trend Component')

# Plot the seasonal component
axs[2].plot(decomposition.seasonal)
axs[2].set_title('Seasonal Component')

# Plot the residual component
axs[3].plot(decomposition.resid)
axs[3].set_title('Residual Component')

# Adjust the spacing between the subplots
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Decompose the time series to visualize any trends and seasonal patterns
decomposition = sm.tsa.seasonal_decompose(weekly_data['Total Volume'], model='additive')

# Create a new figure
fig = plt.figure(figsize=(12, 6))

# Plot the original time series
plt.plot(weekly_data['Total Volume'], color='black', label='Original Time Series')

# Plot the trend component
plt.plot(decomposition.trend, color='red', label='Trend Component')

# Plot the seasonal component
plt.plot(decomposition.seasonal, color='green', label='Seasonal Component')

# Plot the residual component
plt.plot(decomposition.resid, color='blue', label='Residual Component')

# Set the x and y-axis labels and title
plt.xlabel('Time')
plt.ylabel('Total Volume')
plt.title('Time Series Decomposition')

# Add a legend to the plot
plt.legend(loc='upper left')

# Show the plot
plt.show()


<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 15 - Summary<a class="anchor" id="AVO_page_15"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

Write summary here

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

# Page 16 - Future Work<a class="anchor" id="AVO_page_16"></a>

[Back to Top](#AVO_toc)

<hr style="height:5px;border-width:0;color:MediumAquamarine;background-color:MediumAquamarine">

write Future ideas and work here