#Description: Lab 6_Visualizing Numbers
###(1)In this exercise, you will have the opportunity to use a dataset about avocado price modified from Kaggle: https://www.kaggle.com/datasets/neuromusic/avocado-prices. The columns are:


*   Date - The date of the observation
*   AveragePrice - the average price of a single avocado
*   type - conventional or organic
*   year - the year
*   city - the city or region of the observation
*   state - the state of the observation
*   region - the region of the observation
*   Total Volume - Total number of avocados sold
*   4046 - Total number of avocados with PLU 4046 sold
*   4225 - Total number of avocados with PLU 4225 sold
*   4770 - Total number of avocados with PLU 4770 sold
*   Small Bags: the number of small bags of avocados sold
*   Large Bags: the number of large bags of acocados sold
*   XLarge Bags: the number of extra large bags of avocado sold
*   Total Bags: the total bags of acovados sold

###(2) You are asked to create **five** visualizations using the avocado dataset 
###(3) One or more examples are provided for each type of visualizations


# Step 0: Import the data

In [None]:
#0 we import the main packages we will be using: pandas
import pandas as pd
import seaborn as sns
import numpy as np
import io
import matplotlib.pyplot as plt
!pip install squarify # You need to install squarify because it is not a default package on Goolge CoLab
import squarify

In [None]:
## import avocado dataset using this if you are on colab
from google.colab import files
uploaded = files.upload()

In [None]:
#read the csv file to a Pandas dataframe
data = pd.read_csv(io.BytesIO(uploaded['avocado.csv']))
# Dataset is now stored in a Pandas Dataframe
# if you use Pytho notebook locally, you can use the following code to read the csv file in.
#data = pd.read_csv('avocado.csv')

# Step 1: Explore the data

In [None]:
# 1.1 show information about the data
data.info()

In [None]:
# 1.2 Because the Date column is not in Date Time data type, we change it here. 
data['Date']= pd.to_datetime(data['Date'])
data.info()

In [None]:
# 1.3 returns the number of unique values for each column; another way to explore the data
data.nunique()

In [None]:
# 1.2 print out first 10 rows of values in data. 
data.head(10)

In [None]:
# 1.3 calculate a correlation matrix for the data. More info: https://www.datacamp.com/tutorial/tutorial-datails-on-correlation 
data.corr()

# Step 2: Create visualizations


*   You are required to create 5 visualizations
*   An example is given for each visualization



# 2.1 Line chart

In [None]:
# Line chart example - the number of totol volumes sold by year by avocado type. Note that we use the Seaborn library here. 
sns.relplot(data=data, kind='line',
            x='year', y='Total Volume', hue='type')

In [None]:
# Line chart question: Please create a line chart that shows the averge price by year by region 
# Solution: 


# 2.2 Bar Chart

In [None]:
#Bar chart example 1: Create a bar plot that shows the mean, median, and standard deviation of the Total Volume column by year
data1=data.groupby('year')['Total Volume'].agg(['mean','median','std'])
data1
data1.plot.bar()

In [None]:
#Bar chart example 2: Create a barchart using the total volume of organic avocadoes sold, grouping by region and year. Use region as the color hue, and year as the x-axis
sns.catplot(data=data.query('type in ("organic")'),
            kind='bar', x='year', y='Total Volume', hue='region')

In [None]:
# Bar chart question: Create a grouped bar chart showing the Average Price of avocadoes sold, grouping by region and year. Use region as the color hue. Use color palette Set2. More info on color palette: https://seaborn.pydata.org/tutorial/color_palettes.html
# Solutions: 



# 2.3 Showing distribution using histograms or boxplots

In [None]:
# Histogram example: show the distribution of average price using 15 bins. 
data_hist = data['AveragePrice']
sns.histplot(data_hist,bins=15)


In [None]:
# Boxplot example: 
# create grouped boxplot: show distribution of average price by year
# use this to plt.style.use("seaborn-white") set the basic style of visualizations
sns.boxplot(x = data['year'],
			y = data['AveragePrice'],
           palette='husl')


In [None]:
# Boxplot Question (You are required to pick one from Boxplot or histogram): create grouped boxplot to show the distribution of average price by year and by type of avocado (for color); You can choose any color pallete; You should see 8 boxes
# Boxplot solution:



In [None]:
# Histogram question (You are required to pick one from Boxplot or histogram): create a histogram to show the average price distribution of different avocado types
# Histogram solution -



# Step 2.4 Scatterplot

In [None]:
# Scatterplot example 1: create a scatterplot for total volume and average price using a generic method in seaborn. More info: https://seaborn.pydata.org/generated/seaborn.relplot.html 
sns.relplot(data=data, kind='scatter', x='AveragePrice', y='Total Volume')

In [None]:
#Scatterplot example 2: Create the same plot but with the hue parameter set to year and the dots for the total volume with their sizes proporitional to its value 
sns.relplot(data=data, kind='scatter', x='AveragePrice', y='Total Volume', hue='year',
           size='Total Volume')

In [None]:
#Scatterplot question: create a 4*4 scatterplot showing the relationship between average price and total volume by year and region; each dot is color coded by avocado type
#Hints: you will use hue, col and row parameters for relplot function. 
# Solutions


# Step 2.5 Treemap

In [None]:
# Treemap example: create a treemap with fake data
# Create a data frame with fake data
df = pd.DataFrame({'nb_people':[8,3,4,2], 'group':["group A", "group B", "group C", "group D"] })
# plot it
squarify.plot(sizes=df['nb_people'], label=df['group'], alpha=.8 )
#plt.axis('off')
plt.show()

In [None]:
# Treemap question: Creat a treemap showing the composition of total avocado sales by year. More info on colors: https://www.geeksforgeeks.org/treemaps-in-python-using-squarify/
# group the nmber of bags sold by year
n = data.groupby('year')[['Total Bags']].sum()
#extract data and labels as lists
a = data.groupby('year')[['Total Bags']].sum().index.get_level_values(0).tolist()
#print(n)
squarify.plot(sizes=n.values, label=a, alpha=.8,color=["orange","green","blue", "grey"],pad=True)
plt.axis('off')
plt.show()

In [None]:
# Treemap question: Creat a treemap showing the composition of avocado sales by year by region
# Solutions:
# hint 1: you will need to group the nmber of bags sold by year and region


# Optional - additional exploration

In [None]:
# 
#Filter the data so it contains just the rows for 2015, for the conventional type, and for the region named Northeast. Store just the Date, Total Bags, and Small Bags columns in a new DataFrame.
data_reduced = data.query('year == 2015 and region == "Northeast" and type == "conventional"')[['Date','Total Bags','Small Bags']]

In [None]:
#Melt the data in the Total Bags and Small Bags columns, but not the values in the Date column. Name the column that contains the type of bag Bags, and name the column that contains the number of bags Count. Then, display the resulting DataFrame.
data_melted = pd.melt(data_reduced, id_vars='Date', value_vars=['Total Bags','Small Bags'],
                     var_name='Bags', value_name='Count')
data_melted

In [None]:
## Plot the melted data with Seaborn in a line plot, using the hue parameter to distinguish between the bag types.
sns.relplot(data=data_melted, kind='line', x='Date', y='Count', hue='Bags')

In [None]:
#Bin the data in the Total Volume column into four quantiles labeled ‘poor’, ‘modest’, ‘good’, and ‘excellent’, and store the bin labels in a new column.
data['Sales Volume'] = pd.qcut(data['Total Volume'], q=4, labels=['poor','modest','good','excellent'])
data

In [None]:
# Plot the binned data by year using a Seaborn count plot.
sns.catplot(data=data, kind='count', x='year', hue='Sales Volume')

# The end of the assignment