# Descriptive Statistics - Tutorial

**Dr Chao Shu (chao.shu@qmul.ac.uk)**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set_theme(style="ticks")

# Import the tutorial dataset
food_consumption_df = pd.read_csv("datasets/T01_food_consumption.csv", index_col=0)

In this tutorial, we'll be working with the [2018 Food Carbon Footprint Index](https://www.nu3.de/blogs/nutrition/food-carbon-footprint-index-2018) from nu3. The `food_consumption` dataset contains information about the kilograms of food consumed per person per year in each country in each food category (`consumption`) as well as information about the carbon footprint of that food category (`co2_emissions`) measured in kilograms of carbon dioxide, or CO<sub>2</sub>, per person per year in each country.

Let's firstly take a quick look at the dataset.

In [None]:
food_consumption_df.head()

## Rice Consumption

Rice is one of the most essential and widely consumed staple foods worldwide. It is a cereal grain that serves as a primary source of sustenance for billions of people, particularly in Asia, Africa, and parts of the Americas. Let's analyse the rice consumption in different countries. 

Let's analyse the consumption of rice in different countries worldwide based on the `food_consumption` dataset. 

**Q1.1**: Subset `food_consumption_df` to get the rows where `food_category` is `'rice'`. Create a histogram of `consumption` for rice and show the plot. What does the histogram show? what do the x-axis and y-axis represent, respectively? Set suitable label for the x-axis and y-axis.


In [None]:
# Subset for food_category equals rice
rice_consumption_df = food_consumption_df[food_consumption_df['___'] == '___']

# Histogram of rice consumption and show plot
fig, ax = plt.subplots()
sns.histplot(data=___, bins=range(0, 200, 20), ax=ax)
# Set ticks for x-axis based on the bins
ax.set_xticks(range(0, 200, 20))
# Set labels for x- and y-axis
ax.set_xlabel('___')
ax.set_ylabel('___')

plt.show()

**Q1.2**: What is the shape of the distribution?

**Q1.3**: Calculate the mean and median of rice consumption of all countries.

In [None]:
# Calculate mean and median of consumption with .agg()
print(rice_consumption_df['___'].agg(['___', '___']))

**Q1.4**: Given the skewness of this data, what measure of centre is best for summarising the kilograms of rice consumption per person per year for rice?

## Food Consumption in Different Countries

It is interesting to know the total amount of food (including all categories of food) was consumed per person per year in different countries. Additionally, since the dataset reveals the annual CO<sub>2</sub> emissions per person for different nations/regions worldwide, it is also interesting to find out the CO<sub>2</sub> emission generated by food production per person per year due to the diet in each country. The analysis may indicate which countries could significantly reduce their carbon footprint by switching to a plant-based diet.

**Q2.1**: Get the total annual CO<sub>2</sub> emissions per person and total kilograms of food (including all categories of food) was consumed per person per year in different countries. After the process, each country only has one row with two values: food_consumption and co2_emission.

In [None]:
# Get the sum of food consumption and co2 emission for each country
food_consumption_per_country_df = food_consumption_df.groupby('___')[['___', '___']].sum()

# Show the aggregated dataframe
food_consumption_per_country_df

**Q2.2**: Get the five-number summary, variance and standard deviation for `consumption` and `co2_consumption`.

**Q2.3**: Create box plots for `consumption` and `co2_emission` of all countries.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(6, 3))

# Create box plot for food consumption of all countries
sns.boxplot(y=food_consumption_per_country_df['___'], ax=axs[0])

# Create box plot for co2 emission of all countries
sns.boxplot(y=food_consumption_per_country_df['___'], ax=axs[1])

fig.tight_layout()
plt.show()


**Q2.4**: Based on the box plots, would you be able to identify any outliers in the `consumption` and `co2_emission` data? What is the shape of distribution of `co2_emission` likely to be?

> 💬 **Discussion:** how to filter out the outliers programmatically?


In [None]:
# Get Q1 and Q3  
q1 = food_consumption_per_country_df['___'].quantile(___)
q3 = food_consumption_per_country_df['___'].quantile(___)

# Calculate IQR
iqr = ___

# Calculate the lower and upper cutoffs for outliers
lower = ___
upper = ___

# Filter out outliers
outliers = food_consumption_per_country_df.query('co2_emission < @lower | co2_emission > @upper')
outliers

**Q2.5**: Find out which countries have the largest/smallest amount of food consumption per person per year, and which countries have the largest/smallest amount of CO<sub>2</sub> emission generated by food production based on their diet. 

> 💬 **Discussion:** Before you start to process the data to find the results, think about what types of country you expect to have the largest/smallest amount of food consumption per person per year, what types of country you expect to have the largest/smallest amount of CO<sub>2</sub> emission generated by food production. Do the results you find from the data meet your expectation. What are the possible reasons for the results?

In [None]:
# Apply idxmin() and idxmax() to find the index of the maximum and minimum of values in the `consumption` and `co2_emission` columns 
food_consumption_per_country_df.agg(['___', '___'])

## Food Consumption Comparison between China and UK

We know people in western countries and Asian countries have different diets. As a QMUL-BUPT JP students you might be wondering the difference of food consumption between China and UK based on the dataset we have.

The food consumption dataset contains the amount of food consumption and carbon footprint for different types of food. Now, let's start with analysing the consumption of different types of food in China and UK.

**Q3.1**: What is the type of `food_category` data? What kind of plot is commonly used to visualise and analyse this type of data?

**Q3.2**: Use the plot you choose in **Q3.1** to visualise the annual food consumption per person of each food type in China and the UK.

> 💬 **Discussion:** What interesting differences can you find from the figure?

In [None]:
# Subset for China and United Kingdom only
cn_and_uk_df = food_consumption_df.query('country == "___" | country == "___"')

# Draw a suitable plot to show the annual food consumption per person of each food type in China and the UK.
sns.catplot(data=, x='___', y='___', hue='___', kind='___', height=4, aspect=2)

**Q3.3**: Compare the total annual food consumption per person (`consumption`) and the corresponding carbon footprint (`co2_emission`) in China and UK.

> 💬 **Discussion:** What can the summary statistics tell you? Try to use the bar chart you plotted to understand the summary statistics.

In [None]:
# Group by country, select consumption column, and compute mean, median and sum
print(cn_and_uk_df.groupby(['___'])['___'].agg(['___', '___', '___']))
# Group by country, select co2_emission column, and compute mean, median and sum
print(cn_and_uk_df.groupby(['___'])['___'].agg(['___', '___', '___']))