## Measures of center

##### Mean and median
In this chapter, you'll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO2, per person per year in each country.

In this exercise, you'll compute measures of center to compare food consumption in the US and Belgium using your pandas and numpy skills.

pandas is imported as pd for you and food_consumption is pre-loaded.



1)
Import numpy with the alias np.
Create two DataFrames: one that holds the rows of food_consumption for 'Belgium' and another that holds rows for 'USA'. Call these be_consumption and usa_consumption.
Calculate the mean and median of kilograms of food consumed per person per year for both countries.

In [None]:
import numpy as np
import pandas as pd

# Assuming food_consumption is already loaded

# Create DataFrames for Belgium and USA
be_consumption = food_consumption[food_consumption['country'] == 'Belgium']
usa_consumption = food_consumption[food_consumption['country'] == 'USA']

# Calculate mean and median for Belgium
be_mean_consumption = be_consumption['consumption'].mean()
be_median_consumption = be_consumption['consumption'].median()

# Calculate mean and median for USA
usa_mean_consumption = usa_consumption['consumption'].mean()
usa_median_consumption = usa_consumption['consumption'].median()

print("Belgium:")
print("Mean consumption:", be_mean_consumption)
print("Median consumption:", be_median_consumption)

print("\nUSA:")
print("Mean consumption:", usa_mean_consumption)
print("Median consumption:", usa_median_consumption)


2)Subset food_consumption for rows with data about Belgium and the USA.
Group the subsetted data by country and select only the consumption column.
Calculate the mean and median of the kilograms of food consumed per person per year in each country using .agg().

In [None]:
# Import numpy as np
import numpy as np

# Subset for Belgium and USA only
be_and_usa = food_consumption[(food_consumption['country'] == 'Belgium') | (food_consumption['country'] == 'USA')]

# Group by country, select consumption column, and compute mean and median
print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))


##### Mean vs. median
In the video, you learned that the mean is the sum of all the data points divided by the total number of data points, and the median is the middle value of the dataset where 50% of the data is less than the median, and 50% of the data is greater than the median. In this exercise, you'll compare these two measures of center.

pandas is loaded as pd, numpy is loaded as np, and food_consumption is available.



2)
Import matplotlib.pyplot with the alias plt.
Subset food_consumption to get the rows where food_category is 'rice'.
Create a histogram of co2_emission for rice and show the plot.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Assuming food_consumption is already loaded

# Subset data for 'rice' food category
rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']

# Create a histogram of co2_emission for rice
plt.hist(rice_consumption['co2_emission'], bins=20, edgecolor='black')
plt.xlabel('CO2 Emissions')
plt.ylabel('Frequency')
plt.title('Histogram of CO2 Emissions for Rice')
plt.show()


2)Use .agg() to calculate the mean and median of co2_emission for rice.

In [None]:
# Subset for food_category equals rice
rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']

# Calculate mean and median of co2_emission with .agg()
print(rice_consumption['co2_emission'].agg(mean_co2_emission='mean', median_co2_emission='median'))


## Measures of spread

##### Quartiles, quantiles, and quintiles
Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set. For example, you might want to give a discount to the 10% most active users on a website.

In this exercise, you'll calculate quartiles, quintiles, and deciles, which split up a dataset into 4, 5, and 10 pieces, respectively.

Both pandas as pd and numpy as np are loaded and food_consumption is available.

1)Calculate the quartiles of the co2_emission column of food_consumption.

In [None]:
# Calculate the quartiles of co2_emission
print(np.quantile(food_consumption['co2_emission'], [0, 0.25, 0.5, 0.75, 1]))

2)Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.


In [None]:
# Calculate the quintiles of co2_emission
print( np.percentile(food_consumption['co2_emission'], [0,20, 40, 60, 80,100])
)

3)Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).

In [None]:
# Calculate the deciles of co2_emission
print(np.quantile(food_consumption['co2_emission'], np.array([0,10, 20, 30, 40, 50, 60, 70, 80, 90,100])*0.01)
)

##### Variance and standard deviation
Variance and standard deviation are two of the most common ways to measure the spread of a variable, and you'll practice calculating these in this exercise. Spread is important since it can help inform expectations. For example, if a salesperson sells a mean of 20 products a day, but has a standard deviation of 10 products, there will probably be days where they sell 40 products, but also days where they only sell one or two. Information like this is important, especially when making predictions.

Both pandas as pd and numpy as np are loaded, and food_consumption is available.

Calculate the variance and standard deviation of co2_emission for each food_category by grouping and aggregating.
Import matplotlib.pyplot with alias plt.
Create a histogram of co2_emission for the beef food_category and show the plot.
Create a histogram of co2_emission for the eggs food_category and show the plot.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Assuming food_consumption is already loaded

# Calculate variance and standard deviation of co2_emission for each food_category
variance_and_sd = food_consumption.groupby('food_category')['co2_emission'].agg(variance=np.var, std_dev=np.std)

print("Variance and Standard Deviation:")
print(variance_and_sd)

# Create histogram of co2_emission for food_category 'beef'
plt.hist(food_consumption[food_consumption['food_category'] == 'beef']['co2_emission'], bins=20, edgecolor='black')
plt.xlabel('CO2 Emissions')
plt.ylabel('Frequency')
plt.title('Histogram of CO2 Emissions for Beef')
plt.show()

# Create histogram of co2_emission for food_category 'eggs'
plt.hist(food_consumption[food_consumption['food_category'] == 'eggs']['co2_emission'], bins=20, edgecolor='black')
plt.xlabel('CO2 Emissions')
plt.ylabel('Frequency')
plt.title('Histogram of CO2 Emissions for Eggs')
plt.show()


##### Finding outliers using IQR
Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that's less influenced by outliers. IQR is also often used to find outliers. If a value is less than 
 or greater than 
, it's considered an outlier. In fact, this is how the lengths of the whiskers in a matplotlib box plot are calculated.

Diagram of a box plot showing median, quartiles, and outliers

In this exercise, you'll calculate IQR and use it to find some outliers. pandas as pd and numpy as np are loaded and food_consumption is available.

1)Calculate the total co2_emission per country by grouping by country and taking the sum of co2_emission. Store the resulting DataFrame as emissions_by_country.

In [None]:
import numpy as np
import pandas as pd

# Assuming food_consumption is already loaded

# Calculate total co2_emission per country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

print(emissions_by_country)


2)Compute the first and third quartiles of emissions_by_country and store these as q1 and q3.
Calculate the interquartile range of emissions_by_country and store it as iqr.

In [None]:
import numpy as np
import pandas as pd

# Assuming food_consumption is already loaded

# Calculate total co2_emission per country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Calculate the first and third quartiles
q1 = np.percentile(emissions_by_country, 25)
q3 = np.percentile(emissions_by_country, 75)

# Calculate the interquartile range (IQR)
iqr = q3 - q1

print("First Quartile (Q1):", q1)
print("Third Quartile (Q3):", q3)
print("Interquartile Range (IQR):", iqr)


3)Calculate the lower and upper cutoffs for outliers of emissions_by_country, and store these as lower and upper.

In [None]:
import numpy as np
import pandas as pd

# Assuming food_consumption is already loaded

# Calculate total co2_emission per country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Calculate the first and third quartiles
q1 = np.percentile(emissions_by_country, 25)
q3 = np.percentile(emissions_by_country, 75)

# Calculate the interquartile range (IQR)
iqr = q3 - q1

# Calculate lower and upper cutoffs for outliers
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

print("Lower Cutoff for Outliers:", lower)
print("Upper Cutoff for Outliers:", upper)


4) Subset emissions_by_country to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.

In [None]:
import numpy as np
import pandas as pd

# Assuming food_consumption is already loaded

# Calculate total co2_emission per country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Calculate the first and third quartiles
q1 = np.percentile(emissions_by_country, 25)
q3 = np.percentile(emissions_by_country, 75)

# Calculate the interquartile range (IQR)
iqr = q3 - q1

# Calculate lower and upper cutoffs for outliers
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

# Subset countries with emissions greater than upper cutoff or less than lower cutoff
outliers = emissions_by_country[(emissions_by_country > upper) | (emissions_by_country < lower)]

print("Outliers:")
print(outliers)


## What are the chances?

##### Calculating probabilities
You're in charge of the sales team, and it's time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he's worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you'll first figure out what the chances are of selecting certain deals.

Recall that the probability of an event can be calculated by
 

Both pandas as pd and numpy as np are loaded and amir_deals is available.

1)Count the number of deals Amir worked on for each product type and store in counts.

In [None]:
# Count the deals for each product
counts = amir_deals['product'].value_counts()
print(counts)

Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as probs.

In [None]:
# Count the deals for each product
counts = amir_deals['product'].value_counts()

# Calculate probability of picking a deal with each product
probs =  amir_deals['product'].value_counts()/len(amir_deals['product'])
print(probs)

##### Sampling deals
In the previous exercise, you counted the deals Amir worked on. Now it's time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You'll try doing this both with and without replacement.

Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you'll need to set the random seed before sampling from the deals.

Both pandas as pd and numpy as np are loaded and amir_deals is available.

1)Set the random seed to 24.
Take a sample of 5 deals without replacement and store them as sample_without_replacement.

In [None]:
# Set random seed
random_seed=np.random.seed(24)
# Sample 5 deals without replacement
sample_without_replacement = amir_deals.sample(5)
print(sample_without_replacement)

2)Take a sample of 5 deals with replacement and save as sample_with_replacement

In [None]:
# Set random seed
np.random.seed(24)

# Sample 5 deals with replacement
sample_with_replacement = amir_deals.sample(5,replace=True)
print(sample_with_replacement)