# Cereals Exploration

This notebook explores the cereals dataset to find good candidate tasks for a small prototype study.

Reference: http://lib.stat.cmu.edu/datasets/1993.expo/

Here are some facts about nutrition that might help you in your analysis. Nutritional recommendations are drawn from the references at the end of this document:

- Adults should consume between 20 and 35 grams of dietary fiber per day.
- The recommended daily intake (RDI) for calories is 2200 for women and 2900 for men.
- Calories come in three food components. There are 9 calories per gram of fat, and 4 calories per gram of carbohydrate and protein.
- Overall, in your diet, no more than 10% of your calories should be consumed from simple carbohydrates (sugars), and no more than 30% should come from fat. The RDI of protein is 50 grams for women and 63 grams for men. The balance of calories should be consumed in the form of complex carbohydrates (starches).
- The average adult with no defined risk factors or other dietary restrictions should consume between 1800 and 2400 mg of sodium per day.
- The type and amount of milk added to cereal can make a significant difference in the fat and protein content of your breakfast.

One possible task is to develop a graphic that would allow the consumer to quickly compare a particular cereal to other possible choices. Some additional questions to consider, and try to answer with effective graphics:
- Can you find the correlations you might expect? Are there any surprising correlations?
- What is the true "dimensionality" of the data?
- Are there any cereals which are virtually identical?
- Is there any way to discriminate among the major manufacturers by cereal characteristics, or do they each have a "balanced portfolio" of cereals?
- Do the nutritional claims made in cereal advertisements stand the scrutiny of data analysis?
- Are there cereals which are clearly nutritionally superior, or inferior? Are there clusters of cereals?
- Is a ranking or scoring scheme possible or reasonable, and if so, are there cereals which are nutritionally superior or inferior under all reasonable weighting schemes?

In [2]:
import polars as pl
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
cereals = pl.read_csv("../etc/samples/cereals-cleaned.csv")

In [4]:
cereals.head(10)

## Preliminary Exploration

### Missing Data

In [1]:
(cereals
 .lazy()
 .
).collect()

In [28]:
# Select only columns with numeric data, and then find all rows with negative values.
cereals.select(pl.col(pl.NUMERIC_DTYPES)).filter(pl.any_horizontal(pl.col("*") < 0))

### Characterization

In [8]:
sns.boxenplot(cereals, y="rating")

In [5]:
cereals[cereals["rating"].arg_max()]

In [6]:
g = sns.PairGrid(cereals, x_vars=["carbs", "protein", "fat",], y_vars="calories")
g.map(sns.regplot);

In [7]:
g = sns.PairGrid(cereals, x_vars=["calories", "carbs", "protein", "fat", "vitamins"], y_vars="rating")
g.map(sns.regplot);

In [6]:
g = sns.PairGrid(cereals, x_vars=["shelf"], y_vars="sugars")
g.map(sns.regplot);

In [16]:
sns.scatterplot(cereals, x="carbo", y="sugars")