## Notebook01b

### Setup

Run all of the following before starting the notebook.

In [1]:
! wget -q -nc https://raw.githubusercontent.com/taylor-arnold/fds-py/refs/heads/main/funs.py

In [2]:
import numpy as np
import polars as pl

from funs import *
from plotnine import *
from polars import col as c
theme_set(theme_minimal())

ub = "https://raw.githubusercontent.com/taylor-arnold/fds-py-nb/refs/heads/main/"

In [3]:
food = pl.read_csv(ub + "data/food.csv").drop(c.description)

### Questions

Let's look at the `food` dataset that we loaded above. To get an idea about its structure, write the name of the dataset in the block below and run the cell with nothing else. This will show the column names and data types along with the first five and last five rows.

In [4]:
food

item,food_group,calories,total_fat,sat_fat,cholesterol,sodium,carbs,fiber,sugar,protein,iron,vitamin_a,vitamin_c,wiki,color
str,str,i64,f64,f64,i64,i64,f64,f64,f64,f64,i64,i64,i64,str,str
"""Apple""","""fruit""",52,0.1,0.028,0,1,13.81,2.4,10.39,0.26,1,1,8,"""apple""","""red"""
"""Asparagus""","""vegetable""",20,0.1,0.046,0,2,3.88,2.1,1.88,2.2,12,15,9,"""asparagus""","""green"""
"""Avocado""","""fruit""",160,14.6,2.126,0,7,8.53,6.7,0.66,2.0,3,3,17,"""avocado""","""green"""
"""Banana""","""fruit""",89,0.3,0.112,0,1,22.84,2.6,12.23,1.09,1,1,15,"""banana""","""yellow"""
"""Chickpea""","""grains""",180,2.9,0.309,0,243,29.98,8.6,5.29,9.54,17,0,3,"""chickpea""","""brown"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Tomato""","""vegetable""",18,0.0,0.046,0,5,3.92,1.2,2.63,0.88,2,17,21,"""tomato""","""red"""
"""Tuna""","""fish""",153,3.9,0.811,53,366,0.41,0.0,0.09,27.3,5,3,4,"""tuna""","""red"""
"""Turkey""","""meat""",187,7.0,1.999,77,69,0.0,0.0,0.0,28.9,8,0,0,"""turkey_(bird)""","""white"""
"""Potato""","""vegetable""",104,2.0,0.458,0,254,19.36,1.7,0.82,1.66,2,2,12,"""potato""","""white"""


What kinds of information in this dataset are stored as numbers?

**Answer**: The variables for calories, total_fat, sat_fat, cholesterol, sodium, carbs, fiber, sugar, protein, iron, vitamin_a, and vitamin_c are all stored as numbers.

What variables in this dataset are stored as strings?

**Answer**: The variables for item, food_group, wiki, and color are stored as strings.

How many observations are there in the dataset?

**Answer**: 61

Can you figure out how the rows of the data are ordered?

**Answer**: Alphabetical order by item name.

Throughout this book, we will want to apply data manipulation, visualization, and modeling functions to data. Whenever possible, which is most of the time, we will write these are *method chains* in a very specific format. The first line of a code block starts with an opening parenthesis and the final line has the closing parenthesis. Inside, each of the steps of the analysis will get placed with four leading spaces (you should be able to hit tab in Colab to get this), with one step on each line. The dataset goes on its own line.

In the code below, greate a method chain that starts with `food` (on its own line), applies the method `.head(10)` to grab the first ten rows, and the applies the method `.tail(3)` to grab the last three rows the remaining data. Hint: The code should have five total lines.

In [15]:
(
    food
    .head(10)
    .tail(3)
)

item,food_group,calories,total_fat,sat_fat,cholesterol,sodium,carbs,fiber,sugar,protein,iron,vitamin_a,vitamin_c,wiki,color
str,str,i64,f64,f64,i64,i64,f64,f64,f64,f64,i64,i64,i64,str,str
"""Bell Pepper""","""vegetable""",26,0.0,0.059,0,2,6.03,2.0,4.2,0.99,2,63,317,"""bell_pepper""","""green"""
"""Crab""","""fish""",87,1.0,0.222,78,293,0.04,0.0,0.0,18.06,4,0,5,"""callinectes_sapidus""","""red"""
"""Broccoli""","""vegetable""",34,0.3,0.039,0,33,6.64,2.6,1.7,2.82,4,12,149,"""broccoli""","""green"""


What rows, relatively to the original dataset, are returned by the sequence of steps above?
**Answer**: the last 3 rows of the first 10 (i.e. indicies 8-10)

Note that method chains never modify the original dataset. After running the code above `food` still has all of the original data and the new version you printed out is no longer available. If we want to save a copy of the output, we need to add something like `new_name =` before the paranthesis on the first line of the code. When we do this, the original dataset will remain but we have a copy in the `new_name` object. In the code below, repeat the steps we did above with `food` but create a new object `food_sub` that consists of the subset of three rows.

In [19]:
food_sub = (
    food
    .head(10)
    .tail(3)
)

Notice that we do not get to see the output of the operation above. Python does not print out the result because we saved it as a new variable. To see the result, we would need to write the name of the dataset as an extra row in the code block all on its own. This, for example, is how we printed the `food` dataset in the first question above. Do this below, recreating the `food_sub` and then including the name `food_sub` on its own line to see the data.

In [18]:
food_sub

item,food_group,calories,total_fat,sat_fat,cholesterol,sodium,carbs,fiber,sugar,protein,iron,vitamin_a,vitamin_c,wiki,color
str,str,i64,f64,f64,i64,i64,f64,f64,f64,f64,i64,i64,i64,str,str
"""Bell Pepper""","""vegetable""",26,0.0,0.059,0,2,6.03,2.0,4.2,0.99,2,63,317,"""bell_pepper""","""green"""
"""Crab""","""fish""",87,1.0,0.222,78,293,0.04,0.0,0.0,18.06,4,0,5,"""callinectes_sapidus""","""red"""
"""Broccoli""","""vegetable""",34,0.3,0.039,0,33,6.64,2.6,1.7,2.82,4,12,149,"""broccoli""","""green"""


In these notebooks, please only save the output with a new object name if specifically instructed to do so. When asked to create a new object, include the object name as an extra line at the end so that we can also see and understand the results.