# Python Lists

## Creating

In [64]:
# this is a python list
a = [42, 7, 13, 24601, 2001, 3.50]

In [65]:
# this is a list comprehension -- think of it as a sexy for loop

# the following gives us a list in which we multiplied each element in a by 2
z = [i * 2 for i in a]
z

[84, 14, 26, 49202, 4002, 7.0]

## Indexing

In [66]:
# you can index into it
a[0]

42

In [67]:
# what's the 3rd element?
a[?]

SyntaxError: invalid syntax (<ipython-input-67-01ce5673918f>, line 2)

In [0]:
# indices can also be negative
# this gives you the last element
a[-1]

## Slicing

In [0]:
# you can also get subsets of the list with slicing
#     a[start:end]
# [start, end)

# this returns the 3rd and 4th entries (indices 2 and 3 -- note we exclude 4!)
a[2:4]

In [0]:
# if you leave one side blank, it automatically goes all the way
# first five:
a[:5]

In [0]:
# how do you get the last three elements?
a[len(a) - 3:]

In [0]:
# slices can also skip numbers
# a[start:end:interval]

# this gives us every other number, starting with the first
a[::2]

In [0]:
# the interval can also be negative
# what does that do?

a[::-2]

# Numpy

In [0]:
import numpy as np

## Creating

In [0]:
# numpy arrays can be created from a python list
b = np.array(a)
b

Right now, it looks an awful like a python list, but there are some key points you should know.

numpy arrays are:
- homogeneous (all elements in an array have the same type)
- multidimensional

In [0]:
# Homogeneous: all numpy arrays have an associated data type.
# numbers are usually ints or floats
b.dtype

In [0]:
# Multidimensional: numpy arrays can have multiple dimensions, like a nested list.
# We can reshape b into a 3x2 matrix
# Note: this doesn't change b. That's why we assign it to a new variable: m
m = b.reshape(3, 2)
m

In [0]:
# Each dimension is called an axis
# The size across each axis is called the shape
# These are two very important concepts!
m.shape

## Indexing

In [0]:
# We index into numpy arrays much the same way as python lists.
b[0]

In [0]:
# But N-dimensional arrays mean we can be more expressive with indexing
# This gives us [0th index of axis 0, 1st index of axis 1]
# You can think of this as a grid
# Alternatively, this is like m[0][1]
m[0, 1]

In [0]:
# We can also pass in multiple indices as a list
# This gives us the 1st, 2nd, and 5th values of b
b[[0, 1, 4]]

In [0]:
# Let's combine these two facts to get the 2nd and 3rd items in the second column of m
m[[1,2], 1]

In [0]:
# We can also incorporate our previous knowledge of slices.
# So to get the second column
# This gives us the entire range on axis 0, and only the 1st index on axis 1
m[:,1]

## Math

In [0]:
# numpy gives us a lot of math functions to work with
# I'll only show you a couple, but you can find them all in the documentation

np.sum(b)  # guess what this does?

In [0]:
np.mean(b)  # and this?

In [0]:
# for convenience, you can also call
b.mean()

In [0]:
# you can also apply these functions to only one axis
# only sum across rows (read: apply the sum to axis 1)
np.sum(m, axis=1)

In [0]:
# numpy has a concept called podcasting
# It tries to coerce non-matching shapes.
# 2 is a scalar, but we can still multiply m by it
# it just repeats the 2 across all instances of m
m * 2

# Pandas

In [0]:
import pandas as pd

## Creating

Pandas lets us read all sorts of data into a Dataframe. Think of this as a series of lists. Let's look at an example.

In [0]:
df = pd.read_csv("./cereal.csv")
type(df)

In [0]:
# head() gives us the first 10 rows in the dataframe (pd.DataFrame)
df.head()

In [0]:
# you can think of each column as a list (or a 1D numpy array)
# in practice, these are called pandas Series (pd.Series)
# you can index into the dataframe with a string to get one column
df["name"]

In [0]:
type(df["name"])

## Pandas Series vs Numpy Arrays

In [0]:
# There are many similarities between pd.Series and np.ndarray
# for example:
df["carbo"].mean()

In [0]:
# In fact, we can turn pd.Series into a numpy array
# again, this returns a numpy array -- df["carbo"] doesn't change.
df["carbo"].to_numpy()

In [0]:
# The key difference is that Series are indexed
# See the 0, 1, ... 76 on the left? That is the index of each item.
# Right now they are just positions, but theoretically they can be any unique identifier for the row
# Think: ID, username, etc
df["carbo"].index

## Indexing into DataFrames and Series

In [0]:
# Indexing is a little bit different in pandas.
# One parallel to what you've been used to is .loc[]
# this is the row at index 0
df.loc[0]

In [0]:
# multiple indices work
df.loc[[1, 2, 3]]

In [0]:
# caveat: remember that pandas doesn't require zero-indexing. indices can be anything.
# this means slicing might not work all the time (what would df.loc["asdf":"hjkl"] even mean?)
# in the cases that you actually want to index by row number, you can always do that with .iloc[]
# again, this will behave the same as .loc[] with our dataset because our data is 0-indexed
df.iloc[0]

In [0]:
# We can also use boolean indexing by passing a list of booleans like so:
df[[True] + [False] * 76]
# Let me explain:
# - [True] + [False] * 76 gives us a list that looks like [True, False, ..., False] with 1 True and 76 Falses
# - This matches the number of rows in our data (77)
# - pandas returns all the rows with a corresponding True (in this case, only the first one)

In [0]:
# This is powerful because we can also make comparisons with Series and values.
df["protein"] > 3

In [0]:
# Combining these two things, we have a very expressive way of filtering.
# This gives us all the rows in which the protein is greater than 3.
df[df["protein"] > 3]

## Manipulating Series

Often when we're preprocessing data, we want to make uniform changes to a specific column. We can do this by applying functions.

In [0]:
# Suppose we want to make the cereals more appetizing.
# Let's add "Delicious " to the beginning of every name.

# The pattern is we define a function for a single entry
def make_delicious(name):
    return "Delicious " + name

# and then call apply on the series to apply the function to each element in the series
df["name"].apply(make_delicious)

In [0]:
# this returns the changes, but doesn't apply them in place.
# that means on our original dataframe, the cereals are still bland
df.head()

In [0]:
# we can fix this by assigning the new names to the column.
df["name"] = df["name"].apply(make_delicious)
df.head()

In [0]:
# here's another example.
# Jackson is a skeptic and doesn't believe calling things "Delicious" makes them taste better.
# But he does think adding sugar will make them taste better.
# How can we add 10 grams of sugar to every cereal?
df["sugars"] = df["sugars"].apply(lambda sug: 10 + sug)


## Groups and Aggregates

When we have lots and lots of data, it's more useful to look at aggregate statistics like the mean or median. But sometimes we lose too much detail aggregating across the whole dataset.

The solution is to aggregate across groups. For example, maybe we're less interested in the mean calorie count of all cereals and more interested in the mean for each manufacturer.

In [0]:
# First, we can see how many (and which) unique manufacturers there are
# Note: this gives us a numpy array
df["mfr"].unique()

In [0]:
# Now let's group by the manufacturers
# This gives us a groupby object across the dataframe
mfrs = df.groupby("mfr")
mfrs

In [0]:
# what happens if we try to access the calories column?
mfrs["calories"]

In [0]:
# now let's try to get the mean
mfrs["calories"].mean()

In [0]:
# we can also aggregate across multiple columns, and even use different aggregations
# let's get the average calorie count but the maximum protein
mfrs[["calories", "protein"]].agg({"calories": "mean", "protein": "max"})

# Exercises

Unless otherwise noted, these should be one line of code.

In [133]:
# here is a Python list:

a = [1, 2, 3, 4, 5, 6]

# get a list containing the last 3 elements of a
b = a[len(a) - 3:]
# reverse the list
c = a[::-1]
# get a list where each entry in a is cubed (so the new list is [1, 4, 9, 16, 25, 36])
d = [x*x*x for x in a]

In [135]:
# create a numpy array from this list
b = [1,2,3,4,5,6]# change this
b = np.array(b)

In [0]:
# find the mean of b
b.mean()

In [142]:
# change b from a length-6 list to a 2x3 matrix
m = b.reshape(2, 3)
m

array([[1, 2, 3],
       [4, 5, 6]])

In [137]:
# find the mean value of each row
m.mean(axis=1)

array([2., 5.])

In [138]:
# find the mean value of each column
m.mean(axis=0)

array([2.5, 3.5, 4.5])

In [139]:
# find the third column of b
m[:, 2]

array([3, 6])

In [143]:
# get a list where each entry in b is cubed (so the new numpy array is [1, 4, 9, 16, 25, 36])
# use a different (numpy-specific) approach
np.array(d)

array([  1,   8,  27,  64, 125, 216])

In [145]:
# load in the "starbucks.csv" dataset
fu = pd.read_csv("starbucks.csv")
fu

Unnamed: 0,Beverage_category,Beverage,Beverage_prep,Calories,Total Fat (g),Trans Fat (g),Saturated Fat (g),Sodium (mg),Total Carbohydrates (g),Cholesterol (mg),Dietary Fibre (g),Sugars (g),Protein (g),Vitamin A (% DV),Vitamin C (% DV),Calcium (% DV),Iron (% DV),Caffeine (mg)
0,Coffee,Brewed Coffee,Short,3,0.1,0.0,0.0,0,5,0,0,0,0.3,0%,0%,0%,0%,175
1,Coffee,Brewed Coffee,Tall,4,0.1,0.0,0.0,0,10,0,0,0,0.5,0%,0%,0%,0%,260
2,Coffee,Brewed Coffee,Grande,5,0.1,0.0,0.0,0,10,0,0,0,1.0,0%,0%,0%,0%,330
3,Coffee,Brewed Coffee,Venti,5,0.1,0.0,0.0,0,10,0,0,0,1.0,0%,0%,2%,0%,410
4,Classic Espresso Drinks,Caffè Latte,Short Nonfat Milk,70,0.1,0.1,0.0,5,75,10,0,9,6.0,10%,0%,20%,0%,75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
237,Frappuccino® Blended Crème,Strawberries & Crème (Without Whipped Cream),Soymilk,320,3 2,0.4,0.0,0,250,67,1,64,5.0,6%,8%,20%,10%,0
238,Frappuccino® Blended Crème,Vanilla Bean (Without Whipped Cream),Tall Nonfat Milk,170,0.1,0.1,0.0,0,160,39,0,38,4.0,6%,0%,10%,0%,0
239,Frappuccino® Blended Crème,Vanilla Bean (Without Whipped Cream),Whole Milk,200,3.5,2.0,0.1,10,160,39,0,38,3.0,6%,0%,10%,0%,0
240,Frappuccino® Blended Crème,Vanilla Bean (Without Whipped Cream),Soymilk,180,1.5,0.2,0.0,0,160,37,1,35,3.0,4%,0%,10%,6%,0


In [147]:
# this is nutritional info for starbucks items
# let's see if we can answer some questions

# what is the average # calories across all items?
fu.columns
fu["Calories"].mean()

193.87190082644628

In [149]:
# how many different categories of beverages are there?
fu["Beverage_category"].unique()

array(['Coffee', 'Classic Espresso Drinks', 'Signature Espresso Drinks',
       'Tazo® Tea Drinks', 'Shaken Iced Beverages', 'Smoothies',
       'Frappuccino® Blended Coffee', 'Frappuccino® Light Blended Coffee',
       'Frappuccino® Blended Crème'], dtype=object)

In [150]:
# what is the average # calories for each beverage category?
fu.groupby("Beverage_category")["Calories"].mean()

Beverage_category
Classic Espresso Drinks              140.172414
Coffee                                 4.250000
Frappuccino® Blended Coffee          276.944444
Frappuccino® Blended Crème           233.076923
Frappuccino® Light Blended Coffee    162.500000
Shaken Iced Beverages                114.444444
Signature Espresso Drinks            250.000000
Smoothies                            282.222222
Tazo® Tea Drinks                     177.307692
Name: Calories, dtype: float64

In [152]:
# what beverage preparation includes the most sugar?
fu.columns
fu.groupby("Beverage_prep").agg({' Sugars (g)': "max"})

Unnamed: 0_level_0,Sugars (g)
Beverage_prep,Unnamed: 1_level_1
2% Milk,74
Doppio,0
Grande,65
Grande Nonfat Milk,62
Short,33
Short Nonfat Milk,29
Solo,0
Soymilk,80
Tall,49
Tall Nonfat Milk,45


In [164]:
# what is the average % daily value calcium content for each beverage?
# HINT: make sure your columns have the datatypes you want
# (you can use more than one line for this one)
fu.columns
# fu[' Calcium (% DV) '] = fu[' Calcium (% DV) '].str.rstrip('%').astype(float)
fu.groupby("Beverage")[' Calcium (% DV) '].mean()

Beverage
Banana Chocolate Smoothie                              20.000000
Brewed Coffee                                           0.500000
Caffè Americano                                         1.500000
Caffè Latte                                            35.000000
Caffè Mocha (Without Whipped Cream)                    30.000000
Cappuccino                                             22.500000
Caramel                                                11.000000
Caramel (Without Whipped Cream)                        12.000000
Caramel Apple Spice (Without Whipped Cream)             0.000000
Caramel Macchiato                                      28.333333
Coffee                                                 12.333333
Espresso                                                0.000000
Hot Chocolate (Without Whipped Cream)                  35.000000
Iced Brewed Coffee (With Classic Syrup)                 0.000000
Iced Brewed Coffee (With Milk & Classic Syrup)          8.000000
Java Chip       

In [167]:
# It's bulking season. What drink should Jackson get so that he maximizes protein but minimizes fat?
# (you can use more than one line for this one)
fu.columns
fu.groupby("Beverage").agg({' Protein (g) ' : "max", ' Total Fat (g)' : "min", 'Trans Fat (g) ' : "min", 'Saturated Fat (g)': "min"})

Unnamed: 0_level_0,Protein (g),Total Fat (g),Trans Fat (g),Saturated Fat (g)
Beverage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Banana Chocolate Smoothie,20.0,2.5,1.5,0.0
Brewed Coffee,1.0,0.1,0.0,0.0
Caffè Americano,1.0,0.0,0.0,0.0
Caffè Latte,16.0,0.1,0.1,0.0
Caffè Mocha (Without Whipped Cream),17.0,1.5,1.0,0.0
Cappuccino,10.0,0.1,0.1,0.0
Caramel,5.0,0.1,0.0,0.0
Caramel (Without Whipped Cream),5.0,0.1,0.0,0.0
Caramel Apple Spice (Without Whipped Cream),0.0,0.0,0.0,0.0
Caramel Macchiato,13.0,1.0,0.5,0.0
