# Agenda

- Series (1d data structures)
- Indexes
- Series methods
- `nan` and Pandas
- Data frames (2d data structures)

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# Apache Arrow

Apache Arrow is a high-speed, in-memory database optimized for the kinds of things we do with NumPy and Pandas. It has ways to work with Pandas, R, and Spark (among other things). It also has a disk version that you can use to save and retrieve data.

As of Pandas 2.0 (which came out a few weeks ago), you can use Arrow instead of NumPy in the back end.

For some operations (such as searching), it's *far* faster than NumPy.

For other operations (such as grouping), it's actually much slower.



In [3]:
# let's create a series!

# this is similar to creating a 1D NumPy array
# we pass a list of Python objects (or of a NumPy array), and we get a series out of it

s = Series([10, 20, 30, 40, 50])

In [4]:
type(s)

pandas.core.series.Series

In [6]:
# when we look at the representation of s,
# we see the index (along the left side) and the 
# values (on the right side).

# we also see the dtype -- just like NumPy arrays, series have dtypes

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [8]:
# let's look at the values in our series

s.values   # it's a NumPy array!

array([10, 20, 30, 40, 50])

In [9]:
# because we have NumPy arrays in the back end, so many of the things
# we've learned about NumPy also apply to Pandas series

s.mean()

30.0

In [10]:
s.std()

15.811388300841896

In [11]:
s.min()

10

In [12]:
s.max()

50

In [13]:
s.sum()

150

In [14]:
s.count()   # how many non-NaN values are there?

5

In [15]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [16]:
# I can retrieve with []  -- but we'll talk about this in a moment
s[2]

30

In [17]:
# fancy indexing
s[[2, 4]]

2    30
4    50
dtype: int64

In [18]:
# if I get a series back, then I can run any method I'd like
s[[2, 4]].mean()

40.0

# Indexes on series

In NumPy, we're stuck with the same sort of numeric index as many other objects in Python use -- starting at 0, and going up to len-1.

However, in Pandas, we can use any sort of data as our index, sort of like how it's done with dictionaries. This means that our series can be much more expressive.

In [19]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [20]:
# we can always replace the existing index by assigning 
# to s.index. It needs to get a list or NumPy array that's of the right length

s.index

RangeIndex(start=0, stop=5, step=1)

In [21]:
s.index = [50, 51, 203, 62, 18]

In [23]:
s  # look at the index now!

50     10
51     20
203    30
62     40
18     50
dtype: int64

In [24]:
# how do I retrieve from our series now?
# the answer -- two different special accessors, .loc and .iloc
# you can still use [] to retrieve via the index BUT DON'T DO THAT?

In [25]:
s[51]  # what's wrong with this?

20

In [26]:
# if I want to retrieve via the index, I can use .loc
# note that you use [] with .loc, not ()

s.loc[51]  # retrieve the item with index 51

20

In [27]:
s.loc[62]

40

In [28]:
s.loc[[51, 62]]   # retrieve the items at indexes 51 and 62

51    20
62    40
dtype: int64

In [29]:
s.loc[[51, 62]].mean()   # get the mean of the items at indexes 51 + 62

30.0