# 1.2 🐼 Pandas Core Concepts - `pandas.Series`

Numpy is fairly low level. We want to manipulate our data in more descriptive ways. Pandas lets us do just that. It extends the concepts of Numpy above, adding labels to our data in the form of indices.


In [8]:
# It convention to import numpy as 'np'.
import numpy as np
# It's customary to import pandas as pd
import pandas as pd

In [158]:
# Lets create a series that holds the price of some fruit
s = pd.Series(
    [0.50, 2.10, 0.50, 0.19, 1.29], index=['grapefruit','watermelon','apple','banana','starfruit'], name='fruits'
)

display(s)

grapefruit    0.50
watermelon    2.10
apple         0.50
banana        0.19
starfruit     1.29
Name: fruits, dtype: float64

The index of a series is a list of values that label each datapoint in that series. They can be of any hashable datatype - strings and integers are the most common, along with a datetime indices for time series data.

A series can optionally have a name that labels what the data represents.

In [144]:
display(s.index)

Index(['grapefruit', 'watermelon', 'apple', 'banana', 'starfruit'], dtype='object')

In [159]:
display(s.values)

array([0.5 , 2.1 , 0.5 , 0.19, 1.29])

In [160]:
display(s.name)

'fruits'

### Accessing Series Values

We can access the values in our series in a variety of ways. 

 - With its label using the standard dictionary like accessor eg. `['foo']`
 - With its label index using the `.loc` accessor
 - With its numeric index using the `.iloc` accessor

In [146]:
s['apple']

0.5

In [147]:
s.loc['watermelon']

2.1

In [148]:
s.iloc[3] # Remember arrays are zero based.

0.19

At first the `.loc` accessor might seem redundant, but it allows us to do some powerful things.


In [149]:
# Slice along the axis (in the order of the axis - not alphabetical order unless you sort the index first)
s.loc['watermelon':'banana']

watermelon    2.10
apple         0.50
banana        0.19
dtype: float64

In [150]:
# Supply a list of True, or False values to return all the True ones.
s.loc[[False, True, False, True, True]]

watermelon    2.10
banana        0.19
starfruit     1.29
dtype: float64

# Series Operations
We can do the same kinds of operations on a Series as we could on a numpy array.


In [151]:
# Operations where once input is a scalar are 'broadcast' to the whole series.
s + 1

grapefruit    1.50
watermelon    3.10
apple         1.50
banana        1.19
starfruit     2.29
dtype: float64

In [152]:
# Pandas series share many of the aggregate functions of a numpy array
s.mean()

0.916

# The `apply` method

The `apply` method is one of the more powerful concepts in pandas. It will loop over each value in the series (or DataFrame - see below) and apply a function to it.

In [153]:
# Decide which fruits are to expensive.

def is_expensive(price):
    if price > 1.0:
        return 'expensive'
    else:
        return 'cheap'
    
s.apply(is_expensive)

grapefruit        cheap
watermelon    expensive
apple             cheap
banana            cheap
starfruit     expensive
dtype: object

A handy shorthand for this if you are familiar with python's [lambda](https://www.w3schools.com/python/python_lambda.asp) syntax (not to be confused with AWS lambda) for anonymous functions is as follows:

In [154]:
s.apply(lambda x: 'expensive' if x > 1 else 'cheap')

grapefruit        cheap
watermelon    expensive
apple             cheap
banana            cheap
starfruit     expensive
dtype: object

#### A note about applies

Earlier we noted that numpy has some nice fast methods for numeric aggregations and operations on arrays. This is true for pandas too, for example when we sum or add to a series. But applies are not paralleised - meaning that they will run slower than built in functions like `.sum()` or operators like adding or muliplication.

This means you should be careful when using applies on large datasets that they arent slowing down your program. If you need to do an apply on a large set of data, there are ways to make this faster though - look into frameworks like https://dask.org/.

A lot of the time though doing an apply is just fine, and they're extremely useful.

### nan values in Pandas

nan values in pandas use the same np.nan value as numpy does. *However*, in pandas sums and means etc. ignore nan values by default.

In [110]:
s2 = pd.Series([1,2,np.nan,7,np.nan])
s2

0    1.0
1    2.0
2    NaN
3    7.0
4    NaN
dtype: float64

In [111]:
s2.sum()

10.0

Pandas also has some useful methods for selecting and filling nans.


In [124]:
display(s2.isna())

print("\nSelected with isna()")
display(s2.loc[s2.isna()])

0    False
1    False
2     True
3    False
4     True
dtype: bool


Selected with isna()


2   NaN
4   NaN
dtype: float64

The ~ (diacritical mark) is used to negate the values used in a selection like the above.

In [118]:
s2.loc[~s2.isna()]

0    1.0
1    2.0
3    7.0
dtype: float64

But there's also a handy notna() method.

In [155]:
s2.notna()

0     True
1     True
2    False
3     True
4    False
dtype: bool

## Pandas datatypes
The most common datatypes you will see in pandas are:

| dtype | description |
|-------|---------------|
| int64 | Integer Array |
| float64 | Float Array |
| object | Object/String array |
| datetime64[ns] | datetime with nanosecond accuracy |
| category | categorical data (e.g. strings with only a small number of valid options), <br/> may be ordered (e.g. low, medium, high) or unordered (e.g. apple, orange, banana) |

