# 1.2 🐼 Pandas Core Concepts - `pandas.Series`

Numpy is fairly low level. We want to manipulate our data in more descriptive ways. Pandas lets us do just that. It extends the concepts of Numpy above, adding labels to our data in the form of indices.


In [None]:
# It convention to import numpy as 'np'.
import numpy as np
# It's customary to import pandas as pd
import pandas as pd

In [None]:
# Lets create a series that holds the price of some fruit
s = pd.Series(
    [0.50, 2.10, 0.50, 0.19, 1.29], index=['grapefruit','watermelon','apple','banana','starfruit'], name='fruits'
)

display(s)

The index of a series is a list of values that label each datapoint in that series. They can be of any hashable datatype - strings and integers are the most common, along with a datetime indices for time series data.

A series can optionally have a name that labels what the data represents.

In [None]:
display(s.index)

In [None]:
display(s.values)

In [None]:
display(s.name)

### Accessing Series Values

We can access the values in our series in a variety of ways. 

 - With its label using the standard dictionary like accessor eg. `['foo']`
 - With its label index using the `.loc` accessor
 - With its numeric index using the `.iloc` accessor

In [None]:
s['apple']

In [None]:
s.loc['watermelon']

In [None]:
s.iloc[3] # Remember arrays are zero based.

At first the `.loc` accessor might seem redundant, but it allows us to do some powerful things.


In [None]:
# Slice along the axis (in the order of the axis - not alphabetical order unless you sort the index first)
s.loc['watermelon':'banana']

In [None]:
# Supply a list of True, or False values to return all the True ones.
s.loc[[False, True, False, True, True]]

In [None]:
# You can pass an expression to .loc that evaluates to true or false for each value as well.
s.loc[s > 1]

# Series Operations
We can do the same kinds of operations on a Series as we could on a numpy array.


In [None]:
# Operations where once input is a scalar are 'broadcast' to the whole series.
s + 1

In [None]:
# Pandas series share many of the aggregate functions of a numpy array
s.mean()

# The `apply` method

The `apply` method is one of the more powerful concepts in pandas. It will loop over each value in the series (or DataFrame - see below) and apply a function to it.

In [None]:
# Decide which fruits are to expensive.

def is_expensive(price):
    if price > 1.0:
        return 'expensive'
    else:
        return 'cheap'
    
s.apply(is_expensive)

A handy shorthand for this if you are familiar with python's [lambda](https://www.w3schools.com/python/python_lambda.asp) syntax (not to be confused with AWS lambda) for anonymous functions is as follows:

In [None]:
s.apply(lambda x: 'expensive' if x > 1 else 'cheap')

#### A note about applies

Earlier we noted that numpy has some nice fast methods for numeric aggregations and operations on arrays. This is true for pandas too, for example when we sum or add to a series. But applies are not paralleised - meaning that they will run slower than built in functions like `.sum()` or operators like adding or muliplication.

This means you should be careful when using applies on large datasets that they arent slowing down your program. If you need to do an apply on a large set of data, there are ways to make this faster though - look into frameworks like https://dask.org/.

A lot of the time though doing an apply is just fine, and they're extremely useful.

### nan values in Pandas

nan values in pandas use the same np.nan value as numpy does. *However*, in pandas sums and means etc. ignore nan values by default.

In [None]:
s2 = pd.Series([1,2,np.nan,7,np.nan])
s2

In [None]:
s2.sum()

Pandas also has some useful methods for selecting and filling nans.


In [None]:
display(s2.isna())

print("\nSelected with isna()")
display(s2.loc[s2.isna()])

The ~ (diacritical mark) is used to negate the values used in a selection like the above.

In [None]:
s2.loc[~s2.isna()]

But there's also a handy notna() method.

In [None]:
s2.notna()

## Pandas datatypes
The most common datatypes you will see in pandas are:

| dtype | description |
|-------|---------------|
| int64 | Integer Array |
| float64 | Float Array |
| object | Object/String array |
| datetime64[ns] | datetime with nanosecond accuracy |
| category | categorical data (e.g. strings with only a small number of valid options), <br/> may be ordered (e.g. low, medium, high) or unordered (e.g. apple, orange, banana) |



## Methods for specific datatypes

For series of different datatypes there are often methods that apply to specific data types. Most often these are useful for string and datetime datatypes.

For string type series objects we can access numerous string methods with `.str.method_name()`

In [None]:
s3 = pd.Series(['MaTThEw','Mark','Luke','John','Paul','GeOrge','RinGO'])

s3.str.upper()

In [None]:
s3.str.title()

In [None]:
s3.str.contains('Ma')

In [None]:
# Extract just the capital letters. Note this adds another level to the index to allow for multiple matches!
s3.str.extractall(r'([A-Z]+)')

Datetime series objects have methods you can access with `.dt`

In [None]:
s4 = pd.to_datetime(pd.Series(['1987-01-01 00:00:00','2016-04-01 05:06:07','2020-08-06 12:10:20']))
s4

In [None]:
s4.dt.date

In [None]:
s4.dt.day

In [None]:
s4.dt.year

In [None]:
s4.dt.is_leap_year

## Excercises:

Answer the following about the series `wc` below
 - who won the world cup in 1994? Hint: use the .loc accessor (or just print the series...)
 - who has won it the most times? Hint: try calling value_counts() on the series.

Try to do the following:
 - change the winner for 1974 to 'Wales'. Hint: you can assign values to the series with .loc too.
 - filter the series to years where Germany won. Hint: Pass an expression to the .loc accessor

In [None]:
import string
string.ascii_lowercase

# Lets learn something about this not-very interesting series.
wc = pd.read_csv('data/worldcupwinners.csv', index_col='year', squeeze=True)

In [None]:
# Who won in 1994
wc.loc[1994]

In [None]:
# Who won the most times.
wc.value_counts()

In [None]:
# Change the 1974 winner to Wales
wc.loc[1974] = 'Wales'
wc

In [None]:
# Filter to years where germany won
display(wc.loc[wc == 'Germany'])

# Extra credit -  what about west germany?
display(wc.loc[wc.isin(['Germany', 'West Germany'])])

# Or a different way
display(wc.loc[wc.str.contains("Germany")])